--- title: "Parallel Processing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parallel Processing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` When processing large datasets, parallel generation can significantly reduce execution time. This tutorial covers efficient batch processing strategies with localLLM. ## Why Parallel Processing? Sequential processing with a for-loop processes one prompt at a time. Parallel processing batches multiple prompts together, sharing computation and reducing overhead. In benchmarks, `generate_parallel()` typically completes in **60–70% of the time** compared to sequential `generate()` calls (1.3×–1.7× speedup depending on model size). ## Using generate_parallel() ### Basic Usage ```{r} library(localLLM) # Load model model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999) # Create context with batch support ctx <- context_create( model, n_ctx = 2048, n_seq_max = 10 # Allow up to 10 parallel sequences ) # Define prompts prompts <- c( "What is the capital of France?", "What is the capital of Germany?", "What is the capital of Italy?" ) # Format prompts formatted_prompts <- sapply(prompts, function(p) { messages <- list( list(role = "system", content = "Answer concisely."), list(role = "user", content = p) ) apply_chat_template(model, messages) }) # Process in parallel results <- generate_parallel(ctx, formatted_prompts, max_tokens = 50) print(results) ``` ``` #> [1] "The capital of France is Paris." #> [2] "The capital of Germany is Berlin." #> [3] "The capital of Italy is Rome." ``` ### Progress Tracking Progress reporting is enabled by default in interactive sessions (`progress = interactive()`). To force it in non-interactive scripts, set `progress = TRUE` explicitly: ```{r} results <- generate_parallel( ctx, formatted_prompts, max_tokens = 50, progress = TRUE # force progress bar even in non-interactive mode ) ``` ``` #> Processing 100 prompts... #> [##########----------] 50% #> [####################] 100% #> Done! ``` ## Text Classification Example Here's a complete example classifying news articles: ```{r} library(localLLM) # Load sample dataset data("ag_news_sample", package = "localLLM") # Load model model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999) # Create context (n_seq_max determines max parallel prompts) ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10) # Prepare all prompts all_prompts <- character(nrow(ag_news_sample)) for (i in seq_len(nrow(ag_news_sample))) { messages <- list( list(role = "system", content = "You are a helpful assistant."), list(role = "user", content = paste0( "Classify this news article into exactly one category: ", "World, Sports, Business, or Sci/Tech. ", "Respond with only the category name.\n\n", "Title: ", ag_news_sample$title[i], "\n", "Description: ", substr(ag_news_sample$description[i], 1, 100), "\n\n", "Category:" )) ) all_prompts[i] <- apply_chat_template(model, messages) } # Process all samples in parallel results <- generate_parallel( context = ctx, prompts = all_prompts, max_tokens = 5, seed = 92092, progress = TRUE, clean = TRUE ) # Extract predictions ag_news_sample$LLM_result <- sapply(results, function(x) { trimws(gsub("\\n.*$", "", x)) }) # Calculate accuracy accuracy <- mean(ag_news_sample$LLM_result == ag_news_sample$class) cat("Accuracy:", round(accuracy * 100, 1), "%\n") ``` ``` #> Accuracy: 87.0 % ``` ## Sequential vs Parallel Comparison ### Sequential (For Loop) ```{r} # Sequential approach ag_news_sample$LLM_result <- NA ctx <- context_create(model, n_ctx = 512) system.time({ for (i in seq_len(nrow(ag_news_sample))) { formatted_prompt <- all_prompts[i] output <- generate(ctx, formatted_prompt, max_tokens = 5, seed = 92092) ag_news_sample$LLM_result[i] <- trimws(output) } }) ``` ``` #> user system elapsed #> 0.62 0.08 41.55 ``` ### Parallel ```{r} # Parallel approach ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10) system.time({ results <- generate_parallel( ctx, all_prompts, max_tokens = 5, seed = 92092, progress = TRUE ) }) ``` ``` #> user system elapsed #> 0.38 0.04 24.08 ``` **Result**: ~42% faster with parallel processing (1.73×). ### Benchmark: Multiple Models Tested on Apple M3 Pro (18 GB unified memory), 100 AG News classification prompts, `ctx_size = 512`, `max_tokens = 50`, `n_seq_max = 10`: | Model | Sequential | Parallel (10×) | Speedup | |---|---|---|---| | Llama-3.2-3B-Instruct-Q5_K_M | 41.6 sec | 24.1 sec | **1.73×** | | Gemma-3-4B-it-QAT-Q5_K_M | 41.3 sec | 30.0 sec | 1.38× | | OLMo-3-7B-Instruct-Q5_K_M | 61.5 sec | 43.3 sec | 1.42× | | Gemma-4-26B-A4B-it-IQ2_XXS | 69.2 sec | 52.9 sec | 1.31× | On Apple Silicon (M3 Pro), smaller models tend to show **higher parallel speedup** than larger ones. The GPU is underutilised during single-sequence inference for small models, so batching provides more headroom. Larger models approach GPU saturation even at `n_seq_max = 1`, leaving less room for parallel gains. > **Note on reasoning models**: DeepSeek-R1 and similar reasoning models (QwQ, Gemma 4) > output a thinking block before the final answer (e.g. `...answer`). > For classification tasks, strip the thinking section before evaluating predictions: > > ```r > clean_pred <- function(x) { > # Remove thinking block, keep only text after closing tag > x <- gsub(".*?", "", x, perl = TRUE) > trimws(gsub("\n.*", "", trimws(x))) > } > ``` ## Using quick_llama() for Batches The simplest approach for parallel processing is passing a vector to `quick_llama()`: ```{r} # quick_llama automatically uses parallel mode for vectors prompts <- c( "Summarize: Climate change is affecting global weather patterns...", "Summarize: The stock market reached new highs today...", "Summarize: Scientists discovered a new species of deep-sea fish..." ) results <- quick_llama(prompts, max_tokens = 50) print(results) ``` ## Performance Considerations ### Context Size and n_seq_max The context window is shared across parallel sequences: ```{r} # If n_ctx = 2048 and n_seq_max = 8 # Each sequence gets approximately 2048/8 = 256 tokens # For longer prompts, increase n_ctx proportionally ctx <- context_create( model, n_ctx = 4096, # Larger context n_seq_max = 8 # 8 parallel sequences ) ``` ### Memory Usage Parallel processing uses more memory. Monitor with: ```{r} hw <- hardware_profile() cat("Available RAM:", round(hw$ram_total / 1e9, 1), "GB\n") cat("GPU:", hw$gpu$name, "\n") ``` ### Batch Size Recommendations | Dataset Size | Recommended n_seq_max | |-------------|----------------------| | < 100 | 4-8 | | 100-1000 | 8-16 | | > 1000 | 16-32 (memory permitting) | ## Error Handling If a prompt fails, the result will contain an error message: ```{r} results <- generate_parallel(ctx, prompts, max_tokens = 50) # Check for errors for (i in seq_along(results)) { if (grepl("^Error:", results[i])) { cat("Prompt", i, "failed:", results[i], "\n") } } ``` ## Complete Workflow ```{r} library(localLLM) # 1. Setup model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999) ctx <- context_create(model, n_ctx = 2048, n_seq_max = 10) # 2. Prepare prompts data("ag_news_sample", package = "localLLM") prompts <- sapply(seq_len(nrow(ag_news_sample)), function(i) { messages <- list( list(role = "system", content = "Classify news articles."), list(role = "user", content = paste0( "Category (World/Sports/Business/Sci/Tech): ", ag_news_sample$title[i] )) ) apply_chat_template(model, messages) }) # 3. Process in batches with progress results <- generate_parallel( ctx, prompts, max_tokens = 10, seed = 42, progress = TRUE, clean = TRUE ) # 4. Extract and evaluate predictions <- sapply(results, function(x) trimws(gsub("\\n.*", "", x))) accuracy <- mean(predictions == ag_news_sample$class) cat("Accuracy:", round(accuracy * 100, 1), "%\n") ``` ## Summary | Function | Use Case | |----------|----------| | `generate()` | Single prompts, interactive use | | `generate_parallel()` | Batch processing, large datasets | | `quick_llama(vector)` | Quick batch processing | | `explore()` | Multi-model comparison with batching | ## Tips 1. **Set `n_seq_max`** when creating context for parallel use 2. **Scale `n_ctx`** with `n_seq_max` to give each sequence enough space 3. **Progress is shown automatically** in interactive sessions; set `progress = TRUE` to force it in scripts 4. **Use `clean = TRUE`** to automatically strip control tokens 5. **Set consistent `seed`** for reproducibility across batches 6. **Set `verbosity = 0`** in scripts and automated pipelines to prevent backend log lines from appearing in output files or `R CMD check` output. `generate_parallel()` already defaults to `verbosity = 0`; suppress loading output by passing `verbosity = 0` to `model_load()` and `context_create()` as well: ```{r} # Fully silent batch pipeline model <- model_load("model.gguf", verbosity = 0) ctx <- context_create(model, n_seq_max = 8, verbosity = 0) results <- generate_parallel(ctx, prompts, max_tokens = 50, progress = FALSE) ``` ## Next Steps - **[Model Comparison](tutorial-model-comparison.html)**: Compare multiple models - **[Basic Text Generation](tutorial-basic-generation.html)**: Learn the core API - **[Reproducible Output](reproducible-output.html)**: Ensure reproducibility