--- title: "Benchmarking bigPLScox" shorttitle: "Benchmarking bigPLScox" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Benchmarking bigPLScox} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/benchmarking-", fig.width = 7, fig.height = 5, dpi = 150, message = FALSE, warning = FALSE ) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") ``` # Motivation The **bigPLScox** package now provides three distinct engines for extracting PLS components in Cox regression: the original matrix-based solver (`big_pls_cox`), the new Armadillo backend (`big_pls_cox_fast`), and the stochastic gradient descent routine (`big_pls_cox_gd`). This vignette shows how to benchmark those implementations against `coxgpls()` and the standard `survival::coxph()` fit. We rely on the [`bench`](https://bench.r-lib.org) package to collect timing and memory statistics. All chunks are wrapped in the `LOCAL` flag so that CRAN builds can skip the heavier computations. # Dependencies ```{r packages, include=FALSE} needed <- c("bigmemory", "bench", "survival", "plsRcox", "bigPLScox") has <- vapply(needed, requireNamespace, logical(1), quietly = TRUE) if (!all(has)) { missing <- paste(needed[!has], collapse = ", ") knitr::opts_chunk$set(eval = FALSE) message("Note: skipping code execution because these packages are missing: ", missing) } ``` ```{r packages_lib} library(bigPLScox) library(survival) library(bench) library(bigmemory) library(plsRcox) ``` # Simulated data We reuse the helper `dataCox()` to generate survival outcomes with right censoring. Adjust `n` and `p` to stress-test the solvers on larger problems. ```{r simulate-data} set.seed(2024) sim_design <- dataCox( n = 2000, lambda = 2, rho = 1.5, x = matrix(rnorm(2000 * 50), ncol = 50), beta = c(1, 3, rep(0, 48)), cens.rate = 5 ) cox_data <- list( x = as.matrix(sim_design[, -(1:3)]), time = sim_design$time, status = sim_design$status ) X_big <- bigmemory::as.big.matrix(cox_data$x) ``` # Running the benchmark 1. **Dense matrix scenario** – compares the classical `coxgpls()` workflow and the dense variants of the big-memory solvers against `survival::coxph()`. 2. **Big-memory scenario** – evaluates `big_pls_cox_fast()` and `big_pls_cox_gd()` directly on a `big.matrix`, illustrating the benefit of the shared-memory interface. The number of iterations can be increased for more stable measurements when running locally. # Dense matrix comparison ```{r dense-benchmark} bench_dense <- bench::mark( coxgpls = coxgpls( cox_data$x, cox_data$time, cox_data$status, ncomp = 5 ), big_pls = big_pls_cox(cox_data$x, cox_data$time, cox_data$status, ncomp = 5), big_pls_fast = big_pls_cox_fast(cox_data$x, cox_data$time, cox_data$status, ncomp = 5), big_pls_bb = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5), big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, method = "gd"), coxph = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"), iterations = 75, check = FALSE ) bench_dense$expression <- names(bench_dense$expression) bench_dense[, c("expression", "median", "itr/sec", "mem_alloc")] ``` A higher `itr/sec` value indicates faster throughput. The fast backend should provide the best compromise between speed and accuracy in the dense setting. # Big-memory comparison The next experiment reuses the `big.matrix` version of the predictors and adds the SGD solver. Because `coxgpls()` requires an in-memory matrix, it is not included in this round. ```{r big-benchmark} bench_big <- bench::mark( big_pls_fast = big_pls_cox_fast(X_big, cox_data$time, cox_data$status, ncomp = 5), big_pls = big_pls_cox(cox_data$x, cox_data$time, cox_data$status, ncomp = 5), big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 300), iterations = 75, check = FALSE ) bench_big$expression <- names(bench_big$expression) bench_big[, c("expression", "median", "itr/sec", "mem_alloc")] ``` The classical solver is still run on the dense matrix for reference, but only the fast and SGD routines operate on the `big.matrix` without copying. Expect the fast backend to dominate runtime while SGD maintains scalability for out-of-memory workflows. # Visualising the results `bench` includes `autoplot()` and `plot()` helpers for visual inspection of the timing distribution. ```{r dense-plot} plot(bench_dense, type = "jitter") ``` ```{r big-plot} plot(bench_big, type = "jitter") ``` Additional geometries, such as ridge plots, are available via `autoplot(bench_res, type = "jitter")`. # Exporting benchmark tables To integrate the benchmark with automated pipelines, store the results under the `inst/benchmarks/` directory. Each run can include meta-data such as dataset parameters, number of iterations, or git commit identifiers. ```{r export-benchmark, eval = FALSE} if (!dir.exists("inst/benchmarks/results")) { dir.create("inst/benchmarks/results", recursive = TRUE) } write.csv(bench_dense[, 1:9], file = "inst/benchmarks/results/benchmark-dense.csv", row.names = FALSE) write.csv(bench_big[, 1:9], file = "inst/benchmarks/results/benchmark-big.csv", row.names = FALSE) ``` # Additional scripts The package also ships with standalone scripts under `inst/benchmarks/` that mirror this vignette while exposing additional configuration points. Run them from the repository root as: ```bash Rscript inst/benchmarks/cox-benchmark.R Rscript inst/benchmarks/benchmark_bigPLScox.R Rscript inst/benchmarks/cox_pls_benchmark.R ``` Each script accepts environment variables to adjust the problem size and stores results in `inst/benchmarks/results/` with time-stamped filenames.