--- title: "Fast big-memory workflows with bigPLScox" shorttitle: "Fast big-memory workflows" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Fast big-memory workflows with bigPLScox} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/fast-big--", fig.width = 7, fig.height = 4.5, dpi = 150, message = FALSE, warning = FALSE ) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") ``` # Introduction Large survival datasets often outgrow the capabilities of purely in-memory algorithms. **bigPLScox** therefore offers three flavours of Partial Least Squares (PLS) components for Cox models: 1. `big_pls_cox()` – the original R/C++ hybrid that expects dense matrices; 2. `big_pls_cox_fast()` – the new Armadillo backend with variance-one components and support for both dense matrices and `bigmemory::big.matrix` objects; and 3. `big_pls_cox_gd()` – a stochastic gradient descent (SGD) solver for streaming or disk-backed data. allowing datasets too large to fit in RAM. This vignette demonstrates how to build file-backed matrices, run the fast backends, and reconcile their outputs. It complements the introductory vignette `vignette("getting-started", package = "bigPLScox")` and focuses on workflow patterns specific to large datasets. # Simulating a large survival dataset We generate a synthetic dataset with correlated predictors and a known linear predictor. The same object is used to illustrate the dense and big-memory interfaces. ```{r simulate-bigmatrix, cache=TRUE, eval=LOCAL} library(bigPLScox) set.seed(2024) n_obs <- 5000 n_pred <- 100 k_true <- 3 Sigma <- diag(n_pred) for (b in 0:2) { idx <- (b * 10 + 1):(b * 10 + 10) Sigma[idx, idx] <- 0.7 diag(Sigma[idx, idx]) <- 1 } L <- chol(Sigma) Z <- matrix(rnorm(n_obs * n_pred), nrow = n_obs) X_dense <- Z %*% L w1 <- numeric(n_pred); w1[1:4] <- c( 1.0, 0.8, 0.6, -0.5) w2 <- numeric(n_pred); w2[5:8] <- c(-0.7, 0.9, -0.4, 0.5) w3 <- numeric(n_pred); w3[9:12] <- c( 0.6, -0.5, 0.7, 0.8) w1 <- w1 / sqrt(sum(w1^2)) w2 <- w2 / sqrt(sum(w2^2)) w3 <- w3 / sqrt(sum(w3^2)) t1 <- as.numeric(scale(drop(X_dense %*% w1), center = TRUE, scale = TRUE)) t2 <- as.numeric(scale(drop(X_dense %*% w2), center = TRUE, scale = TRUE)) t3 <- as.numeric(scale(drop(X_dense %*% w3), center = TRUE, scale = TRUE)) beta_true <- c(1.0, 0.6, 0.3) eta <- beta_true[1]*t1 + beta_true[2]*t2 + beta_true[3]*t3 lambda0 <- 0.05 u <- runif(n_obs) time_event <- -log(u) / (lambda0 * exp(eta)) target_event <- 0.65 f <- function(lc) { mean(time_event <= rexp(n_obs, rate = lc)) - target_event } lambda_c <- uniroot(f, c(1e-4, 1), tol = 1e-4)$root time_cens <- rexp(n_obs, rate = lambda_c) time <- pmin(time_event, time_cens) status <- as.integer(time_event <= time_cens) ``` ```{r bigmatrixbigmatrix, cache=TRUE, eval=LOCAL} if (requireNamespace("bigmemory", quietly = TRUE)) { library(bigmemory) big_dir <- tempfile("bigPLScox-") dir.create(big_dir) if(file.exists(paste(big_dir,"X.bin",sep="/"))){unlink(paste(big_dir,"X.bin",sep="/"))} if(file.exists(paste(big_dir,"X.desc",sep="/"))){unlink(paste(big_dir,"X.desc",sep="/"))} X_big <- bigmemory::as.big.matrix( X_dense, backingpath = big_dir, backingfile = "X.bin", descriptorfile = "X.desc" ) X_big[1:6, 1:6] } ``` # Dense-matrix solvers The legacy solver `big_pls_cox()` performs the component extraction in R before fitting the Cox model. The new `big_pls_cox_fast()` backend exposes the same arguments but runs entirely in C++ for a substantial speed-up. ```{r dense-solvers, cache=TRUE, eval=LOCAL} fit_legacy <- big_pls_cox( X = X_dense, time = time, status = status, ncomp = k_true ) fit_fast_dense <- big_pls_cox_fast( X = X_dense, time = time, status = status, ncomp = k_true ) list( legacy = head(fit_legacy$scores), fast = head(fit_fast_dense$scores) ) list( legacy = apply(fit_legacy$scores, 2, var), fast = apply(fit_fast_dense$scores, 2, var) ) ``` The `summary()` method reports key diagnostics, including the final Cox coefficients and the number of predictors retained per component. ```{r dense-summary, cache=TRUE, eval=LOCAL} summary(fit_fast_dense) ``` # Switching to file-backed matrices When working with a `big.matrix`, the same function operates on the shared memory pointer without copying data back to R. This is ideal for datasets that do not fit entirely in RAM. ```{r fast-big, cache=TRUE, eval=LOCAL} if (exists("X_big")) { fit_fast_big <- big_pls_cox_fast( X = X_big, time = time, status = status, ncomp = k_true ) summary(fit_fast_big) } ``` The resulting scores, loadings, and centring parameters mirror the dense fit, which simplifies debugging and incremental model building. # Gradient descent for streaming data The SGD solver trades a small amount of accuracy for drastically reduced memory usage. Provide the same `big.matrix` object along with the desired number of components. ```{r gd-fit, cache=TRUE, eval=LOCAL} if (exists("X_big")) { fit_gd <- big_pls_cox_gd( X = X_big, time = time, status = status, ncomp = k_true, max_iter = 2000, learning_rate = 0.05, tol = 1e-8 ) str(fit_gd) } ``` # Comparing the latent subspaces Component bases are unique only up to orthogonal rotations. Comparing the linear predictors generated by each solver verifies that they span the same subspace. ```{r compare-solvers, cache=TRUE, eval=LOCAL} if (exists("fit_fast_dense") && exists("fit_gd")) { eta_fast_dense <- drop(fit_fast_dense$scores %*% fit_fast_dense$cox_fit$coefficients) eta_fast_big <- drop(fit_fast_big$scores %*% fit_fast_big$cox_fit$coefficients) eta_gd <- drop(fit_gd$scores %*% fit_gd$cox_fit$coefficients) list( correlation_fast_dense_big = cor(eta_fast_dense, eta_fast_big), correlation_fast_dense_gd = cor(eta_fast_dense, eta_gd), concordance = c( fast_dense = survival::concordance(survival::Surv(time, status) ~ eta_fast_dense)$concordance, fast_big = survival::concordance(survival::Surv(time, status) ~ eta_fast_big)$concordance, gd = survival::concordance(survival::Surv(time, status) ~ eta_gd)$concordance ) ) } ``` # Predictions on new data Use `predict()` with `type = "components"` to obtain latent scores for new observations. Supplying explicit centring and scaling parameters ensures consistent projections across solvers. ```{r predict-new, cache=TRUE, eval=LOCAL} X_new <- matrix(rnorm(10 * n_pred), nrow = 10) scores_new <- predict( fit_fast_dense, newdata = X_new, type = "components" ) risk_new <- predict( fit_fast_dense, newdata = X_new, type = "risk" ) list(scores = scores_new, risk = risk_new) ``` # Timing snapshot The `bench` package provides lightweight instrumentation for comparing solvers. The chunk below contrasts the classical implementation, the fast backend, and the SGD routine on the simulated dataset. ```{r timing, cache=TRUE, eval=LOCAL} if (requireNamespace("bench", quietly = TRUE) && exists("X_big")) { bench_res <- bench::mark( big_pls = big_pls_cox(X_dense, time, status, ncomp = k_true), fast_dense = big_pls_cox_fast(X_dense, time, status, ncomp = k_true), fast_big = big_pls_cox_fast(X_big, time, status, ncomp = k_true), gd = big_pls_cox_gd(X_big, time, status, ncomp = k_true, max_iter = 1500), iterations = 30, check = FALSE ) bench_res$expression <- names(bench_res$expression) bench_res[, c("expression", "median", "itr/sec", "mem_alloc")] } ``` ```{r timing-plot, cache=TRUE, eval=LOCAL} if (exists("bench_res")) { plot(bench_res, type = "jitter") } ``` # Cleaning up backing files File-backed matrices can be deleted once the analysis is complete. In production workflows you typically keep the descriptor (`.desc`) file alongside the binary matrix for later reuse. # Cleaning up Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data. ```{r cleanup, cache=TRUE, eval=LOCAL} if (exists("X_big")) { rm(X_big) file.remove(file.path(big_dir, "X.bin")) file.remove(file.path(big_dir, "X.desc")) unlink(big_dir, recursive = TRUE) } ``` # Further resources * The introductory vignette demonstrates how to combine the fast backend with the matrix-based modelling functions and deviance residual utilities. * `vignette("bigPLScox-benchmarking", package = "bigPLScox")` provides a reproducible benchmark that includes `survival::coxph()` and `coxgpls()`. * The help pages (`?big_pls_cox_fast`, `?big_pls_cox_gd`) describe all tuning parameters in detail, including `keepX` for component-wise sparsity.