The bigPLScox package now provides three distinct
engines for extracting PLS components in Cox regression: the original
matrix-based solver (big_pls_cox), the new Armadillo
backend (big_pls_cox_fast), and the stochastic gradient
descent routine (big_pls_cox_gd). This vignette shows how
to benchmark those implementations against coxgpls() and
the standard survival::coxph() fit.
We rely on the bench package to collect
timing and memory statistics. All chunks are wrapped in the
LOCAL flag so that CRAN builds can skip the heavier
computations.
We reuse the helper dataCox() to generate survival
outcomes with right censoring. Adjust n and p
to stress-test the solvers on larger problems.
set.seed(2024)
sim_design <- dataCox(
n = 2000,
lambda = 2,
rho = 1.5,
x = matrix(rnorm(2000 * 50), ncol = 50),
beta = c(1, 3, rep(0, 48)),
cens.rate = 5
)
cox_data <- list(
x = as.matrix(sim_design[, -(1:3)]),
time = sim_design$time,
status = sim_design$status
)
X_big <- bigmemory::as.big.matrix(cox_data$x)coxgpls() workflow and the dense variants of the big-memory
solvers against survival::coxph().big_pls_cox_fast() and big_pls_cox_gd()
directly on a big.matrix, illustrating the benefit of the
shared-memory interface.The number of iterations can be increased for more stable measurements when running locally.
bench_dense <- bench::mark(
coxgpls = coxgpls(
cox_data$x,
cox_data$time,
cox_data$status,
ncomp = 5
),
big_pls = big_pls_cox(cox_data$x, cox_data$time, cox_data$status, ncomp = 5),
big_pls_fast = big_pls_cox_fast(cox_data$x, cox_data$time, cox_data$status, ncomp = 5),
big_pls_bb = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5),
big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, method = "gd"),
coxph = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"),
iterations = 75,
check = FALSE
)
bench_dense$expression <- names(bench_dense$expression)
bench_dense[, c("expression", "median", "itr/sec", "mem_alloc")]
#> # A tibble: 6 × 4
#> expression median `itr/sec` mem_alloc
#> <chr> <bch:tm> <dbl> <bch:byt>
#> 1 coxgpls 40.7ms 24.2 31MB
#> 2 big_pls 314.2ms 3.15 4.2MB
#> 3 big_pls_fast 24.6ms 40.5 5.35MB
#> 4 big_pls_bb 869.1ms 1.15 5.55MB
#> 5 big_pls_gd 869.3ms 1.15 5.47MB
#> 6 coxph 60.8ms 16.4 13.33MBA higher itr/sec value indicates faster throughput. The
fast backend should provide the best compromise between speed and
accuracy in the dense setting.
The next experiment reuses the big.matrix version of the
predictors and adds the SGD solver. Because coxgpls()
requires an in-memory matrix, it is not included in this round.
bench_big <- bench::mark(
big_pls_fast = big_pls_cox_fast(X_big, cox_data$time, cox_data$status, ncomp = 5),
big_pls = big_pls_cox(cox_data$x, cox_data$time, cox_data$status, ncomp = 5),
big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 300),
iterations = 75,
check = FALSE
)
bench_big$expression <- names(bench_big$expression)
bench_big[, c("expression", "median", "itr/sec", "mem_alloc")]
#> # A tibble: 3 × 4
#> expression median `itr/sec` mem_alloc
#> <chr> <bch:tm> <dbl> <bch:byt>
#> 1 big_pls_fast 20.3ms 49.1 3.35MB
#> 2 big_pls 295.6ms 3.38 4.11MB
#> 3 big_pls_gd 144.2ms 6.92 5.44MBThe classical solver is still run on the dense matrix for reference,
but only the fast and SGD routines operate on the
big.matrix without copying. Expect the fast backend to
dominate runtime while SGD maintains scalability for out-of-memory
workflows.
bench includes autoplot() and
plot() helpers for visual inspection of the timing
distribution.
Additional geometries, such as ridge plots, are available via
autoplot(bench_res, type = "jitter").
To integrate the benchmark with automated pipelines, store the
results under the inst/benchmarks/ directory. Each run can
include meta-data such as dataset parameters, number of iterations, or
git commit identifiers.
if (!dir.exists("inst/benchmarks/results")) {
dir.create("inst/benchmarks/results", recursive = TRUE)
}
write.csv(bench_dense[, 1:9], file = "inst/benchmarks/results/benchmark-dense.csv", row.names = FALSE)
write.csv(bench_big[, 1:9], file = "inst/benchmarks/results/benchmark-big.csv", row.names = FALSE)The package also ships with standalone scripts under
inst/benchmarks/ that mirror this vignette while exposing
additional configuration points. Run them from the repository root
as:
Rscript inst/benchmarks/cox-benchmark.R
Rscript inst/benchmarks/benchmark_bigPLScox.R
Rscript inst/benchmarks/cox_pls_benchmark.REach script accepts environment variables to adjust the problem size
and stores results in inst/benchmarks/results/ with
time-stamped filenames.