| Title: | Partial Least Squares Regression Models with Big Matrices |
|---|---|
| Description: | Fast partial least squares (PLS) for dense and out-of-core data. Provides SIMPLS (straightforward implementation of a statistically inspired modification of the PLS method) and NIPALS (non-linear iterative partial least-squares) solvers, plus kernel-style PLS variants ('kernelpls' and 'widekernelpls') with parity to 'pls'. Optimized for 'bigmemory'-backed matrices with streamed cross-products and chunked BLAS (Basic Linear Algebra Subprograms) (XtX/XtY and XXt/YX), optional file-backed score sinks, and deterministic testing helpers. Includes an auto-selection strategy that chooses between XtX SIMPLS, XXt (wide) SIMPLS, and NIPALS based on (n, p) and a configurable memory budget. About the package, Bertrand and Maumy (2023) <https://hal.science/hal-05352069>, and <https://hal.science/hal-05352061> highlighted fitting and cross-validating PLS regression models to big data. For more details about some of the techniques featured in the package, Dayal and MacGregor (1997) <doi:10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23>, Rosipal & Trejo (2001) <https://www.jmlr.org/papers/v2/rosipal01a.html>, Tenenhaus, Viennet, and Saporta (2007) <doi:10.1016/j.csda.2007.01.004>, Rosipal (2004) <doi:10.1007/978-3-540-45167-9_17>, Rosipal (2019) <https://ieeexplore.ieee.org/document/8616346>, Song, Wang, and Bai (2024) <doi:10.1016/j.chemolab.2024.105238>. Includes kernel logistic PLS with 'C++'-accelerated alternating iteratively reweighted least squares (IRLS) updates, streamed reproducing kernel Hilbert space (RKHS) solvers with reusable centering statistics, and bootstrap diagnostics with graphical summaries for coefficients, scores, and cross-validation workflows, alongside dedicated plotting utilities for individuals, variables, ellipses, and biplots. The streaming backend uses far less memory and keeps memory bounded across data sizes. For PLS1, streaming is often fast enough while preserving a small memory footprint; for PLS2 it remains competitive with a bounded footprint. On small problems that fit comfortably in RAM (random-access memory), dense in-memory solvers are slightly faster; the crossover occurs as n or p grow and the Gram/cross-product cost dominates. |
| Authors: | Frederic Bertrand [cre, aut]
|
| Maintainer: | Frederic Bertrand <[email protected]> |
| License: | GPL-3 |
| Version: | 0.7.2 |
| Built: | 2026-05-31 08:35:42 UTC |
| Source: | https://github.com/fbertran/bigplsr |
Provides Partial least squares Regression for big data. It allows for missing data in the explanatory variables. Repeated k-fold cross-validation of such models using various criteria. Bootstrap confidence intervals constructions are also available.
Maintainer: Frederic Bertrand [email protected] (ORCID)
Authors:
Myriam Maumy [email protected] (ORCID)
Maumy, M., Bertrand, F. (2023). PLS models and their extension for big data. Joint Statistical Meetings (JSM 2023), Toronto, ON, Canada.
Maumy, M., Bertrand, F. (2023). bigPLS: Fitting and cross-validating PLS-based Cox models to censored big data. BioC2023 — The Bioconductor Annual Conference, Dana-Farber Cancer Institute, Boston, MA, USA. Poster. https://doi.org/10.7490/f1000research.1119546.1
Useful links:
Report bugs at https://github.com/fbertran/bigPLSR/issues
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r", algorithm = "simpls") head(pls_predict_response(fit, X, ncomp = 2))set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r", algorithm = "simpls") head(pls_predict_response(fit, X, ncomp = 2))
Finalize pls objects
.finalize_pls_fit(fit, algorithm).finalize_pls_fit(fit, algorithm)
fit |
Fitted object |
algorithm |
Name of the algorithm used to fit the object |
The fit object with normalized naming and class attributes.
Compute the column means and grand mean of the kernel matrix
without materialising it in memory. The input design matrix must be stored as
a bigmemory::big.matrix (or descriptor), and the kernel is evaluated by
iterating over row/column chunks.
bigPLSR_stream_kstats( Xbm, kernel, gamma, degree, coef0, chunk_rows = getOption("bigPLSR.predict.chunk_rows", 8192L), chunk_cols = getOption("bigPLSR.predict.chunk_cols", 8192L) )bigPLSR_stream_kstats( Xbm, kernel, gamma, degree, coef0, chunk_rows = getOption("bigPLSR.predict.chunk_rows", 8192L), chunk_cols = getOption("bigPLSR.predict.chunk_cols", 8192L) )
Xbm |
A |
kernel |
Kernel name passed to |
gamma, degree, coef0
|
Kernel hyper-parameters. |
chunk_rows, chunk_cols
|
Numbers of rows/columns to process per chunk. |
A list with entries r (column means) and g
(grand mean) of the kernel matrix.
Fast IRLS for binomial logit with class weights
cpp_irls_binomial(TT, ybin, w_class = NULL, maxit = 50L, tol = 1e-08)cpp_irls_binomial(TT, ybin, w_class = NULL, maxit = 50L, tol = 1e-08)
TT |
n x A numeric matrix of latent scores (no intercept column) |
ybin |
integer vector of {0,1} labels (length n) |
w_class |
optional length-2 numeric vector: weights for classes c( w0, w1 ) |
maxit |
max IRLS iterations |
tol |
relative tolerance on parameter change |
list(beta = A-vector, b = scalar intercept, fitted = n-vector, iter = integer, converged = logical)
Internal kernel and wide-kernel PLS solver
cpp_kernel_pls(X, Y, ncomp, tol, wide)cpp_kernel_pls(X, Y, ncomp, tol, wide)
X |
Centered design matrix. |
Y |
Centered response matrix. |
ncomp |
Maximum number of components. |
tol |
Numerical tolerance. |
wide |
Whether to use the wide-kernel update. |
A list containing the kernel PLS factors.
Pre-computed runtime comparisons between bigPLSR (dense and big.memory backends) and reference implementations from the pls and mixOmics packages.
data(external_pls_benchmarks)data(external_pls_benchmarks)
A data frame with 384 rows and 11 columns:
Character vector identifying the task ("pls1" or "pls2").
PLS algorithm used for the benchmark (e.g., "simpls").
Package providing the implementation.
Median execution time in seconds.
Iterations per second recorded by bench::mark().
Memory usage in bytes recorded by bench::mark().
Number of observations in the simulated dataset.
Number of predictors (X) in the simulated dataset.
Number of responses (Y) in the simulated dataset.
Number of extracted components.
Helpful context on dependencies or configuration.
Fix task = "pls1" and select algorithms in "kernelpls",
"nipals" or "simpls" to get a full factorial design.
Fix task = "pls1" and fix algorithm = "widekernelpls" to get a
full factorial design.
Fix task = "pls2" and select algorithms in "kernelpls",
"nipals" or "simpls" to get a full factorial design.
Fix task = "pls2" and fix algorithm = "widekernelpls" to get a
full factorial design.
Generated via inst/scripts/external_pls_benchmarks.R.
data("external_pls_benchmarks", package = "bigPLSR") sub_pls1 <- subset(external_pls_benchmarks, task == "pls1" & algorithm != "widekernelpls") sub_pls1$n <- factor(sub_pls1$n) sub_pls1$p <- factor(sub_pls1$p) sub_pls1$q <- factor(sub_pls1$q) sub_pls1$ncomp <- factor(sub_pls1$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls1) sub_pls1_wide <- subset(external_pls_benchmarks, task == "pls1" & algorithm == "widekernelpls") sub_pls1_wide$n <- factor(sub_pls1_wide$n) sub_pls1_wide$p <- factor(sub_pls1_wide$p) sub_pls1_wide$q <- factor(sub_pls1_wide$q) sub_pls1_wide$ncomp <- factor(sub_pls1_wide$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls1_wide) sub_pls2 <- subset(external_pls_benchmarks, task == "pls2" & algorithm != "widekernelpls") sub_pls2$n <- factor(sub_pls2$n) sub_pls2$p <- factor(sub_pls2$p) sub_pls2$q <- factor(sub_pls2$q) sub_pls2$ncomp <- factor(sub_pls2$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls2) sub_pls2_wide <- subset(external_pls_benchmarks, task == "pls2" & algorithm == "widekernelpls") sub_pls2_wide$n <- factor(sub_pls2_wide$n) sub_pls2_wide$p <- factor(sub_pls2_wide$p) sub_pls2_wide$q <- factor(sub_pls2_wide$q) sub_pls2_wide$ncomp <- factor(sub_pls2_wide$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls2_wide)data("external_pls_benchmarks", package = "bigPLSR") sub_pls1 <- subset(external_pls_benchmarks, task == "pls1" & algorithm != "widekernelpls") sub_pls1$n <- factor(sub_pls1$n) sub_pls1$p <- factor(sub_pls1$p) sub_pls1$q <- factor(sub_pls1$q) sub_pls1$ncomp <- factor(sub_pls1$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls1) sub_pls1_wide <- subset(external_pls_benchmarks, task == "pls1" & algorithm == "widekernelpls") sub_pls1_wide$n <- factor(sub_pls1_wide$n) sub_pls1_wide$p <- factor(sub_pls1_wide$p) sub_pls1_wide$q <- factor(sub_pls1_wide$q) sub_pls1_wide$ncomp <- factor(sub_pls1_wide$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls1_wide) sub_pls2 <- subset(external_pls_benchmarks, task == "pls2" & algorithm != "widekernelpls") sub_pls2$n <- factor(sub_pls2$n) sub_pls2$p <- factor(sub_pls2$p) sub_pls2$q <- factor(sub_pls2$q) sub_pls2$ncomp <- factor(sub_pls2$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls2) sub_pls2_wide <- subset(external_pls_benchmarks, task == "pls2" & algorithm == "widekernelpls") sub_pls2_wide$n <- factor(sub_pls2_wide$n) sub_pls2_wide$p <- factor(sub_pls2_wide$p) sub_pls2_wide$q <- factor(sub_pls2_wide$q) sub_pls2_wide$ncomp <- factor(sub_pls2_wide$ncomp) if (exists("replications")) replications(~ package + algorithm + task + n + p + ncomp, data = sub_pls2_wide)
Converts the accumulated KF-PLS state into a SIMPLS-equivalent fitted
model (using the current sufficient statistics). The result is compatible
with predict.big_plsr().
kf_pls_state_fit(state, tol = 1e-08)kf_pls_state_fit(state, tol = 1e-08)
state |
External pointer created by |
tol |
Numeric tolerance for the inner SIMPLS step. |
A list with PLS factors and coefficients, classed as big_plsr.
n <- 200; p <- 30; m <- 2; A <- 3 X <- matrix(rnorm(n*p), n, p) Y <- X[,1:2] %*% matrix(c(0.7, -0.3, 0.2, 0.9), 2, m) + matrix(rnorm(n*m, sd=0.2), n, m) state <- kf_pls_state_new(p, m, A, lambda = 0.99, q_proc = 1e-6) # stream in mini-batches bs <- 64 for (i in seq(1, n, by = bs)) { idx <- i:min(i+bs-1, n) kf_pls_state_update(state, X[idx, , drop=FALSE], Y[idx, , drop=FALSE]) } fit <- kf_pls_state_fit(state) # returns a big_plsr-compatible list # predict via your existing predict.big_plsr (linear case) Yhat <- cbind(1, scale(X, center = fit$x_means, scale = FALSE)) %*% rbind(fit$intercept, fit$coefficients)n <- 200; p <- 30; m <- 2; A <- 3 X <- matrix(rnorm(n*p), n, p) Y <- X[,1:2] %*% matrix(c(0.7, -0.3, 0.2, 0.9), 2, m) + matrix(rnorm(n*m, sd=0.2), n, m) state <- kf_pls_state_new(p, m, A, lambda = 0.99, q_proc = 1e-6) # stream in mini-batches bs <- 64 for (i in seq(1, n, by = bs)) { idx <- i:min(i+bs-1, n) kf_pls_state_update(state, X[idx, , drop=FALSE], Y[idx, , drop=FALSE]) } fit <- kf_pls_state_fit(state) # returns a big_plsr-compatible list # predict via your existing predict.big_plsr (linear case) Yhat <- cbind(1, scale(X, center = fit$x_means, scale = FALSE)) %*% rbind(fit$intercept, fit$coefficients)
Create a persistent Kalman–filter PLS (KF-PLS) state that accumulates
cross-products from streaming mini-batches and later produces a
big_plsr-compatible fit via kf_pls_state_fit().
kf_pls_state_new(p, m, ncomp, lambda = 0.99, q_proc = 0, r_meas = 0)kf_pls_state_new(p, m, ncomp, lambda = 0.99, q_proc = 0, r_meas = 0)
p |
Integer, number of predictors (columns of |
m |
Integer, number of responses (columns of |
ncomp |
Integer, number of latent components to extract at fit time. |
lambda |
Numeric in (0,1], forgetting factor (closer to 1 = slower decay). |
q_proc |
Non-negative numeric, process-noise magnitude (adds a ridge to
|
r_meas |
Reserved measurement-noise parameter (not used by the minimal API yet; kept for forward compatibility). |
The state maintains exponentially weighted cross-moments
and with forgetting factor lambda.
When lambda >= 0.999999 and q_proc == 0, the backend switches to an
exact accumulation mode that matches concatenating all chunks (no decay).
An external pointer to an internal KF-PLS state (opaque object) that
you pass to kf_pls_state_update() and then to
kf_pls_state_fit() to produce model coefficients.
kf_pls_state_update(), kf_pls_state_fit(), pls_fit()
(use algorithm = "kf_pls" for the one-shot dense path).
set.seed(1) n <- 1000; p <- 50; m <- 2 X1 <- matrix(rnorm(n/2 * p), n/2, p) X2 <- matrix(rnorm(n/2 * p), n/2, p) B <- matrix(rnorm(p*m), p, m) Y1 <- scale(X1, TRUE, FALSE) %*% B + 0.05*matrix(rnorm(n/2*m), n/2, m) Y2 <- scale(X2, TRUE, FALSE) %*% B + 0.05*matrix(rnorm(n/2*m), n/2, m) st <- kf_pls_state_new(p, m, ncomp = 4, lambda = 0.99, q_proc = 1e-6) kf_pls_state_update(st, X1, Y1) kf_pls_state_update(st, X2, Y2) fit <- kf_pls_state_fit(st) # returns a big_plsr-compatible list preds <- predict(bigPLSR::.finalize_pls_fit(fit, "kf_pls"), rbind(X1, X2)) head(preds)set.seed(1) n <- 1000; p <- 50; m <- 2 X1 <- matrix(rnorm(n/2 * p), n/2, p) X2 <- matrix(rnorm(n/2 * p), n/2, p) B <- matrix(rnorm(p*m), p, m) Y1 <- scale(X1, TRUE, FALSE) %*% B + 0.05*matrix(rnorm(n/2*m), n/2, m) Y2 <- scale(X2, TRUE, FALSE) %*% B + 0.05*matrix(rnorm(n/2*m), n/2, m) st <- kf_pls_state_new(p, m, ncomp = 4, lambda = 0.99, q_proc = 1e-6) kf_pls_state_update(st, X1, Y1) kf_pls_state_update(st, X2, Y2) fit <- kf_pls_state_fit(st) # returns a big_plsr-compatible list preds <- predict(bigPLSR::.finalize_pls_fit(fit, "kf_pls"), rbind(X1, X2)) head(preds)
Feed one chunk (X_chunk, Y_chunk) to an existing KF-PLS state created by
kf_pls_state_new(). The function updates exponentially weighted means and
cross-products (or exact sufficient statistics when in exact mode).
kf_pls_state_update(state, X_chunk, Y_chunk)kf_pls_state_update(state, X_chunk, Y_chunk)
state |
External pointer produced by |
X_chunk |
Numeric matrix with the same number of columns |
Y_chunk |
Numeric matrix with |
Call this repeatedly for each incoming batch. When you want model
coefficients (weights/loadings/intercepts), call
kf_pls_state_fit(), which solves SIMPLS on the accumulated
cross-moments without re-materializing all past data.
Invisibly returns state, updated in place.
kf_pls_state_new(), kf_pls_state_fit()
PLS biplot
plot_pls_biplot( object, comps = c(1L, 2L), scale_variables = 1, circle = TRUE, circle_col = "grey85", arrow_col = "firebrick", groups = NULL, ellipse = TRUE, ellipse_level = 0.95, ellipse_n = 200L, group_col = NULL, ... )plot_pls_biplot( object, comps = c(1L, 2L), scale_variables = 1, circle = TRUE, circle_col = "grey85", arrow_col = "firebrick", groups = NULL, ellipse = TRUE, ellipse_level = 0.95, ellipse_n = 200L, group_col = NULL, ... )
object |
A fitted PLS model with scores and loadings. |
comps |
Components to display. |
scale_variables |
Scaling factor applied to variable loadings. |
circle |
Logical; draw a unit circle behind loadings. |
circle_col |
Colour of the unit circle guide. |
arrow_col |
Colour for loading arrows. |
groups |
Optional factor or character vector defining groups for
individuals. When supplied, group-specific colours are used and, if
|
ellipse |
Logical; draw group confidence ellipses when |
ellipse_level |
Confidence level for group ellipses (between 0 and 1). |
ellipse_n |
Number of points used to draw each ellipse. |
group_col |
Optional vector of colours for the groups. Recycled as needed. |
... |
Additional arguments passed to |
Invisibly returns NULL after drawing the biplot.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_biplot(fit)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_biplot(fit)
Boxplots of bootstrap coefficient distributions
plot_pls_bootstrap_coefficients( boot_result, responses = NULL, variables = NULL, ... )plot_pls_bootstrap_coefficients( boot_result, responses = NULL, variables = NULL, ... )
boot_result |
Result returned by |
responses |
Optional character vector selecting response columns. |
variables |
Optional character vector selecting predictor variables. |
... |
Additional arguments passed to |
Invisibly returns NULL after drawing the boxplots.
Visualise the variability of latent scores obtained through
pls_bootstrap() when return_scores = TRUE.
plot_pls_bootstrap_scores( boot_result, components = NULL, observations = NULL, ... )plot_pls_bootstrap_scores( boot_result, components = NULL, observations = NULL, ... )
boot_result |
Result returned by |
components |
Optional vector of component indices or names to include. |
observations |
Optional vector of observation indices or names to include. |
... |
Additional arguments passed to |
Invisibly returns NULL after drawing the boxplots.
Plot individual scores
plot_pls_individuals( object, comps = c(1L, 2L), labels = NULL, groups = NULL, ellipse = TRUE, ellipse_level = 0.95, ellipse_n = 200L, group_col = NULL, ... )plot_pls_individuals( object, comps = c(1L, 2L), labels = NULL, groups = NULL, ellipse = TRUE, ellipse_level = 0.95, ellipse_n = 200L, group_col = NULL, ... )
object |
A fitted PLS model with scores. |
comps |
Components to plot (length two). |
labels |
Optional character vector of point labels. |
groups |
Optional factor or character vector defining groups for
individuals. When supplied, group-specific colours are used and, if
|
ellipse |
Logical; draw group confidence ellipses when |
ellipse_level |
Confidence level for the ellipses (between 0 and 1). |
ellipse_n |
Number of points used to draw each ellipse. |
group_col |
Optional vector of colours for the groups. Recycled as needed. |
... |
Additional plotting parameters passed to |
Invisibly returns NULL after drawing the plot.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_individuals(fit)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_individuals(fit)
Plot variable loadings
plot_pls_variables( object, comps = c(1L, 2L), circle = TRUE, circle_col = "grey80", arrow_col = "steelblue", arrow_scale = 1, ... )plot_pls_variables( object, comps = c(1L, 2L), circle = TRUE, circle_col = "grey80", arrow_col = "steelblue", arrow_scale = 1, ... )
object |
A fitted PLS model. |
comps |
Components to display (length two). |
circle |
Logical; draw the unit circle. |
circle_col |
Colour of the unit circle. |
arrow_col |
Colour of the variable arrows. |
arrow_scale |
Scaling applied to variable vectors. |
... |
Additional plotting parameters passed to |
Invisibly returns NULL after drawing the plot.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_variables(fit)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_variables(fit)
Plot Variable Importance in Projection (VIP)
plot_pls_vip( object, comps = NULL, threshold = 1, palette = c("#4575b4", "#d73027"), ... )plot_pls_vip( object, comps = NULL, threshold = 1, palette = c("#4575b4", "#d73027"), ... )
object |
A fitted PLS model. |
comps |
Components to aggregate. Defaults to all available. |
threshold |
Optional threshold to highlight influential variables. |
palette |
Colour palette used for bars. |
... |
Additional parameters passed to |
Invisibly returns the VIP scores used to create the bar plot.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_vip(fit)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") plot_pls_vip(fit)
Draw bootstrap replicates of a fitted PLS model, refitting on each resample.
pls_bootstrap( X, Y, ncomp, R = 100L, algorithm = c("simpls", "nipals", "kernelpls", "widekernelpls"), backend = "arma", conf = 0.95, seed = NULL, type = c("xy", "xt"), parallel = c("none", "future"), future_seed = TRUE, return_scores = FALSE, ... )pls_bootstrap( X, Y, ncomp, R = 100L, algorithm = c("simpls", "nipals", "kernelpls", "widekernelpls"), backend = "arma", conf = 0.95, seed = NULL, type = c("xy", "xt"), parallel = c("none", "future"), future_seed = TRUE, return_scores = FALSE, ... )
X |
Predictor matrix. |
Y |
Response matrix or vector. |
ncomp |
Number of components. |
R |
Number of bootstrap replications. |
algorithm |
Backend algorithm ("simpls", "nipals", "kernelpls" or "widekernelpls"). |
backend |
Backend argument passed to the fitting routine. |
conf |
Confidence level. |
seed |
Optional seed. |
type |
Character; bootstrap scheme, e.g. |
parallel |
Logical or character; if |
future_seed |
Logical or integer; forwarded to |
return_scores |
Logical; if |
... |
Additional arguments forwarded to |
A list with bootstrap estimates and summaries.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) pls_bootstrap(X, y, ncomp = 2, R = 20)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) pls_bootstrap(X, y, ncomp = 2, R = 20)
Cross-validate PLS models
pls_cross_validate( X, Y, ncomp, folds = 5L, type = c("kfold", "loo"), algorithm = c("simpls", "nipals", "kernelpls", "widekernelpls"), backend = "arma", metrics = c("rmse", "mae", "r2"), seed = NULL, parallel = c("none", "future"), future_seed = TRUE, ... )pls_cross_validate( X, Y, ncomp, folds = 5L, type = c("kfold", "loo"), algorithm = c("simpls", "nipals", "kernelpls", "widekernelpls"), backend = "arma", metrics = c("rmse", "mae", "r2"), seed = NULL, parallel = c("none", "future"), future_seed = TRUE, ... )
X |
Predictor matrix as accepted by |
Y |
Response matrix or vector as accepted by |
ncomp |
Integer; components grid to evaluate. |
folds |
Number of folds (ignored when |
type |
Either "kfold" (default) or "loo". |
algorithm |
Backend algorithm: "simpls", "nipals", "kernelpls" or "widekernelpls". |
backend |
Backend passed to |
metrics |
Metrics to compute (subset of "rmse", "mae", "r2"). |
seed |
Optional seed for reproducibility. |
parallel |
Logical or character; same semantics as in |
future_seed |
Logical or integer; reproducible seeds for parallel evaluation. |
... |
Passed to |
A list containing per-fold metrics and their summary across folds.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) pls_cross_validate(X, y, ncomp = 2, folds = 3)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) pls_cross_validate(X, y, ncomp = 2, folds = 3)
Select components from cross-validation results
pls_cv_select(cv_result, metric = c("rmse", "mae", "r2"), minimise = NULL)pls_cv_select(cv_result, metric = c("rmse", "mae", "r2"), minimise = NULL)
cv_result |
Result returned by |
metric |
Metric to optimise. |
minimise |
Logical; whether the metric should be minimised. |
Selected number of components.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) cv <- pls_cross_validate(X, y, ncomp = 2, folds = 3) pls_cv_select(cv, metric = "rmse")set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) cv <- pls_cross_validate(X, y, ncomp = 2, folds = 3) pls_cv_select(cv, metric = "rmse")
Dispatches to a dense (Arm/BLAS) backend for in-memory matrices or to a streaming big.matrix backend when X (or Y) is a big.matrix. Algorithm can be chosen between: "simpls" (default), "nipals", "kernelpls", "widekernelpls", "rkhs" (Rosipal & Trejo), "klogitpls", "sparse_kpls", "rkhs_xy" (double RKHS), and "kf_pls" (Kalman-filter PLS, streaming).
The "kernelpls" paths now include a streaming XX'
variant for big.matrix inputs, with an optional row-chunking loop
controlled by chunk_cols.
pls_fit( X, y, ncomp, tol = 1e-08, backend = c("auto", "arma", "bigmem"), mode = c("auto", "pls1", "pls2"), algorithm = c("auto", "simpls", "nipals", "kernelpls", "widekernelpls", "rkhs", "klogitpls", "sparse_kpls", "rkhs_xy", "kf_pls"), scores = c("none", "r", "big"), chunk_size = 10000L, chunk_cols = NULL, scores_name = "scores", scores_target = c("auto", "new", "existing"), scores_bm = NULL, scores_backingfile = NULL, scores_backingpath = NULL, scores_descriptorfile = NULL, scores_colnames = NULL, return_scores_descriptor = FALSE, coef_threshold = NULL, kernel = c("linear", "rbf", "poly", "sigmoid"), gamma = 1, degree = 3L, coef0 = 0, approx = c("none", "nystrom", "rff"), approx_rank = NULL, class_weights = NULL )pls_fit( X, y, ncomp, tol = 1e-08, backend = c("auto", "arma", "bigmem"), mode = c("auto", "pls1", "pls2"), algorithm = c("auto", "simpls", "nipals", "kernelpls", "widekernelpls", "rkhs", "klogitpls", "sparse_kpls", "rkhs_xy", "kf_pls"), scores = c("none", "r", "big"), chunk_size = 10000L, chunk_cols = NULL, scores_name = "scores", scores_target = c("auto", "new", "existing"), scores_bm = NULL, scores_backingfile = NULL, scores_backingpath = NULL, scores_descriptorfile = NULL, scores_colnames = NULL, return_scores_descriptor = FALSE, coef_threshold = NULL, kernel = c("linear", "rbf", "poly", "sigmoid"), gamma = 1, degree = 3L, coef0 = 0, approx = c("none", "nystrom", "rff"), approx_rank = NULL, class_weights = NULL )
X |
numeric matrix or |
y |
numeric vector/matrix or |
ncomp |
number of latent components |
tol |
numeric tolerance used in the core solver |
backend |
one of |
mode |
one of |
algorithm |
one of |
scores |
one of |
chunk_size |
chunk size for the bigmem backend |
chunk_cols |
columns chunk size for the bigmem backend |
scores_name |
name for dense scores (or output big.matrix) |
scores_target |
one of |
scores_bm |
optional existing big.matrix or descriptor for scores |
scores_backingfile |
Character; file name for file-backed scores (when |
scores_backingpath |
Character; directory for the file-backed scores.
Defaults to |
scores_descriptorfile |
Character; descriptor file name for the file-backed scores. |
scores_colnames |
optional character vector for score column names |
return_scores_descriptor |
logical; if TRUE and scores is big.matrix, add |
coef_threshold |
Optional non-negative value used to hard-threshold
the fitted coefficients after model estimation. When supplied, absolute
coefficients strictly below the threshold are set to zero via
|
kernel |
kernel name for RKHS/KPLS ( |
gamma |
RBF/sigmoid/poly scale parameter |
degree |
polynomial degree |
coef0 |
polynomial/sigmoid bias |
approx |
kernel approximation: |
approx_rank |
rank (columns / features) for the approximation |
class_weights |
optional numeric weights for classes in |
a list with coefficients, intercept, weights, loadings, means,
and optionally $scores.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r", algorithm = "simpls") head(pls_predict_response(fit, X, ncomp = 2))set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r", algorithm = "simpls") head(pls_predict_response(fit, X, ncomp = 2))
Compute information criteria for component selection
pls_information_criteria(object, X, Y, max_comp = NULL)pls_information_criteria(object, X, Y, max_comp = NULL)
object |
A fitted PLS model. |
X |
Training design matrix. |
Y |
Training response matrix or vector. |
max_comp |
Maximum number of components to consider. |
A data frame with RSS, RMSE, AIC and BIC per component.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_information_criteria(fit, X, y)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_information_criteria(fit, X, y)
Predict responses from a PLS fit
pls_predict_response(object, newdata, ncomp = NULL)pls_predict_response(object, newdata, ncomp = NULL)
object |
A fitted PLS model. |
newdata |
Predictor matrix for scoring. |
ncomp |
Number of components to use. |
A numeric matrix or vector of predictions.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_predict_response(fit, X, ncomp = 2)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_predict_response(fit, X, ncomp = 2)
Predict latent scores from a PLS fit
pls_predict_scores(object, newdata, ncomp = NULL)pls_predict_scores(object, newdata, ncomp = NULL)
object |
A fitted PLS model. |
newdata |
Predictor matrix for scoring. |
ncomp |
Number of components to use. |
Matrix of component scores.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_predict_scores(fit, X, ncomp = 2)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_predict_scores(fit, X, ncomp = 2)
Component selection via information criteria
pls_select_components( object, X, Y, criteria = c("aic", "bic"), max_comp = NULL )pls_select_components( object, X, Y, criteria = c("aic", "bic"), max_comp = NULL )
object |
A fitted PLS model. |
X |
Training design matrix. |
Y |
Training response matrix or vector. |
criteria |
Character vector specifying which criteria to compute. |
max_comp |
Maximum number of components to consider. |
A list with the per-component table and the selected components.
set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_select_components(fit, X, y)set.seed(123) X <- matrix(rnorm(60), nrow = 20) y <- X[, 1] - 0.5 * X[, 2] + rnorm(20, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_select_components(fit, X, y)
Naive sparsity control by coefficient thresholding
pls_threshold(object, threshold)pls_threshold(object, threshold)
object |
A fitted PLS model. |
threshold |
Values below this absolute magnitude are set to zero. |
A modified copy of object with thresholded coefficients.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2) pls_threshold(fit, threshold = 0.05)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2) pls_threshold(fit, threshold = 0.05)
Variable importance in projection (VIP) scores
pls_vip(object, comps = NULL)pls_vip(object, comps = NULL)
object |
A fitted PLS model. |
comps |
Components used to compute the VIP scores. Defaults to all available components. |
A named numeric vector of VIP scores.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_vip(fit)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") pls_vip(fit)
Predict method for big_plsr objects
## S3 method for class 'big_plsr' predict( object, newdata, ncomp = NULL, type = c("response", "scores", "prob", "class"), ... )## S3 method for class 'big_plsr' predict( object, newdata, ncomp = NULL, type = c("response", "scores", "prob", "class"), ... )
object |
A fitted PLS model produced by |
newdata |
Matrix or |
ncomp |
Number of components to use for prediction. |
type |
Either "response" (default) or "scores". |
... |
Unused, for compatibility with the generic. |
Predicted responses or component scores.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") predict(fit, X, ncomp = 2)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") predict(fit, X, ncomp = 2)
summary.big_plsr objectPrint a summary.big_plsr object
## S3 method for class 'summary.big_plsr' print(x, ...)## S3 method for class 'summary.big_plsr' print(x, ...)
x |
A |
... |
Passed to lower-level print methods. |
x, invisibly.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") print(summary(fit))set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") print(summary(fit))
Summarise bootstrap estimates
summarise_pls_bootstrap(boot_result)summarise_pls_bootstrap(boot_result)
boot_result |
Result returned by |
A data frame containing mean, standard deviation, percentile and BCa confidence intervals for each coefficient.
big_plsr modelSummarize a big_plsr model
## S3 method for class 'big_plsr' summary(object, ..., X = NULL, Y = NULL)## S3 method for class 'big_plsr' summary(object, ..., X = NULL, Y = NULL)
object |
A fitted PLS model. |
... |
Unused. |
X |
Optional design matrix to recompute reconstruction metrics. |
Y |
Optional response matrix/vector. |
An object of class summary.big_plsr.
set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") summary(fit)set.seed(123) X <- matrix(rnorm(40), nrow = 10) y <- X[, 1] - 0.5 * X[, 2] + rnorm(10, sd = 0.1) fit <- pls_fit(X, y, ncomp = 2, scores = "r") summary(fit)