| Title: | Principal Component Analysis for 'bigmemory' Matrices |
|---|---|
| Description: | High performance principal component analysis routines that operate directly on bigmemory::big.matrix() objects. The package avoids materialising large matrices in memory by streaming data through 'BLAS' and 'LAPACK' kernels and provides helpers to derive scores, loadings, correlations, and contribution diagnostics, including utilities that stream results into 'bigmemory'-backed matrices for file-based workflows. Additional interfaces expose 'scalable' singular value decomposition, robust PCA, and robust SVD algorithms so that users can explore large matrices while tempering the influence of outliers. 'Scalable' principal component analysis is also implemented, Elgamal, Yabandeh, Aboulnaga, Mustafa, and Hefeeda (2015) <doi:10.1145/2723372.2751520>. |
| Authors: | Frederic Bertrand [aut, cre] |
| Maintainer: | Frederic Bertrand <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.9.1 |
| Built: | 2026-05-24 07:01:16 UTC |
| Source: | https://github.com/fbertran/bigpcacpp |
The bigPCAcpp package provides high-performance principal component analysis
routines that work directly with bigmemory::big.matrix objects. Data are
streamed through BLAS and LAPACK kernels so large, file-backed matrices can be
analysed without materialising dense copies in R. Companion helpers compute
scores, loadings, correlations, and contributions, including streaming
variants that write results to bigmemory::big.matrix destinations used by
file-based pipelines.
Maintainer: Frederic Bertrand [email protected]
pca_bigmatrix(), pca_stream_bigmatrix()
library(bigmemory) mat <- as.big.matrix(matrix(rnorm(20), nrow = 5)) result <- pca_bigmatrix(mat) result$sdevlibrary(bigmemory) mat <- as.big.matrix(matrix(rnorm(20), nrow = 5)) result <- pca_bigmatrix(mat) result$sdev
A dataset summarising wall-clock performance for the main PCA entry points
in bigPCAcpp across matrices of increasing size. The benchmarks compare
the classical block-based decomposition, the streaming variant that writes
rotations as it progresses, the scalable stochastic PCA implementation, and
the base R stats::prcomp() routine for reference.
A data frame with 360 rows and 14 columns:
Human-readable size label ("small", "medium", "large", "xlarge").
Number of rows in the simulated matrix.
Number of columns in the simulated matrix.
Number of components requested.
Computation strategy ("classical", "streaming", "scalable", or "prcomp").
Replication index for repeated runs.
User CPU time in seconds returned by base::system.time().
System CPU time in seconds.
Elapsed (wall-clock) time in seconds.
Logical flag indicating whether the run completed without errors.
Name of the backend reported by the result object when the computation succeeded.
Recorded iteration count for iterative methods when
available (otherwise NA).
Logical convergence flag for iterative methods when available.
Error message captured for failed runs (otherwise NA).
Generated by scripts/run_benchmark.R using randomly simulated
in-memory matrices (no file-backed storage).
data(benchmark_results)data(benchmark_results)
Results returned by pca_bigmatrix(), pca_stream_bigmatrix(), and
pca_robust() inherit from the bigpca class. The objects store the
component standard deviations, rotation/loadings, and optional scores while
recording which computational backend produced them. Standard S3 generics
such as summary() and plot() are implemented for convenience.
bigpca objects are lists produced by pca_bigmatrix(),
pca_stream_bigmatrix(), pca_robust(), and related helpers. They mirror
the structure of base R's prcomp() outputs while tracking additional
metadata for large-scale and streaming computations.
#' @seealso pca_bigmatrix(), pca_stream_bigmatrix(), pca_robust(),
pca_plot_scree(), pca_plot_scores(), pca_plot_contributions(),
pca_plot_correlation_circle(), and pca_plot_biplot().
sdevNumeric vector of component standard deviations.
rotationNumeric matrix whose columns contain the variable loadings (principal axes).
center, scale
Optional numeric vectors describing the centring and scaling applied to each variable when fitting the model.
scoresOptional numeric matrix of principal component scores when computed alongside the decomposition.
column_sdNumeric vector of marginal standard deviations for each input variable.
eigenvaluesNumeric vector of eigenvalues associated with the retained components.
explained_variance, cumulative_variance
Numeric vectors summarising the fraction of variance explained by individual components and the corresponding cumulative totals.
covarianceSample covariance matrix used to derive the components.
nobsNumber of observations used in the decomposition.
The class also records the computation backend via
attr(x, "backend"), enabling downstream methods to adjust their
behaviour for streamed or robust results.
pca_bigmatrix(), pca_stream_bigmatrix(), summary.bigpca(),
print.summary.bigpca(), plot.bigpca()
bigmemory::big.matrix inputsPerform principal component analysis (PCA) directly on a
bigmemory::big.matrix without copying the data into R memory. The
exported helpers mirror the structure of base R's prcomp() while avoiding
the need to materialise large matrices.
resolve_big_pointer(x, arg, allow_null = FALSE) pca_scores_bigmatrix( xpMat, rotation, center, scale, ncomp = -1L, block_size = 1024L ) pca_variable_loadings(rotation, sdev) pca_variable_correlations(rotation, sdev, column_sd, scale = NULL) pca_variable_contributions(loadings) pca_individual_contributions(scores, sdev, total_weight = NA_real_) pca_individual_cos2(scores) pca_variable_cos2(correlations) ## S3 method for class 'bigpca' summary(object, ...) ## S3 method for class 'summary.bigpca' print(x, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'bigpca' plot( x, y, type = c("scree", "contributions", "correlation_circle", "biplot"), max_components = 25L, component = 1L, top_n = 20L, components = c(1L, 2L), data = NULL, draw = TRUE, ... )resolve_big_pointer(x, arg, allow_null = FALSE) pca_scores_bigmatrix( xpMat, rotation, center, scale, ncomp = -1L, block_size = 1024L ) pca_variable_loadings(rotation, sdev) pca_variable_correlations(rotation, sdev, column_sd, scale = NULL) pca_variable_contributions(loadings) pca_individual_contributions(scores, sdev, total_weight = NA_real_) pca_individual_cos2(scores) pca_variable_cos2(correlations) ## S3 method for class 'bigpca' summary(object, ...) ## S3 method for class 'summary.bigpca' print(x, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'bigpca' plot( x, y, type = c("scree", "contributions", "correlation_circle", "biplot"), max_components = 25L, component = 1L, top_n = 20L, components = c(1L, 2L), data = NULL, draw = TRUE, ... )
x |
A |
arg |
Character string naming the argument being validated. Used to construct informative error messages. |
allow_null |
Logical flag indicating whether |
xpMat |
Either a |
rotation |
A rotation matrix such as the |
center |
For |
scale |
Optional numeric vector of scaling factors returned by
|
ncomp |
Number of components to retain. Use a non-positive value to keep all components returned by the decomposition. |
block_size |
Number of rows to process per block when streaming data through BLAS kernels. Larger values improve throughput at the cost of additional memory. |
sdev |
A numeric vector of component standard deviations, typically the
|
column_sd |
A numeric vector with the marginal standard deviation of
each original variable. When |
loadings |
A numeric matrix such as the result of
|
scores |
For |
total_weight |
Optional positive scalar giving the effective number of
observations to use when computing contributions. Defaults to the number of
rows in |
correlations |
For |
object |
A |
... |
Additional arguments passed to plotting helpers. |
digits |
Number of significant digits to display when printing importance metrics. |
y |
Currently unused. |
type |
The plot to draw. Options include "scree" (variance explained), "contributions" (top contributing variables), "correlation_circle" (variable correlations with selected components), and "biplot" (joint display of scores and loadings). |
max_components |
Maximum number of components to display in scree plots. |
component |
Component index to highlight when drawing contribution plots. |
top_n |
Number of variables to display in contribution plots. |
components |
Length-two integer vector selecting the components for correlation circle and biplot views. |
data |
Optional data source (matrix, data frame, |
draw |
Logical; if |
For pca_bigmatrix(), a bigpca object mirroring a prcomp result
with elements sdev, rotation, optional center and scale vectors,
column_sd, eigenvalues, explained_variance, cumulative_variance, and
the sample covariance matrix. The object participates in S3 generics such as
summary() and plot().
A numeric matrix of scores with rows corresponding to observations and columns to retained components.
A numeric matrix containing variable loadings for each component.
A numeric matrix of correlations between variables and components.
A numeric matrix where each entry represents the contribution of a variable to a component.
For summary.bigpca(), a summary.bigpca object containing
component importance measures.
pca_scores_bigmatrix(): Project observations into principal component
space while streaming from a big.matrix.
pca_variable_loadings(): Compute variable loadings (covariances between
original variables and components).
pca_variable_correlations(): Compute variable-component correlations given
column standard deviations.
pca_variable_contributions(): Derive the relative contribution of each variable
to the retained components.
pca_individual_contributions(): Compute the relative contribution of individual
observations to each component.
pca_individual_cos2(): Compute squared cosine values measuring the quality
of representation for individual observations.
pca_variable_cos2(): Compute squared cosine values measuring the quality
of representation for variables.
summary(bigpca): Summarise the component importance metrics for a
bigpca result.
print(summary.bigpca): Print the component importance summary produced by
summary.bigpca().
plot(bigpca): Visualise PCA diagnostics such as scree, correlation
circle, contribution, and biplot displays.
bigpca, pca_scores_bigmatrix(), pca_variable_loadings(),
pca_variable_correlations(), pca_variable_contributions(), and the
streaming variants pca_stream_bigmatrix() and companions.
set.seed(123) mat <- bigmemory::as.big.matrix(matrix(rnorm(40), nrow = 10)) pca <- pca_bigmatrix(mat, center = TRUE, scale = TRUE, ncomp = 3) scores <- pca_scores_bigmatrix(mat, pca$rotation, pca$center, pca$scale, ncomp = 3) loadings <- pca_variable_loadings(pca$rotation, pca$sdev) correlations <- pca_variable_correlations(pca$rotation, pca$sdev, pca$column_sd, pca$scale) contributions <- pca_variable_contributions(loadings) list(scores = scores, loadings = loadings, correlations = correlations, contributions = contributions)set.seed(123) mat <- bigmemory::as.big.matrix(matrix(rnorm(40), nrow = 10)) pca <- pca_bigmatrix(mat, center = TRUE, scale = TRUE, ncomp = 3) scores <- pca_scores_bigmatrix(mat, pca$rotation, pca$center, pca$scale, ncomp = 3) loadings <- pca_variable_loadings(pca$rotation, pca$sdev) correlations <- pca_variable_correlations(pca$rotation, pca$sdev, pca$column_sd, pca$scale) contributions <- pca_variable_contributions(loadings) list(scores = scores, loadings = loadings, correlations = correlations, contributions = contributions)
Combines principal component scores and variable loadings in a single
scatter plot. The helper accepts both standard matrices and
bigmemory::big.matrix inputs, extracting only the requested component
columns. When draw = TRUE, the function scales the loadings to match the
score ranges, draws optional axes, overlays loading arrows, and labels
observations when requested.
pca_plot_biplot( scores, loadings, components = c(1L, 2L), draw = TRUE, draw_axes = TRUE, draw_arrows = TRUE, label_points = FALSE, ... )pca_plot_biplot( scores, loadings, components = c(1L, 2L), draw = TRUE, draw_axes = TRUE, draw_arrows = TRUE, label_points = FALSE, ... )
scores |
Matrix or |
loadings |
Matrix or |
components |
Integer vector of length two selecting the components to display. |
draw |
Logical; set to |
draw_axes |
Logical; when |
draw_arrows |
Logical; when |
label_points |
Logical; when |
... |
Additional graphical parameters passed to |
A list containing the selected components, extracted scores,
original loadings, scaled loadings (loadings_scaled), and the applied
scale_factor. The list is returned invisibly. When draw = TRUE, a biplot
is produced using base graphics.
Highlights the variables that contribute most to a selected principal
component. The helper works with dense matrices returned by
pca_variable_contributions() as well as with bigmemory::big.matrix
objects via sampling.
pca_plot_contributions( contributions, component = 1L, top_n = 20L, draw = TRUE, ... )pca_plot_contributions( contributions, component = 1L, top_n = 20L, draw = TRUE, ... )
contributions |
Contribution matrix where rows correspond to variables and columns to components. |
component |
Integer index of the component to visualise. |
top_n |
Number of variables with the largest absolute contribution to include in the bar plot. |
draw |
Logical; set to |
... |
Additional arguments passed to |
A data frame with the variables and their contributions is returned
invisibly. When draw = TRUE, a bar plot of the top variables is produced.
Visualises the correlation between each variable and a pair of principal components. The variables are projected onto the unit circle, where points near the perimeter indicate strong correlation with the selected components.
pca_plot_correlation_circle( correlations, components = c(1L, 2L), labels = NULL, draw = TRUE, ... )pca_plot_correlation_circle( correlations, components = c(1L, 2L), labels = NULL, draw = TRUE, ... )
correlations |
Matrix or |
components |
Length-two integer vector specifying the principal components to display. |
labels |
Optional character vector specifying the labels to display for
each variable. When |
draw |
Logical; set to |
... |
Additional graphical parameters passed to |
A data frame with variable, PCx, and PCy columns representing the
projected correlations, where PCx/PCy correspond to the requested
component indices. The data frame is returned invisibly.
Streams a subset of observations through the PCA rotation and plots their scores on the requested components. Sampling keeps the drawn subset small so graphics remain interpretable even when the source big matrix contains millions of rows.
pca_plot_scores( x, rotation, center = numeric(), scale = numeric(), components = c(1L, 2L), max_points = 5000L, sample = c("uniform", "head"), seed = NULL, draw = TRUE, ... )pca_plot_scores( x, rotation, center = numeric(), scale = numeric(), components = c(1L, 2L), max_points = 5000L, sample = c("uniform", "head"), seed = NULL, draw = TRUE, ... )
x |
Either a |
rotation |
A rotation matrix such as |
center |
Optional centering vector. Use |
scale |
Optional scaling vector. Use |
components |
Length-two integer vector selecting the principal components to display. |
max_points |
Maximum number of observations to sample for the plot. |
sample |
Strategy for selecting rows. |
seed |
Optional seed to make the sampling reproducible. |
draw |
Logical; set to |
... |
Additional graphical parameters forwarded to |
A list containing indices (the sampled row indices) and scores
(the corresponding score matrix) is returned invisibly. When draw = TRUE
a scatter plot is produced.
Displays the proportion of variance explained by the leading principal components. The function caps the number of displayed components to keep the visualization legible on very high-dimensional problems.
pca_plot_scree( pca_result, max_components = 25L, cumulative = TRUE, draw = TRUE, ... )pca_plot_scree( pca_result, max_components = 25L, cumulative = TRUE, draw = TRUE, ... )
pca_result |
A list created by |
max_components |
Maximum number of components to display. Defaults to 25 or the available number of components, whichever is smaller. |
cumulative |
Logical flag indicating whether to overlay the cumulative explained variance line. |
draw |
Logical; set to |
... |
Additional parameters passed to |
A list with component, explained, and cumulative vectors is
returned invisibly. When draw = TRUE, the function produces a scree plot
using base graphics.
These helpers visualise the results returned by pca_bigmatrix() and its
companions without requiring users to materialise dense intermediate
structures. Each plotting function optionally samples the inputs so the
default output remains responsive even when the underlying big matrix spans
millions of observations.
pca_bigmatrix(), pca_variable_loadings(),
pca_variable_contributions()
Compute principal component analysis (PCA) using robust measures of location and scale so that extreme observations have a reduced influence on the resulting components. The implementation centres each variable by its median and, when requested, scales by the median absolute deviation (MAD) before performing an iteratively reweighted singular value decomposition that down-weights observations with unusually large reconstruction errors.
pca_robust(x, center = TRUE, scale = FALSE, ncomp = NULL)pca_robust(x, center = TRUE, scale = FALSE, ncomp = NULL)
x |
A numeric matrix, data frame, or an object coercible to a numeric matrix. Missing values are not supported. |
center |
Logical; should variables be centred by their median before applying PCA? |
scale |
Logical; when |
ncomp |
Number of components to retain. Use |
A bigpca object mirroring the structure of
pca_bigmatrix() with robust estimates of location, scale, and
variance metrics.
set.seed(42) x <- matrix(rnorm(50), nrow = 10) x[1, 1] <- 25 # outlier robust <- pca_robust(x, ncomp = 2) robust$sdevset.seed(42) x <- matrix(rnorm(50), nrow = 10) x[1, 1] <- 25 # outlier robust <- pca_robust(x, ncomp = 2) robust$sdev
Implements the scalable PCA (sPCA) procedure of Elgamal et al. (2015), which uses block power iterations to approximate the leading principal components while streaming the data in manageable chunks. The algorithm only requires matrix-vector products, allowing large matrices to be processed without materialising the full cross-product in memory.
Implements the scalable PCA (sPCA) procedure of Elgamal et al. (2015), which uses block power iterations to approximate the leading principal components while streaming the data in manageable chunks. The algorithm only requires matrix-vector products, allowing large matrices to be processed without materialising the full cross-product in memory.
Implements the scalable PCA (sPCA) procedure of Elgamal et al. (2015), which uses block power iterations to approximate the leading principal components while streaming the data in manageable chunks. The algorithm only requires matrix-vector products, allowing large matrices to be processed without materialising the full cross-product in memory.
pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca_R( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca_R( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE )pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca_R( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_spca_R( x, ncomp = NULL, center = TRUE, scale = FALSE, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE )
x |
A numeric matrix, data frame, |
ncomp |
Number of principal components to retain. Use |
center |
Logical; should column means be subtracted before performing PCA? |
scale |
Logical; when |
block_size |
Number of rows to stream per block when computing column statistics and matrix-vector products. |
max_iter |
Maximum number of block power iterations. |
tol |
Convergence tolerance applied to the Frobenius norm of the difference between successive subspace projectors. |
seed |
Optional integer seed used to initialise the random starting basis. |
return_scores |
Logical; when |
verbose |
Logical; when |
A bigpca object containing the approximate PCA solution with the
same structure as pca_bigmatrix(). The result includes component standard
deviations, rotation/loadings, optional scores, column statistics, and
variance summaries. Additional metadata is stored in
attr(result, "iterations") (number of iterations performed),
attr(result, "tolerance") (requested tolerance), and
attr(result, "converged") (logical convergence flag).
A bigpca object containing the approximate PCA solution with the
same structure as pca_bigmatrix(). The result includes component standard
deviations, rotation/loadings, optional scores, column statistics, and
variance summaries. Additional metadata is stored in
attr(result, "iterations") (number of iterations performed),
attr(result, "tolerance") (requested tolerance), and
attr(result, "converged") (logical convergence flag).
A bigpca object containing the approximate PCA solution with the
same structure as pca_bigmatrix(). The result includes component standard
deviations, rotation/loadings, optional scores, column statistics, and
variance summaries. Additional metadata is stored in
attr(result, "iterations") (number of iterations performed),
attr(result, "tolerance") (requested tolerance), and
attr(result, "converged") (logical convergence flag).
Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda (2015). sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/2723372.2751520.
Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda (2015). sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/2723372.2751520.
Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda (2015). sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/2723372.2751520.
Variants of the PCA helpers that stream results directly into
bigmemory::big.matrix objects, enabling file-backed workflows without
materialising dense R matrices.
pca_spca_stream_bigmatrix( xpMat, xpRotation = NULL, center = TRUE, scale = FALSE, ncomp = -1L, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_scores_stream_bigmatrix( xpMat, xpDest, rotation, center, scale, ncomp = -1L, block_size = 1024L ) pca_variable_loadings_stream_bigmatrix(xpRotation, sdev, xpDest) pca_variable_correlations_stream_bigmatrix( xpRotation, sdev, column_sd, scale = NULL, xpDest ) pca_variable_contributions_stream_bigmatrix(xpLoadings, xpDest)pca_spca_stream_bigmatrix( xpMat, xpRotation = NULL, center = TRUE, scale = FALSE, ncomp = -1L, block_size = 2048L, max_iter = 50L, tol = 1e-04, seed = NULL, return_scores = FALSE, verbose = FALSE ) pca_scores_stream_bigmatrix( xpMat, xpDest, rotation, center, scale, ncomp = -1L, block_size = 1024L ) pca_variable_loadings_stream_bigmatrix(xpRotation, sdev, xpDest) pca_variable_correlations_stream_bigmatrix( xpRotation, sdev, column_sd, scale = NULL, xpDest ) pca_variable_contributions_stream_bigmatrix(xpLoadings, xpDest)
xpMat |
Either a |
xpRotation |
For |
center |
For |
scale |
Optional numeric vector of scaling factors returned by
|
ncomp |
Number of components to retain. Use a non-positive value to keep all components returned by the decomposition. |
block_size |
Number of rows to process per block when streaming data through BLAS kernels. Larger values improve throughput at the cost of additional memory. |
max_iter |
Maximum number of block power iterations. |
tol |
Convergence tolerance applied to the Frobenius norm of the difference between successive subspace projectors. |
seed |
Optional integer seed used to initialise the random starting basis. |
return_scores |
Logical; when |
verbose |
Logical; when |
xpDest |
Either a |
rotation |
A rotation matrix such as the |
sdev |
A numeric vector of component standard deviations, typically the
|
column_sd |
A numeric vector of variable standard deviations used to scale the correlations when the PCA was performed on unscaled data. |
xpLoadings |
For |
For pca_stream_bigmatrix(), the same bigpca object as
pca_bigmatrix() with the
addition of a rotation_stream_bigmatrix element referencing the populated
big.matrix when xpRotation is supplied. For
pca_spca_stream_bigmatrix(), the same scalable PCA structure as
pca_spca() with the optional pointer populated when provided.
The external pointer supplied in xpDest, invisibly.
pca_scores_stream_bigmatrix(): Stream PCA scores into a destination
big.matrix.
pca_variable_loadings_stream_bigmatrix(): Populate big.matrix objects with derived variable
diagnostics.
pca_variable_correlations_stream_bigmatrix(): Stream variable correlations into a
destination big.matrix.
pca_variable_contributions_stream_bigmatrix(): Stream variable contributions into a
destination big.matrix.
set.seed(456) mat <- bigmemory::as.big.matrix(matrix(rnorm(30), nrow = 6)) ncomp <- 2 rotation_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_stream <- pca_stream_bigmatrix(mat, xpRotation = rotation_store, ncomp = ncomp) score_store <- bigmemory::big.matrix(nrow(mat), ncomp, type = "double") pca_scores_stream_bigmatrix( mat, score_store, pca_stream$rotation, pca_stream$center, pca_stream$scale, ncomp = ncomp ) loadings_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_loadings_stream_bigmatrix( pca_stream$rotation_stream_bigmatrix, pca_stream$sdev, loadings_store ) correlation_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_correlations_stream_bigmatrix( pca_stream$rotation_stream_bigmatrix, pca_stream$sdev, pca_stream$column_sd, pca_stream$scale, correlation_store ) contribution_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_contributions_stream_bigmatrix( loadings_store, contribution_store )set.seed(456) mat <- bigmemory::as.big.matrix(matrix(rnorm(30), nrow = 6)) ncomp <- 2 rotation_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_stream <- pca_stream_bigmatrix(mat, xpRotation = rotation_store, ncomp = ncomp) score_store <- bigmemory::big.matrix(nrow(mat), ncomp, type = "double") pca_scores_stream_bigmatrix( mat, score_store, pca_stream$rotation, pca_stream$center, pca_stream$scale, ncomp = ncomp ) loadings_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_loadings_stream_bigmatrix( pca_stream$rotation_stream_bigmatrix, pca_stream$sdev, loadings_store ) correlation_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_correlations_stream_bigmatrix( pca_stream$rotation_stream_bigmatrix, pca_stream$sdev, pca_stream$column_sd, pca_stream$scale, correlation_store ) contribution_store <- bigmemory::big.matrix(ncol(mat), ncomp, type = "double") pca_variable_contributions_stream_bigmatrix( loadings_store, contribution_store )
Compute principal component scores and quality metrics for supplementary individuals (rows) projected into an existing PCA solution.
pca_supplementary_individuals( data, rotation, sdev, center = NULL, scale = NULL, total_weight = NA_real_ )pca_supplementary_individuals( data, rotation, sdev, center = NULL, scale = NULL, total_weight = NA_real_ )
data |
Matrix-like object whose rows correspond to supplementary individuals and columns to the original variables. |
rotation |
Rotation matrix from the PCA model (e.g. the |
sdev |
Numeric vector of component standard deviations associated with
|
center |
Optional numeric vector giving the centring applied to each variable when fitting the PCA. Defaults to zero centring. |
scale |
Optional numeric vector describing the scaling applied to each
variable when fitting the PCA. When |
total_weight |
Optional positive scalar passed to
|
A list with elements scores, contributions, and cos2.
Compute loadings, correlations, contributions, and cos^2 values for supplementary variables (columns) given component scores for the active individuals.
pca_supplementary_variables(data, scores, sdev, center = NULL)pca_supplementary_variables(data, scores, sdev, center = NULL)
data |
Matrix-like object whose columns correspond to supplementary variables measured on the active individuals. |
scores |
Numeric matrix of component scores for the active individuals. |
sdev |
Numeric vector of component standard deviations associated with
|
center |
Optional numeric vector specifying the centring to apply to each
supplementary variable. When |
A list with elements loadings, correlations, contributions, and
cos2.
bigmemory::big.matrix inputsCompute the singular value decomposition (SVD) of a
bigmemory::big.matrix without materialising it as a base R matrix.
Blocks of rows are streamed through BLAS before LAPACK is invoked so that
even moderately large matrices can be decomposed efficiently.
svd_bigmatrix( xpMat, nu = -1L, nv = -1L, block_size = 1024L, method = c("dgesdd", "dgesvd") )svd_bigmatrix( xpMat, nu = -1L, nv = -1L, block_size = 1024L, method = c("dgesdd", "dgesvd") )
xpMat |
Either a |
nu |
Number of left singular vectors to return. Use a negative value to
request the default of |
nv |
Number of right singular vectors to return. Use a negative value to
request the default of |
block_size |
Number of rows to process per block when streaming data into BLAS kernels. Larger values can improve throughput at the cost of additional temporary memory. |
method |
LAPACK backend used to compute the decomposition. The default
uses the divide-and-conquer routine |
A list with components u, d, and v analogous to base R's
svd() output. When nu or nv are zero the corresponding matrix has
zero columns.
set.seed(42) mat <- bigmemory::as.big.matrix(matrix(rnorm(20), nrow = 5)) svd_res <- svd_bigmatrix(mat, nu = 2, nv = 2) svd_res$dset.seed(42) mat <- bigmemory::as.big.matrix(matrix(rnorm(20), nrow = 5)) svd_res <- svd_bigmatrix(mat, nu = 2, nv = 2) svd_res$d
Compute the iteratively reweighted SVD using the high-performance C++
implementation. The interface mirrors svd_robust_R() while delegating the
heavy lifting to compiled code.
svd_robust( x, ncomp, max_iter = 25L, tol = sqrt(.Machine$double.eps), huber_k = 1.345 )svd_robust( x, ncomp, max_iter = 25L, tol = sqrt(.Machine$double.eps), huber_k = 1.345 )
x |
Numeric matrix for which the decomposition should be computed. |
ncomp |
Number of leading components to retain. |
max_iter |
Maximum number of reweighting iterations. |
tol |
Convergence tolerance applied to successive changes in the row weights and singular values. |
huber_k |
Tuning constant controlling the aggressiveness of the Huber weight function. Larger values down-weight fewer observations. |
A list containing the left and right singular vectors (u and v),
the singular values (d), the final row weights (weights), and the number
of iterations required for convergence (iterations).
Internal helper used by pca_robust() to compute a singular value
decomposition that is less sensitive to individual rows with extreme values.
The routine alternates between computing the SVD of a row-weighted matrix and
updating the weights via a Huber-type scheme based on the reconstruction
residuals.
svd_robust_R( x, ncomp, max_iter = 25L, tol = sqrt(.Machine$double.eps), huber_k = 1.345 )svd_robust_R( x, ncomp, max_iter = 25L, tol = sqrt(.Machine$double.eps), huber_k = 1.345 )
x |
Numeric matrix for which the decomposition should be computed. |
ncomp |
Number of leading components to retain. |
max_iter |
Maximum number of reweighting iterations. |
tol |
Convergence tolerance applied to successive changes in the row weights and singular values. |
huber_k |
Tuning constant controlling the aggressiveness of the Huber weight function. Larger values down-weight fewer observations. |
A list containing the left and right singular vectors (u and v),
the singular values (d), the final row weights (weights), and the number
of iterations required for convergence (iterations). The structure mirrors
base R's base::svd() output with additional metadata.