| Title: | Methods and Reproducible Workflows for Partial Least Squares with Missing Data |
|---|---|
| Description: | Methods-first tooling for reproducing and extending the partial least squares regression studies on incomplete data described in Nengsih et al. (2019) <doi:10.1515/sagmb-2018-0059>. The package provides simulation helpers, missingness generators, imputation wrappers, component-selection utilities, real-data diagnostics, and reproducible study orchestration for Nonlinear Iterative Partial Least Squares (NIPALS)-Partial Least Squares (PLS) workflows. |
| Authors: | Titin Agustin Nengsih [aut], Frederic Bertrand [aut, cre], Myriam Maumy-Bertrand [aut] |
| Maintainer: | Frederic Bertrand <[email protected]> |
| License: | GPL-3 |
| Version: | 0.2.0 |
| Built: | 2026-05-13 09:38:21 UTC |
| Source: | https://github.com/fbertran/misspls |
Create MCAR or MAR missingness on the predictor matrix x. Missingness is
generated column-wise so that each predictor receives approximately the same
missing-data proportion, matching the simulation strategy used in the
original work.
add_missingness( x, y, mechanism = c("MCAR", "MAR"), missing_prop, seed = NULL, mar_y_bias = 0.8 )add_missingness( x, y, mechanism = c("MCAR", "MAR"), missing_prop, seed = NULL, mar_y_bias = 0.8 )
x |
Predictor matrix or data frame. |
y |
Numeric response vector. |
mechanism |
Missingness mechanism: |
missing_prop |
Missing-data proportion as a fraction ( |
seed |
Optional random seed. If supplied, it is used only for this call. |
mar_y_bias |
Proportion of missing values assigned to the upper
half of the observed |
A list with components x_incomplete, missing_mask,
missing_prop, mechanism, and seed.
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1) miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2) mean(is.na(miss$x_incomplete))sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1) miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2) mean(is.na(miss$x_incomplete))
Bromhexine in pharmaceutical syrup used in the article and thesis.
bromhexinebromhexine
A misspls_dataset list with components:
Dataset name.
A numeric 23 x 64 predictor matrix.
A numeric response vector of length 23.
A data frame with response y and predictors x1 to x64.
A short source reference.
Dataset preprocessing notes.
Additional study notes.
Goicoechea and Olivieri (1999a), calibration and test files bundled in
extra_docs/pls_data.
Compute correlation summaries and VIF-style diagnostics for a packaged real dataset.
diagnose_real_data(dataset, cor_threshold = 0.7)diagnose_real_data(dataset, cor_threshold = 0.7)
dataset |
A packaged dataset name or |
cor_threshold |
Absolute-correlation threshold used when reporting predictor pairs and predictor-response associations. |
A list with correlation and VIF summaries.
diag_bromhexine <- diagnose_real_data("bromhexine") names(diag_bromhexine)diag_bromhexine <- diagnose_real_data("bromhexine") names(diag_bromhexine)
Apply one of the imputation strategies used in the article and thesis.
impute_pls_data( x, method = c("mice", "knn", "svd"), seed = NULL, m, k = 15L, svd_rank = 10L, svd_maxiter = 1000L )impute_pls_data( x, method = c("mice", "knn", "svd"), seed = NULL, m, k = 15L, svd_rank = 10L, svd_maxiter = 1000L )
x |
Incomplete predictor matrix or data frame. |
method |
Imputation method: |
seed |
Optional random seed forwarded to stochastic imputers when supported. |
m |
Number of imputations for |
k |
Number of neighbours for |
svd_rank |
Target rank for |
svd_maxiter |
Maximum number of iterations for the fallback SVD routine. |
A misspls_imputation object.
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1) miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2) imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3) length(imp$datasets)sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1) miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2) imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3) length(imp$datasets)
Octane in gasoline from NIR data used in the article and thesis.
octaneoctane
A misspls_dataset list with components:
Dataset name.
A numeric 68 x 493 predictor matrix.
A numeric response vector of length 68.
A data frame with response y and predictors x1 to x493.
A short source reference.
Dataset preprocessing notes.
Additional study notes.
Goicoechea and Olivieri (2003), calibration and test files bundled in
extra_docs/pls_data.
Los Angeles ozone pollution complete-case dataset used in the article and thesis.
ozone_completeozone_complete
A misspls_dataset list with components:
Dataset name.
A numeric 203 x 12 predictor matrix.
A numeric response vector of length 203.
A data frame with response y and predictors x1 to x12.
A short source reference.
Dataset preprocessing notes.
Additional study notes.
mlbench::Ozone, restricted to the 203 complete observations used
in the published analysis.
Run a real-data study
run_real_data_study( dataset, seed = NULL, missing_props = seq(5, 50, 5), mechanisms = c("MCAR", "MAR"), reps = 1L, baseline_reps = 100L, max_ncomp = 12L, criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), incomplete_methods = c("nipals_standard", "nipals_adaptative"), imputation_methods = c("mice", "knn", "svd"), folds = 10L, mar_y_bias = 0.8 )run_real_data_study( dataset, seed = NULL, missing_props = seq(5, 50, 5), mechanisms = c("MCAR", "MAR"), reps = 1L, baseline_reps = 100L, max_ncomp = 12L, criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), incomplete_methods = c("nipals_standard", "nipals_adaptative"), imputation_methods = c("mice", "knn", "svd"), folds = 10L, mar_y_bias = 0.8 )
dataset |
A packaged dataset name or |
seed |
Optional base random seed. |
missing_props |
Missing-data proportions as fractions or percentages. |
mechanisms |
Missing-data mechanisms. |
reps |
Number of replicate missingness draws for each mechanism and proportion. |
baseline_reps |
Number of repeated complete-data |
max_ncomp |
Maximum number of extracted components. |
criteria |
Criteria evaluated on incomplete and imputed data. |
incomplete_methods |
Incomplete-data NIPALS workflows. |
imputation_methods |
Imputation methods. |
folds |
Number of folds used by |
mar_y_bias |
MAR bias parameter passed to |
A data frame with one row per study run.
Run the simulation workflows used in the article and thesis.
run_simulation_study( dimensions = list(c(500L, 100L), c(500L, 20L), c(100L, 20L), c(80L, 25L), c(60L, 33L), c(40L, 50L), c(20L, 100L)), true_ncomp = c(2L, 4L, 6L), missing_props = seq(5, 50, 5), mechanisms = c("MCAR", "MAR"), reps = 1L, seed = NULL, max_ncomp = 8L, criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), incomplete_methods = c("nipals_standard", "nipals_adaptative"), imputation_methods = c("mice", "knn", "svd"), folds = 10L, mar_y_bias = 0.8 )run_simulation_study( dimensions = list(c(500L, 100L), c(500L, 20L), c(100L, 20L), c(80L, 25L), c(60L, 33L), c(40L, 50L), c(20L, 100L)), true_ncomp = c(2L, 4L, 6L), missing_props = seq(5, 50, 5), mechanisms = c("MCAR", "MAR"), reps = 1L, seed = NULL, max_ncomp = 8L, criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), incomplete_methods = c("nipals_standard", "nipals_adaptative"), imputation_methods = c("mice", "knn", "svd"), folds = 10L, mar_y_bias = 0.8 )
dimensions |
List of |
true_ncomp |
Vector of true component counts. |
missing_props |
Missing-data proportions as fractions or percentages. |
mechanisms |
Missing-data mechanisms. |
reps |
Number of replicates. |
seed |
Optional base random seed. |
max_ncomp |
Maximum number of extracted components. |
criteria |
Criteria evaluated on complete and imputed data. |
incomplete_methods |
Incomplete-data NIPALS workflows. |
imputation_methods |
Imputation methods. |
folds |
Number of folds used by |
mar_y_bias |
MAR bias parameter passed to |
A data frame with one row per study run.
Select the number of components for complete, imputed, or incomplete-data PLS workflows.
select_ncomp( x, y, method = c("complete", "nipals_standard", "nipals_adaptative"), criterion = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), max_ncomp, seed = NULL, folds = 10L, threshold = 0.0975 )select_ncomp( x, y, method = c("complete", "nipals_standard", "nipals_adaptative"), criterion = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"), max_ncomp, seed = NULL, folds = 10L, threshold = 0.0975 )
x |
Predictor matrix, dataset object, or |
y |
Numeric response vector. This may be omitted when |
method |
Selection workflow: |
criterion |
Selection criterion: |
max_ncomp |
Maximum number of components to consider. |
seed |
Optional random seed used by the cross-validation and imputation aggregation steps. |
folds |
Number of cross-validation folds used by |
threshold |
Threshold applied to |
A one-row data frame describing the selected component count.
sim <- simulate_pls_data(n = 25, p = 10, true_ncomp = 2, seed = 1) select_ncomp(sim$x, sim$y, method = "complete", criterion = "AIC", max_ncomp = 4, seed = 2)sim <- simulate_pls_data(n = 25, p = 10, true_ncomp = 2, seed = 1) select_ncomp(sim$x, sim$y, method = "complete", criterion = "AIC", max_ncomp = 4, seed = 2)
Simulate a univariate-response PLS dataset using the Li et al.-style
generator available in plsRglm.
simulate_pls_data(n, p, true_ncomp, seed = NULL, model = "li2002")simulate_pls_data(n, p, true_ncomp, seed = NULL, model = "li2002")
n |
Number of observations. |
p |
Number of predictors. |
true_ncomp |
True number of latent components. |
seed |
Optional random seed. If supplied, it is used only for this call. |
model |
Simulation model. Only |
A list with components x, y, data, true_ncomp, seed, and
model.
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 42) str(sim)sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 42) str(sim)
Summarize simulation or real-data study results
summarize_simulation_study(results)summarize_simulation_study(results)
results |
A results data frame returned by |
A grouped summary data frame.
sim_results <- run_simulation_study( dimensions = list(c(30, 12)), true_ncomp = 2, missing_props = numeric(0), mechanisms = character(0), reps = 2, seed = 1 ) summarize_simulation_study(sim_results)sim_results <- run_simulation_study( dimensions = list(c(30, 12)), true_ncomp = 2, missing_props = numeric(0), mechanisms = character(0), reps = 2, seed = 1 ) summarize_simulation_study(sim_results)
Tetracycline in serum used in the article and thesis.
tetracyclinetetracycline
A misspls_dataset list with components:
Dataset name.
A numeric 107 x 101 predictor matrix.
A numeric response vector of length 107.
A data frame with response y and predictors x1 to x101.
A short source reference.
Dataset preprocessing notes.
Additional study notes.
Goicoechea and Olivieri (1999b), calibration and test files bundled in
extra_docs/pls_data.