| Title: | Stability-Selection via Correlated Resampling for Beta-Regression Models |
|---|---|
| Description: | Adds variable-selection functions for Beta regression models (both mean and phi submodels) so they can be used within the 'SelectBoost' algorithm. Includes stepwise AIC, BIC, and corrected AIC on betareg() fits, 'gamlss'-based LASSO/Elastic-Net, a pure 'glmnet' iterative re-weighted least squares-based selector with an optional standardization speedup, and 'C++' helpers for iterative re-weighted least squares working steps and precision updates. Also provides a fastboost_interval() variant for interval responses, comparison helpers, and a flexible simulator simulation_DATA.beta() for interval-valued data. For more details see Bertrand and Maumy (2023) <doi:10.7490/f1000research.1119552.1>, <https://hal.science/hal-05352047>, and <https://hal.science/hal-05352056>. |
| Authors: | Frederic Bertrand [cre, aut]
|
| Maintainer: | Frederic Bertrand <[email protected]> |
| License: | GPL-3 |
| Version: | 0.4.5 |
| Built: | 2026-05-17 09:11:10 UTC |
| Source: | https://github.com/fbertran/selectboost.beta |
Uses gamlss.lasso::gnet() to fit ENet on the mean submodel of
gamlss(dist = BE). The routine assumes complete cases and does not expose
offsets or precision-model terms.
betareg_enet_gamlss( X, Y, method = c("IC", "CV"), ICpen = c("BIC", "AIC", "HQC"), alpha = 1, trace = FALSE )betareg_enet_gamlss( X, Y, method = c("IC", "CV"), ICpen = c("BIC", "AIC", "HQC"), alpha = 1, trace = FALSE )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
method |
|
ICpen |
Penalty for |
alpha |
Elastic-net mixing (1 = LASSO, 0 = ridge). |
trace |
Logical; print stepwise trace. |
Named numeric vector of coefficients as in betareg_lasso_gamlss().
gamlss.lasso::gnet(), gamlss::gamlss(), gamlss.dist::BE()
Runs an IRLS loop with Beta working responses/weights and calls
glmnet on the weighted least-squares surrogate. Supports BIC/AIC/CV
model choice and an optional prestandardize speedup. The helper uses only
the mean submodel, requires complete cases, and does not expose offset terms.
betareg_glmnet( X, Y, alpha = 1, choose = c("bic", "aic", "cv"), nfolds = 5, n_iter = 6, tol = 1e-05, standardize = TRUE, lambda = NULL, phi_init = 20, update_phi = TRUE, phi_maxit = 5, prestandardize = FALSE, trace = FALSE )betareg_glmnet( X, Y, alpha = 1, choose = c("bic", "aic", "cv"), nfolds = 5, n_iter = 6, tol = 1e-05, standardize = TRUE, lambda = NULL, phi_init = 20, update_phi = TRUE, phi_maxit = 5, prestandardize = FALSE, trace = FALSE )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
alpha |
Elastic-net mixing parameter. |
choose |
One of |
nfolds |
Folds for CV when |
n_iter |
Max IRLS iterations; |
tol |
Convergence tolerance for the IRLS parameter change (Euclidean norm
of the difference in |
standardize |
Forwarded to |
lambda |
Optional fixed lambda; if |
phi_init |
Initial precision (phi). |
update_phi |
Logical; update phi inside the IRLS loop. |
phi_maxit |
Newton steps for phi update. |
prestandardize |
If |
trace |
Logical; print IRLS progress. |
Named numeric vector (Intercept) + colnames(X) with zeros for
unselected variables.
glmnet::glmnet(), glmnet::cv.glmnet()
set.seed(1); X <- matrix(rnorm(500), 100, 5); Y <- plogis(X[,1]-0.5*X[,3]) Y <- rbeta(100, Y*40, (1-Y)*40) betareg_glmnet(X, Y, alpha = 1, choose = "bic", prestandardize = TRUE)set.seed(1); X <- matrix(rnorm(500), 100, 5); Y <- plogis(X[,1]-0.5*X[,3]) Y <- rbeta(100, Y*40, (1-Y)*40) betareg_glmnet(X, Y, alpha = 1, choose = "bic", prestandardize = TRUE)
Uses gamlss::ri() (L1 penalty) in a gamlss(dist = BE) mean submodel to
select variables. The helper works on complete cases of X/Y, targets the
mean component, and does not yet expose offset handling.
betareg_lasso_gamlss( X, Y, method = c("ML", "GAIC"), k = 2, degf = NULL, lambda = NULL, trace = FALSE )betareg_lasso_gamlss( X, Y, method = c("ML", "GAIC"), k = 2, degf = NULL, lambda = NULL, trace = FALSE )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
method |
|
k |
Penalty multiplier for GAIC when |
degf |
Optional degrees of freedom for the L1 term. |
lambda |
Optional penalty strength. |
trace |
Logical; print stepwise trace. |
Named numeric vector of coefficients (Intercept) + colnames(X),
with 0 for unselected variables.
gamlss::gamlss(), gamlss::ri(), gamlss.dist::BE()
set.seed(1); X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[,1]); Y <- rbeta(100, Y*30, (1-Y)*30) betareg_lasso_gamlss(X, Y, method = "GAIC", k = 2)set.seed(1); X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[,1]); Y <- rbeta(100, Y*30, (1-Y)*30) betareg_lasso_gamlss(X, Y, method = "GAIC", k = 2)
Fits a Beta regression with optional joint selection of the mean and
precision (phi) submodels using betareg::betareg(). The routine performs
greedy forward/backward search using the requested information criterion and
returns coefficients aligned with the supplied design matrix. The selectors
currently target the mean submodel only, require complete cases, and do not
expose offsets. Observation weights are passed through to betareg() when
provided.
betareg_step_aic( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )betareg_step_aic( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
direction |
Stepwise direction for the mean submodel: |
link |
Link for the mean submodel (passed to |
link.phi |
Link for precision parameter. Default |
type |
Likelihood type for |
trace |
Logical; print stepwise trace. |
max_steps |
Integer; maximum number of greedy steps (default |
epsilon |
Numeric; minimum improvement required to accept a move
(default |
X_phi |
Optional matrix of candidate predictors for the precision (phi)
submodel. When |
direction_phi |
Stepwise direction for the precision submodel.
Defaults to |
weights |
Optional non-negative observation weights passed to
|
Named numeric vector of length p_mean + p_phi + 1 containing the
intercept, mean coefficients, phi-intercept (prefixed by "phi|"), and
phi coefficients (also prefixed by "phi|"). Non-selected variables have
coefficient 0.
set.seed(1) X <- matrix(rnorm(200), 100, 2); Y <- plogis(0.5 + X[,1]-X[,2]); betareg_step_aic(X, Y) Y <- rbeta(100, Y*20, (1-Y)*20) betareg_step_aic(X, Y)set.seed(1) X <- matrix(rnorm(200), 100, 2); Y <- plogis(0.5 + X[,1]-X[,2]); betareg_step_aic(X, Y) Y <- rbeta(100, Y*20, (1-Y)*20) betareg_step_aic(X, Y)
Greedy forward/backward search minimizing AICc computed on betareg fits with
optional precision-submodel selection and observation weights.
betareg_step_aicc( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )betareg_step_aicc( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
direction |
Stepwise direction for the mean submodel: |
link |
Link for the mean submodel (passed to |
link.phi |
Link for precision parameter. Default |
type |
Likelihood type for |
trace |
Logical; print stepwise trace. |
max_steps |
Maximum number of greedy steps (default |
epsilon |
Minimal AICc improvement to accept a move. |
X_phi |
Optional matrix of candidate predictors for the precision (phi)
submodel. When |
direction_phi |
Stepwise direction for the precision submodel.
Defaults to |
weights |
Optional non-negative observation weights passed to
|
See betareg_step_aic().
set.seed(1); X <- matrix(rnorm(400), 100, 4); Y <- plogis(X[,1]+0.5*X[,2]) betareg_step_aicc(X, Y) Y <- rbeta(100, Y*25, (1-Y)*25); betareg_step_aicc(X, Y)set.seed(1); X <- matrix(rnorm(400), 100, 4); Y <- plogis(X[,1]+0.5*X[,2]) betareg_step_aicc(X, Y) Y <- rbeta(100, Y*25, (1-Y)*25); betareg_step_aicc(X, Y)
Stepwise Beta regression by BIC
betareg_step_bic( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )betareg_step_bic( X, Y, direction = "both", link = "logit", link.phi = "log", type = "ML", trace = FALSE, max_steps = NULL, epsilon = 1e-08, X_phi = NULL, direction_phi = c("none", "both", "forward", "backward"), weights = NULL )
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
direction |
Stepwise direction for the mean submodel: |
link |
Link for the mean submodel (passed to |
link.phi |
Link for precision parameter. Default |
type |
Likelihood type for |
trace |
Logical; print stepwise trace. |
max_steps |
Integer; maximum number of greedy steps (default |
epsilon |
Numeric; minimum improvement required to accept a move
(default |
X_phi |
Optional matrix of candidate predictors for the precision (phi)
submodel. When |
direction_phi |
Stepwise direction for the precision submodel.
Defaults to |
weights |
Optional non-negative observation weights passed to
|
See betareg_step_aic().
set.seed(1); X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[,1]); betareg_step_bic(X, Y) Y <- rbeta(100, Y*30, (1-Y)*30) betareg_step_bic(X, Y)set.seed(1); X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[,1]); betareg_step_bic(X, Y) Y <- rbeta(100, Y*30, (1-Y)*30) betareg_step_bic(X, Y)
Bootstraps the dataset B times and records how often each variable is
selected by each selector. Observations containing NA in either X or Y
are removed prior to resampling. Column names are abbreviated internally and
mapped back to the originals in the output just like in
compare_selectors_single().
compare_selectors_bootstrap(X, Y, B = 50, include_enet = TRUE, seed = NULL)compare_selectors_bootstrap(X, Y, B = 50, include_enet = TRUE, seed = NULL)
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
B |
Number of bootstrap replications. |
include_enet |
Logical; include ENet if |
seed |
Optional RNG seed. |
Long data frame with columns selector, variable, freq in [0,1],
n_success, and n_fail. The freq column reports the share of bootstrap
replicates where a variable was selected by the corresponding selector.
Values near 1 signal high stability whereas small values indicate weak
evidence. n_success counts the successful fits contributing to the
frequency estimate (excluding failed replicates), while n_fail records the
number of unsuccessful fits. A "failures" attribute attached to the
returned data frame lists the replicate indices and messages for any
encountered errors.
set.seed(1) X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[, 1]) Y <- rbeta(100, Y * 30, (1 - Y) * 30) freq <- compare_selectors_bootstrap(X, Y, B = 10, include_enet = FALSE) head(freq) subset(freq, freq > 0.8) # Increase B until the reported frequencies stabilise. For example, freq_big <- compare_selectors_bootstrap(X, Y, B = 200, include_enet = FALSE) stats::aggregate(freq ~ selector, freq_big, summary)set.seed(1) X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[, 1]) Y <- rbeta(100, Y * 30, (1 - Y) * 30) freq <- compare_selectors_bootstrap(X, Y, B = 10, include_enet = FALSE) head(freq) subset(freq, freq > 0.8) # Increase B until the reported frequencies stabilise. For example, freq_big <- compare_selectors_bootstrap(X, Y, B = 200, include_enet = FALSE) stats::aggregate(freq ~ selector, freq_big, summary)
Convenience wrapper that runs AIC/BIC/AICc stepwise, GAMLSS LASSO (and ENet
when available), and the pure glmnet IRLS selector, then collates coefficients
into a long table for comparison. Observations containing NA in either X
or Y are removed prior to fitting. Column names are temporarily shortened
to satisfy selector requirements and avoid clashes; the outputs remap them to
the original labels before returning so the reported variables always match
the input design.
compare_selectors_single(X, Y, include_enet = TRUE)compare_selectors_single(X, Y, include_enet = TRUE)
X |
Numeric matrix (n × p) of mean-submodel predictors. |
Y |
Numeric response in (0,1). Values are squeezed to (0,1) internally. |
include_enet |
Logical; include ENet if |
A list with:
Named coefficient vectors for each selector.
Long data frame with columns selector, variable, coef, selected.
set.seed(1) X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[, 1]) Y <- rbeta(100, Y * 30, (1 - Y) * 30) single <- compare_selectors_single(X, Y, include_enet = FALSE) head(single$table)set.seed(1) X <- matrix(rnorm(300), 100, 3); Y <- plogis(X[, 1]) Y <- rbeta(100, Y * 30, (1 - Y) * 30) single <- compare_selectors_single(X, Y, include_enet = FALSE) head(single$table)
Merge single-run results and bootstrap frequencies
compare_table(single_tab, freq_tab = NULL)compare_table(single_tab, freq_tab = NULL)
single_tab |
Data frame returned in |
freq_tab |
Optional frequency table from |
Merged data frame.
single_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), coef = c(0.5, 0, -0.2, 0.6, 0.1, -0.3) ) single_tab$selected <- single_tab$coef != 0 freq_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), freq = c(0.9, 0.15, 0.4, 0.85, 0.3, 0.25) ) compare_table(single_tab, freq_tab)single_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), coef = c(0.5, 0, -0.2, 0.6, 0.1, -0.3) ) single_tab$selected <- single_tab$coef != 0 freq_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), freq = c(0.9, 0.15, 0.4, 0.85, 0.3, 0.25) ) compare_table(single_tab, freq_tab)
Repeats selection on interval-valued responses by sampling a pseudo-response
from each interval (uniformly or midpoint), tallying variable selection
frequencies across B replicates.
fastboost_interval( X, Y_low, Y_high, func, B = 100, sample = c("uniform", "midpoint"), version = "glmnet", use.parallel = FALSE, seed = NULL, ... )fastboost_interval( X, Y_low, Y_high, func, B = 100, sample = c("uniform", "midpoint"), version = "glmnet", use.parallel = FALSE, seed = NULL, ... )
X |
Numeric matrix (n × p). |
Y_low, Y_high
|
Interval bounds in [0,1]. Rows with missing bounds are dropped. |
func |
Function |
B |
Number of interval resamples. |
sample |
|
version |
Ignored (reserved for future). |
use.parallel |
Use |
seed |
Optional RNG seed. Scoped via |
... |
Extra args forwarded to |
A list with:
B × (p+1) matrix of coefficients over replicates.
Named vector of selection frequencies for each predictor.
# suppose you have interval data (Y_low, Y_high) set.seed(1) n <- 120; p <- 6 X <- matrix(rnorm(n*p), n, p); colnames(X) <- paste0("x",1:p) mu <- plogis(X[,1] - 0.5*X[,2]); Y <- rbeta(n, mu*25, (1-mu)*25) Y_low <- pmax(0, Y - 0.05); Y_high <- pmin(1, Y + 0.05) fb <- fastboost_interval(X, Y_low, Y_high, func = function(X,y) betareg_glmnet(X,y, choose="bic", prestandardize=TRUE), B = 40) sort(fb$freq, decreasing = TRUE)# suppose you have interval data (Y_low, Y_high) set.seed(1) n <- 120; p <- 6 X <- matrix(rnorm(n*p), n, p); colnames(X) <- paste0("x",1:p) mu <- plogis(X[,1] - 0.5*X[,2]); Y <- rbeta(n, mu*25, (1-mu)*25) Y_low <- pmax(0, Y - 0.05); Y_high <- pmin(1, Y + 0.05) fb <- fastboost_interval(X, Y_low, Y_high, func = function(X,y) betareg_glmnet(X,y, choose="bic", prestandardize=TRUE), B = 40) sort(fb$freq, decreasing = TRUE)
Visual comparison of coefficients returned by each selector. Requires ggplot2.
plot_compare_coeff(single_tab)plot_compare_coeff(single_tab)
single_tab |
Data frame as returned by |
A ggplot object when ggplot2 is available; otherwise draws a base R image.
demo_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), coef = c(0.6, 0, -0.2, 0.55, 0.05, -0.3) ) demo_tab$selected <- demo_tab$coef != 0 plot_compare_coeff(demo_tab)demo_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), coef = c(0.6, 0, -0.2, 0.55, 0.05, -0.3) ) demo_tab$selected <- demo_tab$coef != 0 plot_compare_coeff(demo_tab)
Visual comparison of bootstrap selection frequencies by selector. Requires ggplot2.
plot_compare_freq(freq_tab)plot_compare_freq(freq_tab)
freq_tab |
Data frame as returned by |
A ggplot object when ggplot2 is available; otherwise draws a base R image.
freq_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), freq = c(0.85, 0.2, 0.45, 0.75, 0.35, 0.3) ) plot_compare_freq(freq_tab)freq_tab <- data.frame( selector = rep(c("AIC", "BIC"), each = 3), variable = rep(paste0("x", 1:3), times = 2), freq = c(0.85, 0.2, 0.45, 0.75, 0.35, 0.3) ) plot_compare_freq(freq_tab)
Apply a selector to a collection of resampled designs
sb_apply_selector_manual( X_norm, resamples, Y, selector, ..., keep_template = TRUE )sb_apply_selector_manual( X_norm, resamples, Y, selector, ..., keep_template = TRUE )
X_norm |
Normalised design matrix. |
resamples |
List of matrices returned by |
Y |
Numeric response. |
selector |
Variable-selection routine; function or character string. If it is a function, the selector name should be added as the fun.name attribute. |
... |
Extra arguments passed to the selector. |
keep_template |
Logical; when |
A numeric matrix of coefficients with one column per resample.
sb_beta() orchestrates all SelectBoost stages—normalisation, correlation
analysis, grouping, correlated resampling, and stability tallying—while using
the beta-regression selectors provided by this package. It can operate on
point-valued or interval-valued responses and automatically squeezes the
outcome into (0, 1) unless instructed otherwise.
sb_beta( X, Y = NULL, selector = betareg_step_aic, corrfunc = "cor", B = 100, step.num = 0.1, steps.seq = NULL, version = c("glmnet", "lars"), squeeze = TRUE, use.parallel = FALSE, seed = NULL, verbose = FALSE, threshold = 1e-04, interval = c("none", "uniform", "midpoint"), Y_low = NULL, Y_high = NULL, ... )sb_beta( X, Y = NULL, selector = betareg_step_aic, corrfunc = "cor", B = 100, step.num = 0.1, steps.seq = NULL, version = c("glmnet", "lars"), squeeze = TRUE, use.parallel = FALSE, seed = NULL, verbose = FALSE, threshold = 1e-04, interval = c("none", "uniform", "midpoint"), Y_low = NULL, Y_high = NULL, ... )
X |
Numeric design matrix. Coerced with |
Y |
Numeric response vector. Values are squeezed to the open unit
interval with the standard SelectBoost transformation unless |
selector |
Selection routine. Defaults to |
corrfunc |
Correlation function passed to |
B |
Number of replicates to generate. |
step.num |
Step length for the automatically generated |
steps.seq |
Optional user-supplied grid of absolute correlation thresholds. |
version |
Either |
squeeze |
Logical; ensure the response lies in |
use.parallel |
Logical; enable parallel resampling and selector fits when supported by the current R session. |
seed |
Optional integer seed for reproducibility. The seed is scoped via
|
verbose |
Logical; emit progress messages. |
threshold |
Numeric tolerance for considering a coefficient selected. |
interval |
Interval-resampling mode: |
Y_low, Y_high
|
Interval bounds in |
... |
Additional arguments forwarded to |
The returned object carries a rich set of attributes:
"c0.seq" – the grid of absolute-correlation thresholds explored during
resampling.
"steps.seq" – the raw sequence (if any) used to construct the grid.
"selector" – the selector identifier (function name or expression).
"B" – number of resampled designs passed to the selector.
"interval" – the interval sampling mode ("none", "uniform", or
"midpoint").
"resample_diagnostics" – per-threshold data frames with summary
statistics on the cached correlated draws.
These attributes mirror the historical SelectBoost beta implementation so the object can be consumed by existing plotting and reporting utilities.
Matrix of selection frequencies with one row per c0 level and class
"sb_beta". See Details for the recorded attributes.
set.seed(42) sim <- simulation_DATA.beta(n = 80, p = 4, s = 2) # increase B for real applications res <- sb_beta(sim$X, sim$Y, B = 5) resset.seed(42) sim <- simulation_DATA.beta(n = 80, p = 4, s = 2) # increase B for real applications res <- sb_beta(sim$X, sim$Y, B = 5) res
sb_beta_interval() forwards to sb_beta() while activating interval sampling
so that beta-regression SelectBoost runs can ingest lower/upper response
bounds directly. It mirrors fastboost_interval() but reuses the correlated
resampling pipeline of sb_beta().
sb_beta_interval( X, Y_low, Y_high, selector = betareg_step_aic, sample = c("uniform", "midpoint"), Y = NULL, ... )sb_beta_interval( X, Y_low, Y_high, selector = betareg_step_aic, sample = c("uniform", "midpoint"), Y = NULL, ... )
X |
Numeric design matrix. Coerced with |
Y_low, Y_high
|
Interval bounds in |
selector |
Selection routine. Defaults to |
sample |
Interval sampling scheme passed to the |
Y |
Optional point-valued response. Supply it when you wish to keep the
observed mean response but still resample within |
... |
Additional arguments forwarded to |
See sb_beta(). The returned object carries the same
"sb_beta"-class attributes describing the correlation thresholds,
resampling diagnostics, selector, and number of replicates.
set.seed(1) sim <- simulation_DATA.beta(n = 120, p = 5, s = 2) y_low <- pmax(sim$Y - 0.05, 0) y_high <- pmin(sim$Y + 0.05, 1) interval_fit <- sb_beta_interval( sim$X, Y_low = y_low, Y_high = y_high, B = 5, step.num = 0.4 ) attr(interval_fit, "interval")set.seed(1) sim <- simulation_DATA.beta(n = 120, p = 5, s = 2) y_low <- pmax(sim$Y - 0.05, 0) y_high <- pmin(sim$Y + 0.05, 1) interval_fit <- sb_beta_interval( sim$X, Y_low = y_low, Y_high = y_high, B = 5, step.num = 0.4 ) attr(interval_fit, "interval")
sb_beta() resultsThese S3 helpers make it easier to inspect and visualise the
correlation-threshold grid returned by sb_beta(). They surface the stored
attributes, reshape the selection frequencies into tidy summaries, and produce
quick ggplot2 visualisations for interactive use.
## S3 method for class 'sb_beta' print(x, digits = 3, ...) ## S3 method for class 'sb_beta' summary(object, ...) ## S3 method for class 'summary.sb_beta' print(x, digits = 3, n = 10, ...) autoplot.sb_beta(object, variables = NULL, ...)## S3 method for class 'sb_beta' print(x, digits = 3, ...) ## S3 method for class 'sb_beta' summary(object, ...) ## S3 method for class 'summary.sb_beta' print(x, digits = 3, n = 10, ...) autoplot.sb_beta(object, variables = NULL, ...)
x, object
|
An object of class |
digits |
Number of decimal places to display when printing. |
... |
Additional arguments passed on to lower-level methods. |
n |
Number of rows to show from the summary table when printing. |
variables |
Optional character vector of variables to retain in the plotted output. |
summary.sb_beta() returns an object of class summary.sb_beta
containing a tidy data frame of selection frequencies. The plotting and
printing methods are invoked for their side effects and return the input
object invisibly.
set.seed(42) sim <- simulation_DATA.beta(n = 50, p = 4, s = 2) fit <- sb_beta(sim$X, sim$Y, B = 5, step.num = 0.5) print(fit) summary(fit) if (requireNamespace("ggplot2", quietly = TRUE)) { autoplot.sb_beta(fit) }set.seed(42) sim <- simulation_DATA.beta(n = 50, p = 4, s = 2) fit <- sb_beta(sim$X, sim$Y, B = 5, step.num = 0.5) print(fit) summary(fit) if (requireNamespace("ggplot2", quietly = TRUE)) { autoplot.sb_beta(fit) }
These helpers expose the individual stages of the SelectBoost workflow so
that beta-regression selectors can be combined with correlation-aware
resampling directly from SelectBoost.beta. They normalise the design matrix,
derive correlation structures, form groups of correlated predictors, generate
Gaussian surrogates that mimic the observed dependency structure, and apply a
user-provided selector on each resampled design.
sb_normalize(X, center = NULL, scale = NULL, eps = 1e-08) sb_compute_corr(X, corrfunc = "cor") sb_group_variables(corr_mat, c0)sb_normalize(X, center = NULL, scale = NULL, eps = 1e-08) sb_compute_corr(X, corrfunc = "cor") sb_group_variables(corr_mat, c0)
X |
Numeric matrix of predictors. |
center |
Optional centering vector recycled to the number of columns.
Defaults to the column means of |
scale |
Optional scaling vector recycled to the number of columns.
Defaults to the column-wise |
eps |
Small positive constant used when normalising columns. |
corrfunc |
Function or character string used to compute pairwise
associations. Defaults to |
corr_mat |
Numeric matrix of associations. |
c0 |
Threshold applied to the absolute correlations. |
sb_normalize() returns a centred, -scaled copy of X.
sb_compute_corr() returns the association matrix.
sb_group_variables() returns a list of integer vectors, one per
variable, describing the correlated group it belongs to.
sb_normalize(matrix(rnorm(20), 5))sb_normalize(matrix(rnorm(20), 5))
Generate correlated design replicates for a set of groups
sb_resample_groups( X_norm, groups, B = 100, jitter = 1e-06, seed = NULL, use.parallel = FALSE, cache = NULL )sb_resample_groups( X_norm, groups, B = 100, jitter = 1e-06, seed = NULL, use.parallel = FALSE, cache = NULL )
X_norm |
Normalised design matrix. |
groups |
Correlation structure. Either a list as returned by
|
B |
Number of replicates to generate. |
jitter |
Numeric value added to covariance diagonals for stability. |
seed |
Optional integer seed for reproducibility. The seed is scoped via
|
use.parallel |
Logical; when |
cache |
Optional environment or named list used to cache previously generated surrogates. Passing the same cache across calls reuses draws for identical groups. |
When every group has size one (no correlated variables) the function
simply returns B copies of X_norm. A warning is issued in that situation
so downstream code can avoid mistaking the replicated designs for genuinely
resampled surrogates. The covariance matrices underpinning each correlated
draw are cached in the supplied cache environment; reusing the environment
across calls lets sb_resample_groups() skip redundant covariance
decompositions for identical groups and speeds up iterative workflows.
An object of class sb_resamples, i.e. a list of length B whose
elements are resampled design matrices. The object exposes per-group
diagnostics in its "diagnostics" attribute and returns the cache via the
"cache" attribute for reuse.
Compute selection frequencies from coefficient paths
sb_selection_frequency( coef_matrix, version = c("glmnet", "lars"), threshold = 1e-04 )sb_selection_frequency( coef_matrix, version = c("glmnet", "lars"), threshold = 1e-04 )
coef_matrix |
Matrix produced by |
version |
Either |
threshold |
Coefficients with absolute value below this threshold are treated as zero. |
Numeric vector of selection frequencies.
Simulate interval Beta-regression data (flexible)
simulation_DATA.beta( n, p, s = min(5L, p), beta_size = 1, a0 = 0, X_dist = c("gaussian", "t", "bernoulli"), corr = c("indep", "ar1", "block"), rho = 0, block_size = 5L, df = 5, prob = 0.5, active_idx = NULL, phi = 20, mechanism = c("jitter", "quantile", "mixed"), mix_prob = 0.5, delta = 0.05, delta_low = NULL, delta_high = NULL, alpha = 0.1, alpha_low = NULL, alpha_high = NULL, na_rate = 0, na_side = c("random", "left", "right"), centerX = FALSE, scaleX = FALSE, seed = NULL )simulation_DATA.beta( n, p, s = min(5L, p), beta_size = 1, a0 = 0, X_dist = c("gaussian", "t", "bernoulli"), corr = c("indep", "ar1", "block"), rho = 0, block_size = 5L, df = 5, prob = 0.5, active_idx = NULL, phi = 20, mechanism = c("jitter", "quantile", "mixed"), mix_prob = 0.5, delta = 0.05, delta_low = NULL, delta_high = NULL, alpha = 0.1, alpha_low = NULL, alpha_high = NULL, na_rate = 0, na_side = c("random", "left", "right"), centerX = FALSE, scaleX = FALSE, seed = NULL )
n, p
|
Sample size and number of predictors. |
s |
Number of active (nonzero) coefficients. |
beta_size |
Scalar (alternating ±) or numeric vector of length greater then equal s. |
a0 |
Intercept (logit scale). |
X_dist |
Distribution for X: |
corr |
Correlation structure: |
rho |
AR(1) correlation or within-block correlation. |
block_size |
Block size when |
df |
Degrees of freedom for |
prob |
Success prob for |
active_idx |
Optional integer vector of active feature indices (length s). If NULL, uses 1:s. |
phi |
Precision parameter: scalar, length-n vector, or function |
mechanism |
Interval mechanism per row: |
mix_prob |
Probability of jitter when |
delta |
Symmetric jitter half-width (scalar / vector / function). |
delta_low, delta_high
|
Asymmetric jitter widths (override |
alpha |
Miscoverage for quantile intervals (scalar / vector / function). |
alpha_low, alpha_high
|
Asymmetric miscoverage (override |
na_rate |
Fraction of rows with a missing bound (default 0). |
na_side |
Which bound to drop: |
centerX, scaleX
|
Whether to center/scale X before returning. |
seed |
RNG seed. |
list with X, Y, Y_low, Y_high, mu, beta, a0, phi, info, settings.