This vignette details the simulations tools provided with the selectboost package by providing five examples of use.
If you are a Linux/Unix or a Macos user, you can install a version of
SelectBoost with support for doMC
from github with:
We want to creates NDatasets = 200 datasets with length(group) = 10 variables and N = 10 observations. In that example we want 9 groups:
The correlation structure of the explanatory variables of the dataset
is provided by group
and the intra-group Pearson
correlation value for each of the groups by cor_group
. A
value must be provided even for single variable groups and the number of
variables is length of the group
vector. Use the
simulation_cor
function to create the correlation matrix
(CM
).
Then generation of an explanatory dataset with N = 10 observations is made by the
simulation_X
function.
A response can now be added to the dataset by the
simulation_Data
function. We have to specifiy the support
of the response, i.e. the explanatory variables that will be used in the
linear model created to compute the response. The support is given by
the supp
vector whose entries are either 0 or 1. The
length of the supp
vector must be equal to the number of
explanatory variables and if the ientry is equal to 1, it means that the ivariable will be used to derive the
response value, whereas if the ientry is equal to 0, it means that the ivariable will not be used to derive
the response value (beta<-rep(0,length(supp))
). The
values of the coefficients for the explanatory variables that are in the
support of the response are random (either absolute value and sign) and
given by
beta[which(supp==1)]<-runif(sum(supp),minB,maxB)*(rbinom(sum(supp),1,.5)*2-1)
.
Hence, the user can specify their minimal absolute value with the
minB
option and their maximal absolute value with the
maxB
option. The stn
option is a scaling
factor for the noise added to the response vector
((t(beta)%*%var(X)%*%beta)/stn
, with X
the
matrix of explanatory variables). The higher the stn
value,
the smaller the noise: for instance for a given X
dataset,
an stn
value α
times larger will result in a noise exactly $\sqrt{\alpha}$ times smaller.
To generate multiple datasets, repeat steps 2 and 3, for instance use
a for
loop. We create NDatasets = 200
datasets and assign them to the objects DATA_exemple1_nb_1
to DATA_exemple1_nb_200
.
set.seed(3141)
NDatasets=200
for(i in 1:NDatasets){
X<-simulation_X(N,CM)
assign(paste("DATA_exemple1_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn))
}
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple1_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
coef_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_nb_",i,sep=""))$Y,
get(paste("DATA_exemple1_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
coef_mean=coef_sum/NDatasets
All fits were sucessful. Then we display and plot that the mean coefficient vector values.
Reduce the noise in the response for the new responses by a factor $\sqrt{5000/50}=10$. 1/stn ⋅ βsupporttVar(X)βsupport where βsupport is the vector of coefficients wh
set.seed(3141)
stn <- 5000
for(i in 1:NDatasets){
X<-simulation_X(N,CM)
assign(paste("DATA_exemple1_bis_nb_",i,sep=""),simulation_DATA(X,supp,minB,maxB,stn))
}
Since it is the same explanatory dataset for response generation, we can compare the σ between those NDatasets = 200 datasets.
stn_ratios=rep(0,NDatasets)
for(i in 1:NDatasets){
stn_ratios[i]<-get(paste("DATA_exemple1_nb_",i,sep=""))$sigma/
get(paste("DATA_exemple1_bis_nb_",i,sep=""))$sigma
}
all(sapply(stn_ratios,all.equal,10))
All the ratios are equal to 10 as anticipated.
Since, the correlation structure is the same as before, we do not
need to check it again. As befor, we infer the coefficients values of a
linear model using the lm
function.
coef_sum_bis=rep(0,length(group))
names(coef_sum_bis)<-paste("x",1:length(group),sep="")
error_counter_bis=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple1_bis_nb_",i,sep=""))$Y,
get(paste("DATA_exemple1_bis_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter_bis=error_counte_bisr+1
} else{
coef_sum_bis=coef_sum_bis+abs(tempcoef)
}
}
error_counter_bis
coef_mean_bis=coef_sum_bis/NDatasets
All fits were sucessful. Then we display and plot that the mean
coefficient vector values. As expected, the noise reduction enhances the
retrieval of the true
mean coefficient absolute values by
the models.
The simulation process looks sucessfull. What are the confidence indices for those variables?
We want to creates NDatasets = 200 datasets with length(group) = 50 variables and N = 20 observations. In that example we want 1 group:
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple2_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
coef_sum=rep(0,length(group))
coef_lasso_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
names(coef_lasso_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y,
get(paste("DATA_exemple2_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple2_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple2_nb_",i,sep=""))$Y,
plot.it=FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
coef_lasso_mean
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?
We want to creates NDatasets = 200
datasets with length(supp) = 100
variables and N = 24
observations. In that example we use real data for the X
variables that we sample from all the 1650 probesets that are differentially
expressed between the two conditions US and S. The main interest of that
simulation is that the correlation structure of the X
dataset will be a real one.
First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots.
require(CascadeData)
data(micro_S)
data(micro_US)
require(Cascade)
micro_US<-as.micro_array(micro_US,c(60,90,240,390),6)
micro_S<-as.micro_array(micro_S,c(60,90,240,390),6)
S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),-1)
Sel<-micro_S@microarray[S@name,]
Generates the datasets sampling for each of them 100 probesets expressions among the 1650 that were selected and linking the response to the expressions of the first five probesets.
Here are the plots of an example of correlation structure, namely for
DATA_exemple3_nb_200$X
. Run the code to get the
graphics.
coef_sum=rep(0,length(supp))
coef_lasso_sum=rep(0,length(supp))
names(coef_sum)<-paste("x",1:length(supp),sep="")
names(coef_lasso_sum)<-paste("x",1:length(supp),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y,
get(paste("DATA_exemple3_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple3_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple3_nb_",i,sep=""))$Y,
plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
coef_lasso_mean
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?
We want to creates NDatasets = 101
datasets with length(supp) = 100
variables and N = 18
observations. In that example we use real data for the variables that
are the 101 probesets that are the more
differentially expressed between the two conditions US and S. We create
101 datasets by leaving one of the
variables out each time and using it as the response that shall be
predicted. We also only use for the explanatory variables the
observations that are the measurements for the 1st, 2nd and 3rd
timepoints and for the responses the observations that are the
measurements of the same variables for the 2nd, 3rd and 4th timepoints.
The main interest of that simulation is that the correlation structure
of the X
dataset will be a real one and that it is a
typical setting for cascade network reverse-engineering in genomics or
proteomics, see the Cascade
package for more details.
First retrieve the datasets and get the differentially expressed probesets. Run the code to get additionnal plots.
require(CascadeData)
data(micro_S)
data(micro_US)
require(Cascade)
micro_US<-as.micro_array(micro_US,c(60,90,240,390),6)
micro_S<-as.micro_array(micro_S,c(60,90,240,390),6)
S<-geneSelection(list(micro_S,micro_US),list("condition",c(1,2),1),101)
Sel<-micro_S@microarray[S@name,]
suppt<-rep(1:4,6)
supp<-c(1,1,1,1,1,rep(0,95)) #not used since we use one of the probeset expressions as response
minB<-1 #not used since we use one of the probeset expressions as response
maxB<-2 #not used since we use one of the probeset expressions as response
stn<-50 #not used since we use one of the probeset expressions as response
NDatasets<-101
set.seed(3141)
for(i in 1:NDatasets){
#the explanatory variables are the values for the 1st, 2nd and 3rd timepoints
X<-t(as.matrix(Sel[-i,suppt!=4]))
Xnorm<-t(t(X)/sqrt(colSums(X*X)))
DATA<-simulation_DATA(Xnorm,supp,minB,maxB,stn)
#the reponses are the values for the 2nd, 3rd and 4th timepoints
DATA$Y<-as.vector(t(Sel[i,suppt!=1]))
assign(paste("DATA_exemple4_nb_",i,sep=""),DATA)
rm(DATA)
}
Here are the plots of an example of correlation structure, namely for
DATA_exemple3_nb_200$X
. Run the code to get the
graphics.
coef_sum=rep(0,length(supp)+1)
coef_lasso_sum=rep(0,length(supp)+1)
names(coef_sum)<-paste("x",1:(length(supp)+1),sep="")
names(coef_lasso_sum)<-paste("x",1:(length(supp)+1),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y,
get(paste("DATA_exemple4_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple4_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple4_nb_",i,sep=""))$Y,
plot.it=FALSE, normalize=FALSE, intercept = FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum[-i]=coef_lasso_sum[-i]+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum[-i]=coef_sum[-i]+abs(tempcoef)
}
}
error_counter
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
Some probesets seem explanatory for many other ones (=hubs). What are the confidence indices for those variables?
We want to creates NDatasets = 200 datasets with length(group) = 500 variables and N = 25 observations. In that example we want 1 group:
We now check the correlation structure of the explanatory variable. First we compute the mean correlation matrix.
corr_sum=matrix(0,length(group),length(group))
for(i in 1:NDatasets){
corr_sum=corr_sum+cor(get(paste("DATA_exemple5_nb_",i,sep=""))$X)
}
corr_mean=corr_sum/NDatasets
Then we display and plot that the mean correlation matrix.
coef_sum=rep(0,length(group))
coef_lasso_sum=rep(0,length(group))
names(coef_sum)<-paste("x",1:length(group),sep="")
names(coef_lasso_sum)<-paste("x",1:length(group),sep="")
error_counter=0
for(i in 1:NDatasets){
tempdf=data.frame(cbind(Y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y,
get(paste("DATA_exemple5_nb_",i,sep=""))$X))
tempcoef=coef(lm(Y~.-1,data=tempdf))
require(lars)
lasso.1 <- lars::lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y, type="lasso",
trace=FALSE, normalize=FALSE, intercept = FALSE)
# cv.lars() uses crossvalidation to estimate optimal position in path
cv.lasso.1 <- lars::cv.lars(x=get(paste("DATA_exemple5_nb_",i,sep=""))$X,
y=get(paste("DATA_exemple5_nb_",i,sep=""))$Y,
plot.it=FALSE, type="lasso")
# Use the "+1SE rule" to find best model:
# Take the min CV and add its SE ("limit").
# Find smallest model that has its own CV within this limit (at "s.cv.1")
limit <- min(cv.lasso.1$cv) + cv.lasso.1$cv.error[which.min(cv.lasso.1$cv)]
s.cv.1 <- cv.lasso.1$index[min(which(cv.lasso.1$cv < limit))]
# Print out coefficients at optimal s.
coef_lasso_sum=coef_lasso_sum+abs(coef(lasso.1, s=s.cv.1, mode="fraction"))
if(is.null(tempcoef)){
cat("Error in lm fit, skip coefficients\n")
error_counter=error_counter+1
} else{
coef_sum=coef_sum+abs(tempcoef)
}
}
error_counter
coef_mean=coef_sum/NDatasets
coef_lasso_mean=coef_lasso_sum/NDatasets
With regular least squares and lasso estimators all fits were sucessful, yet only 20 variables coefficients could be estimated with regular least squares estimates for the linear model. Then we display and plot that the mean coefficient vector values for the least squares estimates.
head(coef_lasso_mean, 40)
barplot(coef_lasso_mean,ylim=c(0,1.5))
abline(h=(minB+maxB)/2,lwd=2,lty=2,col="blue")
The simulation process looks sucessfull: the lasso estimates retrives mostly the correct variables, yet the other ones are also selected sometimes. What are the confidence indices for those variables?