4.1. Simulation Settings
In each example, the data
$({X}_{1}^{T},{Y}_{1}),({X}_{2}^{T},{Y}_{2}),\dots ,({X}_{n}^{T},{Y}_{n})$ are independent copies of a pair
$({X}^{T},Y)$, where the conditional distribution of the response
Y given
$X=x$ is a binomial distribution with probability of success
${\pi}_{i}$. We generate
$x={({X}_{1},{X}_{2},\dots ,{X}_{p})}^{T}$ from multivariate normal distribution with mean
$\mathbf{0}$ and covariance matrix
$\Sigma ={\left({\sigma}_{ij}\right)}_{p\times p}={\rho}^{|i-j|}$. We set up 5 different
$\rho $ values from small to large to generate
X with different correlation strength among the
p predictors. There are independence (
$\rho =0$), low correlation (
$\rho =0.2$), moderate correlation (
$\rho =0.4$), high correlation (
$\rho =0.6$) and very high correlation (
$\rho =0.8$). We vary the size of the non-sparse set of coefficients as
$s=2,3,4$ with vary signals and set up the number of parameter with
$p=200$ and
$p=600$. Besides, we apply one link function, logit, to generate the binomial proportion
${\pi}_{i}$, then generate the binary response variable
Y. For each link function, we consider 6 different models which are presented in
Table 1 with different covariates. The true coefficients for these 6 models are
$\mathbf{\beta}=(2,3)$,
$\mathbf{\beta}=(2,-3)$,
$\mathbf{\beta}=(2,3,3)$,
$\mathbf{\beta}=(2,-3,3)$,
$\mathbf{\beta}=(2,3,3,3)$, and
$\mathbf{\beta}=(2,-3,3,-3)$ and the same constant term
${\beta}_{0}=1$. Note that these parameters are randomly selected and some easily recognizable numbers are chosen for brevity. The patterns and trends of the simulation results do not depend on the parameter values. Thus, the proposed PB-SIS method is compared with MMLE and Kolmogorov filter method under all
$2\times 6=18$ simulation settings. All simulation results are based on 1000 replicates.
For each simulation, we use the proportion of submodels
${\mathcal{M}}_{d}$ with size
d that contain all the true predictors among 1000 replications,
${\mathcal{P}}_{1}$, and computing time to evaluate the performance for each setting. For the threshold value
d, we follows [
9] and choose
d to be
${d}_{1}=\lfloor n/logn\rfloor $,
${d}_{2}=2\lfloor n/logn\rfloor $ and
${d}_{3}=3\lfloor n/logn\rfloor $ throughout our simulations to empirically examine the effect of the cutoff, where the
$\lfloor n/logn\rfloor $ means the floor function of
$n/log\left(n\right)$. Since in our simulation setting, we take
$n=100$, we have
${d}_{1}=21$,
${d}_{2}=43$, and
${d}_{3}=65$. We also evaluate each method by summarizing the median minimum model size (MMMS) of each selected models and its robust estimate of the standard deviation (RSD). RSD is the interquantile range (IQR) divided by 1.34, which is given by [
11].
For the principle to define the value of
d, Ref. [
9] set
$d=n/log\left(n\right)$ as one way of choices for
d, and this way is conservative yet effective. Their preference is to select sufficiently many features in the first stage, and when
d is not very small, the selection results are not very sensitive to the choice of
d. It is obvious that larger
d means larger probability of including the true model
${\mathcal{M}}_{\ast}$ in the submodel
${\mathcal{M}}_{d}$. Provide that
$d=n/log\left(n\right)$ is large enough, we can use it as the threshold. Doing so can detect all significant predictors in the selected subset and the
${\mathcal{P}}_{1}$ value is large. Therefore, the principle for choosing
d is to obtain a relatively large value of
d to ensure the selection of the first stage can include all important predictors in the submodel
${\mathcal{M}}_{d}$. The simulation results in the next subsection will show that taking
${d}_{1}=\lfloor n/logn\rfloor $,
${d}_{2}=2\lfloor n/logn\rfloor $ and
${d}_{3}=3\lfloor n/logn\rfloor $ as thresholds results in the
${\mathcal{P}}_{1}$ values being close to 1, verifying that these thresholds perform effectively in the proposed feature screening method.
4.2. Presentation of Simulation Results for Logit Models
We present a series of simulation results where the response variable is generated from GLMs for binary data by using logit link. For the link function, we will summarize simulation results for 6 different models in
Table 1. The proportion
${\mathcal{P}}_{1}$ and computing time are tabulated in first 6 tables and the MMMS and the associated RSD are summarized in Tables 7–12 for each link.
The simulation results for model 1 to model 6 where data is generated from logit link are tabulated in
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7. From
Table 2, we can see that the all proportions
${\mathcal{P}}_{1}$ are close to 1, which illustrates the sure screening property. MMLE screening procedure usually has highest proportion
${\mathcal{P}}_{1}$ than the other two methods, but it takes much longer computing time than PB-SIS method and Kolmogorov-filter method in all settings. Even through the proportion
${\mathcal{P}}_{1}$ for PB-SIS is slightly lower than MMLE when
$\rho =0$ and
$\rho =0.2$, the difference is very small. The biggest difference for proportion
${\mathcal{P}}_{1}$ is only
$1.3\%$ between PB-SIS and MMLE when
$\rho =0$ and
$p=600$. When
$\rho $ is greater than 0.4, the PB-SIS and MMLE have the exact same proportion
${\mathcal{P}}_{1}$. But when we consider about computational cost, the PB-SIS method can be implemented much fast than the MMLE method. The average computing time for the PB-SIS and MMLE methods in logit model 1 are 41.85 seconds and 579.18 seconds when
$p=200$, and 282.05 seconds and 1289.69 seconds when
$p=600$. The computing time for MMLE is almost 6.74 times and 2.23 times longer than the PB-SIS method when
$p=200$ and
$p=600$. The Kolmogorov filter method has lowest proportion
${\mathcal{P}}_{1}$ and moderate computing time in each setting. Since we assign all coefficients are positive in logit model 1, the proportions
${\mathcal{P}}_{1}$ do not dependent on the independence assumption. Even for the highly correlated predictors, all three feature screening methods still can successfully select all the true predictors. For example, the proportions
${\mathcal{P}}_{1}$ are all equals to
$100\%$ when
$\rho =0.6$ and
$\rho =0.8$. Besides, the proportion
${\mathcal{P}}_{1}$ decreases as the dimensionality increases. As the number of features increases from
$p=200$ to
$p=600$, the proportions
${\mathcal{P}}_{1}$ decrease in most settings.
The proportion
${\mathcal{P}}_{1}$ and computing time for logit model 2 are reported in
Table 3. In logit model 2, the two true covariates are assigned different signs. All
${\mathcal{P}}_{1}$ of PB-SIS and MMLE are still very close. It means those two screening procedures perform equally well in most of settings. However, when we compare the computing time for the different methods, we can observe that PB-SIS takes much shorter computing time than MMLE in all settings. If we compare covariance structures with different
$\rho $’s, those predictors are independent to each other (
$\rho $ = 0) and predictors have low correlation (
$\rho $ = 0.2) settings typically perform better than those with high (
$\rho $ = 0.6) or very high
$\rho =0.8$ correlation settings for all three screening procedure. This is due to the probabilities of selecting some unimportant variables are inflated by the adjacent important ones when the predictors are highly correlated. Then some unimportant predictors may be selected since those predictors have strong correlation with the true predictors and it weakens the probabilities of selecting all true predictors.
Table 4 depicts the proportion
${\mathcal{P}}_{1}$ and computing time for logit model 3. Similar conclusions can be drawn from
Table 4 as from
Table 2. All proportions
${\mathcal{P}}_{1}$ of all three screening approaches are close to one. It means those three approaches are able to select all important predictors in this setting. As the submodel size
d increases, the proportions
${\mathcal{P}}_{1}$ for all three approaches increase as well. Thus increasing the submodel size
d is helpful for increasing the proportion
${\mathcal{P}}_{1}$. The computing time does not change too much as the submodel size
d increases. If we would like to get higher proportion
${\mathcal{P}}_{1}$, we can choose a larger threshold
d. However, the larger threshold
d means the model will become more complex. There is a trade off between the model complexity and the selection accuracy. Our suggestion is to choose the smaller submodel model size
$d=\lfloor n/log\left(n\right)\rfloor $, since the small growth of the proportion
${\mathcal{P}}_{1}$ is not worth the increasing of twice or three times of model complexity.
Table 5 reports the proportion
${\mathcal{P}}_{1}$ and computing time for logit model 4. In logit model 4, the three true covariates are assigned different signs. The PB-SIS and MMLE perform equally well and PB-SIS approach is more efficient when
$\rho =0$,
$\rho =0.2$,
$\rho =0.4$ or
$\rho =0.6$. However, when predictors are highly correlated (
$\rho =0.8$), all three feature screening fail to detect important predictors. This is because when predictors are highly correlated (
$\rho $ = 0.8), each predictor’s contribution to the response variable is cancelled out, especially for the predictors have opposite sign.
The proportion
${\mathcal{P}}_{1}$ and computing time for logit model 5 and logit model 6 are summarized in
Table 6 and
Table 7. For logit model 5, we observe a qualitative pattern similar to logit model 1 and logit model 3. The PB-SIS and MMLE approaches perform equally well, and the PB-SIS approach yields a comparable computing time. The Kolmogorov filter approach performs a little bit worse than the PB-SIS in both selection accuracy and computing time. We also observe that the proportion
${\mathcal{P}}_{1}$ increases as the correlation
$\rho $ increases. From
Table 7, the simulation results show the PB-SIS and MMLE perform equally well in selection accuracy, while the PB-SIS approach has lower computational cost than MMLE when predictors are independent or have lower correlation. Similar to logit model 1 and logit model 3 simulation results, when predictors are highly correlated, all three feature screening approaches tend to fail select important predictors.
Table 8 summarizes the MMMS which contains all true predictors for logit model 1 and its RSD. Those two values could be used to measure the effectiveness of a screening method. The MMMS value can avoid the issues of choosing different threshold
d. From
Table 8, we can observe that the PB-SIS and MMLE methods perform equally well and Kolmogorov filter approach performs a little bit worse than the PB-SIS and MMLE approaches in all settings. The Kolmogorov filter has a little bit larger RSD due to some outliers, which makes the minimum model size spread out in some cases. For the high correlation and very high correlation settings, the RSD values for PB-SIS and MMLE are larger, which means the minimum model size has higher variability when covariates are highly correlated to each other.
Table 9 depicts the MMMS and RSD for logit model 2. We can observe the similar results as logit model 1. The PB-SIS and MMLE still perform well in selecting all important variables when predictors are independent or have low correlation. However, all three feature screening procedures fail to detect important predictors when predictors are highly correlated (
$\rho $ = 0.8), especially for Kolmogorov filter method. For example, when the correlation is high, the MMMS of Kolmogorov filter are 16 and 33 for
p = 200 and 600, and the RSD values even achieve 30.60 and 79.85 when
$p=200$ and
$p=600$. This means the minimum size models containing all important predictors are very spread out over the 1000 replications and may exist some outliers. This is mainly because each predictor’s contribution to the response variable is cancelled out when they are of the different signs and highly correlated to each other.
Table 10 summarizes the MMMS and RSD for logit model 3. The PB-SIS and MMLE approaches are more robust to select important predictors than Kolmogorov filter in most of settings. The MMMS value for PB-SIS and MMLE are almost same in all settings, and MMLE usually has smallest RSD values among all three feature screening procedures. The Kolmogorov filter method still performs a little bit worse than the PB-SIS and MMLE methods. In general, these three screening approaches do not make a big difference when the number of true predictors is small and of the same signs.
Table 11 presents the simulation results for logit model 4 in terms of MMMS and the associated RSD. The simulation results illustrate that the PB-SIS and MMLE have more effective and consistent performance than Kolmogorov filter method when
$\rho $ = 0, 0.2 or 0.4. In addition, we also notice that for the different dimension and correlation levels, the MMMS and the associated RSD usually increase as the dimension increases or the correlation level increases. When predictors are highly correlated, the PB-SIS, MMLE and Kolmogorov filter methods fail to select important predictors. For example, when
$\rho $ = 0.8, the MMMS of PB-SIS, MMLE and Kolmogorov filter procedures are 105, 105 and 144 for
p = 200 and 140, 140 and 199 for
p = 600, which are much larger than our true model size 3.
The simulation results for logit model 5 about the MMMS and the associated RSD are presented in
Table 12. The overall pattern of logit model 5 is similar to logit model 1 and 3. The PB-SIS and MMLE methods still outperform Kolmogorov filter method in selection effectiveness. The Kolmogorov filter method has larger MMMS and the associated RSD than PB-SIS and MMLE in almost all settings.
The simulation results of MMMS with the associated RSD for logit model 6 are summarized in
Table 13. From
Table 13, we can observe that as the correlation increases, the MMMS and the associated RSD usually increase as well for all PB-SIS, MMLE and Kolmogorov filter approaches. In addition, we also see that as the dimension increases from 200 to 600, the MMMS also increases for all three feature screening approaches. Among the all approaches, the PB-SIS method usually can achieve smallest MMMS value in most settings. When predictors are highly correlated, all three feature screening methods fail to select important predictors. As we discussed before, this is due to the contribution of predictors with opposite signs may cancel out when predictors are highly correlated.
4.3. Simulations in Two-Stage Approach
We investigate the selection performance of two-stage PB-SIS method with different penalties. We consider the LASSO penalty, SCAD penalty and MCP along with four tuning parameter selection criteria: cross-validation(CV), Akaike information criterion (AIC), Bayesian information criterion (BIC) and Extended Bayesian information criteria (EBIC). In this section, only the logit link is applied to generate the binomial proportion
${\pi}_{i}$ and the binary response
Y. We use the same model settings as
Section 4.1 and are presented in
Table 1. In the first stage, PB-SIS is conducted to obtain the submodel
${\mathcal{M}}_{d}$ with size
$d=\lfloor n/log\left(n\right)\rfloor $. Then in the second stage, three different penalized methods are applied to further select important predictors and recover final sparse model. All the simulation results are based on 1000 replicates.
We evaluate the two-stage PB-SIS performance based on the
${\mathcal{P}}_{2}$, the proportion of final models containing all the true predictors among 1000 iterations and the mean of the final model size. The proportion
${\mathcal{P}}_{2}$ and mean model size are summarized for model 1 to model 6 in
Table 14,
Table 15,
Table 16,
Table 17,
Table 18 and
Table 19 and the mean of the final model size after regularization is reported in the parentheses. We use package
$SIS$ in
$\mathbf{R}$ to implement the penalized methods in the second stage. The
$tune.fit$ function in
$\mathbf{SIS}$ package fits a generalized linear model via penalized maximum likelihood, with available penalties such as LASSO, SCAD and MPC as indicated in the
$\mathbf{glmnet}$ and
$\mathbf{ncvreg}$ packages. The number of folds used in cross-validation is 10 and loss function used in selecting the final model is deviance.
The proportion
${\mathcal{P}}_{2}$ and mean model size for model 1 and model 2 are tabulated in
Table 14 and
Table 15. For model 1 and model 2, the number of true parameters are both two. In general, we can observe that the PB-SIS+LASSO two-stage approaches with different tuning selection criteria have the higher proportions
${\mathcal{P}}_{2}$, while the PB-SIS+MCP two stage approaches with different tuning parameter selection criteria yield the sparsest models among all three different penalties. Even though the PB-SIS+LASSO two stage approaches usually have highest proportion
${\mathcal{P}}_{2}$, they also give us the largest final models size for all different tuning parameter selection criteria. Furthermore, the PB-SIS+SCAD two-stage approach by using EBIC to select tuning parameter occasionally fails to select important predictors. For example, in
Table 14, the proportions
${\mathcal{P}}_{2}$ for the PB-SIS+SCAD two-stage approach by using EBIC to select tuning parameter are just 0.742 and 0.599 when
$p=200$ and
$p=600$, which are smallest among all two-stage approaches with different penalties. We also notice that as the dimension
p increases, the proportion
${\mathcal{P}}_{2}$ decreases and the mean model size increases for all three penalties.
Table 16 and
Table 17 summarize the proportion
${\mathcal{P}}_{2}$ and mean model size for model 3 and model 4. Model 3 and model 4 both contain three true parameters. For those two models, we observe similar overall pattern as model 1 and model 2. The final models which are selected by the PB-SIS+MCP two-stage approach with different tuning parameter selection criteria usually have smallest model size among the three penalties. The PB-SIS+SCAD two-stage approaches with different tuning parameter selection criteria return the moderate size final models and the PB-SIS+LASSO two-stage approaches with different tuning parameter selection criteria return the largest size final models. If we consider the proportion
${\mathcal{P}}_{2}$, the PB-SIS+LASSO two-stage approaches with different tuning parameter selection criteria have the largest proportion
${\mathcal{P}}_{2}$. We can conclude that the PB-SIS+LASSO two-stage approach performs better in selection accuracy and the PB-SIS+MCP two-stage approach performs better in finding the sparsest model.
The simulation results for model 5 and model 6 about proportion
${\mathcal{P}}_{2}$ and mean model size are presented in
Table 18 and
Table 19. The overall performance of PB-SIS+LASSO, PB-SIS+SCAD and PB-SIS+MCP two-stage approaches for model 5 and model 6 are similar to model 1 to model 4. The PB-SIS+LASSO two-stages approaches with different tuning parameter selection criteria have the highest proportion
${\mathcal{P}}_{2}$ along with largest mean model sizes. On the other hand, the PB-SIS+MCP two stage approaches with different tuning parameter selection criteria end up with the smallest model size on average with a slightly smaller proportion
${\mathcal{P}}_{2}$ than the PB-SIS+LASSO and PB-SIS+SCAD two-stage approaches. Therefore, there is a trade-off between the selection accuracy and the final model size for those two-stage methods. Our suggestion is that we can choose the two-stage PB-SIS+LASSO method when we care more about selecting all true predictors, while the two-stage PB-SIS+MCP approach is a better choice if we would like to find the sparest final model.
We now remark on the choice of a criterion for selecting tuning parameter
$\lambda $. In the simulations, as mentioned prior to Algorithm 1, one can use cross-validation (CV), AIC [
21], BIC [
22] or EBIC [
23] to choose tuning parameter
$\lambda $, each of which serves as a model selection criterion. Depending on the property of each model selection criterion, we can choose one for selecting tuning parameter
$\lambda $ based on different needs. CV is a method for choosing a model with the best out-of-sample predictive accuracy. AIC is an efficient model selection criterion, but not consistent. AIC is a method for choosing a model with the minimum disparity between a candidate model and the true model and is very likely to select an overfitted model including more predictors than the true model. BIC is consistent, which means asymptotically BIC chooses the true model. EBIC is extended BIC and consistent as well and may incur a small loss in the positive rate but tightly control the false discovery rate (see [
23]). In many applications, CV or BIC is used for selecting tuning parameter
$\lambda $.