Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression

Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.


Introduction
In the era of precision medicine, constructing interpretable and accurate predictive models, based on patients' demographic characteristics, clinical conditions and molecular biomarkers, has been crucial for disease prevention, early diagnosis and targeted therapy [1]. When the number of predictors is moderate, penalized regression approaches such as least absolute shrinkage and selection operator (LASSO) by [2] have been used to construct predictive models and explain the impacts of the selected predictors. However, in ultrahigh dimensional settings where p is in the exponential order of n, penalized methods may incur computational challenges [3], may not reach globally optimal solutions and often generate biased estimates [4]. Sure independence screening (SIS) proposed by [5] has emerged as a powerful tool for modeling ultrahigh dimensional data. However, the method relies on a partial faithfulness assumption, which stipulates that jointly important variables must be marginally important, an assumption that may not be always realistic. To relieve this condition, some iterative procedures, such as ISIS [5], have been adopted to repeatedly screen variables based on the residuals from the previous iterations, but with heavy computation and unclear theoretical properties. Conditional screening approaches [see, e.g., [6]] have, to some extent, addressed the challenge. However, screening methods do not directly generate a final model, and post-screening regularization methods, such as LASSO, are recommended by [5] to produce a final model.
Let L(u, v) = uv − b(u) and E n { f (ξ)} = n −1 ∑ n i=1 f (ξ i ) denote the mean of { f (ξ i )} n i=1 for a sequence of i.i.d. random variables ξ i (i = 1, . . . , n) and a non-random function f (·). Based on the i.i.d. observations, the log-likelihood function is We use β * = (β * 0 , β * 1 , . . . , β * p ) T to denote the true values of β. Then the true model is M = {j : β * j = 0, j ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects. Overarching goals of ultra-high dimensional data analysis are to identify M and estimate β * j for j ∈ M. While most of the relevant literature [8,9] is on estimating M, this work is to accomplish both identification of M and estimation of β * j . When p is in the exponential order of n, we aim to generate a predictive model that contains the true model with high probability, and provide consistent estimates of regression coefficients. We further introduce the following notation. For a generic index set S ⊂ {0, 1, . . . , p} and a (p + 1)-dimensional vector A, we use S c to denote the complement of a set S and A S = {A j : j ∈ S} to denote the subvector of A corresponding to S. For instance, if S = {2, 3, 4}, then X iS = (X i2 , X i3 , X i4 ) T . Moreover, denote by S (β S ) = E n {L(X T S β S , Y)} the log-likelihood of the regression model of Y on X S and denote bŷ β S the maximizer of S (β S ). Under model (1), we elaborate on the idea of stepwise (details in the supplementary materials) selection, consisting of the forward and backward stages.
Forward stage: We start with F 0 , a set of variables that need to be included according to some a priori knowledge, such as clinically important factors and conditions. If no such information is available, F 0 is set to be {0}, corresponding to a null model. We sequentially add covariates as follows: where F k ⊂ {0, 1, . . . , p} is the index set of the selected covariates upon completion of the kth step, with k ≥ 0. At the (k + 1)th step, we append new variables to F k one at a time and refit GLMs: for every j ∈ F c k , we let F k,j = F k ∪ {j}, obtainβ F k,j by maximizing F k,j (β F k,j ), and compute the increment of log-likelihood, Then the index of a new candidate variable is determined to be Additionally, we update F k+1 = F k ∪ {j k+1 }. We then need to decide whether to stop at the kth step or move on to the (k + 1)th step with F k+1 . To do so, we use the following EBIC criterion: where the second term is motivated by [17] and |F| denotes the cardinality of a set F. The forward selection stops if EBIC(F k+1 ) > EBIC(F k ). We denote the stopping step by k * and the set of variables selected so far by F k * .
Backward stage: Upon the completion of forward stage, backward elimination, starting with B 0 = F k * , sequentially drops covariates as follows: where B k is the index set of the remaining covariates upon the completion of the kth step of the backward stage, with k ≥ 0. At the (k + 1)th backward step and for every j ∈ B k , we let B k/j = B k \ {j}, obtainβ B k/j by maximizing (β B k/j ), and calculating the difference of the log-likelihoods between these two nested models: The variable that can be removed from the current set of variables is indexed by We determine whether to stop at the kth step or move on to the (k + 1)th step of the backward stage according to the following BIC criterion: If BIC(B k+1 ) > BIC(B k ), we end the backward stage at the kth step. Let k * * denote the stopping step and we declare the selected model B k * * to be the final model. Thus,M = B k * * is the estimate of M. As the backward stage starts with the k * variables selected by forward selection, k * * cannot exceed k * .
A strength of our algorithm, termed STEPWISE hereafter, is the added flexibility with η 1 and η 2 in the stopping criteria for controlling the false negatives and positives. For example, a smaller value of η 1 close to zero in the forward selection step will likely include more variables, and thus incur more false positives and less false negatives, whereas a larger value of η 1 will recruit too few variables and cause too many false negatives. Similarly, in the backward selection step, a large η 2 would eliminate more variables and therefore further reduce more false positives, and vice versa for a small η 2 . While finding optimal η 1 and η 2 is not trivial, our numerical experiences suggest a small η 1 and a large η 2 may well balance the false negatives and positives. When η 2 = 0, no variables can be dropped after forward selection; hence, our proposal includes forward selection as a special case.
Moreover, [8] proposed a sequentially conditioning approach based on offset terms that absorb the prior information. However, our numerical experiments indicate that the offset approach may be suboptimal compared to our full stepwise optimization approach, which will be demonstrated in the simulation studies.

Theoretical Properties
With a column vector v, let v q denote the L q -norm for any q ≥ 1. For simplicity, we denote the L 2 -norm of v by v , and denote vv T by v ⊗2 . We use C 1 , C 2 , . . . , to denote some generic constants that do not depend on n and may change from line to line. The following regularity conditions are set.

5.
There exist two positive constants, κ min and κ max such that 0 < κ min < Λ E X ⊗2 uniformly in S ⊂ {0, 1, . . . , p} satisfying |S| ≤ q, where Λ(A) is the collection of all eigenvalues of a square matrix A. 6. min S:M ⊆S,|S|≤q D S > Cn −α for some constants C > 0 and α > 0 that satisfies Condition (1), as assumed in [8,18], is an alternative to the Lipschitz assumption [5,19]. The bound of the model size allowed in the selection procedure or q is often required in model-based screening methods see, e.g., [8,[20][21][22]. The bound should be large enough so that the correct model can be included, but not too large; otherwise, excessive noise variables would be included, leading to unstable and inconsistent estimates. Indeed, Conditions (1) and (6) reveal that the range of q depends on the true model size |M|, the minimum signal strength, n −α and the total number of covariates, p. The upper bound of q is o((n 1−4α / log p) ∧ (n 1/3 / log p)), ensuring the consistency of EBIC [17]. Condition (1) also implies that the parameter space under consideration can be restricted to B := {β ∈ R p+1 : β 1 ≤ K 2 }, for any model S with |S| ≤ q. Condition (2), as assumed in [23,24], reflects that data are often standardized at the pre-processing stage. Condition (3) ensures that Y has a light tail, and is satisfied by Gaussian and discrete data, such as binary and count data [25]. Condition (4) is satisfied by common GLM models, such as Gaussian, binomial, Poisson and gamma distributions. Condition (5) represents the sparse Riesz condition [26] and Condition (6) is a strong "irrepresentable" condition, suggesting that M cannot be represented by a set of variables that does not include the true model. It further implies that adding a signal variable to a mis-specified model will increase the log-likelihood by a certain lower bound [8]. The signal rate is comparable to the conditions required by the other sequential methods, see, e.g., [7,22]. Theorem 1 develops a lower bound of the increment of the log-likelihood if the true model M is not yet included in a selected model S.
Theorem 1 shows that, before the true model is included in the selected model, we can append a variable which will increase the log-likelihood by at least C 1 n −2α with probability tending to 1. This ensures that in the forward stage, our proposed STEPWISE approach will keep searching for signal variables until the true model is contained. To see this, suppose at the kth step of the forward stage that F k satisfies M ⊆ F k and |F k | < q, and let r be the index selected by STEPWISE. By Theorem 1, we obtain that, for any η 1 > 0, when n is sufficiently large, with probability at least 1 − 6 exp(−6q log p), where the last inequality is due to Condition (6). Therefore, with high probability the forward stage of STEPWISE continues as long as M ⊆ F k and |F k | < q. We next establish an upper bound of the number of steps in the forward stage needed to include the true model.
then there exists some constant C 2 > 2 such that M ⊂ F k , for some F k in the forward stage of STEPWISE and k ≤ C 2 |M|, with probability at least 1 − 18 exp(−4q log p).
The "max" condition, as assumed in Section 5.3 of [27], relaxes the partial orthogonality assumption that X M c are independent of X M , and ensures that with probability tending to 1, appending a signal variable increases log-likelihood more than adding a noise variable does, uniformly over all possible models S satisfying M ⊆ S, |S| < q. This entails that the proposed procedure is much more likely to select a signal variable, in lieu of a noise variable, at each step. Since EBIC is a consistent model selection criterion [28,29], the following theorem guarantees termination of the proposed procedure with M ⊂ F k for some k. Theorem 3 ensures that the forward stage of STEPWISE will stop within a finite number of steps and will cover the true model with probability at least 1 − q exp(−3q log p) ≥ 1 − exp(−2q log p). We next consider the backward stage and provide a probability bound of removing a signal from a set in which the set of true signals M is contained. Theorem 4 indicates that with probability at 1 − 6 exp(−6q log p), BIC would decrease when removing a signal variable from a model that contains the true model. That is, with high probability, back elimination is to reduce false positives.
Recall that F k * denotes the model selected at the end of the forward selection stage. By Theorem 2, M ⊂ F k * with probability at least 1 − 18 exp(−4q log p). Then Theorem 4 implies that at each step of the backward stage, a signal variable will not be removed from the model with probability at least 1 − 6 exp(−6q log p). By Theorem 2, |F k * | ≤ C 2 |M|. Thus, the backward elimination will carry out at most (C 2 − 1)|M| steps. Combining results from Theorems 2 and 3 yields that M ⊂M with probability at least 1 − 18 exp(−4q log p) − 6(C 2 − 1)|M| exp(−6q log p). Letβ be the estimate of β * in model (1) at the termination of STEPWISE. By convention, the estimates of the coefficients of the unselected covariates are 0. The theorem warrants that the proposed STEPWISE yields consistent estimates, a property not shared by many regularized methods, including LASSO. Our later simulations verified this. Proof of main theorems and lemmas are provided in Appendix A.

Simulation Studies
We compared the proposal with the other competing methods, including the penalized methods, such as least absolute shrinkage and selection operator (LASSO); the differential geometric least angle regression (dgLARS) [11,12]; the forward regression (FR) approach [7]; the sequentially conditioning (SC) approach [8]; and the screening methods, such as sure independence screening (SIS) [5], which is popular in practice. As SIS does not directly generate a predictive model, we applied LASSO for the top [n/ log(n)] variables chosen by SIS and denoted the procedure by SIS+LASSO. As the FR, SC and STEPWISE approaches involve forward searching and to make them comparable, we applied the same stopping rule, for example, Equation (3) with the same γ, to their forward steps. In particular, the STEPWISE approach, with η 1 = γ and η 2 = 0, is equivalent to FR and asymptotically equivalent to SC. By varying γ in FR and SC between γ L and γ H , we explored the impact of γ on inducing false positives and negatives. In our numerical studies, we fixed γ H = 10 and set γ L = η 1 . To choose η 1 and η 2 in (3) and (4) in STEPWISE, we performed 5-fold cross-validation to minimize the mean squared prediction error (MSPE), and reported the results in Table 1. Since the proposed STEPWISE algorithm uses the (E)BIC criterion, for a fair comparison we chose the tuning parameter in dgLARS by using the BIC criterion as well, and coined the corresponding approach as dgLARS(BIC). The regularization parameter in LASSO was chosen via the following two approaches: (1) giving the smallest BIC for the models on the LASSO path, denoted by LASSO(BIC); (2) using the one-standard-error rule, denoted by LASSO(1SE), which chooses the most parsimonious model whose error is no more than one standard error above the error of the best model in cross-validation [30]. Denote by X i = (X i1 , . . . , X ip ) T and β = (β 1 , . . . , β p ) T , the covariate vector for subject i (1, . . . , n) and the true coefficient vector. The following five examples generated X T i β, the inner product of the coefficient and covariate vectors for each individual, which were used to generate outcomes from the normal, binomial and Poisson models.
where β j = (−1) B j (4log n/ √ n + |Z j |), for j = 1, . . . , p 0 and β j =0 otherwise B j was a binary random variable with P(B j = 1) = 0.4 and Z j was generated by a standard normal distribution; p 0 = 8; X ij s were independently generated from a standardized exponential distribution, that is, exp(1) − 1. Here and also in the other examples, c (specified later) controls the signal strengths.

Example 2.
This scenario is the same as Example 1 except that X ij was independently generated from a standard normal distribution.

Example 3. For each i,
where β j = 2j for 1 ≤ j ≤ p 0 and p 0 = 5. We simulated every component of Z i = (Z ij ) ∈ R p and W i = (W ij ) ∈ R p independently from a standard normal distribution. Next, we generated X i according to X ij = where the first 500 X ij s were generated from the multivariate normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1 and cov(X ij , X ij ) = 0.5 |j−j | for 1 ≤ j, j ≤ p. The remaining p − 500 X ij s were generated through the autoregressive processes with X i,501 ∼ Unif(-2, 2), X ij = 0.5 X i,j−1 + 0.5 X * ij , for j = 502, . . . , p, where X * ij ∼ Unif(-2, 2) were generated independently. The coefficients β j for j = 1, . . . , 7, 501, . . . , 507 were generated from (−1) B j (4log n/ √ n + |Z j |), where B j was a binary random variable with P(B j = 1) = 0.4 and Z j was from a standard normal distribution. The remaining β j were zeros.
where X i were generated from a multivariate normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1 and cov(X ij , X ij ) = 0.9 |j−j | for 1 ≤ j, j' ≤ p. All of the coefficients were zero except for X i1 , X i2 and X i,100 .
Examples 1 and 3 were adopted from [7], while Examples 2 and 4 were borrowed from [5,15], respectively. We then generated the responses from the following three models.

Normal model:
. We considered n = 400 and p = 1000 throughout all of the examples. We specified the magnitude of the coefficients in the GLMs with a constant multiplier, c. For Examples 1-5, this constant was set, respectively for the normal, binomial and Poisson models, to be: (1, 1, 0.3), (1, 1.5, 0.3), (1, 1, 0.1), (1, 1.5, 0.3) and (1,3,2). For each parameter configuration, we simulated 500 independent data sets. We evaluated the performances of the methods by the criteria of true positives (TP), false positives (FP), the estimated probability of including the true models (PIT), the mean squared error (MSE) ofβ and the mean squared prediction error (MSPE). To compute the MSPE, we randomly partitioned the samples into the training (75%) and testing (25%) sets. The models obtained from the training datasets were used to predict the responses in the testing datasets. Tables 2-4 report the average TP, FP, PIT, MSE and MSPE over 500 datasets along with the standard deviations. The findings are summarized below.
First, the proposed STEPWISE method was able to detect all the true signals with nearly zero FPs. Specifically, in all of the Examples, STEPWISE outperformed the other methods by detecting more TPs with fewer FPs, whereas LASSO, SIS+LASSO and dgLARS included much more FPs.
Second, though a smaller γ in FR and SC led to the inclusion of all TPs with a PIT close to 1, it incurred more FPs. On the other hand, a larger γ may eliminate some TPs, resulting in a smaller PIT and a larger MSPE.
Third, for the normal model, the STEPWISE method yielded an MSE close to 0, the smallest among all the competing methods. The binary and Poisson data challenged all of the methods, and the MSEs for all the methods were non-negligible. However, the STEPWISE method still produced the lowest MSE. The results seemed to verify the consistency ofβ, which distinguished the proposed STEPWISE method from the other regularized methods and highlighted its ability to provide a more accurate means to characterize the effects of high dimensional predictors.  [5] followed by LASSO(BIC); SIS+LASSO(1SE), sure independence screening followed by LASSO(1SE); dgLARS(BIC), differential geometric least angle regression by [11,12] with the tuning parameter chosen to give the smallest BIC on the dgLARS path; SC(γ), sequentially conditioning approach by [8]; FR(γ), forward regression by [7]; STEPWISE, the proposed method; in FR and SC, the smaller and large values of γ are presented as γ L and γ H , respectively; p 0 denotes the number of true signals; LASSO(1SE), LASSO(BIC), SIS and dgLARS were conducted via R packages glmnet [31], ncvreg [32], screening [33] and dglars [34], respectively .

A Study of Gene Regulation in the Mammalian Eye
To demonstrate the utility of our proposed method, we analyzed a microarray dataset from [35] with 120 twelve-week male rats selected for eye tissue harvesting. The dataset contained more than 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array); see [35] for a more detailed description of the data.
Although our method was applicable to the original 31,042 probe sets, many probes turned out to have very small variances and were unlikely to be informative for correlative analyses. Therefore, using variance as the screening criterion, we selected 5000 genes with the largest variances in expressions and correlated them with gene TRIM32 that has been found to cause Bardet-Biedl syndrome, a genetically heterogeneous disease of multiple organ systems including the retina [36].
We applied the proposed STEPWISE method to the dataset with n = 120 and p = 5000, and treated the TRIM32 gene expression as the response variable and the expressions of 5000 genes as the predictors. With no prior biological information available, we started with the empty set. To choose η 1 and η 2 , we carried out 5-fold cross-validation to minimize the mean squared prediction error (MSPE) by using the following grid search: η 1 = {0, 0.25, 0.5, 1} and η 2 = {1, 2, 3, 4, 5}, and set η 1 = 1 and η 2 = 4. We also performed the same procedure to choose the γ for FR and SC. The regularization parameters in LASSO and dgLARS were selected to minimize BIC values.
In the forward step, STEPWISE selected the probes of 1376747_at, 1381902_at, 1382673_at and 1375577_at, and the backward step eliminated probe 1375577_at. The STEPWISE procedure produced the following final predictive model: TRIM32 = 4.6208 + 0.2310 × (1376747_at) + 0.1914 × (1381902_at) + 0.1263 × (1382673_at). Table A1 in Appendix B presents the numbers of overlapping genes among competing methods. It shows that the two out of three probes, 1381902_at and 1376747_at, selected from our method are also discovered by the other methods, except for dgLARS.
Next, we performed Leave-One-Out Cross-Validation (LOOCV) to obtain the distribution of the model size (MS) and MSPE for the competing methods.
As reported in Table 5 and Figure 1, LASSO, SIS+LASSO and dgLARS tended to select more variables than the forward approaches and STEPWISE. Among all of the methods, STEPWISE selected the fewest variables but with almost the same MSPE as the other methods. Note: The mean squared prediction error (MSPE) was averaged over 120 splits. LASSO, least absolute shrinkage and selection operator with regularization parameter that gives the smallest BIC; SIS+LASSO, sure independence screening by [5] followed by LASSO; dgLARS, differential geometric least angle regression by [11,12] that gives the smallest BIC; SC(γ), sequentially conditioning approach by [8]; FR(γ), forward regression by [7]; STEPWISE, the proposed method. STEPWISE was performed with η 1 = 1 and η 2 = 4 ; FR and SC were performed with γ = 1.  Figure 1. Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with η 1 = 1 and η 2 = 4, and FR and SC were conducted with γ = 1.

An Esophageal Squamous Cell Carcinoma Study
Esophageal squamous cell carcinoma (ESCC), the most common histological type of esophageal cancer, is known to be associated with poor overall survival, making early diagnosis crucial for treatment and disease management [37]. Several studies have investigated the roles of circulating microRNAs (miRNAs) in diagnosis of ESCC [38].
Using a clinical study that investigated the roles of miRNAs on the ESCC [39], we aimed to use miRNAs to predict ESCC risks and estimate their impacts on the development of ESCC. Specifically, with a dataset of serum profiling of 2565 miRNAs from 566 ESCC patients and 4965 controls without cancer, we demonstrated the utility of the proposed STEPWISE method in predicting ESCC with miRNAs.
To proceed, we used a balance sampling scheme (283 cases and 283 controls) in the training dataset. The design of yielding an equal number of cases and controls in the training set has proved to be useful [39] for handling imbalanced outcomes as we encountered here. To validate our findings, samples were randomly divided into a training (n 1 = 566, p = 2565) and testing set (n 2 = 4965, p = 2565).
The training set consisted of 283 patients with ESCC (median age of 65 years, 79% male) and 283 control patients (median age of 68 years, 46.3% male), and the testing set consisted of 283 patients with ESCC (median age of 67 years, 85.7% male) and 4682 control patients (median age of 67.5 years, 44.5% male). Control patients without ESCC came from three sources: 323 individuals from National Cancer Center Biobank (NCCB); 2670 individuals from the Biobank of the National Center for Geriatrics and Gerontology (NCGG); and 1972 individuals from Minoru Clinic (MC). More detailed characteristics of cases and controls in the training and testing sets are given in Table 6. We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. As age and gender (0 = female, 1 = male) are important risk factors for ESCC [40,41] and it is common to adjust for them in clinical models, we set the initial set in STEPWISE to be F 0 = {age, gender}. With η 1 = 0 and η 2 = 3.5 that were also chosen from 5-fold CV, our procedure recruited three miRNAs. More specifically, miR-4783-3p, miR-320b, miR-1225-3p and miR-6789-5p were selected among 2565 miRNAs by the forward stage from the training set, and then the backward stage eliminated miR-6789-5p.
Our findings were biologically meaningful, as the selected miRNAs had been identified by other cancer studies as well. Specifically, miR-320b was found to promote colorectal cancer proliferation and invasion by competing with its homologous miR-320a [42]. In addition, serum levels of miR-320 family members were associated with clinical parameters and diagnosis in prostate cancer patients [43]. Reference [44] showed that miR-4783-3p was one of the miRNAs that could increase the risk of colorectal cancer death among rectal cancer cases. Finally, miR-1225-5p inhibited proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of β-catenin signaling [45].
Aiming to identify a final model without resorting to a pre-screening procedure that may miss out on important biomarkers, we applied STEPWISE to reach the following predictive model for ESCC based on patients' demographics and miRNAs: In the testing dataset, the model had an area under the receiver operating curve (AUC) of 0.99 and achieved a high accuracy of 0.96, with a sensitivity and specificity of 0.97 and 0.95, respectively. Additionally, using the testing cohort, we evaluated the performances of the models sequentially selected by STEPWISE. Starting with a model containing age and gender, STEPWISE selected miR-4783-3p, miR-320b and miR-1225-3p in turn. Figure 2, showing the corresponding receiver operating curves (ROC) for these sequential models, revealed the improvement by sequentially adding predictors to the model and justified the importance of these variables in the final model. In addition, Figure 2e illustrated that adding an extra miRNA selected by FR and SC made little improvement of the model's predictive power.
Furthermore, we conducted subgroup analysis within the testing cohort to study how the sensitivity of the final model differed by cancer stage, one of the most important risk factors. The sensitivities for stages 0, i.e., non-invasive cancer, 9 (n = 27), 1 (n = 128), 2 (n = 57), 3 (n = 61) and 4 (n = 10) were 1.00, 0.98, 0.97, 0.97 and 1.00, respectively. We next evaluated how the specificity varied across controls coming from different data sources. The specificities for the various control groups, namely, NCCB (n = 306), NCGG (n = 2512) and MC (n = 1864), were 0.99, 0.99 and 0.98, respectively. The results indicated the robust performance of the miRNA-based model toward cancer stages and data sources.
Finally, to compare STEPWISE with the other competing methods, we repeatedly applied the aforementioned balance sampling procedure and split the ESCC data into the training and testing sets 100 times. We obtained MSPE and the average of accuracy, sensitivity, specificity, and AUC. Figure 3 reported the model size of each method. Though STEPWISE selected fewer variables compared to the other variable selection methods (for example, LASSO selected 11-31 variables and dgLARS selected 12-51 variables), it achieved comparable prediction accuracy, specificity, sensitivity and AUC (see Table 7), evidencing the utility of STEPWISE for generating parsimonious models while maintaining competitive predictability. We used R software [46] to obtain the numerical results in Sections 4 and 5 with following packages: ggplot2 [47], ncvreg [32], glmnet [31], dglars [34] and screening [33].

Discussion
We have proposed to apply STEPWISE to produce final models in ultrahigh dimensional settings, without resorting to a pre-screening step. We have shown that the method identifies or includes the true model with probability going to 1, and produces consistent coefficient estimates, which are useful for properly interpreting the actual impacts of risk factors. The theoretical properties of STEPWISE were established under mild conditions, which are worth discussing. As in practice covariates are often standardized for various reasons, Condition (2) is assumed without loss of generality. Conditions (3) and (4) are generally satisfied under common GLM models, including Gaussian, binomial, Poisson and gamma distributions. Condition (5) is also often satisfied in practice. Proposition 2 in [26] may be used as a tool to verify Condition (5) as well. Conditions (1) and (6) are in good faith with the unknown true model size |M| and minimum signal strength n −α in practice. The "irrepresentable" condition (6) is strong and may not hold in some real datasets, see, e.g., [48,49]. However, the condition holds under some commonly used covariance structures, including AR(1) and compound symmetry structure [48].
As shown in simulation studies and real data analyses, STEPWISE tends to generate models as predictive as the other well-known methods, with fewer variables (Figure 3). Parsimonious models are useful for biomedical studies as they explain data with a small number of important predictors, and offer practitioners a realistic list of biomarkers to investigate. With categorical outcome data frequently observed in biomedical studies (e.g., histology types of cancer), STEPWISE can be extended to accommodate multinomial classification, with more involved notation and computation. We will pursue this elsewhere.
There are several open questions. First, our final model was determined by using (E)BIC, which involves two extra parameters η 1 and η 2 . In our numerical experiments, we used cross-validation to choose them, which seemed to work well. However, more in-depth research is needed to find their optimal values to strike a balance between false positives and false negatives. Second, despite our consistent estimates, drawing inferences based on them remains challenging. Statistical inference, which accounts for uncertainty in estimation, is key for properly interpreting analysis results and drawing appropriate conclusions. Our asymptotic results, nevertheless, are a stepping stone toward this important problem.  In the proof of Lemma A6, we show that max |S|≤q β S − β * S ≤ A 2 K −1 (q 2 log p/n) 1/2 under Ω. Then given an index set S and β S such that |S| < q, β S − β * S ≤ A 2 K −1 (q 2 log p/n) 1/2 , and for any j ∈ S c , where the second inequality follows from the event Ω and Lemma A5.
Thus, there exists some constant C 1 that does not depend on n such that where the first inequality follows fromβ S+j being the maximizer of (A1) and the second inequality follows from Conditions (1) and (6).
Withdrawing the restriction to Ω, we obtain that P min Proof of Theorem 2. We have shown that our forward stage will not stop when M ⊆ S and |S| < q with probability converging to 1. For any r ∈ S c ∩ M c , β * S+r is the unique solution to the equation E Y − µ β T S+r X S∪{r} X S∪{r} = 0. By the mean value theorem, whereβ S+r is some point between β S+r and (β * T S , 0) T and e r is a vector of length (|S| + 1) with the rth element being 1.
For k > |M|, M ⊆ S k implies that at least k − |M| noise variables are selected within the k steps. Then for k = C 2 |M| with C 2 > 2, Therefore, M ⊂ S C 2 |M| with probability at least 1 − 18 exp(−4q log p).
Proof of Theorem 3. By Theorem 2, M will be included in F k for some k < q with probability going to 1. Therefore, the forward stage stops at the kth step if EBIC(F k+1 ) > EBIC(F k ).
By Theorem 1.1 in [51] and |S| ≤ q, we can find some constant C 5 such that where C 4 is some constant that depends on C 4 only. Thus, S ≤ σ max κ max , for all β S ∈ B 0 S (d) and S : M ⊆ S, |S| < q. Then, by (A5), uniformly over all S satisfying M ⊆ S and |S| ≤ q, with probability at least 1 − exp(−3q log p). Hence, the condition (A4) in [17] is satisfied with probability at least 1 − exp(−3q log p).
Additionally, for any β S ∈ B 0 S (d), Hence, the condition (A5) in [17] is satisfied uniformly over all S such that M ⊆ S and |S| ≤ q, with probability at least 1 − exp(−3q log p).
Then (A4) can be shown by following the proof of Equation (3.2) in [17]. Thus, our forward stage stops at the kth step with probability at least 1 − exp(−3q log p).
Proof of Theorem 4. Suppose that a covariate X r is removed from S. For any r ∈ M, since M ⊆ S\{r} and r is the only element that is in (S\{r}) c ∩ M, by Lemma A1 and (A2) with probability at least 1 − 6 exp(−6q log p). From the proof of Theorem 1, we have for any η 2 > 0, BIC(S) − BIC(S\{r}) ≤ −2C 1 n −2α + η 2 n −1 log n < 0, uniformly over r ∈ M and S satisfying M ⊂ S and |S| ≤ q, with probability at least 1 − 6 exp(−6q log p).

Additional Lemmas and Proofs
Lemma A1. Given a model S such that |S| < q, M ⊆ S, under Condition (6), (i): ∃r ∈ S c ∩ M, such that β * S+r = (β * T S , 0) T . (ii): Suppose Conditions (1), (2) and (6') hold. ∃r ∈ S c ∩ M, such that β * T Proof. As β * S+j is the maximizer of E S∪{j} (β S+j ) , by the concavity of E S∪{j} (β S+j ) , β * S+j is the solution to the equation E Y − µ β * T S X S + β j X j X S∪{j} = 0, (i): Suppose that β * S+j = (β * T S , 0) T , ∀j ∈ S c ∩ M. Then, which violates the Condition (6). Therefore, we can find a r ∈ S c ∩ M, such that β * S+r = (β * T S , 0) T . (ii): Let r ∈ S c ∩ M satisfy that E µ β T * X − µ β * T S X S X r > Cn −α . Without loss of generality, we assume that X r is the last element of X S∪{r} . By the mean value theorem, whereβ S+r is some point between β * S+r and (β * T S , 0) T and e r is a vector of length (|S| + 1) with the rth element being 1.

Proof. By Conditions
where the last inequality follows from that E[|Y|] ≤ M + µ max as shown in the proof of Lemma A2.
when n is sufficiently large such that 20 q log p/n < 1. Since 10(M + µ max ) q log p/n = o(1), then Lemma A4. Given an index set S and r ∈ S c , let B 0 (1)-(3), when n is sufficiently large such that 10 q log p/n < 1, we have 1.
|G n L β T S X S , Y − L β * T S X S , Y | ≤ 20A 1 d q log p, uniformly over β S ∈ B 0 S (d) and |S| ≤ q, with probability at least 1 − 4 exp(−6q log p).
Proof. : (1): Let R |S| (d) be a |S|-dimensional ball with center at 0 and radius d/(K |S|). Then B 0 S (d) = R |S| (d) + β * S . Let C |S| := {C(ξ k )} be a collection of cubes that cover the ball R |S| (d), where C(ξ k ) is a cube containing ξ k with sides of length d/(K |S|n 2 ) and ξ k is some point in R |S| (d). As the volume of C(ξ k ) is d/(K |S|n 2 ) |S| and the volume of R |S| (d) is less than (2d/(K |S|)) |S| , we can select ξ k s so that no more than (4n 2 ) |S| cubes are needed to cover R |S| (d). We thus assume |C |S| | ≤ (4n 2 ) |S| . For any ξ ∈ C(ξ k ), ξ − ξ k ≤ d/(Kn 2 ). In addition, let T 1S (ξ) := E n Yξ T X S , . Given any ξ ∈ R |S| (d), there exists C(ξ k ) ∈ C |S| such that ξ ∈ C(ξ k ). Then We deal with I I I first. By the mean value theorem, there exists aξ between ξ and ξ k such that where the last inequality follows from Lemma A2 and A 1 = M + 2µ max . Next, we evaluate I I. By Condition (2), |X T iS ξ| ≤ X iS ξ ≤ d/(K |S|) |S|K = d, for all ξ ∈ R |S| (d). Then by Lemma A2, By Bernstein's inequality, when n is sufficiently large such that 10 q log p/n ≤ 1.
We now assess I. Following the same arguments as in Lemma A3, (2): We evaluate the mth moment of L(β * S X S , Y).
By the same arguments used in (i), we obtain that P sup |S|≤q G n L β * T S X S , Y ≥ 10(A 1 K 2 + b max ) q log p ≤ q ∑ s=1 (ep/s) s 2 exp(−10q log p) ≤ 2 exp(−8q log p).

Proof.
Define G n L β T S X S , Y − L β * T S X S , Y < 20A 1 d q log p .
By Lemma A4, the event Ω(d) holds with probability at least 1 − 4 exp(−6q log p). Thus, in the proof of Lemma A6, we shall assume Ω(d) hold with d = A 2 q 3 log p/n for some A 2 > 20(σ min κ min ) −1 K 2 A 1 .
Then by the concavity of S (·), we obtain that max |S|≤q β S − β * S ≤ A 2 K −1 q 2 n −1 log p.

Appendix B. Additional Results in the Applications
Note: LASSO, SIS+LASSO, dgLARS selected 20, 17 and 33 miRNAs, respectively, and we only reported top 10 miRNAs.