1. Introduction
With the exponential growth of data sets in various fields over the past two decades, numerous exhaustive methods have been proposed to address the issue of coefficient sparsity in high-dimensional statistical models, such as bridge regression [
1], LASSO [
2], SCAD and other folded-concave penalties [
3], and the Dantzig selector [
4]. While these methods have demonstrated their effectiveness both theoretically and practically, real-world scenarios present new challenges such as identifying disease-causing genes among millions of other genes or analyzing key factors contributing to stock price fluctuations from vast amounts of business data. To tackle ultra-high dimensional data, a range of techniques has emerged. One notable technique is Sure Independent Screening (SIS), initially developed by Fan and Lv [
5] for screening out irrelevant factors before conducting variable selection in ultra-high dimensional linear models. There are numerous further developments based on SIS [
6,
7,
8,
9]. However, these methods overlook the correlations of covariates, despite their computational efficiency. Consequently, additional procedures have been proposed to address this limitation, including ISIS [
6], FR [
10], SMLE [
11].
The aforementioned approaches, which are all based on the maximum likelihood function or Pearson’s correlation, become invalid in the presence of outliers. Therefore, robust methods have been extensively studied in the literature. Although quantile regression [
12] is effective in handling heterogeneous data, the significantly higher computational cost compared to least squares error necessitates an investigation of the asymmetric least squares (ALS) regression (i.e., expectile regression [
13,
14,
15,
16]). ALS regression provides a more comprehensive interpretation of the conditional distribution than the ordinary least squares (OLS) method by allocating different squared error losses to positive and negative residuals, respectively. Moreover, its smooth differentiability greatly reduces computational costs and facilitates theoretical research. Building upon ALS and quantile regression, numerous methods have been proposed to address heterogeneous data with high-dimensionality, such as [
17,
18] for variable selection and [
19,
20,
21,
22,
23,
24] for feature screening. The study of [
25] proposed an expectile partial correlation screening (EPCS) procedure to sequentially identify important variables for expectile regressions in ultra-high dimensions, and proved that this procedure can lead to a sure screening set. Another robust parametric technique called DPD-SIS [
26,
27] has been developed for ultra-high-dimensional linear regression models and generalized linear models. This approach is based on the robust minimum density power divergence estimator [
28], but it is still limited to addressing marginal aspects without accounting for the correlations between features. In addition, the DPD-SIS cannot handle heterogeneity, which is often a feature of ultra-high-dimensional data.
In the context of heterogeneity and outliers in the data, we propose a new method called Robust Weighted Expectile Regression (RoWER), which combines the error criterion with expectile regression to achieve the robustness and address the heterogeneity. Furthermore, we developed a sparse restricted RoWER (SRoWER) approach to achieve feature screening. Under general assumptions, we show that the SRoWER can be used for sure screening property. Numerical studies validate the robustness and efficacy of SRoWER. There are three advantages of our SRoWER method, including the following: (1) The SRoWER can provide more reliable screening results, particularly in the presence of outliers in both covariates and the response; (2) In the case of heteroscedasticity, the SRoWER yields superior performance in estimation and feature screening as demonstrated in simulation studies; (3) The SRoWER can be efficiently solved by an iterative hard-thresholding-based algorithm.
The remaining sections of this article are organized as follows.
Section 2 introduces the model and the RoWER method. In
Section 3, we present the SRoWER method for feature screening and establish sure independent screening property.
Section 4 describes simulation studies and a real data analysis that evaluates the finite sample performance of the SRoWER method. Concluding remarks are provided in
Section 5. The proofs of the main results can be found in the
Appendix A.
3. The SRoWER and Sure Screening Property
Let be any subset of , which corresponds to a submodel with the relevant regression coefficient vector and the design matrix , . In addition, let be the -norm, and be the -norm, which denotes the number of non-zero components of a vector. The size of model is denoted as . The true model is represented by , with being the true regression coefficient vector, and .
3.1. The IHT Algorithm
For the objective function
, assuming that
is sparse with
for some known
k, the RoWER method with sparsity restriction (SRoWER) yields an estimator of
defined as
and
stands for the set of subscripts of the non-zero components of
.
For feature screening, the goal is to retain a relatively small number of features from
p features. Currently, many studies have proposed methods to solve such problems. For example, Mallat and Zhang [
31] proposed the matching pursuit algorithm. Moreover, the hard thresholding method proposed by Blumensath and Davies [
32] is particularly effective for linear models. We now follow the idea of an iterative hard thresholding (IHT) algorithm to compute the SRoWER estimate. For
within the neighborhood of a given
, the IHT uses the approximation of
,
where
, and
is a scale parameter. Denote
.
By (
6), the approximate solution of (
5) can be obtained by the following iterative procedure
The optimization of (
7) is equivalent to
If there is no constraint
, the analytic solution of (
8) is
. However, due to the sparsity restriction,
can be obtained by selecting the component of
with the largest absolute value before
k, i.e.,
where
r is the
k-th largest component of
, and
is a hard thresholding function. Given the sparse solution
obtained at the
t-th iteration, iterating (
8) is equivalent to iterating the following expression
The ultra-high dimensional case is often faced with a huge amount of computational tasks including matrix operations. However, the use of thresholding functions can eliminate this issue. Moreover, it naturally incorporates information on the correlations between predictors. Theorem 1 shows that the value of decreases as the number of iterations increases.
Theorem 1. Let be the sequence obtained by(
7)
, be the maximum eigenvalue of . If with , the value of decreases as the number of iterations increases, i.e., . 3.2. Sure Screening Property
This subsection will prove the sure screening property of feature screening based on the SRoWER method. Define
as the collections of the over-fitted models and the under-fitted models, respectively. When
p,
,
k and
vary along with the sample size
n, we provide the asymptotic property of
. Additionally, we make the following assumptions, some of which are completely technical and only help us comprehend the SRoWER method theoretically.
- (A1)
for some .
- (A2)
There exist
and some non-negative constants
, such that
and
- (A3)
There exists a constant , such that .
- (A4)
Suppose that the random errors are i.i.d. sub-Gaussian random variables satisfying .
- (A5)
There exists a constant
, such that, for sufficiently large
n,
for
with
being the complement of
.
Condition (A1) shows that
p diverges exponentially with
n, which is a common setting in the ultra-high dimension. The two requirements in Condition (A2) are crucial for establishing the sure screening property. The former one implies that the signals of the true model are stronger than the random errors, so they are detectable. The latter one implies that the sparsity of
makes sure screening possible with
. Condition (A3) is a regular condition for the theoretical derivation. Condition (A4) is the same as the assumption of [
17]. Condition (A5) is similar to [
11].
Theorem 2. Suppose that Conditions (A1)–(A5) are satisfied with . Let be the estimated model obtained by the SRoWER
with size k; then, we have By using feature screening, important features that are highly correlated with the response variable can be kept in . However, it is necessary to note that there is no explicit choice for k, because it depends on the different dimensions. Note that the IHT algorithm needs a initial estimate . To further enhance computational efficiency, the LASSO estimate is chosen as the initial value of the iterations. The following theorem shows that with the initial value obtained using LASSO, the IHT-implemented SRoWER can satisfy the property of sure independent screening within a finite iteration.
Theorem 3. Let be the t-th update of the IHT procedure. The scale parameter for some , and let be the screening features. The initial value of iteration iswhere λ satisfies and . Then, under Conditions (A1)–(A5), for any finite , we have 3.3. The Choice of k
For the SRoWER method, we need prespecified
k, such as
[
4,
6,
11]. Here, we treat
k as a tuning parameter to control model complexity, and determine
k by minimizing the following EBIC score:
where
. The study of [
33] proposed the EBIC for model selection for large model spaces. Here, we use it to determine
k for comparing the SRoWER with the EPCS proposed by [
25], which also used the EBIC for model selection.
Note that the EBIC selector for determining k requires searching over . To balance the computation and model selection accuracy in practice, we minimize for .