Robust Variable Selection and Estimation Based on Kernel Modal Regression

Model-free variable selection has attracted increasing interest recently due to its flexibility in algorithmic design and outstanding performance in real-world applications. However, most of the existing statistical methods are formulated under the mean square error (MSE) criterion, and susceptible to non-Gaussian noise and outliers. As the MSE criterion requires the data to satisfy Gaussian noise condition, it potentially hampers the effectiveness of model-free methods in complex circumstances. To circumvent this issue, we present a new model-free variable selection algorithm by integrating kernel modal regression and gradient-based variable identification together. The derived modal regression estimator is related closely to information theoretic learning under the maximum correntropy criterion, and assures algorithmic robustness to complex noise by replacing learning of the conditional mean with the conditional mode. The gradient information of estimator offers a model-free metric to screen the key variables. In theory, we investigate the theoretical foundations of our new model on generalization-bound and variable selection consistency. In applications, the effectiveness of the proposed method is verified by data experiments.


Introduction
Variable selection has attracted increasing attention in the machine learning community due to the massive requirements of high-dimensional data mining. Under different motivations, many variable selection methods have been constructed and shown promising performance in various applications. From the viewpoint of hypothesis function space, there are mainly two types of variable selection approaches with respect to linear assumption and nonlinear additive assumption, respectively. For the linear model assumption, variable selection algorithms are usually formulated based on the least-squares empirical risk and the sparsity-induced regularization, which include Least Absolute Shrinkage and Selection Operator (Lasso) [1], Group Lasso [2] and Elastic net [3] as special examples. For the nonlinear additive model assumption, various additive models have been developed to relax the linear restriction on regression function [4,5]. It is well known that additive models enjoy the flexibility and interpretability of their representation and can remedy the curse of dimensionality of high-dimensional nonparametric regression [6][7][8]. Typical examples of additive models include Sparse Additive Models (SpAM) [9], Component Selection and Smoothing Operator (COSSO) [10] and Group Sparse Additive Models (GroupSpAM) [11]. Most of the above approaches are formulated under of each gradient associated with hypothesis function. The above three steps assure the robustness and flexibility of our new approach.
Our main contributions can be summarized as follows.
• We formulate a new RGVS method by integrating the RMR in RKHS and the model-free strategy for variable screening. This algorithm can be implemented via the half-quadratic optimization [32]. To our knowledge, this algorithm is the first one for robust model-free variable selection.

•
In theory, the proposed method enjoys statistical consistency on regression estimator under much general conditions on data noise and hypothesis space. In particular, the learning rate with polynomial decay O(n − 2 5 ) is obtained, which is faster than O(n − 1 7 ) in [20] for linear RMR. It should be noted that our work is established under the MRC, while all previous model-free methods are formulated under the MSE criterion. In addition, variable selection consistency is obtained for our approach under a self-calibration condition.

•
In application, the proposed RGVS shows the empirical effectiveness on both simulated and real-world data sets. In particular, our approach can achieve much better performance than the model-free algorithm in [12] for complex noise data, e.g., containing Chi-square noise, Exponential noise, and Student noise. Experimental results together with theoretical analysis support the effectiveness of our approach.
The rest of this paper is organized as follows. After recalling the preliminaries of modal regression, we formulate the RGVS algorithm in Section 2. Then, theoretical analysis, optimization algorithm, and empirical evaluation are provided from Section 3 to Section 5 respectively. Finally, we conclude this paper in Section 6. Table 1. Properties of different regression algorithms.

Gradient-Based Variable Selection in Modal Regression
Let X ∈ R p and Y ∈ R be a compact input space and an output space, respectively. We consider the following data-generating setting where x ∈ X , y ∈ Y and is a random noise. For the feasibility of theoretical analysis, we denote the intrinsic distribution of (x, y) ∈ Z := (X , Y ) generated in (1) as ρ. Let z = {(x i , y i )} n i=1 ∈ Z n be empirical observations drawn independently according to the unknown distribution ρ. Unlike sparse methods with certain model assumption (e.g., Lasso [1], SpAM [9]), the gradient-based sparse algorithms [12,13] mainly aim at screening out the informative variables according to the gradient information of intrinsic function. For input vector u = (u 1 , ..., u p ) T ∈ R p , the variable information is characterized by the gradient function g * j (u) := ∂ f * (u)/∂u j . Clearly, g * j (u) = 0 implies that the j-th variable is uninformative [12,17]. Considering an 2 -norm measure on the partial derivatives, we denote the true active set as where g * j 2 2 = X (g * j (x)) 2 dρ X (x) and ρ X is the marginal distribution of ρ.
Indeed, all the gradient-based variable selection algorithms [12,13,17] are constructed under Tikhonov regularization scheme in RKHS H K [33,34]. The RKHS H K associated with the Mercer kernel K is the closure of the linear span of {K x := K(x, ·) : x ∈ X }. Such a Mercer kernel K : X × X → R is a symmetric and positive semi-definite function. Denote < ·, · > K as the inner product in H K , the reproducing properties of RKHS means < f ,

Gradient-Based Variable Selection Based on Kernel Least-Squares Regression
In this subsection, we recall the gradient-based variable selection algorithm in [12] associated with least-squares error metric. When the noise in (1) satisfies E( |X) = 0 (i.e., Gaussian noise), the regression function equals to the conditional mean, which can be represented by Here ρ Y|X=x denotes the conditional distribution of Y given x. Theoretically, the regression function f * in (3) is the minimizer of expected least-squares risk As ρ is unknown in practice, we cannot get f * directly by minimizing E ( f ) over certain hypothesis space. Given training samples z, the empirical risk with respect to the expected risk E ( f ) is denoted by The gradient-based variable selection algorithm in [12] depends on the estimator defined as below: where λ > 0 is the regularization parameter and || f || K is the kernel-norm of f . The properties of RKHS [33] assure thatf where K n (x) = (K(x 1 , x), · · · , K(x n , x)) T ∈ R n andα z = (α 1 , · · · ,α n ) T ∈ R n . Denote K = (K n (x 1 ), · · · , K n (x n )) ∈ R n×n and Y = (y 1 , ..., y n ) T ∈ R n , the closed-form solution is Following Lemma 1 in [12], for any j ∈ {1, · · · , p}, we haveg After imposing the empirical norm ong j , i.e., we get the estimated active setS where v n is a pre-configured constant for variable selection.
The general variable selection method has shown some theoretical advantages in [12], e.g., the representation flexibility and the computation feasibility. However, the gradient-based method [12] may result in a degraded performance for real-world data without the zero-mean noise condition. Inspired by the modal regression [19,35] to learn the conditional mode, we propose a new robust gradient-based variable selection method under much general noise condition.

Robust Gradient-Based Variable Selection Based on Kernel Modal Regression
Unlike the traditional zero-mean noise assumption [12,17], the modal regression requires that the conditional mode of random noise is zero for any x ∈ X , i.e., where P |X is the conditional density of conditioned on X. In fact, this assumption imposes no restrictions on conditional mean, and can include the heavy-tailed noise, the skewed noise, and outliers. Then, we can verify that the mode-regression function where P Y|X (·|X) denotes the conditional density of Y conditioned on x ∈ X . It is worth noting that P Y|X (·|X = x) is assumed to be unique and existing here. As shown in [19,20], f * in (6) is the maximizer of the MRC over all measurable functions, which is defined as The maximizer of R( f ) is difficult to be obtained since both P Y|X and ρ X are unknown. Fortunately, Theorem 5 of [19] has proved that R( f ) = P E f (0), where P E f (0) is the density function of E f = Y − f (x) at 0 and which can be easily approximated by the kernel density method [20]. With the help of modal kernel K σ : R × R → R for the density estimation, we can formulate an empirical kernel density estimatorP E f at 0 Setting φ( In addition, the modal regression also can be interpreted by minimizing a mode-induced error metric [19]. When φ(u) ≤ φ(0) for any u ∈ R, the mode-induced loss can be defined as which is related closely with the correntropy-induced loss in [19,36]. Given training samples z = {(x i , y i )} n i=1 , we can formulate the RMR in RKHS as where λ > 0 is a turning parameter that controls the complexity of the hypothesis space, and f 2 K = f , f K is the kernel-norm of f ∈ H K . Denoteα = (α 1 , ...,α n ) T ∈ R n , K n (x) = (K(x 1 , x), ..., K(x n , x)) T ∈ R n and K = (K(x i , x j )) n i,j=1 ∈ R n×n . From the representer theorem of kernel methods, we can deduce that From Lemma 1 in [12], we know that for any f ∈ H K and u = (u 1 , ..., u p ) T ∈ R p , The empirical measure on gradient functionĝ j (x) is Then, the identified active set can be written aŝ where v n is a positive threshold selected under the sample-adaptive tuning framework [37].

Generalization-Bound and Variable Selection Consistency
This section establishes the theoretical guarantees on generalization ability and variable selection for the proposed RGVS. Firstly, we introduce some necessary assumptions.
Observe that some smoothing kernels meet Assumption 1, such as Gaussian kernel and Logistic kernel, etc. Assumption 2. The conditional density function P |X is second-order differentiable and P |X ∞ is bounded.

Assumption 3.
Let C s be a space of s-times continuous differentiable functions. Assume that sup x∈X K(x, x) < ∞ with K ∈ C s with s > 0, and for a given constant M, the target function satisfies f * ∈ H K with f * ∞ ≤ M. Assumption 3 has been used extensively in learning theory literatures, see, e.g., [38][39][40][41][42][43][44]. In particular, the Gaussian kernel belongs to C ∞ .
Our error analysis begins with the following inequality in [19], where the relationship between R σ ( f ) and R( f ) is provided.

This indicates us to bound the excess risk
To be specific, we further make an error decomposition as follows.
Proof. According to the definition of f z in (9), we have Then, we can deduce that This together with Lemma 1 yields the desired result.
) characterizes the divergence between the data-free risk R σ ( f ) and the empirical risk R σ z ( f ). To establish its uniform estimation, we need to give the upper bound of f z K firstly.
According to the definition of f z , we have Then, Lemma 3. For f z in (9), there holds This motivates us to measure the capacity of B r through the empirical covering number [45].
Then, the 2 -empirical covering number of function set F is defined as Next, we introduce a concentration inequality established in [46].

Lemma 4.
Let T be a function set associated with function t. Suppose that there are some constants B, c s , c θ > 0 and s ∈ [0, 1] satisfying t ∞ ≤ B, Et 2 ≤ c s (Et) s for any t ∈ T . If for 0 < θ < 2 and log N 2 (T , ) ≤ c θ −θ , ∀ > 0, then for any 0 < δ < 1 and given where c θ is a constant only depending on θ and Proof. Denote a function-based random variable set by Under Assumption 1, for any f 1 , f 2 ∈ B r , we have Combining the above inequality and the properties of empirical covering number [40,41], we have where θ is defined in (13). According to Assumption 1, there exists t ∞ ≤ φ ∞ σ . Furthermore, we get where Recalling (14) and (15), we know Lemma 4 holds true for t ∈ T with c θ = c θ r θ σ −2θ , B = φ ∞ σ , s = 0, and c s = c 2 σ −1 + c 3 σ. That is to say, for any t ∈ T and 0 < δ < 1, with confidence 1 − δ Combining Lemma 2 and (16) where C n,σ,λ is positive constants independently of n, σ, λ.
Theorem 1 provides the upper bound to the excess risk of f z under the MRC, which extends the previous ERM-based analysis in [19] to the regularized learning scheme. In addition, we can further bound f z − f * 2 L 2 ρ X after imposing Assumption 3 in [19].
The learning rate derived in Corollary 1 is faster than O(n − 1 7 ) for the linear regularized modal regression [20]. Meanwhile, it should be noted that some kernel functions meet K ∈ C ∞ , e.g., Gaussian kernel, Sigmoid kernel, and Logistic kernel.
Since the proposed RGVS employs the non-convex mode-induced loss, our variable selection analysis is completely different from kernel method with least-squares loss [12]. Here, we introduce the following self-calibration inequality, which addresses that a weak convergence on risk implies a strong convergence in kernel-norm under certain conditions. Assumption 4. For any given σ and B r with r = φ Assumption 4 characterizes the concentration of our estimator near f * with the kernel-norm metric. Indeed, the current restriction is related to Assumption 4 in [12], Theorem 2.7 in [47] for quartile regression, and the so-called RNI condition in [48,49] as well.
Proof. As shown in [12], by direct computation, there holds where HS denotes the Hilbert-Schmidt operator on , and a 1 is a positive constant. The concentration inequality for kernel operator in [17] states that with confidence 1 − δ. Meanwhile, with the similar proof of Theorem 1, we can deduce that This excess risk estimation together with Assumption 4 implies that where a 4 > 0 is a constant independently of n, δ, λ. Now we turn to investigate the relationship betweenŜ in (12) and S * in (2). Firstly, we suppose there exists some j ∈ S * but j / ∈Ŝ. That is to say ĝ j 2 n v n . By Assumption 5 with C 2 = 2a 4 log(p/δ), we have which contradicts with (21). This implies that S * ⊂Ŝ with confidence 1 − δ. Secondly, we suppose there exists some j ∈Ŝ but j / ∈ S * . This means g * j 2 which contradicts with (21) with confidence 1 − δ. Therefore, the desired property follows by combining these two results.
Theorem 2 demonstrates that the identified variables are consistent with truly informative variables with probability 1 as n → ∞. This result guarantees the variable selection performance of our approach, provided that the active variables have enough gradient signal. In the future, it is necessary to further investigate the self-calibration assumption for RMR in RKHS.
When choosing Gaussian kernel as the modal kernel, the modal regression is consistent with regression under the maximum correntropy criterion (MCC) [36]. In terms of the breakdown point theory, Theorem 24 in [19] established the robustness characterization of kernel regression under MCC and Theorem 3 in [36] provided robust analysis for RMR. These results imply the robustness of our approach.

Optimization Algorithm
With the help of half-quadratic (HQ) optimization [32], the maximization problem (9) can be transformed into a weighted least-squares problem, and then get the estimator via the ADMM [18]. Indeed, the kernel-based RMR (9) can be implemented directly by the optimization strategy in [36,51] for Gaussian kernel-based modal representation, and in [20] for Epanechnikov kernel-based modal representation. For completeness, we provide the optimization steps of (9) associated with Logistic kernel-based density estimation.
Consider a convex function As illustrated in [52], a convex function f (a) and its convex conjugate function g(b) satisfy According to the Logistic-based representation φ and (22), we have Applying (23) into (10), we can obtain the augmented objective function max α∈R n ,b∈R n 1 nσ where α = (α 1 , ..., α n ) T ∈ R n , and b = (b 1 , ..., b n ) T ∈ R n is the auxiliary vector. Then the maximization problem (24) can be solved by the following iterative optimization algorithm. According to Theorem 1 in [20], we have arg max b (ab − g(b)) = f (a). Then, for a fixed α, b i can be updated by For K = (K(x i , x j )) n i,j=1 ∈ R n×n and Y = (y 1 , ..., y n ) ∈ R n , the problem (25) can be rewritten as where diag(·) is an operator that transforms the vector into a diagonal matrix.
When α is obtained from (26), we can calculate the gradient-based measure ĝ j 2 n by (11) directly. Then we apply a pre-specified threshold v n to identify the truly active setŜ n = {j : ĝ j 2 n > v n }. Here, the threshold v n is selected by the stability-based criterion [37], which include two steps as below. Firstly, the training samples are randomly divided into two subsets, and the identified active variable sets J z,1k and J z,2k are obtained under given v n for the k-th splitting of training samples. Then, the threshold v n is updated by maximizing the Cohen kappa statistical measure 1 T T ∑ k=1 κ(J z,1k , J z,2k ).
The optimization steps of RGVS are summarized in Algorithm 1.

Empirical Assessments
This section assesses the empirical performance of our proposed method on simulated and real-world datasets. Three variable selection methods are introduced as the baselines, which include Least Absolute Shrinkage and Selection Operator (Lasso) [1], Sparse Additive Models (SpAM) [9], and General Variable Selection Method (GM) [12].

Simulated Data
Now we evaluate our approach on two synthetic data used in [12,13]. The first example is a simple additive function and the second one is a function that includes interaction terms.

Example 1.
We generate the p-dimension input x i = (x i1 , ..., x ip ) by x ij = W ij +ηV i 1+η , where both W ij and V i are extracted from the uniform distribution U(−0.5, 0.5) and η = 0.2. The output y i is generated by ) and i is a random noise. Here, we consider the Gaussian noise N (0, 1), the Chi-square noise X 2 (2), the Student noise t(2), and the Exponential noise E(2), respectively. Example 2. This example follows the way of Example 1 to generate data. The differences are that W ij and V i are extracted from the same distribution U(0, 1) and the true function f * ( For each evaluation, we consider training set with different size n = 100, 150, 200 and dimension p = 150. To make sure the results are reliable, each evaluation is repeated 50 times. Since the truly informative variables are usually unknown in practice, we evaluate the algorithmic performance according to the average squares error(ASE) defined as ASE : To better evaluate the algorithmic performance, we also adopt some metrics used in [12,13] to measure the quality of model regression, e.g., Cp (correct-fitting), SIZE (the average number of selected variables), TP (the average number of the selected true informative variables), FP (the average number of the selected uninformative variables), Up (under-fitting probability), Op (over-fitting probability). The detail result is summarized in Tables 2 and 3. To further support the competitive performance of the proposed method, we also provide the experimental results on ASE in Figure 1 and Cp in Figure 2 with n = [100 : 50 : 300] and p = 100, 200, 400. Figures 1 and 2 show that our method has always performed well with different n.
Empirical evaluations on simulated examples verify the promising performance of RGVS on variable selection and regression estimation, even for data with non-Gaussian noises (e.g., the Chi-square noise X 2 (2), the Student noise t(2), and the Exponential noise E(2)). Meanwhile, GM and RGVS have similar performance under the Gaussian noise setting, which is consistent with our motivation for algorithmic design.

Real-World Data
We now evaluate our RGVS on Auto-Mpg and Requirements of buildings, which are all collected from UCI. Since the variable number is very limited for the current datasets, 100 irrelative variables are added, which are generated from the distribution of U(−0.5, 0.5).
Auto-Mpg data describes the mile per gallon of automobile (MPG). It contains 398 samples and 7 variables, including Cylinders, Displacement, Horsepower, Weight, Acceleration, Model year, and Origin. The second real data sets is obtained to assess the heating load and cooling load requirements of buildings which contains 768 samples and 8 input variables, including Relative Compactness, Surface Area, Wall Area, Roof Area, Overall Height, Orientation, Glazing Area, and Glazing Area Distribution. In particular, it has two response variables (heating load and cooling load). Now, we use the 5-fold cross validation to tune the hyper-parameters and employ the relative sum of the squared errors (RSSE) to measure learning performance. Here RSSE = ∑ x∈X test ( f (x) − f z (x)) 2 / ∑ x∈X test ( f (x) − E( f )) 2 , where f z is the estimator of f and E( f ) denotes the average value of f on the test set X test . Experimental results are reported in Tables 4 and 5.
As shown in Table 4, our method identifies similar variables as GM, but can achieve the smaller RSSE. At same time, SpAM and Lasso tend to select less variables than GM and RGVS, which may discard the truly informative variable for regression estimation. Table 5 shows RGVS has better performance for both the Heating Load data and the Cooling Load data. All these empirical evaluations validate the effectiveness of our learning strategy consistently. Table 4. Learning performance on Auto-Mpg.

Conclusions
This paper proposes a new RGVS method rooted in kernel modal regression. The main advantages of RGVS are its flexibility on mimicking the decision function and adaptivity on screening the truly active variables. The proposed approach is evaluated by the theoretical analysis on the generalization error and variable selection, and by the empirical results on data experiments. In theory, our method can achieve the polynomial decay rate with O(n − 2 5 ). In applications, our model has shown the competitive performance for data with non-Gaussian noises.