Next Article in Journal
Fault Diagnosis for Rolling Bearings Based on Fine-Sorted Dispersion Entropy and SVM Optimized with Mutation SCA-PSO
Previous Article in Journal
Stabilization of All Bell States in a Lossy Coupled-Cavity Array
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Variable Selection and Estimation Based on Kernel Modal Regression

College of Science, Huazhong Agricultural University, Wuhan 430070, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2019, 21(4), 403; https://doi.org/10.3390/e21040403
Submission received: 19 February 2019 / Revised: 8 April 2019 / Accepted: 9 April 2019 / Published: 16 April 2019
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Model-free variable selection has attracted increasing interest recently due to its flexibility in algorithmic design and outstanding performance in real-world applications. However, most of the existing statistical methods are formulated under the mean square error (MSE) criterion, and susceptible to non-Gaussian noise and outliers. As the MSE criterion requires the data to satisfy Gaussian noise condition, it potentially hampers the effectiveness of model-free methods in complex circumstances. To circumvent this issue, we present a new model-free variable selection algorithm by integrating kernel modal regression and gradient-based variable identification together. The derived modal regression estimator is related closely to information theoretic learning under the maximum correntropy criterion, and assures algorithmic robustness to complex noise by replacing learning of the conditional mean with the conditional mode. The gradient information of estimator offers a model-free metric to screen the key variables. In theory, we investigate the theoretical foundations of our new model on generalization-bound and variable selection consistency. In applications, the effectiveness of the proposed method is verified by data experiments.

1. Introduction

Variable selection has attracted increasing attention in the machine learning community due to the massive requirements of high-dimensional data mining. Under different motivations, many variable selection methods have been constructed and shown promising performance in various applications. From the viewpoint of hypothesis function space, there are mainly two types of variable selection approaches with respect to linear assumption and nonlinear additive assumption, respectively. For the linear model assumption, variable selection algorithms are usually formulated based on the least-squares empirical risk and the sparsity-induced regularization, which include Least Absolute Shrinkage and Selection Operator (Lasso) [1], Group Lasso [2] and Elastic net [3] as special examples. For the nonlinear additive model assumption, various additive models have been developed to relax the linear restriction on regression function [4,5]. It is well known that additive models enjoy the flexibility and interpretability of their representation and can remedy the curse of dimensionality of high-dimensional nonparametric regression [6,7,8]. Typical examples of additive models include Sparse Additive Models (SpAM) [9], Component Selection and Smoothing Operator (COSSO) [10] and Group Sparse Additive Models (GroupSpAM) [11]. Most of the above approaches are formulated under Tikhonov regularization scheme with special hypothesis function space (e.g., linear function space, nonlinear function space with additive structure).
More recently, some works have been made in [12,13,14,15] to alleviate the restriction on the hypothesis function space, which just require that the regression function belongs to a reproducing kernel Hilbert space (RKHS). In contrast to the traditional structure assumption on regression function, these methods identify the important variable via the gradient of kernel-based estimator. There are two strategies to improve the model flexibility through the gradient information of predictor. One follows the learning gradient methods in [13,14,16], where the functional gradient is used to construct the loss function for forming the empirical risk. Under this strategy, two model-free variable selection methods are presented by combining the error metric associated with the gradient information of estimator and the coefficient-based 2 , 1 -regularizer in [13] and · K -regularizer in [14], respectively. In particular, the variable selection consistency is also established based on the properties of RKHS and mild parameter conditions (e.g., the regularization parameter, the width of kernel). The other follows the structural sparsity issue in [15,17], where the functional gradient is employed to construct the sparsity-induced regularization term. Rosasco et al. in [17] proposes a least-squares regularization scheme with nonparametric sparsity, which can be solved by an iterative procedure associated with the theory of RKHS and proximal methods. Magda et al. [15] introduces a nonparametric structured sparsity by considering two regularizers based on partial derivatives and offers its optimization with the alternating direction method of multiples (ADMM) [18]. Moreover, to further improve the computation feasibility, a three-step variable selection algorithm is developed in [12] with the help of the three building blocks: kernel ridge regression, functional gradient in RKHS, and a hard threshold. Meanwhile, the effectiveness of the proposed algorithm in [12] is supported by theoretical guarantees on variable selection consistency and empirical verification on simulated data.
Despite the aforementioned methods showing promising performance for identifying the active variables, all of them rely heavily on the least-squares loss under the MSE criterion, which is sensitive to non-Gaussian noise [19,20], e.g., the heavy-tailed noise, the skewed noise, and outliers. In essence, learning methods under MSE aim to find an approximator to the conditional mean based on empirical observations. When the data are contaminated by a complex noise without zero mean, the mean-based estimator is difficult to reveal with the intrinsic regression function. This motivates us to formulate a new variable selection strategy in terms of other criterion with respect to different statistical metric (e.g., the conditional mode). Following the research line in [12,19], we consider a new robust variable selection method by integrating the issues of modal regression (for estimating the conditional mode function) and variable screening based on functional derivatives. To the best of our knowledge, this is the first paper to address robust model-free variable selection.
Statistical models for learning the conditional mode can be traced back to [21,22], which include the local modal regression in [23,24] and the global modal regression in [25,26,27]. Recently, the idea of modal regression has been successfully incorporated into machine learning methods from theoretical analysis [19] and application-oriented studies (e.g., cognitive impairment prediction [20] and cluster estimation [28]). Particularly, Feng et al. [19] considers a learning theory approach to modal regression and illustrates some relations between modal regression and learning under the maximum correntropy criterion [29,30,31]. In addition, Wang et al. [20] formulates a regularized modal regression (RMR) under modal regression criterion (MRC), and establishes its theoretical characteristics on generalization ability, robustness, and sparsity. It is natural to extend the RMR under linear regression assumption to general model-free variable selection setting.
Inspired by recent works in [12,19], we propose a new robust gradient-based variable selection method (RGVS) by integrating the RMR in RKHS and the model-free strategy for variable screening. Here, the kernel-based RMR is used to construct the robust estimator, which can reveal the truly conditional mode, even when facing data with non-Gaussian noise and outliers. Moreover, we evaluate the information quantity of each input variable by computing the corresponding gradient of estimator. Finally, a hard threshold is used to identify the truly active variables after offering the empirical norm of each gradient associated with hypothesis function. The above three steps assure the robustness and flexibility of our new approach.
To better highlight the novelty of RGVS, we present Table 1 to illustrate its relation with other related methods, e.g., linear models (Lasso [1], RMR [20]), additive models (SpAM [9], COSSO [10]), and General variable selection Method (GM) [12].
Our main contributions can be summarized as follows.
  • We formulate a new RGVS method by integrating the RMR in RKHS and the model-free strategy for variable screening. This algorithm can be implemented via the half-quadratic optimization [32]. To our knowledge, this algorithm is the first one for robust model-free variable selection.
  • In theory, the proposed method enjoys statistical consistency on regression estimator under much general conditions on data noise and hypothesis space. In particular, the learning rate with polynomial decay O ( n 2 5 ) is obtained, which is faster than O ( n 1 7 ) in [20] for linear RMR. It should be noted that our work is established under the MRC, while all previous model-free methods are formulated under the MSE criterion. In addition, variable selection consistency is obtained for our approach under a self-calibration condition.
  • In application, the proposed RGVS shows the empirical effectiveness on both simulated and real-world data sets. In particular, our approach can achieve much better performance than the model-free algorithm in [12] for complex noise data, e.g., containing Chi-square noise, Exponential noise, and Student noise. Experimental results together with theoretical analysis support the effectiveness of our approach.
The rest of this paper is organized as follows. After recalling the preliminaries of modal regression, we formulate the RGVS algorithm in Section 2. Then, theoretical analysis, optimization algorithm, and empirical evaluation are provided from Section 3 to Section 5 respectively. Finally, we conclude this paper in Section 6.

2. Gradient-Based Variable Selection in Modal Regression

Let X R p and Y R be a compact input space and an output space, respectively. We consider the following data-generating setting
y = f * ( x ) + ϵ ,
where x X , y Y and ϵ is a random noise. For the feasibility of theoretical analysis, we denote the intrinsic distribution of ( x , y ) Z : = ( X , Y ) generated in (1) as ρ . Let z = { ( x i , y i ) } i = 1 n Z n be empirical observations drawn independently according to the unknown distribution ρ . Unlike sparse methods with certain model assumption (e.g., Lasso [1], SpAM [9]), the gradient-based sparse algorithms [12,13] mainly aim at screening out the informative variables according to the gradient information of intrinsic function. For input vector u = ( u 1 , , u p ) T R p , the variable information is characterized by the gradient function g j * ( u ) : = f * ( u ) / u j . Clearly, g j * ( u ) = 0 implies that the j-th variable is uninformative [12,17]. Considering an 2 -norm measure on the partial derivatives, we denote the true active set as
S * = { j : g j * 2 2 > 0 } ,
where g j * 2 2 = X ( g j * ( x ) ) 2 d ρ X ( x ) and ρ X is the marginal distribution of ρ .
Indeed, all the gradient-based variable selection algorithms [12,13,17] are constructed under Tikhonov regularization scheme in RKHS H K [33,34]. The RKHS H K associated with the Mercer kernel K is the closure of the linear span of { K x : = K ( x , · ) : x X } . Such a Mercer kernel K : X × X R is a symmetric and positive semi-definite function. Denote < · , · > K as the inner product in H K , the reproducing properties of RKHS means < f , K x > K = f ( x ) , f H K .

2.1. Gradient-Based Variable Selection Based on Kernel Least-Squares Regression

In this subsection, we recall the gradient-based variable selection algorithm in [12] associated with least-squares error metric. When the noise ϵ in (1) satisfies E ( ϵ | X ) = 0 (i.e., Gaussian noise), the regression function equals to the conditional mean, which can be represented by
f * ( x ) = E ( Y | X = x ) = Y y d ρ Y | X = x ( y ) .
Here ρ Y | X = x denotes the conditional distribution of Y given x. Theoretically, the regression function f * in (3) is the minimizer of expected least-squares risk
E ( f ) = Z ( y f ( x ) ) 2 d ρ ( x , y ) .
As ρ is unknown in practice, we cannot get f * directly by minimizing E ( f ) over certain hypothesis space. Given training samples z , the empirical risk with respect to the expected risk E ( f ) is denoted by
E z ( f ) = 1 n i = 1 n ( y i f ( x i ) ) 2 .
The gradient-based variable selection algorithm in [12] depends on the estimator defined as below:
f z ˜ = arg min f H K { E z ( f ) + λ | | f | | K 2 } ,
where λ > 0 is the regularization parameter and | | f | | K is the kernel-norm of f. The properties of RKHS [33] assure that
f z ˜ ( x ) = i = 1 n α ˜ i K ( x i , x ) = ( α ˜ z ) T K n ( x ) ,
where K n ( x ) = ( K ( x 1 , x ) , , K ( x n , x ) ) T R n and α ˜ z = ( α ˜ 1 , , α ˜ n ) T R n . Denote K = ( K n ( x 1 ) , , K n ( x n ) ) R n × n and Y = ( y 1 , , y n ) T R n , the closed-form solution is
α ˜ z = ( K T K + n λ K ) 1 K Y .
Following Lemma 1 in [12], for any j { 1 , , p } , we have g ˜ j ( x ) = ( α ˜ z ) T j K n ( x ) , where
j K n ( u ) = K ( x 1 , u ) u j , , K ( x n , u ) u j T , u = ( u 1 , , u p ) T X .
After imposing the empirical norm on g ˜ j , i.e.,
g ˜ j n 2 = 1 n i = 1 n ( g ˜ j ( x i ) ) 2 ,
we get the estimated active set
S ˜ = { j : g ˜ j n 2 > v n } ,
where v n is a pre-configured constant for variable selection.
The general variable selection method has shown some theoretical advantages in [12], e.g., the representation flexibility and the computation feasibility. However, the gradient-based method [12] may result in a degraded performance for real-world data without the zero-mean noise condition. Inspired by the modal regression [19,35] to learn the conditional mode, we propose a new robust gradient-based variable selection method under much general noise condition.

2.2. Robust Gradient-Based Variable Selection Based on Kernel Modal Regression

Unlike the traditional zero-mean noise assumption [12,17], the modal regression requires that the conditional mode of random noise ϵ is zero for any x X , i.e.,
mode ( ϵ | X = x ) = arg max t R P ϵ | X ( t | X = x ) = 0 ,
where P ϵ | X is the conditional density of ϵ conditioned on X. In fact, this assumption imposes no restrictions on conditional mean, and can include the heavy-tailed noise, the skewed noise, and outliers. Then, we can verify that the mode-regression function
f * ( x ) = mode ( Y | X ) = arg max t R P Y | X ( t | X = x ) ,
where P Y | X ( · | X ) denotes the conditional density of Y conditioned on x X . It is worth noting that P Y | X ( · | X = x ) is assumed to be unique and existing here. As shown in [19,20], f * in (6) is the maximizer of the MRC over all measurable functions, which is defined as
R ( f ) = X P Y | X ( f ( x ) | X = x ) d ρ X ( x ) .
The maximizer of R ( f ) is difficult to be obtained since both P Y | X and ρ X are unknown. Fortunately, Theorem 5 of [19] has proved that R ( f ) = P E f ( 0 ) , where P E f ( 0 ) is the density function of E f = Y f ( x ) at 0 and which can be easily approximated by the kernel density method [20]. With the help of modal kernel K σ : R × R R for the density estimation, we can formulate an empirical kernel density estimator P ^ E f at 0
P ^ E f ( 0 ) = 1 n σ i = 1 n K σ ( y i f ( x i ) , 0 ) = 1 n σ i = 1 n K σ ( y i , f ( x i ) ) : = R z σ ( f ) .
Setting ϕ ( y f ( x ) σ ) : = K σ ( y , f ( x ) ) , we get the corresponding expected version
R σ ( f ) = 1 σ X × Y ϕ ( y f ( x ) σ ) d ρ ( x , y ) .
In addition, the modal regression also can be interpreted by minimizing a mode-induced error metric [19]. When ϕ ( u ) ϕ ( 0 ) for any u R , the mode-induced loss can be defined as
L σ ( y f ( x ) ) = σ 1 ( ϕ ( 0 ) ϕ ( ( y f ( x ) ) / σ ) ) ,
which is related closely with the correntropy-induced loss in [19,36]. Given training samples z = { ( x i , y i ) } i = 1 n , we can formulate the RMR in RKHS as
f z = arg max f H K R z σ ( f ) λ | | f | | K 2 ,
where λ > 0 is a turning parameter that controls the complexity of the hypothesis space, and f K 2 = f , f K is the kernel-norm of f H K .
Denote α ^ = ( α ^ 1 , , α ^ n ) T R n , K n ( x ) = ( K ( x 1 , x ) , , K ( x n , x ) ) T R n and K = ( K ( x i , x j ) ) i , j = 1 n R n × n . From the representer theorem of kernel methods, we can deduce that
f z ( x ) = i = 1 n α ^ z , i K ( x i , x ) = α ^ z T K n ( x ) ,
with
α ^ z = arg max α R n 1 n σ i = 1 n ϕ ( y i K n T ( x i ) α σ ) λ α T K α .
From Lemma 1 in [12], we know that for any f H K and u = ( u 1 , , u p ) T R p ,
g ^ j ( u ) = f ( u ) u j = f , K ( u , · ) u j K .
The empirical measure on gradient function g ^ j ( x ) is
g ^ j n 2 = 1 n i = 1 n ( g ^ j ( x i ) ) 2 = 1 n i = 1 n ( α ^ z T j K n ( x i ) ) 2 .
Then, the identified active set can be written as
S ^ = { j : g ^ j n 2 > v n } ,
where v n is a positive threshold selected under the sample-adaptive tuning framework [37].

3. Generalization-Bound and Variable Selection Consistency

This section establishes the theoretical guarantees on generalization ability and variable selection for the proposed RGVS. Firstly, we introduce some necessary assumptions.
Assumption 1.
The representing function ϕ associated with modal kernel K σ : R × R R + satisfies: ( i ) ϕ is bounded with R ϕ ( u ) d u = 1 , ϕ ( u ) = ϕ ( u ) and ϕ ( u ) ϕ ( 0 ) , u R ; ( i i ) ϕ ( · ) is differentiable with ϕ < and R u 2 ϕ ( u ) d u < .
Observe that some smoothing kernels meet Assumption 1, such as Gaussian kernel and Logistic kernel, etc.
Assumption 2.
The conditional density function P ϵ | X is second-order differentiable and P ϵ | X is bounded.
Assumption 2 has been used in [19,20], which assures upper bound on | R ( f ) R σ ( f ) | together with Assumption 1.
Assumption 3.
Let C s be a space of s-times continuous differentiable functions. Assume that sup x X K ( x , x ) < with K C s with s > 0 , and for a given constant M, the target function satisfies f * H K with f * M .
Assumption 3 has been used extensively in learning theory literatures, see, e.g., [38,39,40,41,42,43,44]. In particular, the Gaussian kernel belongs to C .
Our error analysis begins with the following inequality in [19], where the relationship between R σ ( f ) and R ( f ) is provided.
Lemma 1.
Under Assumptions 1–2, there holds
| R ( f * ) R ( f ) ( R σ ( f * ) R σ ( f ) ) | c 1 σ 2
for any measurable function f: X R , where c 1 = | | P ϵ | x | | R u 2 ϕ ( u ) d u .
This indicates us to bound the excess risk R ( f * ) R ( f ) via estimating R σ ( f * ) R σ ( f z ) . To be specific, we further make an error decomposition as follows.
Lemma 2.
Under Assumptions 1–3, there holds
R ( f * ) R ( f z ) R σ ( f * ) R σ ( f z ) ( R z σ ( f * ) R z σ ( f z ) ) + λ | | f * | | K 2 + c 1 σ 2 .
Proof. 
According to the definition of f z in (9), we have
R z σ ( f * ) λ f * K 2 ( R z σ ( f z ) λ f z K 2 ) 0 .
Then, we can deduce that
R σ ( f * ) R σ ( f z ) = R σ ( f * ) R z σ ( f * ) + R z σ ( f * ) λ f * K 2 + λ f * K 2 ( R z σ ( f z ) λ f z K 2 ) λ f z K 2 + R z σ ( f z ) R σ ( f z ) R σ ( f * ) R z σ ( f * ) + R z σ ( f z ) R σ ( f z ) + R z σ ( f * ) λ f * K 2 ( R z σ ( f z ) λ f z K 2 ) + λ f * K 2 R σ ( f * ) R σ ( f z ) ( R z σ ( f * ) R z σ ( f z ) ) + λ f * K 2 .
This together with Lemma 1 yields the desired result. □
Observe that R σ ( f * ) R σ ( f z ) ( R z σ ( f * ) R z σ ( f z ) ) characterizes the divergence between the data-free risk R σ ( f ) and the empirical risk R z σ ( f ) . To establish its uniform estimation, we need to give the upper bound of f z K firstly.
According to the definition of f z , we have
R z σ ( 0 ) R z σ ( f z ) λ f z K 2 .
Then,
f z K R z σ ( f z ) R z σ ( 0 ) λ ϕ λ σ .
Lemma 3.
For f z in (9), there holds
f z K ϕ λ σ .
Lemma 3 tells us that f z B r with r = ϕ λ σ for any z Z n , where B r = f H K : f K r . This motivates us to measure the capacity of B r through the empirical covering number [45].
Definition 1.
Suppose that F is a set of functions on x = { x 1 , , x n } with the 2 -empirical metric d 2 , x ( f , g ) = 1 n i = 1 n ( f ( x i ) g ( x i ) ) 2 1 2 , f , g F . Then, the 2 -empirical covering number of function set F is defined as
N 2 ( F , ϵ ) = sup n N sup x N 2 , x ( F , ϵ ) , ϵ > 0 ,
where
N 2 , x ( F , ϵ ) = inf l N : { f j } j = 1 l F , s . t . , F j = 1 l B ( f j , ϵ )
with B ( f j , ϵ ) = { f F : d 2 , x ( f , f j ) < ϵ }
Next, we introduce a concentration inequality established in [46].
Lemma 4.
Let T be a function set associated with function t. Suppose that there are some constants B , c s , c θ > 0 and s [ 0 , 1 ] satisfying t B , E t 2 c s ( E t ) s for any t T . If for 0 < θ < 2 and log N 2 ( T , ϵ ) c θ ϵ θ , ϵ > 0 , then for any 0 < δ < 1 and given z = { z i } i = 1 n Z , there holds
E t 1 n i = 1 n t ( z i ) 1 2 η 1 s ( E t ) s + c θ η + 2 ( c s log ( 1 / δ ) n ) 1 2 s + 18 B log ( 1 / δ ) n , t T ,
where c θ is a constant only depending on θ and
η = max c s 2 θ 4 2 s + θ s ( c θ n ) 2 4 2 s + θ s , B 2 θ 2 + θ ( c θ n ) 2 2 + θ .
Theorem 1.
Under Assumptions 1–3, taking σ = n 1 5 and λ = n 2 5 , we have for any 0 < δ < 1
R ( f * ) R ( f z ) C n ζ log ( 1 δ )
with confidence at least 1 δ , where ζ = min 8 9 θ 20 , 8 6 θ 5 ( 2 + θ ) , 2 5 and
θ = 2 p p + 2 s 0 < s 1 2 p p + 2 1 < s 1 + p 2 p s s > 1 + p 2 .
Proof. 
Denote a function-based random variable set by
T = t ( z ) : = t f ( z ) = 1 σ ϕ ( y f * ( x ) σ ) ϕ ( y f ( x ) σ ) : f B r .
Under Assumption 1, for any f 1 , f 2 B r , we have
| t f 1 ( z ) t f 2 ( z ) | = 1 σ | ϕ ( y f 1 ( x ) σ ) ϕ ( y f 2 ( x ) σ ) | ϕ σ | y f 1 ( x ) σ y f 2 ( x ) σ | ϕ σ 2 | f 1 ( x ) f 2 ( x ) | .
Combining the above inequality and the properties of empirical covering number [40,41], we have
log N 2 ( T , ϵ ) log N 2 ( B 1 , ϵ σ 2 r ϕ ) C θ r θ σ 2 θ ϵ θ ,
where θ is defined in (13).
According to Assumption 1, there exists t ϕ σ . Furthermore, we get
E t 2 = ϕ σ E 1 σ ϕ ( y f * σ ) 1 σ ϕ ( y f ( x ) σ ) = ϕ σ ( R σ ( f * ) R σ ( f ) ) ϕ σ ( P E f * ( 0 ) P E f ( 0 ) + c 1 σ 2 ) ϕ σ ( P E f * ( 0 ) P E f ( 0 ) ) + ϕ c 1 σ σ 1 c 2 + σ c 3 ,
where c 2 = ϕ ( P E f * ( 0 ) P E f ( 0 ) ) and c 3 = c 1 ϕ are the bounded constants.
Recalling (14) and (15), we know Lemma 4 holds true for t T with c θ = c θ r θ σ 2 θ , B = ϕ σ , s = 0 , and c s = c 2 σ 1 + c 3 σ . That is to say, for any t T and 0 < δ < 1 , with confidence 1 δ
R σ ( f * ) R σ ( f ) ( R z σ ( f * ) R z σ ( f ) ) ( 1 2 + c θ ) max ( c 2 σ 1 + c 3 σ ) 2 θ 4 ( c θ r θ σ 2 θ n ) 1 2 , ( ϕ σ ) 2 θ 2 + θ ( c θ r θ σ 2 θ n ) 2 2 + θ + 2 ( c 2 σ 1 + c 3 σ ) log ( 1 / δ ) n + 18 ϕ log ( 1 / δ ) n σ .
Combining Lemma 2 and (16) with r = ϕ / λ σ , we have with confidence at least 1 δ
R ( f * ) R ( f z ) C n , σ , λ log ( 1 δ ) max { σ 2 + 5 θ 4 n 1 2 , σ 2 + 4 θ 2 + θ n 2 2 + θ λ θ 2 + θ } + n 1 2 σ 1 2 + λ + σ 2 ,
where C n , σ , λ is positive constants independently of n , σ , λ .
Setting σ 2 = n 1 2 σ 1 2 and λ = σ 2 , we have σ = n 1 5 and λ = n 2 5 . Putting these selected parameters into (17), we get the desired estimation. □
Theorem 1 provides the upper bound to the excess risk of f z under the MRC, which extends the previous ERM-based analysis in [19] to the regularized learning scheme. In addition, we can further bound f z f * L ρ X 2 2 after imposing Assumption 3 in [19].
Corollary 1.
Let the conditions of Theorem 1 be true. Assume that K C , we have
R ( f * ) R ( f z ) O ( n 2 5 log ( 1 / δ ) )
with confidence at least 1 δ .
The learning rate derived in Corollary 1 is faster than O ( n 1 7 ) for the linear regularized modal regression [20]. Meanwhile, it should be noted that some kernel functions meet K C , e.g., Gaussian kernel, Sigmoid kernel, and Logistic kernel.
Since the proposed RGVS employs the non-convex mode-induced loss, our variable selection analysis is completely different from kernel method with least-squares loss [12]. Here, we introduce the following self-calibration inequality, which addresses that a weak convergence on risk implies a strong convergence in kernel-norm under certain conditions.
Assumption 4.
For any given σ and B r with r = ϕ 1 2 n 1 4 σ 1 4 , there exists a universal constant C 1 > 0 such that
R σ ( f * ) R σ ( f ) C 1 f * f K 2 , f B r .
Assumption 4 characterizes the concentration of our estimator near f * with the kernel-norm metric. Indeed, the current restriction is related to Assumption 4 in [12], Theorem 2.7 in [47] for quartile regression, and the so-called RNI condition in [48,49] as well.
In addition, the following condition is required, which implies that the gradient function associated with truly informative variables is separated well from zero. Similar assumptions can also be found in [12,50]. For simplicity, we denote g j 2 : = inf X ( g j ( x ) ) 2 d ρ X ( x ) .
Assumption 5.
There exists some constant C 2 > 0 such that
min j S * g j * 2 2 > C 2 n min { 1 8 , 4 θ 16 + 8 θ } .
Theorem 2.
Let Assumptions 1–5 be true. For any given σ > max { n 4 θ 8 + 14 θ , n 2 4 + 10 θ } , set λ = n 1 2 σ 1 2 in (9) and v n = C 2 n m i n { 1 8 , 4 θ 16 + 8 θ } in (12). Then, P r o b { S ^ = S * } 1 a s n .
Proof. 
As shown in [12], by direct computation, there holds
| g ^ j n 2 g j * 2 2 | a 1 3 f z f * K + D ^ j * D ^ j D j * D j H S , j ,
where H S denotes the Hilbert-Schmidt operator on H K , D j * D j f = α j K x g j ( x ) d ρ X ( x ) , D ^ j * D ^ j f = 1 n i = 1 n α j K x j g j ( x i ) , and a 1 is a positive constant. The concentration inequality for kernel operator in [17] states that
D j * ^ D j ^ D j * D j H S 8 κ 2 n log ( 4 p δ )
with confidence 1 δ .
Meanwhile, with the similar proof of Theorem 1, we can deduce that
R σ ( f * ) R σ ( f z ) a 2 log ( 1 / δ ) max { σ 2 + 5 θ 4 n 1 2 , σ 2 + 4 θ 2 + θ n 2 + 4 θ 2 + θ } + n 1 2 σ 1 2 + λ
with confidence at least 1 δ , where a 2 is a positive constant. Setting λ = n 1 2 σ 1 2 , σ 2 + 5 θ 4 n 1 4 1 and σ 4 + 7 θ 4 + 2 θ n 4 θ 8 + 4 θ 1 , we further get
R σ ( f * ) R σ ( f z ) a 3 log ( 1 / δ ) n min { 1 4 , 4 θ 8 + 4 θ }
with confidence 1 δ . This excess risk estimation together with Assumption 4 implies that
f z f * K 2 a 3 C 1 1 log ( 1 / δ ) n min { 1 4 , 4 θ 8 + 4 θ }
with confidence 1 δ , where a 3 is a positive constant.
Combining (18)–(20), we have with confidence 1 δ
j , | g j ^ n 2 g j * 2 2 | a 4 l o g ( p / δ ) n min { 1 8 , 4 θ 6 + 8 θ } ,
where a 4 > 0 is a constant independently of n , δ , λ .
Now we turn to investigate the relationship between S ^ in (12) and S * in (2). Firstly, we suppose there exists some j S * but j S ^ . That is to say g ^ j n 2 v n . By Assumption 5 with C 2 = 2 a 4 log ( p / δ ) , we have
| g ^ j n 2 g j * 2 2 | g j * 2 g ^ j n 2 > a 4 log ( p / δ ) n min { 1 8 , 4 θ 16 + 8 θ } ,
which contradicts with (21). This implies that S * S ^ with confidence 1 δ .
Secondly, we suppose there exists some j S ^ but j S * . This means g j * 2 2 = 0 and g ^ j n 2 > v n . Then
| g ^ j n 2 g j * 2 2 | = g ^ j n 2 > v n = a 4 log ( p / δ ) n min { 1 8 , 4 θ 16 + 8 θ } ,
which contradicts with (21) with confidence 1 δ . Therefore, the desired property follows by combining these two results. □
Theorem 2 demonstrates that the identified variables are consistent with truly informative variables with probability 1 as n . This result guarantees the variable selection performance of our approach, provided that the active variables have enough gradient signal. In the future, it is necessary to further investigate the self-calibration assumption for RMR in RKHS.
When choosing Gaussian kernel as the modal kernel, the modal regression is consistent with regression under the maximum correntropy criterion (MCC) [36]. In terms of the breakdown point theory, Theorem 24 in [19] established the robustness characterization of kernel regression under MCC and Theorem 3 in [36] provided robust analysis for RMR. These results imply the robustness of our approach.

4. Optimization Algorithm

With the help of half-quadratic (HQ) optimization [32], the maximization problem (9) can be transformed into a weighted least-squares problem, and then get the estimator via the ADMM [18]. Indeed, the kernel-based RMR (9) can be implemented directly by the optimization strategy in [36,51] for Gaussian kernel-based modal representation, and in [20] for Epanechnikov kernel-based modal representation. For completeness, we provide the optimization steps of (9) associated with Logistic kernel-based density estimation.
Consider a convex function
f ( a ) = 1 / ( exp ( a ) + 2 + exp ( a ) ) , a > 0 .
As illustrated in [52], a convex function f ( a ) and its convex conjugate function g ( b ) satisfy
f ( a ) = max b ( a b g ( b ) ) .
According to the Logistic-based representation ϕ and (22), we have
ϕ ( t ) = f ( t 2 ) = max b ( t 2 b g ( b ) ) , t R .
Applying (23) into (10), we can obtain the augmented objective function
max α R n , b R n 1 n σ i = 1 n b i ( y i α T K n ( x i ) σ ) 2 g ( b i ) λ α T K α ,
where α = ( α 1 , , α n ) T R n , and b = ( b 1 , , b n ) T R n is the auxiliary vector. Then the maximization problem (24) can be solved by the following iterative optimization algorithm.
According to Theorem 1 in [20], we have arg max b ( a b g ( b ) ) = f ( a ) . Then, for a fixed α , b i can be updated by b i = f ( ( y i α T K n ( x i ) σ ) 2 ) . While b is settled down, update α via
arg max α R n i = 1 n b i σ ( y i α T K n ( x i ) ) 2 λ α T K α .
For K = ( K ( x i , x j ) ) i , j = 1 n R n × n and Y = ( y 1 , , y n ) R n , the problem (25) can be rewritten as
arg min α R n ( Y K α ) T d i a g ( b σ ) ( Y K α ) + λ α T K α
where d i a g ( · ) is an operator that transforms the vector into a diagonal matrix. By setting [ ( Y K α ) T d i a g ( b / σ ) ( Y K α ) + λ α T K α ] / α = 0 , we have
α = 4 ( K d i a g ( b σ ) + λ I ) 1 d i a g ( b σ ) Y .
When α is obtained from (26), we can calculate the gradient-based measure g ^ j n 2 by (11) directly. Then we apply a pre-specified threshold v n to identify the truly active set S ^ n = { j : g ^ j n 2 > v n } . Here, the threshold v n is selected by the stability-based criterion [37], which include two steps as below. Firstly, the training samples are randomly divided into two subsets, and the identified active variable sets J z , 1 k and J z , 2 k are obtained under given v n for the k-th splitting of training samples. Then, the threshold v n is updated by maximizing the Cohen kappa statistical measure 1 T k = 1 T κ ( J z , 1 k , J z , 2 k ) .
The optimization steps of RGVS are summarized in Algorithm 1.
Algorithm 1: Optimization algorithm of RGVS with Logistic kernel
Input: Samples z , the modal representation ϕ (Logistic kernel), Mercer kernel K;
Initialization: t = 0 , α , bandwidth σ , Max-iter = 10 2 , ε = 10 3 ;
Obtain f z in RKHS:
      While α not converged and t < Max-iter;
            1. Fixed α t , update b i t + 1 = exp ( q ) exp ( q ) 2 q ( exp ( q ) + 2 + exp ( q ) ) 2 , q = y i K n T ( x i ) α t σ ;
            2. Fixed b t + 1 , update α t + 1 = 4 ( K d i a g ( b t + 1 σ ) + λ I ) 1 d i a g ( b t + 1 σ ) Y ;
            3. Check the convergence condition: α t + 1 α t 2 < ε ;
            4. t t + 1 ;
      End While
      Output: α ^ z = α t + 1 ;
Variable Selection: S ^ n = { j : 1 n i = 1 n ( α ^ z T j K n ( x i ) ) 2 > v n }.
Output: S ^ n

5. Empirical Assessments

This section assesses the empirical performance of our proposed method on simulated and real-world datasets. Three variable selection methods are introduced as the baselines, which include Least Absolute Shrinkage and Selection Operator (Lasso) [1], Sparse Additive Models (SpAM) [9], and General Variable Selection Method (GM) [12].
In all experiments, the RKHS H K associated with Gaussian kernel K h ( u , v ) = exp u v 2 2 2 h 2 is employed as the hypothesis function space. For simplicity, we denote R G V S G a u and R G V S L o g as the proposed RGVS method with Gaussian modal kernel and Logistic modal kernel, respectively. In the simulated experiments, we generate three datasets (with identical sample size) independently as the training set, the validation set, and the testing set, respectively. The hyper-parameters are tuned via grid research on validation set, and the corresponding grids are displayed as follows: ( i ) the regularization parameter λ : { 10 3 , 5 × 10 3 , 10 2 , 5 × 10 2 , , 1 , 5 , 10 } ; ( i i ) the bandwidth σ and h: { 1 + 10 1 i , i = 0 , 1 , , 100 } ; ( i i i ) the threshold v n : { 10 3 + 0 . 1 t , t = 0 , , 60 }.

5.1. Simulated Data

Now we evaluate our approach on two synthetic data used in [12,13]. The first example is a simple additive function and the second one is a function that includes interaction terms.
Example 1.
We generate the p-dimension input x i = ( x i 1 , , x i p ) by x i j = W i j + η V i 1 + η , where both W i j and V i are extracted from the uniform distribution U ( 0.5 , 0.5 ) and η = 0.2 . The output y i is generated by y i = f * ( x i ) + ϵ i , where f * ( x i ) = 5 x i 1 + 4 ( x i 2 1 ) 2 + 0.5 sin ( π x i 3 ) + cos ( π x i 3 ) + 1.5 ( sin ( π x i 3 ) ) 2 + 2.5 ( sin ( π x i 3 ) ) 3 + 2 ( cos ( π x i 3 ) ) 3 + 6 sin ( π x i 4 ) / ( 2 sin ( π x i 4 ) ) and ϵ i is a random noise. Here, we consider the Gaussian noise N ( 0 , 1 ) , the Chi-square noise X 2 ( 2 ) , the Student noise t ( 2 ) , and the Exponential noise E ( 2 ) , respectively.
Example 2.
This example follows the way of Example 1 to generate data. The differences are that W i j and V i are extracted from the same distribution U ( 0 , 1 ) and the true function f * ( x i ) = 20 x i 1 x i 2 x i 3 + 5 x i 4 2 + 5 x i 5 .
For each evaluation, we consider training set with different size n = 100 , 150 , 200 and dimension p = 150 . To make sure the results are reliable, each evaluation is repeated 50 times. Since the truly informative variables are usually unknown in practice, we evaluate the algorithmic performance according to the average squares error(ASE) defined as A S E : = 1 n i = 1 n ( f * ( x i ) f z ( x i ) ) 2 . To better evaluate the algorithmic performance, we also adopt some metrics used in [12,13] to measure the quality of model regression, e.g., Cp (correct-fitting), SIZE (the average number of selected variables), TP (the average number of the selected true informative variables), FP (the average number of the selected uninformative variables), Up (under-fitting probability), Op (over-fitting probability). The detail result is summarized in Table 2 and Table 3. To further support the competitive performance of the proposed method, we also provide the experimental results on ASE in Figure 1 and Cp in Figure 2 with n = [ 100 : 50 : 300 ] and p = 100 , 200 , 400 . Figure 1 and Figure 2 show that our method has always performed well with different n.
Empirical evaluations on simulated examples verify the promising performance of RGVS on variable selection and regression estimation, even for data with non-Gaussian noises (e.g., the Chi-square noise X 2 ( 2 ) , the Student noise t ( 2 ) , and the Exponential noise E ( 2 ) ). Meanwhile, GM and RGVS have similar performance under the Gaussian noise setting, which is consistent with our motivation for algorithmic design.

5.2. Real-World Data

We now evaluate our RGVS on Auto-Mpg and Requirements of buildings, which are all collected from UCI. Since the variable number is very limited for the current datasets, 100 irrelative variables are added, which are generated from the distribution of U ( 0.5 , 0.5 ) .
Auto-Mpg data describes the mile per gallon of automobile (MPG). It contains 398 samples and 7 variables, including Cylinders, Displacement, Horsepower, Weight, Acceleration, Model year, and Origin. The second real data sets is obtained to assess the heating load and cooling load requirements of buildings which contains 768 samples and 8 input variables, including Relative Compactness, Surface Area, Wall Area, Roof Area, Overall Height, Orientation, Glazing Area, and Glazing Area Distribution. In particular, it has two response variables (heating load and cooling load).
Now, we use the 5-fold cross validation to tune the hyper-parameters and employ the relative sum of the squared errors (RSSE) to measure learning performance. Here R S S E = x X t e s t ( f ( x ) f z ( x ) ) 2 / x X t e s t ( f ( x ) E ( f ) ) 2 , where f z is the estimator of f and E ( f ) denotes the average value of f on the test set X t e s t . Experimental results are reported in Table 4 and Table 5.
As shown in Table 4, our method identifies similar variables as GM, but can achieve the smaller RSSE. At same time, SpAM and Lasso tend to select less variables than GM and RGVS, which may discard the truly informative variable for regression estimation. Table 5 shows RGVS has better performance for both the Heating Load data and the Cooling Load data. All these empirical evaluations validate the effectiveness of our learning strategy consistently.

6. Conclusions

This paper proposes a new RGVS method rooted in kernel modal regression. The main advantages of RGVS are its flexibility on mimicking the decision function and adaptivity on screening the truly active variables. The proposed approach is evaluated by the theoretical analysis on the generalization error and variable selection, and by the empirical results on data experiments. In theory, our method can achieve the polynomial decay rate with O ( n 2 5 ) . In applications, our model has shown the competitive performance for data with non-Gaussian noises.

Author Contributions

Methodology, B.S., H.C.; software, C.G., Y.W.; validation, B.S., H.X. and Y.W.; formal analysis, C.G., H.C.; writing—original draft preparation, C.G.,Y.W.; writing—review and editing, H.C. and H.X.

Funding

This research was funded in part by the National Natural Science Foundation of China (NSFC) under Grants 11601174 and 11671161.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Tibshirani, R. Regression shrinkage and delection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar]
  2. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 2006, 68, 49–67. [Google Scholar] [CrossRef] [Green Version]
  3. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  4. Stone, C.J. Additive regression and other nonparametric models. Ann. Stat. 1985, 13, 689–705. [Google Scholar] [CrossRef]
  5. Hastie, T.J.; Tibshirani, R.J. Generalized Additive Models; Chapman and Hall: London, UK, 1990. [Google Scholar]
  6. Kandasamy, K.; Yu, Y. Additive approximations in high dimensional nonparametric regression via the SALSA. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016. [Google Scholar]
  7. Kohler, M.; Krzyżak, A. Nonparametric regression based on hierarchical interaction models. IEEE Trans. Inf. Theory 2017, 63, 1620–1630. [Google Scholar] [CrossRef]
  8. Chen, H.; Wang, X.; Huang, H. Group sparse additive machine. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 198–208. [Google Scholar]
  9. Ravikumar, P.; Liu, H.; Lafferty, J.; Wasserman, L. SpAM: Sparse additive models. J. R. Stat. Soc. Ser. B 2009, 71, 1009–1030. [Google Scholar] [CrossRef]
  10. Lin, Y.; Zhang, H.H. Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 2007, 34, 2272–2297. [Google Scholar] [CrossRef]
  11. Yin, J.; Chen, X.; Xing, E.P. Group sparse additive models. In Proceedings of the International Conference on Machine Learning (ICML), Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
  12. He, X.; Wang, J.; Lv, S. Scalable kernel-based variable selection with sparsistency. arXiv 2018, arXiv:1802.09246. [Google Scholar]
  13. Yang, L.; Lv, S.; Wang, J. Model-free variable selection in reproducing kernel Hilbert space. J. Mach. Learn. Res. 2016, 17, 1–24. [Google Scholar]
  14. Ye, G.; Xie, X. Learning sparse gradients for variable selection and dimension reduction. Mach. Learn. 2012, 87, 303–355. [Google Scholar] [CrossRef] [Green Version]
  15. Gregorová, M.; Kalousis, A.; Marchand-Maillet, S. Structured nonlinear variable selection. arXiv 2018, arXiv:1805.06258. [Google Scholar]
  16. Mukherjee, S.; Zhou, D.X. Analysis of half-quadratic minimization methods for signal and image recovery. J. Mach. Learn. Res. 2006, 7, 519–549. [Google Scholar]
  17. Rosasco, L.; Villa, S.; Mosci, S.; Santoro, M.; Verri, A. Nonparametric sparsity and regularization. J. Mach. Learn. Res. 2013, 14, 1665–1714. [Google Scholar]
  18. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  19. Feng, Y.; Fan, J.; Suykens, J.A.K. A statistical learning approach to modal regression. arXiv 2017, arXiv:1702.05960. [Google Scholar]
  20. Wang, X.; Chen, H.; Cai, W.; Shen, D.; Huang, H. Regularized modal regression with applications in cognitive impairment prediction. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1448–1458. [Google Scholar]
  21. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  22. Chernoff, H. Estimation of the mode. Ann. Inst. Stat. Math. 1964, 16, 31–41. [Google Scholar] [CrossRef]
  23. Yao, W.; Lindsay, B.G.; Li, R. Local modal regression. J. Nonparametr. Stat. 2012, 24, 647–663. [Google Scholar] [CrossRef]
  24. Chen, Y.C.; Genovese, C.R.; Tibshirani, R.J.; Wasserman, L. Nonparametric modal regression. Ann. Stat. 2014, 44, 489–514. [Google Scholar] [CrossRef]
  25. Collomb, G.; Härdle, W.; Hassani, S. A note on prediction via estimation of the conditional mode function. J. Stat. Plan. Inference 1986, 15, 227–236. [Google Scholar] [CrossRef]
  26. Lee, M.J. Mode regression. J. Econom. 1989, 42, 337–349. [Google Scholar] [CrossRef]
  27. Sager, T.W.; Thisted, R.A. Maximum likelihood estimation of isotonic modal regression. Ann. Stat. 1982, 10, 690–707. [Google Scholar] [CrossRef]
  28. Li, J.; Ray, S.; Lindsay, B. A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res. 2007, 8, 1687–1723. [Google Scholar]
  29. Liu, W.; Pokharel, P.P.; Príncipe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
  30. Príncipe, J.C. Information Theoretic Learning: Rényi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
  31. Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A.K. Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
  32. Nikolova, M.; Ng, M.K. Analysis of half-quadratic minimization methods for signal and image recovery. SIAM J. Sci. Comput. 2005, 27, 937–966. [Google Scholar] [CrossRef]
  33. Aronszajn, N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  34. Cucker, F.; Zhou, D.X. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
  35. Yao, W.; Li, L. A new regression model: Modal linear regression. Scand. J. Stat. 2013, 41, 656–671. [Google Scholar] [CrossRef]
  36. Chen, H.; Wang, Y. Kernel-based sparse regression with the correntropy-induced loss. Appl. Comput. Harmon. Anal. 2018, 44, 144–164. [Google Scholar] [CrossRef]
  37. Sun, W.; Wang, J.; Fang, Y. Consistent selection of tuning parameters via variable selection stability. J. Mach. Learn. Res. 2012, 14, 3419–3440. [Google Scholar]
  38. Zou, B.; Li, L.; Xu, Z. The generalization performance of ERM algorithm with strongly mixing observations. Mach. Learn. 2009, 75, 275–295. [Google Scholar] [CrossRef] [Green Version]
  39. Guo, Z.C.; Zhou, D.X. Concentration estimates for learning with unbounded sampling. Adv. Comput. Math. 2013, 38, 207–223. [Google Scholar] [CrossRef]
  40. Shi, L.; Feng, Y.; Zhou, D.X. Concentration estimates for learning with 1-regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal. 2011, 31, 286–302. [Google Scholar] [CrossRef]
  41. Shi, L. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 2013, 34, 252–265. [Google Scholar] [CrossRef] [Green Version]
  42. Chen, H.; Pan, Z.; Li, L.; Tang, Y. Error analysis of coefficient-based regularized algorithm for density-level detection. Neural Comput. 2013, 25, 1107–1121. [Google Scholar] [CrossRef]
  43. Zou, B.; Xu, C.; Lu, Y.; Tang, Y.Y.; Xu, J.; You, X. k-Times markov sampling for SVMC. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 1328–1341. [Google Scholar] [CrossRef] [PubMed]
  44. Li, L.; Li, W.; Zou, B.; Wang, Y.; Tang, Y.Y.; Han, H. Learning with coefficient-based regularized regression on Markov resampling. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 4166–4176. [Google Scholar]
  45. Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  46. Wu, Q.; Ying, Y.; Zhou, D.X. Multi-kernel regularized classifiers. J. Complex. 2007, 23, 108–134. [Google Scholar] [CrossRef] [Green Version]
  47. Steinwart, I.; Christmann, A. Estimating conditional quantiles with the help of the pinball loss. Bernoulli 2011, 17, 211–225. [Google Scholar] [CrossRef]
  48. Belloni, A.; Chernozhukov, V. 1-penalized quantile regression in high dimensional sparse models. Ann. Stat. 2009, 39, 82–130. [Google Scholar] [CrossRef]
  49. Kato, K. Group Lasso for high dimensional sparse quantile regression models. arXiv 2011, arXiv:1103.1458. [Google Scholar]
  50. Lv, S.; Lin, H.; Lian, H.; Huang, J. Oracle inequalities for sparse additive quantile regression in reproducing kernel Hilbert space. Ann. Stat. 2018, 46, 781–813. [Google Scholar] [CrossRef]
  51. Wang, Y.; Tang, Y.Y.; Li, L. Correntropy matching pursuit with application to robust digit and face recognition. IEEE Trans. Cybern. 2017, 47, 1354–1366. [Google Scholar] [CrossRef] [PubMed]
  52. Rockafellar, R.T. Convex Analysis; Princeton Univ. Press: Princeton, NJ, USA, 1997. [Google Scholar]
Figure 1. The average squares error (ASE) vs. the sample size n under different noise (A and B represent Example 1. and Example 2 respectively).
Figure 1. The average squares error (ASE) vs. the sample size n under different noise (A and B represent Example 1. and Example 2 respectively).
Entropy 21 00403 g001
Figure 2. The correct-fitting probability (Cp) vs. the sample size n under different noise (A and B represent Example 1. and Example 2 respectively).
Figure 2. The correct-fitting probability (Cp) vs. the sample size n under different noise (A and B represent Example 1. and Example 2 respectively).
Entropy 21 00403 g002
Table 1. Properties of different regression algorithms.
Table 1. Properties of different regression algorithms.
Lasso [1]RMR [20]SpAM [9]COSSO [10]GM [12]Ours
Learning criterionMSEMRCMSEMSEMSEMRC
Model assumptionlinearlinearadditiveadditivemodel-freemodel-free
Table 2. The averaged performance on simulated data in Example 1 (left) and Example 2 (right).
Table 2. The averaged performance on simulated data in Example 1 (left) and Example 2 (right).
Noise ( n , p ) MethodSIZETPFPUpOpCpASESIZETPFPUpOpCpASE
(100, 150)Lasso3.923.920.000.360.000.641.3694.404.280.120.440.120.445.112
n < p SpAM4.123.920.200.080.160.761.0755.024.980.040.040.040.921.611
GM4.123.880.240.120.160.721.1235.144.980.160.040.120.841.775
R G V S G a u 4.003.920.080.080.040.881.0035.125.000.120.000.080.921.565
R G V S L o g 3.843.800.040.200.040.761.1315.124.920.200.080.160.761.914
N ( 0 , 1 ) (150, 150)Lasso4.203.920.040.160.120.721.2454.484.280.200.400.160.444.794
Gaussian Noise n = p SpAM4.004.000.000.000.001.000.8045.005.000.000.000.001.001.612
GM3.963.920.040.080.040.881.0115.045.000.040.000.040.961.627
R G V S G a u 3.963.960.000.040.000.960.8995.005.000.000.000.001.001.500
R G V S L o g 3.963.800.040.080.040.881.0835.025.000.020.000.030.971.622
(200, 150)Lasso4.003.920.080.200.000.801.2524.524.520.000.400.000.602.507
n > p SpAM4.004.000.000.000.001.000.8295.045.000.040.000.020.981.497
GM3.963.960.000.040.000.961.0125.065.000.060.000.040.961.528
R G V S G a u 4.004.000.000.000.001.000.8135.005.000.000.000.001.001.485
R G V S L o g 3.963.960.000.040.000.960.9155.005.000.000.000.001.001.453
(100, 150)Lasso3.723.640.080.360.040.605.8544.323.720.600.680.120.206.888
n < p SpAM4.243.760.480.240.280.484.3425.525.000.520.000.400.604.328
GM4.243.800.440.200.360.444.0124.484.440.040.300.050.653.611
R G V S G a u 4.163.960.200.040.200.762.9375.084.920.160.060.160.783.486
R G V S L o g 4.183.900.280.160.120.722.6815.084.840.240.080.180.743.968
X 2 ( 2 ) (150, 150)Lasso5.323.841.480.160.480.365.3925.164.161.000.600.160.244.503
Chi-square Noise n = p SpAM4.043.960.080.040.080.882.7655.325.000.320.000.240.763.748
GM4.003.880.120.120.080.802.8734.984.920.060.050.050.904.173
R G V S G a u 3.963.920.040.080.040.882.8095.025.000.020.000.020.982.929
R G V S L o g 4.083.960.120.040.120.842.0975.045.000.040.000.040.963.519
(200, 150)Lasso4.244.000.240.000.280.725.8054.364.320.040.520.040.443.754
n > p SpAM4.084.000.080.000.080.922.4635.045.000.040.000.040.963.634
GM4.044.000.040.000.040.962.5235.185.000.180.000.200.803.816
R G V S G a u 3.963.960.000.040.000.962.4495.005.000.000.000.001.002.989
R G V S L o g 3.963.960.000.040.000.961.7385.005.000.000.000.001.003.457
(100, 150)Lasso3.463.460.000.600.000.404.6314.644.000.640.600.200.204.567
n < p SpAM4.283.880.400.120.280.604.5995.845.000.840.000.440.564.224
GM4.203.640.560.360.360.283.9415.364.680.680.320.280.404.528
R G V S G a u 4.203.880.320.120.200.683.2745.064.820.240.140.140.723.907
R G V S L o g 3.963.800.160.200.160.642.7755.124.920.200.040.160.803.667
E ( 2 ) (150, 150)Lasso5.303.661.640.200.440.364.7475.644.161.480.480.280.244.786
Exponential Noise n = p SpAM4.043.960.080.040.080.883.4035.285.000.280.160.000.844.969
GM4.083.960.120.040.120.843.1774.984.920.060.080.040.884.129
R G V S G a u 4.004.000.000.000.001.002.7245.024.980.040.020.040.942.964
R G V S L o g 4.004.000.000.000.001.002.6435.004.960.040.040.040.923.918
(200, 150)Lasso3.803.800.000.200.000.804.2914.684.600.080.280.080.643.669
n > p SpAM4.004.000.000.000.001.002.9885.245.000.240.000.200.804.808
GM3.963.960.000.040.000.963.0164.984.980.000.040.000.963.878
R G V S G a u 4.004.000.000.000.001.002.8845.005.000.000.000.001.003.041
R G V S L o g 3.963.920.040.090.000.913.1134.964.960.000.040.000.963.771
(100, 150)Lasso4.923.801.120.280.320.402.3016.523.922.600.640.200.166.971
n < p SpAM4.903.801.10.240.200.561.6987.924.723.200.240.440.324.658
GM5.003.641.360.320.320.361.5515.684.321.320.400.320.283.561
R G V S G a u 4.143.940.200.050.100.850.8225.004.840.160.080.160.762.308
R G V S L o g 4.143.880.260.120.160.721.2084.964.800.160.160.120.722.339
t ( 2 ) (150, 150)Lasso5.083.721.360.240.400.361.7936.323.802.520.680.200.126.020
Student Noise n = p SpAM4.304.000.300.000.320.680.9555.445.000.440.000.280.722.739
GM4.043.800.240.160.160.681.0465.564.600.960.280.080.642.557
R G V S G a u 4.004.000.000.000.001.000.7574.984.980.000.080.000.921.716
R G V S L o g 3.923.880.040.120.040.841.1694.964.960.000.040.000.961.723
(200, 150)Lasso5.003.921.080.320.200.481.2625.444.361.080.440.280.282.976
n > p SpAM4.104.000.100.000.270.731.0605.645.000.640.000.280.722.427
GM4.003.960.040.040.040.921.0115.204.720.480.200.040.762.350
R G V S G a u 4.044.000.040.000.100.900.6815.005.000.000.000.001.001.517
R G V S L o g 4.044.000.040.000.040.960.8844.964.960.000.040.000.961.672
Table 3. The averaged performance with simulated data in Example 1.
Table 3. The averaged performance with simulated data in Example 1.
Noise ( n , p ) MethodSIZETPFPUpOpCpASE
(300, 500)Lasso1.981.980.001.000.000.001.98
n < p GM4.044.000.040.000.040.960.80
R G V S G a u 4.064.000.060.000.060.940.63
R G V S L o g 4.143.980.160.010.030.960.88
N ( 0 , 1 ) (500, 500)Lasso1.921.920.001.000.000.001.35
Gaussian Noise n = p GM4.064.000.060.000.060.940.78
R G V S G a u 4.024.000.020.000.020.980.59
R G V S L o g 4.044.000.040.000.020.980.74
(700, 500)Lasso1.881.880.001.000.000.001.55
n > p GM4.044.000.040.000.040.960.77
R G V S G a u 4.024.000.020.000.020.980.62
R G V S L o g 4.004.000.000.000.001.000.73
(300, 500)Lasso1.801.800.001.000.000.004.45
n < p GM4.184.000.180.000.140.862.92
R G V S G a u 4.094.000.090.000.110.892.39
R G V S L o g 4.063.880.180.120.140.741.95
X 2 ( 2 ) (500, 500)Lasso1.741.740.001.000.000.004.62
Chi-square Noise n = p GM4.144.000.140.000.140.863.01
R G V S G a u 4.084.000.080.000.060.942.22
R G V S L o g 4.043.980.060.020.060.921.82
(700, 500)Lasso1.861.860.001.000.000.004.37
n > p GM4.284.000.280.000.240.762.96
R G V S G a u 4.024.000.020.000.020.982.13
R G V S L o g 4.024.000.020.000.020.981.72
(300, 500)Lasso2.042.040.001.000.000.004.25
n < p GM3.943.870.070.130.050.823.14
R G V S G a u 4.024.000.020.000.020.982.36
R G V S L o g 3.983.940.040.060.020.921.92
E ( 2 ) (500, 500)Lasso1.941.940.001.000.000.004.34
Exponential Noise n = p GM4.124.000.120.000.100.902.35
R G V S G a u 3.993.960.030.040.030.932.37
R G V S L o g 4.024.000.020.000.020.981.71
(700, 500)Lasso1.901.900.001.000.000.004.67
n > p GM4.084.000.080.000.060.942.33
R G V S G a u 3.993.990.000.010.000.991.74
R G V S L o g 4.054.000.050.000.050.951.92
(300, 500)Lasso1.961.960.001.000.000.004.63
n < p GM3.503.460.040.240.040.722.48
R G V S G a u 4.143.940.200.060.100.840.82
R G V S L o g 4.003.980.020.020.000.980.90
t ( 2 ) (500, 500)Lasso1.761.760.000.980.000.023.83
Student Noise n = p GM4.304.000.300.000.160.841.96
R G V S G a u 4.004.000.000.000.001.000.76
R G V S L o g 4.024.000.020.000.010.990.75
(700, 500)Lasso1.961.960.000.960.000.042.46
n > p GM4.064.000.060.000.040.961.95
R G V S G a u 4.044.000.040.000.060.940.68
R G V S L o g 4.004.000.000.000.001.000.74
Table 4. Learning performance on Auto-Mpg.
Table 4. Learning performance on Auto-Mpg.
VariableCyLDISPHPOWERWEIGACCELERYEARORIGNRSSE(std)
Lasso-----0.5918(0.3762)
SpAM----0.2754(0.0191)
GM-0.2547(0.0313)
RGVS G a u --0.1425(0.0277)
RGVS L o g -0.1379(0.0183)
Table 5. Learning performance on Heating Load (UP) and Cooling Load (DOWN).
Table 5. Learning performance on Heating Load (UP) and Cooling Load (DOWN).
VariableRCSAWARAOHORIENTGAGADRSSE(std)
Lasso------0.1739(0.0801)
SpAM------0.1684(0.0045)
GM---0.1244(0.0383)
RGVS G a u ---0.0935(0.0099)
RGVS L o g ----0.1110(0.0066)
Lasso-----0.2119(0.0926)
SpAM------0.1910(0.0131)
GM---0.1515(0.0120)
RGVS G a u ---0.1339(0.0116)
RGVS L o g ---0.1368(0.0077)

Share and Cite

MDPI and ACS Style

Guo, C.; Song, B.; Wang, Y.; Chen, H.; Xiong, H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy 2019, 21, 403. https://doi.org/10.3390/e21040403

AMA Style

Guo C, Song B, Wang Y, Chen H, Xiong H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy. 2019; 21(4):403. https://doi.org/10.3390/e21040403

Chicago/Turabian Style

Guo, Changying, Biqin Song, Yingjie Wang, Hong Chen, and Huijuan Xiong. 2019. "Robust Variable Selection and Estimation Based on Kernel Modal Regression" Entropy 21, no. 4: 403. https://doi.org/10.3390/e21040403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop