Error Bound of Mode-Based Additive Models

Due to their flexibility and interpretability, additive models are powerful tools for high-dimensional mean regression and variable selection. However, the least-squares loss-based mean regression models suffer from sensitivity to non-Gaussian noises, and there is also a need to improve the model’s robustness. This paper considers the estimation and variable selection via modal regression in reproducing kernel Hilbert spaces (RKHSs). Based on the mode-induced metric and two-fold Lasso-type regularizer, we proposed a sparse modal regression algorithm and gave the excess generalization error. The experimental results demonstrated the effectiveness of the proposed model.


Introduction
Regression estimation and variable selection are two important tasks for high-dimensional data mining [1]. Sparse additive models [2,3], aiming to deal with the above tasks simultaneously, have been extensively investigated in the mean regression setting. As a class of models between linear and nonparametric regression, these methods inherit the flexibility from nonparametric regression and the interpretability from linear regression. Typical methods include COSSO [4] and SpAM [2] and its variants, such as Group SpAM [3], SAM [5], Group SAM [6], SALSA [7], MAM [8], SSAM [9], and ramp-SAM [10]. From the lens of nonparametric regression, the additive structure on the hypothesis space is crucial to overcome the curse of dimensionality [7,11,12].
Usually, the aforementioned models are limited to the estimation of the conditional mean under the mean-squared error (MSE) criterion. However, for the complex non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise), it is difficult to extract the intrinsic trends from the mean-based approaches, resulting in degraded performance. Beyond the traditional mean regression, it is interesting to formulate a new regression framework under the (conditional) mode-based criterion. With the help of the recent works in [13][14][15][16][17][18][19], this paper aimed to propose a new robust sparse additive model, rooted in modal regression associated with the RKHS.
As an alternative approach to mean regression, modal regression has been investigated on statistical behavior [14,15,17] and real-world applications [20,21]. Yao [14] proposed a modal linear regression algorithm and characterized its theoretical properties under the global mode assumption. As a natural extension of Lasso [22], Wang et al. [15] considered the regularized modal regression and established its analysis on the generalization bound and variable selection consistency. Feng et al. [17] studied modal regression by a learning theory approach and illustrated its relation with MCC [23,24]. Different from the above global approaches, some local modal regression algorithms were formulated in [16,25] with convergence guarantees. Recent literature [26] gave a general overview of modal regression, and a more comprehensive list of references can be found there.
The proposed robust additive models are formulated under the Tikhonov regularization scheme, which involves three building blocks, including the mode-based metric, the RKHS-based hypothesis space, and two Lasso-type penalties. Since the linear function space, polynomial function space, and Sobolev/Besov space are special cases of the RKHS, the kernel-based function space is more flexible than the traditional spline-based spaces or other dictionary-based hypotheses [2,5,[27][28][29]. The mode-induced regression metric is robust to the non-Gaussian noise according to the theoretical and empirical evaluations [14,15,17]. The regularized penalty addresses the sparsity and smoothness of the estimator, which has shown promising performance for mean regression [2,[29][30][31]. Therefore, different from mean-based kernel regression and additive models, the modebased approach enjoys robustness and interpretability simultaneously due to its metric criterion and trade-off penalty. The estimator of our approach can be obtained by integrating the half-quadratic (HQ) optimization [32] and the second-order cone programming (SOCP) [33].
The rest of this article is organized as follows. After introducing the robust additive model in Section 2, we state its generalization error bound in Section 3. Finally, Section 5 ends this paper with a brief conclusion.

Modal Regression
In this section, we recall the basic background on modal regression [19,34]. Let X be a compact subset of R p associated with the input covariate vector and Y ∈ R be the response variable set. In this paper, we considered the following nonparametric model: where X = (X 1 , . . . , X p ) T ∈ X , Y ∈ Y, and is a random noise. For feasibility, we denote by ρ the underlying joint distribution of (X, Y) generated by (1). Being different from the traditional mean regression under the noise condition E( |X = x) = 0 (e.g., Gaussian noise), we just require that the mode of the conditional distribution of equal zero at each x ∈ X . That is: where P |X is the conditional density of given X. Notice that the zero condition is not specified to the homogeneity or symmetry distribution of noise , and some non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise) are not excluded. From (1), we further deduce that: where u = (u 1 , . . . , u p ) T ∈ X and P Y|X denotes the density of Y conditional on X. Then, the purpose of modal regression is to find the target function f * according to the empirical drawn independently from ρ. For modal regression, the performance of a predictor f : X → R is measured by the mode-based metric: where ρ X is the marginal distribution of ρ with respect to input space X .
Although the target function f * is the maximizer of R( f ) over all measurable functions, it cannot be estimated directly via maximizing (3) due to the unknown P Y|X and ρ X . Fortunately, some indirect density-estimation-based strategies were proposed in [14,15,17]. As shown in Theorem 5 of [17], R( f ) equals the density function of random variable Therefore, we can find an approximation of f * by maximizing the empirical version of P E f (0) with the help of kernel density estimation (KDE). Let K σ : R × R → R + be a kernel with bandwidth σ, and its representing function Typical kernels used in KDE include the Gaussian kernel, the Epanechnikov kernel, the logistic kernel, and the sigmoid kernel. The KDE-based estimator of P E f (0) is defined as: Learning models for modal regression are usually formulated by Tikhonov regularization schemes associated with the empirical metricR σ ( f ); see, e.g., [15,35]. Naturally, the data-free modal regression metric, w.r.t.R σ ( f ), can be defined as: In theory, the learning performance of estimator f : X → R can be evaluated in terms in [17]).

Remark 1.
As illustrated in [17], when taking K σ as a Gaussian kernel, the modal regression for maximizing R σ ( f ) is consistent with learning under the maximum correntropy criterion (MCC). By employing different kernels, we can provide rich evaluated metrics for better robust estimation.

Mode-Based Sparse Additive Models
The additive model is formulated as follows, where X j ∈ X, (j = 1, 2, · · ·, p), Y ∈ Y, and f * j are unknown component functions. By employing nonlinear hypothesis function spaces with an additive structure, the additive model provides better flexibility for regression estimation and variable selection [19]. In [28], the theoretical properties of the sparse additive model with the quantile loss function were discussed. We introduce some basic notation and assumptions in a similar way.
Suppose that E f * j (X j ) = 0 and f * j K j ≤ 1 for each f * j in (4) with j ∈ S. Here, f * j : X j → R is an unknown univariate function in a reproducing kernel Hilbert space (RKHS) H j := H K j associated with kernel K j and norm · K j [30,31], and S ⊆ {1, . . . , p} is an intrinsic subset with cardinality |S| < p. This means each observation (x j , y j ) is generated according to: and satisfies the condition (2). For any given j ∈ {1, . . . , p}, denote B r (H j ) = {g ∈ H j : g K j ≤ r}. The hypothesis space considered here is defined by: which is a subset of the RKHS H = { f = ∑ p j=1 f j : f j ∈ H j } with the norm: For each X j and the corresponding marginal distribution ρ X j , we denote f j , define the empirical norm of each f j as: With the help of the mode-based metric (3) and the hypothesis space (5), we formulated the mode-based sparse additive model as: where (λ 1 , λ 2 ) is a pair of positive regularization. The first regularization term is sparsitypromoting [11,36], and the second one guarantees smoothness in the solution.
The optimal coefficients with respect to (6) are the solution to the following nonconvex optimization: . . , K j (x nj , x ij )) T ∈ R n and K j = (K j (x ij , x lj )) n i,l = (K j1 , . . . , K jn ) ∈ R n×n .

Remark 2.
There are various combinations of sparsity and smoothness regularization for additive models [2,3,[29][30][31]. The regularization in this paper adopting a two-fold group Lasso scheme, which was employed in [28], but in quantile regression settings, is also different from the coefficient-based regularized modal regression in [19].

Remark 3.
From the lens of computation, the proposed algorithm (6) can be transformed into a regularized least-squares regression problem by HQ optimization [32]. Then, the transformed problem can be tackled with the SOCP [33] easily.

Error Analysis
This section states the upper bounds of the excess quantity R( f * ) − R(f ). For the ease of presentation, we only considered the special setting where H j ≡ H j , ∀j, j ∈ {1, . . . , p}, and we denote ⊕ p j=1 H j as H K with sup K(x, x) ≤ 1. Recall that the Mercer kernel K : X × X → R admits the following spectral expansion [38]: To evaluate the complexity of H K in terms of the decay rate of eigenvalues {b } ≥1 [27,28], we refer to Assumption 1 in [28] as the basis of our analysis.
As illustrated in [27,28], the requirement s < 1 is a weak condition since ∑ b = EK(x, x) ≤ 1. In particular, it holds b −2h for the Sobolev space To describe the hypothesis in RKHS, we refer to Assumption 2 in [28].

Remark 4.
To understand the statistical performance of the proposed estimator without any "correlatedness" conditions on covariates, Rademacher complexity [39] was used to measure functional complexity in [28]. We drew on the experience of [28].
In general, Assumption 2 is stronger than Assumption 1 and is satisfied when the RKHS is continuously embeddable in a Sobolev space. For the uniformly bounded {ψ } ≥1 , this sub-norm condition is consistent with Assumption 1.
For any given independent input variables {x i } n i=1 ⊂ X , define the Rademacher complexity: is an i.i.d. sequence of Rademacher variables that take {±1} with probability 1/2. As shown in [40], it holds: Moreover, from Assumption 1, define: The main idea of our error analysis is to first state a theory result for a defined event and then investigate the behavior off in (6) conditional on that event.

Remark 5.
To analyze the behavior of the regularized estimator conditioned on the event, several basic facts of the empirical processes were introduced in [28]. Our work can be boiled down to this framework. We introduced the relevant lemmas in [28] as a stepping stone.

Lemma 1. Let Assumptions 1 and 2 be true. If
log p √ n ≤ 1, it holds: The following lemma (see also Theorem 4 in [41]) demonstrates the relationship between the empirical norm · n and · 2 for functions in H K .

Lemma 2.
For A ≥ 1 and any givenp ≥ p with logp ≥ 2 log log n, there exists a constant c such that: ⊂ Z be independent random variables, and let Γ be a class of real-valued functions on Z satisfying: for some positive constants η n and ι n . Define ζ : For any given ∆ − and ∆ + , define: 1+α , and c * is a positive constant.
The proof of Lemma 4 is derived from the proof of Proposition 1 in [28] for the quantile regression. We state our main result on the error bound. Theorem 1. Let the regularization parameters off defined in (6) be λ 1 = √ ξγ n and . Under the conditions of Assumptions 1 and 2, for anyp ≥ p such that log p ≤ √ n and logp ≥ 2 log log n, then for some constant A ≥ 2, such that with probability at least 1 − 2d −A : 4+4α } Therefore, we verify thatf ∈ F (∆ − , ∆ + ) with ∆ − ≤ ep and ∆ + ≤ ep. With the choices λ 2 = λ 2 1 = ξγ 2 n , it holds: due to the fact f j n ≤ f j K ≤ 1, for any f j ∈ H K j . According to Lemma 4 and (11), we obtain: with probability at least 1 − 2p −A . Notice that logp ≥ 2 log log n implies that e −p ≤ n −2 ≤ γ n . Then: Combining this with Theorem 9 in [17] and setting σ = ( φ ∞ η(t 0 ) √ ξγ n ) 1 4 , we obtain the desired result.
The proof of Theorem 1 is inspired by that of Theorem 1 in [28]; see [28] for more details. According to Theorem 1, we can conclude that the mode-based SpAM can achieve the learning rate with polynomial decay O(n − 1 4+4α ) since ∈ [0, 1] and A,p are positive constants.

Experimental Evaluation
To demonstrate the efficiency of our method, in this section, we evaluated our model on some synthetic datasets. The data in R p with dimension p = 5 and p = 10 were generated randomly according to the uniform distribution on the interval [0, 1]. Then, we computed the MSE of our estimatorf . Figures 1-3 depict the MSE off when the parameter pair (λ 1 , λ 2 ) = (0, 1), (1, 0) and (1, 1), respectively, while the number of samples n varies from 50/60 to 80/90. This paper used Yalmip [43] modeling in the MATLAB environment and called fmincon to solve the problem. From the figures, we can notice that the MSEs tended to decrease with the increase of the number of samples n under three kinds of parameter settings, which verified that our method was effective in the regression of high-dimensional data.

Conclusions
In this work, we proposed a mode-based sparse additive model and established its generalization error bound. The theoretical results extended the previous mean-based analysis to the mode-based approach. We demonstrated that the mode-based SpAM can achieve the learning rate with polynomial decay O(n − 1 4+4α ), which is comparable to the previous result in [15] with O(n − 1 7 ). In the future, it will be important to further explore the variable selection consistency of the proposed model.

Data Availability Statement:
The synthetic data generation method of the simulation experiment has been introduced in the experimental part.