Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion

In the framework of statistical learning, we study the online gradient descent algorithm generated by the correntropy-induced losses in Reproducing kernel Hilbert spaces (RKHS). As a generalized correlation measurement, correntropy has been widely applied in practice, owing to its prominent merits on robustness. Although the online gradient descent method is an efficient way to deal with the maximum correntropy criterion (MCC) in non-parameter estimation, there has been no consistency in analysis or rigorous error bounds. We provide a theoretical understanding of the online algorithm for MCC, and show that, with a suitable chosen scaling parameter, its convergence rate can be min–max optimal (up to a logarithmic factor) in the regression analysis. Our results show that the scaling parameter plays an essential role in both robustness and consistency.


Introduction
Regression analysis is an important problem in many fields of science. The traditional least squares method may be the most used algorithm for regression in practice. However, it only relies on the mean squared error and belongs to second-order statistics, whose optimality depends heavily on the assumption of Gaussian noise. Thus, it usually performs poorly when the noise is not normally distributed. Alterative approaches have been proposed to deal with outliers or heavy-tailed distributions. A generalized correlation function named correntropy [1] is introduced as a substitute for the least squares loss, and the maximum correntropy criterion (MCC) [2][3][4][5] is used to improve robustness in situations of non-Gaussian and heavy-tailed error distributions. Recently, MCC has been succeeded in many real applications, e.g., wind power forecasting and pattern recognition [6,7].
In the standard framework of statistical learning, let X ∈ R n be an explanatory variable with values taken in a compact metric space (X , d), Y be a real response variable with Y ∈ Y ⊂ R. Here we investigate the application of MCC in the following regression model where is the noise and f ρ (x) is the regression function, defined as the conditional mean E(Y|X = x) at each x ∈ X . The purpose of regression is to estimate the unknown target function f ρ according to the sample z = {z i = (x i , y i )} T i=1 , which is drawn independently from the underlying unknown probability distribution ρ on Z := X × Y. For a hypothesis function f : X → Y, with the scaling parameter σ > 0, the correntropy between f (X) and Y is defined by V σ ( f ) := EG ( f (X)−Y) 2 2σ 2 where G(u) is the Exponential function exp {−u} , u ∈ R. For the given sample z, the empirical form of V σ is . When applied to regression problems, MCC intends to maximize the empirical correntropyV σ over a certain underlying hypothesis space H, that is MCC in regression problems has shown its efficiency for cases when the noises are non-Gaussian, and also with large outliers, see [8][9][10]. It also has drawn much attention in the signal processing, machine learning and optimization communities [2,5,[11][12][13][14].
Let K : X × X → R be a Mercer kernel, i.e., a continuous, symmetric and positive semi-definite function. We say that K is a positive semi-definite, if for any finite set {u 1 , · · · , u m } ⊂ X and m ∈ N, the matrix K(u i , u j ) m i,j=1 is positive semi-definite. An RKHS (H K , · K ) associated with the Mercer kernel K is defined as the completion of the linear span of the functions set {K x := K(x, ·), x ∈ X }. It has the reproducing property for any f ∈ H K and x ∈ X . Since X is compact, the RKHS H K is contained in C(X ), the space of continuous functions on X with the norm f ∞ := sup with some α > n 2 , then the Sobolev space H α (X ) is an RKHS. For more families of RKHS in statistical learning, one can refer to [15]. Denote κ := sup x∈X K(x, x), then, by the reproducing property (2), there holds Denote σ : R × R → R as the correntropy induced regression loss, given by Associated with this regression loss σ and the RKHS H K , MCC for regression (1) in the context of learning theory is reformulated as Notice that σ is not convex, MCC algorithms are usually implemented by various gradient descent methods [14,16,17]. In this paper, we take the online gradient descent method as follows to solve the above optimization scheme (4) since it is scalable to large datasets and applicable to situations where the samples are presented in sequence.
, the online gradient descent method for MCC is defined by f 1 = 0, and where η > 0 is the step size and σ denotes the derivative of σ with respect to the first variable.
In the literature, most MCC algorithms have been implemented for linear models and cannot be applied to analysis of data with nonlinear structures. Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features. So, RKHS are used in this work as hypothesis spaces in the design of learning algorithms.
An online algorithm for MCC has been used in practical applications for more than one decade, but there still is a lack of the theoretical guarantee or strict analysis for its asymptotical convergence. Because the optimization problem arising from MCC is not convex, the global optimization convergence of the online algorithm (5) for MCC is not unconditionally guaranteed. This also makes the theoretical analysis for MCC essentially difficult. In fact, vast numerical studies show that MCC can lead robust estimators while keeping convenient convergence properties. Thus, our goal is to fill the gap between the theoretical analysis and the optimization process so that the output function of the online algorithm (5) can converge to a global minima while the existing work can not ensure the global optimization of this output. To this end, we study the approximation ability of f T+1 generated by (5) at the T-iteration to the regression function f ρ . We derive the explicit error rate for (5) with suitable choice of step sizes, which is competitive with those in the regression analysis. In this work, we show that the scaling parameter σ plays an important role in providing robustness and a fast convergence rate.

Preliminaries and Main Results
We begin with some preliminaries and notations. Throughout the paper, we assume that the unknown distribution ρ on Z = X × Y can be decomposed into the marginal distribution ρ X on X and the conditional distribution ρ(·|x) at each x ∈ X . We also require that |Y| < M almost surely for some M > 1. In the regression analysis, the approximation power of f T+1 by (5) is usually measured in terms of the mean squared error in To present our main result for the error bound of f T+1 − f ρ , the assumption on the target function f ρ will be given as below. Define an integral operator L K : L 2 ρ X −→ L 2 ρ X associated with the kernel K by By the reproducing property (2) of H K , for any f ∈ H K , it can be expressed as Since K is a Mercer kernel, L K is compact and positive. Denote L r K as the r-th power of L K , then it is well defined for any r > 0 by the spectral theorem. Let {λ i } i≥1 be the eigenvalues of L K , arranged in decreasing order. The corresponding eigenfunctions {φ i } i≥1 form an orthonormal basis of L 2 ρ X space. Hence, the regularity space It implies that for any Throughout the paper, the regularity assumption holds for f ρ , i.e., and This assumption is called the source condition [19] in inverse problems and it characterizes the smoothness of the target function f ρ . Obviously, the larger the parameter r is, the higher the regularity of f ρ is. The general source conditions considered in inverse problems usually take the form of where ψ is non-decreasing and ψ(0) = 0, called the index function. It is clear that when r > 1 2 , The above assumption is a special case of (9) with ψ(L K ) = L r− 1 2 K and h = L 1 2 K g. It should be pointed that our analysis in this work also can applied to more general cases by taking source conditions (9).
We are now in a position to state our convergence rate for (5) in L 2 ρ X -space as well as in H K by choosing the step size η := η(T). For brevity, let κ = 1 without losing generality and denote the expectation E z 1 ,··· ,z t as E t for each t ∈ N.
t=1 by (5). Suppose that the assumption (8) holds for r > 0. Take η = T − 2r 2r+1 and , then and if r > 1 2 , where the constants C, C are independent of T, σ, and will be given in the proof.

Remark 1.
Besides the error f T+1 − f ρ ρ , the error bound (11) in H K -norm is also given if r > 1 2 , i.e., f ρ ∈ H K . By (3), it leads the pointwise convergence of f T+1 to f ρ since for each u ∈ X , Compared with the global error | f T+1 − f ρ ρ , the error rate in H K characterizes the local performance of (5) and is much stronger. Furthermore [18], when the kernel K lies in C α (X × X ) for some α > 0, its associated RKHS H K can be embedded into C α/2 (X ), whose partial derivative up to order α/2 are continuous with f C α/2 (X ) = ∑ |s|≤ α 2 D α 2 f ∞ . So, the convergence in H K implies that f T+1 will converge to f ρ in C α 2 , that ensures the convergence of the derivatives of f T+1 to those of f ρ .

Remark 2.
It has been proved in [20] that the min-max optimal rate for regression problems is of order when there exists constants C s > 0, 0 < s ≤ 1 such that the following effective dimension condition holds, i.e., where Trace(·) denotes the trace of the operator. This condition measures the complexity [15,20,21] of H K with respect to the marginal distribution ρ X . It is always satisfied with s = 1 by taking the constant C s = Trace(L K ). Hence, the min-max optimal rate for capacity-independent cases is of order O T − 2r 2r+1 by taking a universal parameter s = 1.
When σ ≥ T 2r+5 4(2r+1) , we see that our convergence rate in L 2 ρ X -norm is of order O T − 2r 2r+1 log(T) . Thus, it is nearly optimal in the capacity-independent sense that up to a logarithmic factor, it matches the min-max optimal rate above. We also find that the convergence rates (10) and (11) keep decreasing as the regularity parameter r increases. Hence, the online algorithm (5) does not suffer from the saturation phenomenon existing in Tikhonov regularization schemes [22], where the error rate of the estimators will not improve if r is out of the range (0, 1]. This again shows the advantage of the online algorithm (5). [2] investigated the approximation ability of the empirical scheme (4) over general hypothesis spaces H. This work shows that with a complexity parameter 0 < β ≤ 2, their error rate is of order

Remark 3. Recent paper
) if the scaling parameter σ = T 1 2+β . To be fair, do not take the capacity of H into consideration by taking β = 2. Then, their order reduces to O(T − 1 2 ), which is far from capacity-independent optimality and inferior to our rates.
In the work [17], iterative regularization techniques (alternatively called early stopping) are taken to solve the optimization problems associated with general robust losses including the correntropy induced loss σ , where the whole sample z are presented at each iteration. In their analysis, under the polynomial decay of the eigenvalues {λ i }, that is, there exists some constants C b > 0 and b ≥ 1 such that ). This decay is also a measurement for the complexity of H K , please refer to [21]. Recall that the compactness of X implies that ∑ i λ i < ∞ and λ i ≤ ci −1 for some c > 0. So, their rate for capacity-independent cases is O( . We can see that our results in (10) are superior in the case 0 < r < 1 2 . It shows in theory that the online algorithm (5) for MCC can achieve better approximation rate when f ρ is not in H K .

Remark 4.
It is easy to check that the roots of the second derivative of σ is ±σ, i.e., when | f (x) − y| < σ, this loss is convex and behaves as the least squares loss; when | f (x) − y| ≥ σ, the loss function becomes concave and rapidly tends to be flat as the value of | f (x) − y| goes to infinity. It implies that σ satisfies the redescending property, and with a suitable chosen scaling parameter σ, σ can reject gross outliers while keeping a prediction accuracy. In Theorem 1, we observe that σ should be large enough to guarantee the nice convergence, which coincides with the work in [2]. They also pointed that too small σ may prevent the estimator to converge to f ρ . In a recent paper [23], correntropy with small σ is interpreted as modal regression. According to the above discussions and empirical studies [2,14,17], we conclude that the value of σ would determine the learning target and a moderate σ may be more appropriate for balancing the convergence and robustness in practice.
Based on the above remarks, we see that the convergence rate of online kernel-based MCC is comparable to that of the least squares that has appeared in the literature [24]. Meanwhile, MCC's redescending property will produce robustness to various outliers including sub-Gaussain, Student's t-distribution, and Cauchy distribution. These all shows the superiority of MCC in a variety of applications, such as clustering, classification and feature selection [14]. At the end of this section, we would like to point out that although our work is carried out under the boundness condition of Y, it can be extended to more general situations such as the moment conditions [20].

Proofs of Main Result
In this section, we prove our main results in Theorem 1. First, we derive the uniform bound for the iteration sequence { f t } T+1 t=1 by (5).
Proof. We prove (12) by induction. It is trivial that (12) holds for t = 1. Suppose (12) holds for Then by (2), Then, we have For the part of the above inequality, we have Based on the above analysis, Then the proof is completed.
Next, we will establish a proposition which is crucial to prove the convergence rates in Theorem 1. It is closely related to the generalization error of f t . Define the generalization error E ( f ) for any measurable function f : X → R by The regression function f ρ that we want to learn or approximate is a minimizer of E ( f ) , that is A simple computation yields the relation for f : . (13) For brevity, set the operator π t k (L K ) := ∏ t j=k (I − ηL K ) for k, t ∈ N and π t t+1 (L K ) := I. (5). If the step size 0 < η < 1, then we have furthermore, if f ρ ∈ H K , where ∆ t is defined in the proof.

Proof.
Denote and define a random variable ξ( f t , z t ) By (5), we have that for any t ∈ N, Applying the above equality iteratively from t = T to t = 1, we get that by f 1 = 0, It follows from the elementary inequality that g 1 + g 2 2 ρ ≤ 2 g 1 2 ρ + 2 g 2 2 ρ for any g 1 , g 2 ∈ L 2 ρ X , that To prove (14), we consider the part of the first term on the right-hand side of (18) Observe that f t is only dependent on {z 1 , · · · , z t−1 }, not on z t . Thus, by the fact that X ydρ = f ρ (x), we have We consider the second term on the right-hand side of (19). It can be rewritten as When t < l ≤ T, by (20), Obviously, the above equality holds for l < t ≤ T. So, with (7), we get To bound E t ξ( f t , z t ) 2 K , we have where the last inequality is derived from (3). Applying Lemma A1 with β = 1 2 , l = t + 1 and k = T, we have Based on the above analysis, we have Now, we estimate the last term on the right-hand side of (19). Using (20) again, we have Plugging (21) and (22) into (19), we get This together with (18) yields the desired conclusion (14). Now we turn to bound f T+1 − f ρ in H K -norm. By (17) again, we have Following the similar procedure in estimating (14), we also get Noticing that π T t+1 (L K ) ≤ 1, then the bound (15) is obtained.
Based on the error bounds of f T+1 − f ρ in Proposition 1, we need to estimate the generalization error E ( f t ).
Applying (14) with T = t, then Since the Gaussian G is Lipschitz continuous, we have that for each 1 ≤ k ≤ T, where the last inequality is derived from (3). Notice that by 0 < η ≤ 1, there holds π t k (L K ) ≤ ∏ t l=k I − ηL K ≤ 1 for each 1 ≤ k ≤ t ≤ T. Then the last term on the right-hand side of (26) is bounded as For the first term 2 π t 1 (L K ) f ρ 2 ρ , it is easy to get that 2 π t 1 (L K ) f ρ 2 ρ ≤ 2 f ρ 2 ρ . Putting the above estimates into (26) and using the relation (13) By the restriction (24) of η and Lemma A3, we know that Plugging it into (28), we have Then the proof is completed.
With these preliminaries in place, we shall prove our main results.
For the second term on the right-hand side of (14), the choice of η and T in Theorem 1 implies that the restriction (24) holds. Then we can put the bound (12) into (25) and get that for t ≥ 2 Finally, we bound the last term on the right-hand side of (14). Notice that Then, using the estimate (27) and the bound (12) of { f t }, we have Based on the above analysis, the conclusion (10) is obtained by taking C = 8 ((r/e) r + 1) 2 g 2 ρ + 16 (1/2e) 1/2 + 1 2 c M,ρ + 32M 6 .