Next Article in Journal
Context Based Predictive Information
Next Article in Special Issue
Entropic Regularization of Markov Decision Processes
Previous Article in Journal
Information Theory and an Entropic Approach to an Analysis of Fiscal Inequality
Previous Article in Special Issue
MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion

1
School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China
2
School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(7), 644; https://doi.org/10.3390/e21070644
Submission received: 15 May 2019 / Revised: 14 June 2019 / Accepted: 24 June 2019 / Published: 29 June 2019
(This article belongs to the Special Issue Entropy Based Inference and Optimization in Machine Learning)

Abstract

:
In the framework of statistical learning, we study the online gradient descent algorithm generated by the correntropy-induced losses in Reproducing kernel Hilbert spaces (RKHS). As a generalized correlation measurement, correntropy has been widely applied in practice, owing to its prominent merits on robustness. Although the online gradient descent method is an efficient way to deal with the maximum correntropy criterion (MCC) in non-parameter estimation, there has been no consistency in analysis or rigorous error bounds. We provide a theoretical understanding of the online algorithm for MCC, and show that, with a suitable chosen scaling parameter, its convergence rate can be min–max optimal (up to a logarithmic factor) in the regression analysis. Our results show that the scaling parameter plays an essential role in both robustness and consistency.

1. Introduction

Regression analysis is an important problem in many fields of science. The traditional least squares method may be the most used algorithm for regression in practice. However, it only relies on the mean squared error and belongs to second-order statistics, whose optimality depends heavily on the assumption of Gaussian noise. Thus, it usually performs poorly when the noise is not normally distributed. Alterative approaches have been proposed to deal with outliers or heavy-tailed distributions. A generalized correlation function named correntropy [1] is introduced as a substitute for the least squares loss, and the maximum correntropy criterion (MCC) [2,3,4,5] is used to improve robustness in situations of non-Gaussian and heavy-tailed error distributions. Recently, MCC has been succeeded in many real applications, e.g., wind power forecasting and pattern recognition [6,7].
In the standard framework of statistical learning, let X R n be an explanatory variable with values taken in a compact metric space ( X , d ) , Y be a real response variable with Y Y R . Here we investigate the application of MCC in the following regression model
Y = f ρ ( X ) + ϵ , E ( ϵ | X = x ) = 0 ,
where ϵ is the noise and f ρ ( x ) is the regression function, defined as the conditional mean E ( Y | X = x ) at each x X . The purpose of regression is to estimate the unknown target function f ρ according to the sample z = { z i = ( x i , y i ) } i = 1 T , which is drawn independently from the underlying unknown probability distribution ρ on Z : = X × Y . For a hypothesis function f : X Y , with the scaling parameter σ > 0 , the correntropy between f ( X ) and Y is defined by V σ ( f ) : = E G ( f ( X ) Y ) 2 2 σ 2 where G ( u ) is the Exponential function exp u , u R . For the given sample z , the empirical form of V σ is V ^ σ ( f ) : = 1 T i = 1 T G ( f ( x i ) y i ) 2 2 σ 2 . When applied to regression problems, MCC intends to maximize the empirical correntropy V ^ σ over a certain underlying hypothesis space H , that is
f z , H : = arg max f H V ^ σ ( f ) .
MCC in regression problems has shown its efficiency for cases when the noises are non-Gaussian, and also with large outliers, see [8,9,10]. It also has drawn much attention in the signal processing, machine learning and optimization communities [2,5,11,12,13,14].
Let K : X × X R be a Mercer kernel, i.e., a continuous, symmetric and positive semi-definite function. We say that K is a positive semi-definite, if for any finite set u 1 , , u m X and m N , the matrix K ( u i , u j ) i , j = 1 m is positive semi-definite. An RKHS ( H K , · K ) associated with the Mercer kernel K is defined as the completion of the linear span of the functions set { K x : = K ( x , · ) , x X } . It has the reproducing property
f ( x ) = f , K x K
for any f H K and x X . Since X is compact, the RKHS H K is contained in C ( X ) , the space of continuous functions on X with the norm f : = sup x X | f ( x ) | . Moreover, if X is a Euclidean ball in R n with some α > n 2 , then the Sobolev space H α ( X ) is an RKHS. For more families of RKHS in statistical learning, one can refer to [15]. Denote κ : = sup x X K ( x , x ) , then, by the reproducing property (2), there holds
f κ f K , f o r a n y f H K .
Denote σ : R × R R as the correntropy induced regression loss, given by
σ ( u , v ) : = σ 2 1 G ( u v ) 2 2 σ 2 = σ 2 1 exp ( u v ) 2 2 σ 2 .
Associated with this regression loss σ and the RKHS H K , MCC for regression (1) in the context of learning theory is reformulated as
f z : = arg min f H K 1 T i = 1 T σ ( f ( x i ) , y i ) .
Notice that σ is not convex, MCC algorithms are usually implemented by various gradient descent methods [14,16,17]. In this paper, we take the online gradient descent method as follows to solve the above optimization scheme (4) since it is scalable to large datasets and applicable to situations where the samples are presented in sequence.
Definition 1.
Given the sample z = { z i = ( x i , y i ) } i = 1 T , the online gradient descent method for MCC is defined by f 1 = 0 , and
f t + 1 = f t η σ ( f t ( x t ) , y t ) K x t , t N ,
where η > 0 is the step size and σ denotes the derivative of σ with respect to the first variable.
In the literature, most MCC algorithms have been implemented for linear models and cannot be applied to analysis of data with nonlinear structures. Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features. So, RKHS are used in this work as hypothesis spaces in the design of learning algorithms.
An online algorithm for MCC has been used in practical applications for more than one decade, but there still is a lack of the theoretical guarantee or strict analysis for its asymptotical convergence. Because the optimization problem arising from MCC is not convex, the global optimization convergence of the online algorithm (5) for MCC is not unconditionally guaranteed. This also makes the theoretical analysis for MCC essentially difficult. In fact, vast numerical studies show that MCC can lead robust estimators while keeping convenient convergence properties. Thus, our goal is to fill the gap between the theoretical analysis and the optimization process so that the output function of the online algorithm (5) can converge to a global minima while the existing work can not ensure the global optimization of this output. To this end, we study the approximation ability of f T + 1 generated by (5) at the T-iteration to the regression function f ρ . We derive the explicit error rate for (5) with suitable choice of step sizes, which is competitive with those in the regression analysis. In this work, we show that the scaling parameter σ plays an important role in providing robustness and a fast convergence rate.

2. Preliminaries and Main Results

We begin with some preliminaries and notations. Throughout the paper, we assume that the unknown distribution ρ on Z = X × Y can be decomposed into the marginal distribution ρ X on X and the conditional distribution ρ ( · | x ) at each x X . We also require that | Y | < M almost surely for some M > 1 . In the regression analysis, the approximation power of f T + 1 by (5) is usually measured in terms of the mean squared error in L ρ X 2 -metric f T + 1 f ρ ρ , that is defined as · ρ = · L ρ X 2 : = X | · | 2 d ρ X 1 2 .
To present our main result for the error bound of f T + 1 f ρ , the assumption on the target function f ρ will be given as below. Define an integral operator L K : L ρ X 2 L ρ X 2 associated with the kernel K by
L K ( f ) : = X f ( x ) K x d ρ X , f L ρ X 2 .
By the reproducing property (2) of H K , for any f H K , it can be expressed as
L K ( f ) = X f , K x K K x d ρ X .
Since K is a Mercer kernel, L K is compact and positive. Denote L K r as the r-th power of L K , then it is well defined for any r > 0 by the spectral theorem. Let { λ i } i 1 be the eigenvalues of L K , arranged in decreasing order. The corresponding eigenfunctions { ϕ i } i 1 form an orthonormal basis of L ρ X 2 space. Hence, the regularity space L K r ( L ρ X 2 ) is expressed as [18]
L K r ( L ρ X 2 ) : = f = i = 1 λ j r a i ϕ i : L K r f ρ = i = 1 a i 2 < .
It implies that for any r 1 > r 2 > 0 , there holds L K r 1 ( L ρ X 2 ) L K r 2 ( L ρ X 2 ) . In particular, we know that L K r ( L ρ X 2 ) H K for any r 1 2 and L K 1 2 ( L ρ X 2 ) = H K satisfying
f K = L K 1 2 f ρ , f H K .
Throughout the paper, the regularity assumption holds for f ρ , i.e.,
f ρ = L K r ( g ) , f o r s o m e r > 0 a n d g L ρ X 2 2 ,
and L r f ρ ρ = g ρ .
This assumption is called the source condition [19] in inverse problems and it characterizes the smoothness of the target function f ρ . Obviously, the larger the parameter r is, the higher the regularity of f ρ is. The general source conditions considered in inverse problems usually take the form of
f ρ = ψ ( L K ) h , f o r s o m e h H K
where ψ is non-decreasing and ψ ( 0 ) = 0 , called the index function. It is clear that when r > 1 2 , The above assumption is a special case of (9) with ψ ( L K ) = L K r 1 2 and h = L K 1 2 g . It should be pointed that our analysis in this work also can applied to more general cases by taking source conditions (9).
We are now in a position to state our convergence rate for (5) in L ρ X 2 -space as well as in H K by choosing the step size η : = η ( T ) . For brevity, let κ = 1 without losing generality and denote the expectation E z 1 , , z t as E t for each t N .
Theorem 1.
Define { f t } t = 1 T + 1 by (5). Suppose that the assumption (8) holds for r > 0 . Take η = T 2 r 2 r + 1 and T > 24 ( 1 / 2 e ) 1 / 2 + 1 2 log ( T ) 2 r + 1 2 r , then
E T f T + 1 f ρ ρ 2 C max T 2 r 2 r + 1 log ( T ) , T 5 2 r + 1 σ 4
and if r > 1 2 ,
E T f T + 1 f ρ K 2 C max T 2 r 1 2 r + 1 , T 5 2 r + 1 σ 4
where the constants C , C are independent of T , σ , and will be given in the proof.
Remark 1.
Besides the error f T + 1 f ρ ρ , the error bound (11) in H K -norm is also given if r > 1 2 , i.e., f ρ H K . By (3), it leads the pointwise convergence of f T + 1 to f ρ since for each u X , | f T + 1 ( u ) f ρ ( u ) | f T + 1 f ρ K . Compared with the global error | f T + 1 f ρ ρ , the error rate in H K characterizes the local performance of (5) and is much stronger. Furthermore [18], when the kernel K lies in C α ( X × X ) for some α > 0 , its associated RKHS H K can be embedded into C α / 2 ( X ) , whose partial derivative up to order α / 2 are continuous with f C α / 2 ( X ) = | s | α 2 D α 2 f . So, the convergence in H K implies that f T + 1 will converge to f ρ in C α 2 , that ensures the convergence of the derivatives of f T + 1 to those of f ρ .
Remark 2.
It has been proved in [20] that the min–max optimal rate for regression problems is of order O T 2 r 2 r + s when there exists constants C s > 0 , 0 < s 1 such that the following effective dimension condition holds, i.e.,
T r a c e ( ( L K + λ I ) 1 L K ) C s λ s , f o r a n y λ > 0 ,
where T r a c e ( · ) denotes the trace of the operator. This condition measures the complexity [15,20,21] of H K with respect to the marginal distribution ρ X . It is always satisfied with s = 1 by taking the constant C s = T r a c e ( L K ) . Hence, the min–max optimal rate for capacity-independent cases is of order O T 2 r 2 r + 1 by taking a universal parameter s = 1 .
When σ T 2 r + 5 4 ( 2 r + 1 ) , we see that our convergence rate in L ρ X 2 -norm is of order O T 2 r 2 r + 1 log ( T ) . Thus, it is nearly optimal in the capacity-independent sense that up to a logarithmic factor, it matches the min–max optimal rate above. We also find that the convergence rates (10) and (11) keep decreasing as the regularity parameter r increases. Hence, the online algorithm (5) does not suffer from the saturation phenomenon existing in Tikhonov regularization schemes [22], where the error rate of the estimators will not improve if r is out of the range ( 0 , 1 ] . This again shows the advantage of the online algorithm (5).
Remark 3.
Recent paper [2] investigated the approximation ability of the empirical scheme (4) over general hypothesis spaces H . This work shows that with a complexity parameter 0 < β 2 , their error rate is of order O ( T 2 2 + β ) if the scaling parameter σ = T 1 2 + β . To be fair, do not take the capacity of H into consideration by taking β = 2 . Then, their order reduces to O ( T 1 2 ) , which is far from capacity-independent optimality and inferior to our rates.
In the work [17], iterative regularization techniques (alternatively called early stopping) are taken to solve the optimization problems associated with general robust losses including the correntropy induced loss σ , where the whole sample z are presented at each iteration. In their analysis, under the polynomial decay of the eigenvalues { λ i } , that is, there exists some constants C b > 0 and b 1 such that
λ i c b i b , i 1 ,
the obtained rate is O ( T 2 b r 2 b r + 1 ) if r 1 2 , else, it is O ( T 2 b r b + 1 ) . This decay is also a measurement for the complexity of H K , please refer to [21]. Recall that the compactness of X implies that i λ i < and λ i c i 1 for some c > 0 . So, their rate for capacity-independent cases is O ( T 2 r 2 r + 1 ) if r 1 2 , else, it is O ( T r ) . We can see that our results in (10) are superior in the case 0 < r < 1 2 . It shows in theory that the online algorithm (5) for MCC can achieve better approximation rate when f ρ is not in H K .
Remark 4.
It is easy to check that the roots of the second derivative of σ is ± σ , i.e., when | f ( x ) y | < σ , this loss is convex and behaves as the least squares loss; when | f ( x ) y | σ , the loss function becomes concave and rapidly tends to be flat as the value of | f ( x ) y | goes to infinity. It implies that σ satisfies the redescending property, and with a suitable chosen scaling parameter σ, σ can reject gross outliers while keeping a prediction accuracy. In Theorem 1, we observe that σ should be large enough to guarantee the nice convergence, which coincides with the work in [2]. They also pointed that too small σ may prevent the estimator to converge to f ρ . In a recent paper [23], correntropy with small σ is interpreted as modal regression. According to the above discussions and empirical studies [2,14,17], we conclude that the value of σ would determine the learning target and a moderate σ may be more appropriate for balancing the convergence and robustness in practice.
Based on the above remarks, we see that the convergence rate of online kernel-based MCC is comparable to that of the least squares that has appeared in the literature [24]. Meanwhile, MCC’s redescending property will produce robustness to various outliers including sub-Gaussain, Student’s t-distribution, and Cauchy distribution. These all shows the superiority of MCC in a variety of applications, such as clustering, classification and feature selection [14]. At the end of this section, we would like to point out that although our work is carried out under the boundness condition of Y , it can be extended to more general situations such as the moment conditions [20].

3. Proofs of Main Result

In this section, we prove our main results in Theorem 1. First, we derive the uniform bound for the iteration sequence { f t } t = 1 T + 1 by (5).
Lemma 1.
Define { f t } t = 1 T + 1 by (5). If 0 < η 1 , then
f t K M η 1 2 ( t 1 ) 1 2 , t N .
Proof. 
We prove (12) by induction. It is trivial that (12) holds for t = 1 . Suppose (12) holds for t 2 . Notice that σ ( f t ( x t ) , y t ) = G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t . Write (12) as f t + 1 = f t η H t where H t = G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t K x t . Then by (2),
f t + 1 K 2 = f t K 2 2 η f t , H t K + η 2 H t K 2 = f t K 2 2 η G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t f t ( x t ) + η 2 H t K 2
and
H t K 2 = G ( f t ( x t ) y t ) 2 σ 2 f t ( x t ) y t 2 K ( x t , x t ) G ( f t ( x t ) y t ) 2 σ 2 f t ( x t ) y t 2 .
Then, we have
f t + 1 K 2 f t K 2 + η η G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t 2 2 ( f t ( x t ) y t ) f t ( x t ) G ( f t ( x t ) y t ) 2 2 σ 2 .
For the part of the above inequality, we have
η G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t 2 2 ( f t ( x t ) y t ) f t ( x t ) = η G ( f t ( x t ) y t ) 2 2 σ 2 2 ( f t ( x t ) y t ) y t η G ( f t ( x t ) y t ) 2 2 σ 2 2 2 + y t 2 2 η G ( f t ( x t ) y t ) 2 2 σ 2 .
Since η 1 , it follows that η G ( f t ( x t ) y t ) 2 2 σ 2 2 < 0 and 2 η G ( f t ( x t ) y t ) 2 2 σ 2 1 . Recall that | y | M for all y Y , then
η G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t 2 2 ( f t ( x t ) y t ) f t ( x t ) y t 2 2 η G ( f t ( x t ) y t ) 2 2 σ 2 M 2 .
Based on the above analysis,
f t + 1 K 2 f t K 2 + η M 2 G ( f t ( x t ) y t ) 2 2 σ 2 f t K 2 + η M 2 M 2 η ( t 1 ) + η M 2 = M 2 η t .
Then the proof is completed. □
Next, we will establish a proposition which is crucial to prove the convergence rates in Theorem 1. It is closely related to the generalization error of f t . Define the generalization error E ( f ) for any measurable function f : X R by
E ( f ) = Z ( f ( x ) y ) 2 d ρ .
The regression function f ρ that we want to learn or approximate is a minimizer of E ( f ) , that is
f ρ = arg min { E ( f ) : f i s a m e a s u r a b l e f u n c t i o n f r o m X t o Y } .
A simple computation yields the relation for f : X R
f f ρ ρ 2 = E ( f ) E ( f ρ ) .
For brevity, set the operator π k t ( L K ) : = j = k t ( I η L K ) for k , t N and π t + 1 t ( L K ) : = I .
Proposition 1.
Define { f t } t = 1 T + 1 by (5). If the step size 0 < η < 1 , then we have
E T f T + 1 f ρ ρ 2 2 π 1 T ( L K ) f ρ ρ 2 + 2 η 2 t = 1 T 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) E t 1 E ( f t ) + 2 E T η t = 1 T π t + 1 T ( L K ) Δ t ρ 2 ,
furthermore, if f ρ H K ,
E T f T + 1 f ρ K 2 2 π 1 T ( L K ) f ρ K 2 + 2 η 2 t = 1 T E t 1 E ( f t ) + 2 E T η t = 1 T π t + 1 T ( L K ) Δ t K 2 ,
where Δ t is defined in the proof.
Proof. 
Denote
Δ t = G ( 0 ) G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t K x t = 1 G ( f t ( x t ) y t ) 2 2 σ 2 f t ( x t ) y t K x t
and define a random variable ξ ( f t , z t ) : = L K ( f t f ρ ) ( f t ( x t ) y t ) K x t .
By (5), we have that for any t N ,
f t + 1 f ρ = f t f ρ η f t ( x t ) y t K x t + η Δ t = I η L K ( f t f ρ ) + η ξ ( f t , z t ) + η Δ t .
Applying the above equality iteratively from t = T to t = 1 , we get that by f 1 = 0 ,
f T + 1 f ρ = π 1 T ( L K ) f ρ + η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) + η t = 1 T π t + 1 T ( L K ) Δ t .
It follows from the elementary inequality that g 1 + g 2 ρ 2 2 g 1 ρ 2 + 2 g 2 ρ 2 for any g 1 , g 2 L ρ X 2 , that
E T f T + 1 f ρ ρ 2 2 E T π 1 T ( L K ) f ρ + η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 + 2 E T η t = 1 T π t + 1 T ( L K ) Δ t ρ 2 .
To prove (14), we consider the part of the first term on the right-hand side of (18)
E T π 1 T ( L K ) f ρ + η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 = π 1 T ( L K ) f ρ ρ 2 + E T η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 2 E T π 1 T ( L K ) f ρ , η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ .
Observe that f t is only dependent on { z 1 , , z t 1 } , not on z t . Thus, by the fact that X y d ρ = f ρ ( x ) , we have
E z t ξ ( f t , z t ) = 0 , t = 1 , , T .
We consider the second term on the right-hand side of (19). It can be rewritten as
E T η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 = η 2 t = 1 T l = 1 T E T π t + 1 T ( L K ) ξ ( f t , z t ) , π l + 1 T ( L K ) ξ ( f l , z l ) ρ .
When t < l T , by (20),
E T π t + 1 T ( L K ) ξ ( f t , z t ) , π l + 1 T ( L K ) ξ ( f l , z l ) ρ = E l 1 E z l π t + 1 T ( L K ) ξ ( f t , z t ) , π l + 1 T ( L K ) ξ ( f l , z l ) ρ = E l 1 π l + 1 T ( L K ) π t + 1 T ( L K ) ξ ( f t , z t ) , E z l ξ ( f l , z l ) ρ = 0 .
Obviously, the above equality holds for l < t T . So, with (7), we get
η 2 t = 1 T l = 1 T E T π t + 1 T ( L K ) ξ ( f t , z t ) , π l + 1 T ( L K ) ξ ( f l , z l ) ρ = η 2 t = 1 T E t π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 η 2 t = 1 T π t + 1 T ( L K ) L K 1 2 2 E t L K 1 2 ξ ( f t , z t ) ρ 2 = η 2 t = 1 T π t + 1 T ( L K ) L K 1 2 2 E t ξ ( f t , z t ) K 2 .
To bound E t ξ ( f t , z t ) K 2 , we have
E t ξ ( f t , z t ) K 2 = E t ( f t ( x t ) y t ) K x t K 2 E t ( f t ( x t ) y t ) K x t K 2 E t 1 E z t ( f t ( x t ) y t ) K x t K 2 E t 1 E z t ( f t ( x t ) y t ) 2 = E t 1 E ( f t )
where the last inequality is derived from (3). Applying Lemma A1 with β = 1 2 , l = t + 1 and k = T , we have
t = 1 T π t + 1 T ( L K ) L K 1 2 2 = t = 1 T 1 π t + 1 T ( L K ) L K 1 2 2 + π T + 1 T ( L K ) L K 1 2 2 t = 1 T 1 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) + 1 t = 1 T 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) .
Based on the above analysis, we have
E T η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 η 2 t = 1 T 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) E t 1 E ( f t ) .
Now, we estimate the last term on the right-hand side of (19). Using (20) again, we have
E T π 1 T ( L K ) f ρ , η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ = π 1 T ( L K ) f ρ , η t = 1 T π t + 1 T ( L K ) E t 1 E z t ξ ( f t , z t ) ρ = 0 .
Plugging (21) and (22) into (19), we get
E T π 1 T ( L K ) f ρ + η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) ρ 2 = π 1 T ( L K ) f ρ ρ 2 + η 2 t = 1 T 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) E t 1 E ( f t ) .
This together with (18) yields the desired conclusion (14).
Now we turn to bound f T + 1 f ρ in H K -norm. By (17) again, we have
E T f T + 1 f ρ K 2 2 E T π 1 T ( L K ) f ρ + η t = 1 T π t + 1 T ( L K ) ξ ( f t , z t ) K 2 + 2 E T η t = 1 T π t + 1 T ( L K ) Δ t K 2 .
Following the similar procedure in estimating (14), we also get
E T f T + 1 f ρ K 2 2 π 1 T ( L K ) f ρ K 2 + 2 η 2 t = 1 T π t + 1 T ( L K ) 2 E t 1 E ( f t ) + 2 E T η t = 1 T π t + 1 T ( L K ) Δ t K 2 .
Noticing that π t + 1 T ( L K ) 1 , then the bound (15) is obtained. □
Based on the error bounds of f T + 1 f ρ in Proposition 1, we need to estimate the generalization error E ( f t ) .
Lemma 2.
Define { f t } t = 1 T + 1 by (5). If
0 < η min 1 , 1 8 ( 1 / 2 e ) 1 / 2 + 1 2 ( log ( e t ) + 1 ) 1 ,
then for t 2 ,
E t 1 E ( f t ) 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 η 2 σ 4 ( t 1 ) 2 sup 1 k t 1 { f k K , M } 6 .
Proof. 
We shall prove (25) by induction. Obviously, (25) holds for t = 2 . Suppose (25) holds for t 2 . Applying (14) with T = t , then
E t f t + 1 f ρ ρ 2 2 π 1 t ( L K ) f ρ ρ 2 + 2 η 2 k = 1 t 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( t k ) E k 1 E ( f k ) + 2 E t η k = 1 t π k + 1 t ( L K ) Δ k ρ 2 2 π 1 t ( L K ) f ρ ρ 2 + 2 η 2 k = 1 t 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( t k ) E k 1 E ( f k ) + 2 η 2 k = 1 t π k + 1 t ( L K ) Δ k 2 .
Since the Gaussian G is Lipschitz continuous, we have that for each 1 k T ,
Δ k G ( 0 ) G ( f k ( x k ) y t ) 2 2 σ 2 f k ( x k ) y k K x k G ( 0 ) G ( f k ( x k ) y k ) 2 2 σ 2 f t ( x k ) y k K x k K ( f k ( x k ) y k ) 2 2 σ 2 f k ( x k ) y k ( f k + M ) 3 2 σ 2 ( f k K + M ) 3 2 σ 2
where the last inequality is derived from (3).
Notice that by 0 < η 1 , there holds π k t ( L K ) l = k t I η L K 1 for each 1 k t T . Then the last term on the right-hand side of (26) is bounded as
2 η 2 k = 1 t π k + 1 t ( L K ) Δ k 2 32 η 2 σ 4 t 2 sup 1 k t { f k K , M } 6 .
For the first term 2 π 1 t ( L K ) f ρ ρ 2 , it is easy to get that 2 π 1 t ( L K ) f ρ ρ 2 2 f ρ ρ 2 .
Putting the above estimates into (26) and using the relation (13) with f = f t + 1 , we have
E t E ( f t + 1 ) = E t f t + 1 f ρ ρ 2 + E ( f ρ ) E ( f ρ ) + 2 f ρ ρ 2 + 32 η 2 σ 4 t 2 sup 1 k t { f k K , M } 6 + 2 η 2 k = 1 t 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( t k ) E k 1 E ( f k ) E ( f ρ ) + 2 f ρ ρ 2 + 32 η 2 σ 4 t 2 sup 1 k t { f k K , M } 6 + 2 η 2 k = 1 t 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( t k ) 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 η 2 σ 4 ( t 1 ) 2 sup 1 k t 1 { f k K , M } 6 .
By the restriction (24) of η and Lemma A3, we know that
2 η 2 k = 1 t 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( t k ) 4 η ( 1 / 2 e ) 1 / 2 + 1 2 ( log ( e t ) + 1 ) 1 2 .
Plugging it into (28), we have
E t E ( f t + 1 ) E ( f ρ ) + 2 f ρ ρ 2 + 32 η 2 σ 4 t 2 sup 1 k t { f k K , M } 6 + 1 2 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 η 2 σ 4 ( t 1 ) 2 sup 1 k t 1 { f k K , M } 6 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 η 2 σ 4 t 2 sup 1 k t { f k K , M } 6 .
Then the proof is completed. □
With these preliminaries in place, we shall prove our main results.
Proof of Theorem 1.
We shall prove Theorem 1 by Proposition 1. First, we will use (14) to estimate the error rate for (5) in L ρ X 2 -space. For the first term on the right-hand side of (14), applying Lemma A2 with f = f ρ and η = T 2 r 2 r + 1 , we have that
π 1 T ( L K ) f ρ ρ 2 4 ( r / e ) r + 1 2 L K r f ρ ρ 2 T 2 r 2 r + 1 = 4 ( r / e ) r + 1 2 g ρ 2 T 2 r 2 r + 1 .
For the second term on the right-hand side of (14), the choice of η and T in Theorem 1 implies that the restriction (24) holds. Then we can put the bound (12) into (25) and get that for t 2
E t 1 E ( f t ) 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 M 6 σ 4 η 5 ( t 1 ) 5 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 M 6 ( 1 + σ 4 η 5 ( t 1 ) 5 ) 2 E ( f ρ ) + 4 f ρ ρ 2 + 64 M 6 ( 1 + σ 4 η 5 T 5 ) : = c M , ρ ( 1 + σ 4 T 5 2 r + 1 ) .
This together with Lemma A3 yields that
η 2 t = 1 T 2 ( 1 / 2 e ) 1 / 2 + 1 2 1 + η ( T t ) E t 1 E ( f t ) 2 ( 1 / 2 e ) 1 / 2 + 1 2 c M , ρ η ( log ( e T ) + 1 ) ( 1 + σ 4 T 5 2 r + 1 ) 4 ( 1 / 2 e ) 1 / 2 + 1 2 c M , ρ log ( T ) ( T 2 r 2 r + 1 + σ 4 T 5 2 r 2 r + 1 ) .
Finally, we bound the last term on the right-hand side of (14). Notice that
η t = 1 T π t + 1 T ( L K ) Δ t ρ η t = 1 T π t + 1 t ( L K ) Δ t .
Then, using the estimate (27) and the bound (12) of { f t } , we have
E T η t = 1 T π t + 1 T ( L K ) Δ t ρ 2 η 2 t = 1 T π t + 1 T ( L K ) Δ t 2 16 η 2 σ 4 t 2 sup 1 t T { f t K , M } 6 16 M 6 σ 4 T 5 2 r + 1 .
Based on the above analysis, the conclusion (10) is obtained by taking
C = 8 ( r / e ) r + 1 2 g ρ 2 + 16 ( 1 / 2 e ) 1 / 2 + 1 2 c M , ρ + 32 M 6 .
Similarity, we can get the conclusion (11) by taking
C = 8 ( ( 2 r 1 ) / 2 e ) r 1 2 + 1 2 g ρ 2 + 8 c M , ρ + 32 M 6 .
 □

Author Contributions

B.W. conceived of the presented idea. T.H. developed the theory and performed the computations. All authors discussed the results and contributed to the final manuscript.

Funding

The work described in this paper is partially supported by National Natural Science Foundation of China [Nos. 11671307 and 11571078], Natural Science Foundation of Hubei Province in China [No. 2017CFB523] and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY18033].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Useful Lemmas

The following two lemmas are slightly modified forms of Lemma 3, Lemma 7 in [24], respectively.
Lemma A1.
Let β > 0 and 0 < η 1 . Then for any t [ l , k ] , there holds
π l k ( L K ) L K β 2 2 ( β / e ) β + 1 2 1 + η 2 β ( k l + 1 ) 2 β .
Lemma A2.
If f L K r ( L ρ X 2 ) for some r > 0 , then
π 1 T ( L K ) f ρ 2 ( r / e ) r + 1 L K r f ρ η r T r .
In addition, if r > 1 2 , then
π 1 T ( L K ) f K 2 ( ( 2 r 1 ) / 2 e ) r 1 2 + 1 L K r f ρ η r + 1 2 T r + 1 2 .
Lemma A3.
For any 0 < η 1 , there holds for t 2 ,
k = 1 t 1 1 + η ( t k ) η 1 ( log ( e t ) + 1 ) .
Proof. 
By the elementary inequality k = 1 t k 1 log e ( t + 1 ) , we know that for t 2 ,
k = 1 t 1 1 + η ( t k ) = k = 1 t 1 1 1 + η ( t k ) + 1 η 1 k = 1 t 1 ( t k ) 1 + 1 = η 1 k = 1 t 1 1 k + 1 η 1 log ( e t ) + 1 η 1 ( log ( e t ) + 1 ) .
Then the proof is completed. □

References

  1. Santamaria, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef]
  2. Feng, Y.L.; Huang, X.L.; Shi, L.; Yang, Y.N.; Suykens, J.A.K. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
  3. He, R.; Zheng, W.S.; Hu, B.G. Maximum Correntropy Criterion for Robust Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1561–1576. [Google Scholar] [PubMed]
  4. Liu, W.F.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
  5. Principe, J.C. Renyi’s Entropy and Kernel Perspectives. In Information Theoretic Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
  6. He, R.; Zheng, W.S.; Hu, B.G.; Kong, X.W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074–2100. [Google Scholar] [CrossRef]
  7. Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
  8. He, R.; Hu, B.G.; Zheng, W.S.; Kong, X.W. Robust Principal Component Analysis Based on Maximum Correntropy Criterion. IEEE Trans. Image Process. 2011, 20, 1485–1494. [Google Scholar] [PubMed]
  9. Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Principe, J.C. Steady-State Mean-Square Error Analysis for Adaptive Filtering under the Maximum Correntropy Criterion. IEEE Signal Process. Lett. 2014, 21, 880–883. [Google Scholar]
  10. Wu, Z.; Peng, S.; Chen, B.; Zhao, H. Robust Hammerstein Adaptive Filtering under Maximum Correntropy Criterion. Entropy 2015, 17, 7149–7166. [Google Scholar] [CrossRef] [Green Version]
  11. Liu, W.; Pokharel, P.P.; Principe, J.C. Error Entropy, Correntropy and M-Estimation. In Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, Arlington, VA, USA, 6–8 September 2006. [Google Scholar]
  12. Syed, M.N.; Pardalos, P.M.; Principe, J.C. Invexity of the minimum error entropy criterion. IEEE Signal Process. Lett. 2013, 20, 1159–1162. [Google Scholar] [CrossRef]
  13. Syed, M.N.; Pardalos, P.M.; Principe, J.C. On the optimization properties of the correntropic loss function in data analysis. Optim. Lett. 2014, 8, 823–839. [Google Scholar] [CrossRef]
  14. Marques de Sá, J.P.; Silva, L.M.A.; Santos, J.M.F.; Alexandre, L.A. Minimum Error Entropy Classification; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  15. Cucker, F.; Zhou, D.X. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
  16. Singh, A.; Pokharel, R.; Principe, J.C. The C-loss function for pattern classification. Pattern Recognit. 2014, 47, 441–453. [Google Scholar] [CrossRef]
  17. Guo, Z.C.; Hu, T.; Shi, L. Gradient descent for robust kernel-based regression. Inverse Prob. 2018, 34. [Google Scholar] [CrossRef]
  18. Smale, F.; Zhou, D.X. Learning theory estimates via integral operators and their approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef]
  19. Lu, S.; Pereverzev, S.V. Regularization Theory for Ill-Posed Problems: Selected Topics; Walter de Gruyter: Berlin, Germany, 2013. [Google Scholar]
  20. Caponnetto, A.; Vito, E.D. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
  21. Steinwart, I.; Christmann, A. Support Vector Machines; Springer: New York, NY, USA, 2008. [Google Scholar]
  22. Bauer, F.; Pereverzev, S.V.; Rosasco, L. On regularization algorithms in learning theory. J. Complexity 2007, 23, 52–72. [Google Scholar] [CrossRef] [Green Version]
  23. Feng, Y.L.; Fan, J.; Suykens, J.A. A statistical learning approach to modal regression. arXiv 2017, arXiv:1702.05960. [Google Scholar]
  24. Ying, Y.; Pontil, M. Online gradient descent learning algorithms. Found. Comput. Math. 2008, 8, 561–596. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Wang, B.; Hu, T. Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy 2019, 21, 644. https://doi.org/10.3390/e21070644

AMA Style

Wang B, Hu T. Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy. 2019; 21(7):644. https://doi.org/10.3390/e21070644

Chicago/Turabian Style

Wang, Baobin, and Ting Hu. 2019. "Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion" Entropy 21, no. 7: 644. https://doi.org/10.3390/e21070644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop