Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion

Wang, Baobin; Hu, Ting

doi:10.3390/e21070644

Open AccessArticle

Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion

by

Baobin Wang

¹ and

Ting Hu

^2,*

¹

School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China

²

School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(7), 644; https://doi.org/10.3390/e21070644

Submission received: 15 May 2019 / Revised: 14 June 2019 / Accepted: 24 June 2019 / Published: 29 June 2019

(This article belongs to the Special Issue Entropy Based Inference and Optimization in Machine Learning)

Download Versions Notes

Abstract

:

In the framework of statistical learning, we study the online gradient descent algorithm generated by the correntropy-induced losses in Reproducing kernel Hilbert spaces (RKHS). As a generalized correlation measurement, correntropy has been widely applied in practice, owing to its prominent merits on robustness. Although the online gradient descent method is an efficient way to deal with the maximum correntropy criterion (MCC) in non-parameter estimation, there has been no consistency in analysis or rigorous error bounds. We provide a theoretical understanding of the online algorithm for MCC, and show that, with a suitable chosen scaling parameter, its convergence rate can be min–max optimal (up to a logarithmic factor) in the regression analysis. Our results show that the scaling parameter plays an essential role in both robustness and consistency.

Keywords:

correntropy; maximum correntropy criterion; online algorithm; robustness; reproducing kernel Hilbert spaces

1. Introduction

Regression analysis is an important problem in many fields of science. The traditional least squares method may be the most used algorithm for regression in practice. However, it only relies on the mean squared error and belongs to second-order statistics, whose optimality depends heavily on the assumption of Gaussian noise. Thus, it usually performs poorly when the noise is not normally distributed. Alterative approaches have been proposed to deal with outliers or heavy-tailed distributions. A generalized correlation function named correntropy [1] is introduced as a substitute for the least squares loss, and the maximum correntropy criterion (MCC) [2,3,4,5] is used to improve robustness in situations of non-Gaussian and heavy-tailed error distributions. Recently, MCC has been succeeded in many real applications, e.g., wind power forecasting and pattern recognition [6,7].

In the standard framework of statistical learning, let

X \in R^{n}

be an explanatory variable with values taken in a compact metric space

(X, d),

Y be a real response variable with

Y \in Y \subset R .

Here we investigate the application of MCC in the following regression model

\begin{matrix} Y = f_{ρ} (X) + ϵ, E (ϵ | X = x) = 0, \end{matrix}

where

ϵ

is the noise and

f_{ρ} (x)

is the regression function, defined as the conditional mean

E (Y | X = x)

at each

x \in X .

The purpose of regression is to estimate the unknown target function

f_{ρ}

according to the sample

z = {z_{i} = (x_{i}, y_{i})}_{i = 1}^{T},

which is drawn independently from the underlying unknown probability distribution

ρ

on

Z : = X \times Y .

For a hypothesis function

f : X \to Y,

with the scaling parameter

σ > 0,

the correntropy between

f (X)

and Y is defined by

V_{σ} (f) : = E G (\frac{{(f (X) - Y)}^{2}}{2 σ^{2}})

where

G (u)

is the Exponential function

exp \{- u\}, u \in R .

For the given sample

z,

the empirical form of

V_{σ}

is

{\hat{V}}_{σ} (f) : = \frac{1}{T} \sum_{i = 1}^{T} G (\frac{{(f (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) .

When applied to regression problems, MCC intends to maximize the empirical correntropy

{\hat{V}}_{σ}

over a certain underlying hypothesis space

H

, that is

\begin{matrix} f_{z, H} : = arg max_{f \in H} {\hat{V}}_{σ} (f) . \end{matrix}

(1)

MCC in regression problems has shown its efficiency for cases when the noises are non-Gaussian, and also with large outliers, see [8,9,10]. It also has drawn much attention in the signal processing, machine learning and optimization communities [2,5,11,12,13,14].

Let

K : X \times X \to R

be a Mercer kernel, i.e., a continuous, symmetric and positive semi-definite function. We say that K is a positive semi-definite, if for any finite set

\{u_{1}, \dots, u_{m}\} \subset X

and

m \in N,

the matrix

{(K (u_{i}, u_{j}))}_{i, j = 1}^{m}

is positive semi-definite. An RKHS

(H_{K}, ∥ \cdot ∥_{K})

associated with the Mercer kernel K is defined as the completion of the linear span of the functions set

{K_{x} : = K (x, \cdot), x \in X} .

It has the reproducing property

\begin{matrix} f (x) = {〈 f, K_{x} 〉}_{K} \end{matrix}

(2)

for any

f \in H_{K}

and

x \in X .

Since

X

is compact, the RKHS

H_{K}

is contained in

C (X),

the space of continuous functions on

X

with the norm

{∥ f ∥}_{\infty} : = sup_{x \in X} | f (x) | .

Moreover, if

X

is a Euclidean ball in

R^{n}

with some

α > \frac{n}{2},

then the Sobolev space

H^{α} (X)

is an RKHS. For more families of RKHS in statistical learning, one can refer to [15]. Denote

κ : = {sup}_{x \in X} \sqrt{K (x, x),}

then, by the reproducing property (2), there holds

\begin{matrix} {∥ f ∥}_{\infty} \leq κ {∥ f ∥}_{K}, f o r a n y f \in H_{K} . \end{matrix}

(3)

Denote

ℓ_{σ} : R \times R \to R

as the correntropy induced regression loss, given by

\begin{matrix} ℓ_{σ} (u, v) : = σ^{2} (1 - G (\frac{{(u - v)}^{2}}{2 σ^{2}})) = σ^{2} (1 - exp \{- \frac{{(u - v)}^{2}}{2 σ^{2}}\}) . \end{matrix}

Associated with this regression loss

ℓ_{σ}

and the RKHS

H_{K},

MCC for regression (1) in the context of learning theory is reformulated as

\begin{matrix} f_{z} : = arg min_{f \in H_{K}} \frac{1}{T} \sum_{i = 1}^{T} ℓ_{σ} (f (x_{i}), y_{i}) . \end{matrix}

(4)

Notice that

ℓ_{σ}

is not convex, MCC algorithms are usually implemented by various gradient descent methods [14,16,17]. In this paper, we take the online gradient descent method as follows to solve the above optimization scheme (4) since it is scalable to large datasets and applicable to situations where the samples are presented in sequence.

Definition 1.

Given the sample

z = {z_{i} = (x_{i}, y_{i})}_{i = 1}^{T},

the online gradient descent method for MCC is defined by

f_{1} = 0,

and

\begin{matrix} f_{t + 1} = f_{t} - η ℓ_{σ}^{'} (f_{t} (x_{t}), y_{t}) K_{x_{t}}, t \in N, \end{matrix}

(5)

where

η > 0

is the step size and

ℓ_{σ}^{'}

denotes the derivative of

ℓ_{σ}

with respect to the first variable.

In the literature, most MCC algorithms have been implemented for linear models and cannot be applied to analysis of data with nonlinear structures. Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features. So, RKHS are used in this work as hypothesis spaces in the design of learning algorithms.

An online algorithm for MCC has been used in practical applications for more than one decade, but there still is a lack of the theoretical guarantee or strict analysis for its asymptotical convergence. Because the optimization problem arising from MCC is not convex, the global optimization convergence of the online algorithm (5) for MCC is not unconditionally guaranteed. This also makes the theoretical analysis for MCC essentially difficult. In fact, vast numerical studies show that MCC can lead robust estimators while keeping convenient convergence properties. Thus, our goal is to fill the gap between the theoretical analysis and the optimization process so that the output function of the online algorithm (5) can converge to a global minima while the existing work can not ensure the global optimization of this output. To this end, we study the approximation ability of

f_{T + 1}

generated by (5) at the T-iteration to the regression function

f_{ρ} .

We derive the explicit error rate for (5) with suitable choice of step sizes, which is competitive with those in the regression analysis. In this work, we show that the scaling parameter

σ

plays an important role in providing robustness and a fast convergence rate.

2. Preliminaries and Main Results

We begin with some preliminaries and notations. Throughout the paper, we assume that the unknown distribution

ρ

on

Z = X \times Y

can be decomposed into the marginal distribution

ρ_{X}

on

X

and the conditional distribution

ρ (\cdot | x)

at each

x \in X .

We also require that

| Y | < M

almost surely for some

M > 1 .

In the regression analysis, the approximation power of

f_{T + 1}

by (5) is usually measured in terms of the mean squared error in

L_{ρ_{X}}^{2}

-metric

∥ f_{T + 1} - f_{ρ} ∥_{ρ},

that is defined as

{∥ \cdot ∥}_{ρ} = {∥ \cdot ∥}_{L_{ρ_{X}}^{2}} : = {(\int_{X} {| \cdot |}^{2} d ρ_{X})}^{\frac{1}{2}} .

To present our main result for the error bound of

f_{T + 1} - f_{ρ}

, the assumption on the target function

f_{ρ}

will be given as below. Define an integral operator

L_{K} : L_{ρ_{X}}^{2} ⟶ L_{ρ_{X}}^{2}

associated with the kernel K by

\begin{matrix} L_{K} (f) : = \int_{X} f (x) K_{x} d ρ_{X}, f \in L_{ρ_{X}}^{2} . \end{matrix}

By the reproducing property (2) of

H_{K},

for any

f \in H_{K},

it can be expressed as

\begin{matrix} L_{K} (f) = \int_{X} {〈 f, K_{x} 〉}_{K} K_{x} d ρ_{X} . \end{matrix}

(6)

Since K is a Mercer kernel,

L_{K}

is compact and positive. Denote

L_{K}^{r}

as the r-th power of

L_{K},

then it is well defined for any

r > 0

by the spectral theorem. Let

{λ_{i}}_{i \geq 1}

be the eigenvalues of

L_{K},

arranged in decreasing order. The corresponding eigenfunctions

{ϕ_{i}}_{i \geq 1}

form an orthonormal basis of

L_{ρ_{X}}^{2}

space. Hence, the regularity space

L_{K}^{r} (L_{ρ_{X}}^{2})

is expressed as [18]

\begin{matrix} L_{K}^{r} (L_{ρ_{X}}^{2}) : = \{f = \sum_{i = 1}^{\infty} λ_{j}^{r} a_{i} ϕ_{i} : {∥ L_{K}^{- r} f ∥}_{ρ} = \sum_{i = 1}^{\infty} a_{i}^{2} < \infty\} . \end{matrix}

It implies that for any

r_{1} > r_{2} > 0

, there holds

L_{K}^{r_{1}} (L_{ρ_{X}}^{2}) \subset L_{K}^{r_{2}} (L_{ρ_{X}}^{2}) .

In particular, we know that

L_{K}^{r} (L_{ρ_{X}}^{2}) \subseteq H_{K}

for any

r \geq \frac{1}{2}

and

L_{K}^{\frac{1}{2}} (L_{ρ_{X}}^{2}) = H_{K}

satisfying

\begin{matrix} {∥ f ∥}_{K} = {∥ L_{K}^{- \frac{1}{2}} f ∥}_{ρ}, \forall f \in H_{K} . \end{matrix}

(7)

Throughout the paper, the regularity assumption holds for

f_{ρ}

, i.e.,

\begin{matrix} f_{ρ} = L_{K}^{r} (g), f o r s o m e r > 0 a n d g \in L_{ρ_{X^{2}}}^{2}, \end{matrix}

(8)

and

∥ L^{- r} f_{ρ} ∥_{ρ} = {∥ g ∥}_{ρ} .

This assumption is called the source condition [19] in inverse problems and it characterizes the smoothness of the target function

f_{ρ} .

Obviously, the larger the parameter r is, the higher the regularity of

f_{ρ}

is. The general source conditions considered in inverse problems usually take the form of

\begin{matrix} f_{ρ} = ψ (L_{K}) h, f o r s o m e h \in H_{K} \end{matrix}

(9)

where

ψ

is non-decreasing and

ψ (0) = 0,

called the index function. It is clear that when

r > \frac{1}{2},

The above assumption is a special case of (9) with

ψ (L_{K}) = L_{K}^{r - \frac{1}{2}}

and

h = L_{K}^{\frac{1}{2}} g .

It should be pointed that our analysis in this work also can applied to more general cases by taking source conditions (9).

We are now in a position to state our convergence rate for (5) in

L_{ρ_{X}}^{2}

-space as well as in

H_{K}

by choosing the step size

η : = η (T) .

For brevity, let

κ = 1

without losing generality and denote the expectation

E_{z_{1}, \dots, z_{t}}

as

E_{t}

for each

t \in N .

Theorem 1.

Define

{f_{t}}_{t = 1}^{T + 1}

by (5). Suppose that the assumption (8) holds for

r > 0 .

Take

η = T^{- \frac{2 r}{2 r + 1}}

and

T > {(24 {({(1 / 2 e)}^{1 / 2} + 1)}^{2} log (T))}^{\frac{2 r + 1}{2 r}},

then

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{ρ}^{2}] \leq C max \{T^{- \frac{2 r}{2 r + 1}} log (T), T^{\frac{5}{2 r + 1}} σ^{- 4}\} \end{matrix}

(10)

and if

r > \frac{1}{2},

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{K}^{2}] \leq C^{'} max \{T^{- \frac{2 r - 1}{2 r + 1}}, T^{\frac{5}{2 r + 1}} σ^{- 4}\} \end{matrix}

(11)

where the constants

C, C^{'}

are independent of

T, σ,

and will be given in the proof.

Remark 1.

Besides the error

∥ f_{T + 1} - f_{ρ} ∥_{ρ}

, the error bound (11) in

H_{K}

-norm is also given if

r > \frac{1}{2}

, i.e.,

f_{ρ} \in H_{K} .

By (3), it leads the pointwise convergence of

f_{T + 1}

to

f_{ρ}

since for each

u \in X

,

| f_{T + 1} (u) - f_{ρ} (u) | \leq ∥ f_{T + 1} - f_{ρ} ∥_{K} .

Compared with the global error

∥ | f_{T + 1} - f_{ρ} ∥_{ρ},

the error rate in

H_{K}

characterizes the local performance of (5) and is much stronger. Furthermore [18], when the kernel K lies in

C^{α} (X \times X)

for some

α > 0,

its associated RKHS

H_{K}

can be embedded into

C^{α / 2} (X)

, whose partial derivative up to order

α / 2

are continuous with

{∥ f ∥}_{C^{α / 2} (X)} = \sum_{| s | \leq \frac{α}{2}} {∥ D^{\frac{α}{2}} f ∥}_{\infty} .

So, the convergence in

H_{K}

implies that

f_{T + 1}

will converge to

f_{ρ}

in

C^{\frac{α}{2}},

that ensures the convergence of the derivatives of

f_{T + 1}

to those of

f_{ρ} .

Remark 2.

It has been proved in [20] that the min–max optimal rate for regression problems is of order

O (T^{- \frac{2 r}{2 r + s}})

when there exists constants

C_{s} > 0,

0 < s \leq 1

such that the following effective dimension condition holds, i.e.,

\begin{matrix} T r a c e ({(L_{K} + λ I)}^{- 1} L_{K}) \leq C_{s} λ^{- s}, f o r a n y λ > 0, \end{matrix}

where

T r a c e (\cdot)

denotes the trace of the operator. This condition measures the complexity [15,20,21] of

H_{K}

with respect to the marginal distribution

ρ_{X} .

It is always satisfied with

s = 1

by taking the constant

C_{s} = T r a c e (L_{K}) .

Hence, the min–max optimal rate for capacity-independent cases is of order

O (T^{- \frac{2 r}{2 r + 1}})

by taking a universal parameter

s = 1 .

When

σ \geq T^{\frac{2 r + 5}{4 (2 r + 1)}}

, we see that our convergence rate in

L_{ρ_{X}}^{2}

-norm is of order

O (T^{- \frac{2 r}{2 r + 1}} log (T)) .

Thus, it is nearly optimal in the capacity-independent sense that up to a logarithmic factor, it matches the min–max optimal rate above. We also find that the convergence rates (10) and (11) keep decreasing as the regularity parameter r increases. Hence, the online algorithm (5) does not suffer from the saturation phenomenon existing in Tikhonov regularization schemes [22], where the error rate of the estimators will not improve if r is out of the range

(0, 1] .

This again shows the advantage of the online algorithm (5).

Remark 3.

Recent paper [2] investigated the approximation ability of the empirical scheme (4) over general hypothesis spaces

H

. This work shows that with a complexity parameter

0 < β \leq 2,

their error rate is of order

O (T^{- \frac{2}{2 + β}})

if the scaling parameter

σ = T^{\frac{1}{2 + β}} .

To be fair, do not take the capacity of

H

into consideration by taking

β = 2 .

Then, their order reduces to

O (T^{- \frac{1}{2}}),

which is far from capacity-independent optimality and inferior to our rates.

In the work [17], iterative regularization techniques (alternatively called early stopping) are taken to solve the optimization problems associated with general robust losses including the correntropy induced loss

ℓ_{σ},

where the whole sample

z

are presented at each iteration. In their analysis, under the polynomial decay of the eigenvalues

{λ_{i}},

that is, there exists some constants

C_{b} > 0

and

b \geq 1

such that

\begin{matrix} λ_{i} \leq c_{b} i^{- b}, \forall i \geq 1, \end{matrix}

the obtained rate is

O (T^{- \frac{2 b r}{2 b r + 1}})

if

r \geq \frac{1}{2}

, else, it is

O (T^{- \frac{2 b r}{b + 1}}) .

This decay is also a measurement for the complexity of

H_{K},

please refer to [21]. Recall that the compactness of

X

implies that

\sum_{i} λ_{i} < \infty

and

λ_{i} \leq c i^{- 1}

for some

c > 0 .

So, their rate for capacity-independent cases is

O (T^{- \frac{2 r}{2 r + 1}})

if

r \geq \frac{1}{2}

, else, it is

O (T^{- r}) .

We can see that our results in (10) are superior in the case

0 < r < \frac{1}{2} .

It shows in theory that the online algorithm (5) for MCC can achieve better approximation rate when

f_{ρ}

is not in

H_{K} .

Remark 4.

It is easy to check that the roots of the second derivative of

ℓ_{σ}

is

\pm σ,

i.e., when

| f (x) - y | < σ,

this loss is convex and behaves as the least squares loss; when

| f (x) - y | \geq σ

, the loss function becomes concave and rapidly tends to be flat as the value of

| f (x) - y |

goes to infinity. It implies that

ℓ_{σ}

satisfies the redescending property, and with a suitable chosen scaling parameter σ,

ℓ_{σ}

can reject gross outliers while keeping a prediction accuracy. In Theorem 1, we observe that σ should be large enough to guarantee the nice convergence, which coincides with the work in [2]. They also pointed that too small σ may prevent the estimator to converge to

f_{ρ} .

In a recent paper [23], correntropy with small σ is interpreted as modal regression. According to the above discussions and empirical studies [2,14,17], we conclude that the value of σ would determine the learning target and a moderate σ may be more appropriate for balancing the convergence and robustness in practice.

Based on the above remarks, we see that the convergence rate of online kernel-based MCC is comparable to that of the least squares that has appeared in the literature [24]. Meanwhile, MCC’s redescending property will produce robustness to various outliers including sub-Gaussain, Student’s t-distribution, and Cauchy distribution. These all shows the superiority of MCC in a variety of applications, such as clustering, classification and feature selection [14]. At the end of this section, we would like to point out that although our work is carried out under the boundness condition of

Y,

it can be extended to more general situations such as the moment conditions [20].

3. Proofs of Main Result

In this section, we prove our main results in Theorem 1. First, we derive the uniform bound for the iteration sequence

{f_{t}}_{t = 1}^{T + 1}

by (5).

Lemma 1.

Define

{f_{t}}_{t = 1}^{T + 1}

by (5). If

0 < η \leq 1,

then

\begin{matrix} ∥ f_{t} ∥_{K} \leq M η^{\frac{1}{2}} {(t - 1)}^{\frac{1}{2}}, t \in N . \end{matrix}

(12)

Proof.

We prove (12) by induction. It is trivial that (12) holds for

t = 1 .

Suppose (12) holds for

t \geq 2 .

Notice that

ℓ_{σ}^{'} (f_{t} (x_{t}), y_{t}) = G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) [f_{t} (x_{t}) - y_{t}] .

Write (12) as

f_{t + 1} = f_{t} - η H_{t}

where

H_{t} = G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) [f_{t} (x_{t}) - y_{t}] K_{x_{t}} .

Then by (2),

\begin{matrix} ∥ f_{t + 1} ∥_{K}^{2} = ∥ f_{t} ∥_{K}^{2} - 2 η {〈 f_{t}, H_{t} 〉}_{K} + η^{2} {∥ H_{t} ∥}_{K}^{2} \\ = ∥ f_{t} ∥_{K}^{2} - 2 η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) [f_{t} (x_{t}) - y_{t}] f_{t} (x_{t}) + η^{2} {∥ H_{t} ∥}_{K}^{2} \end{matrix}

and

\begin{matrix} ∥ H_{t} ∥_{K}^{2} & = G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{σ^{2}}) {[f_{t} (x_{t}) - y_{t}]}^{2} K_{(x_{t}, x_{t})} \\ \leq G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{σ^{2}}) {[f_{t} (x_{t}) - y_{t}]}^{2} . \end{matrix}

Then, we have

\begin{matrix} ∥ f_{t + 1} ∥_{K}^{2} \leq {∥ f_{t} ∥}_{K}^{2} + η \{η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) {[f_{t} (x_{t}) - y_{t}]}^{2} - 2 (f_{t} (x_{t}) - y_{t}) f_{t} (x_{t})\} G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) . \end{matrix}

For the part of the above inequality, we have

\begin{matrix} η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) {[f_{t} (x_{t}) - y_{t}]}^{2} - 2 (f_{t} (x_{t}) - y_{t}) f_{t} (x_{t}) \\ = (η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) - 2) {((f_{t} (x_{t}) - y_{t}) - \frac{y_{t}}{η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) - 2})}^{2} \\ + \frac{y_{t}^{2}}{2 - η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}})} . \end{matrix}

Since

η \leq 1,

it follows that

η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) - 2 < 0

and

2 - η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) \geq 1 .

Recall that

| y | \leq M

for all

y \in Y,

then

\begin{matrix} η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) {[f_{t} (x_{t}) - y_{t}]}^{2} - 2 (f_{t} (x_{t}) - y_{t}) f_{t} (x_{t}) \leq \frac{y_{t}^{2}}{2 - η G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}})} \leq M^{2} . \end{matrix}

Based on the above analysis,

\begin{matrix} ∥ f_{t + 1} ∥_{K}^{2} \leq ∥ f_{t} ∥_{K}^{2} + η M^{2} G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}}) \leq {∥ f_{t} ∥}_{K}^{2} + η M^{2} \leq M^{2} η (t - 1) + η M^{2} = M^{2} η t . \end{matrix}

Then the proof is completed. □

Next, we will establish a proposition which is crucial to prove the convergence rates in Theorem 1. It is closely related to the generalization error of

f_{t} .

Define the generalization error

E (f)

for any measurable function

f : X \to R

by

\begin{matrix} E (f) = \int_{Z} {(f (x) - y)}^{2} d ρ . \end{matrix}

The regression function

f_{ρ}

that we want to learn or approximate is a minimizer of

E (f)

, that is

\begin{matrix} f_{ρ} = arg min {E (f) : f i s a m e a s u r a b l e f u n c t i o n f r o m X t o Y} . \end{matrix}

A simple computation yields the relation for

f : X \to R

\begin{matrix} ∥ f - f_{ρ} ∥_{ρ}^{2} = E (f) - E (f_{ρ}) . \end{matrix}

(13)

For brevity, set the operator

π_{k}^{t} (L_{K}) : = \prod_{j = k}^{t} (I - η L_{K})

for

k, t \in N

and

π_{t + 1}^{t} (L_{K}) : = I .

Proposition 1.

Define

{f_{t}}_{t = 1}^{T + 1}

by (5). If the step size

0 < η < 1,

then we have

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{ρ}^{2}] & \leq 2 {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{ρ}^{2} + 2 η^{2} \sum_{t = 1}^{T} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} E_{t - 1} [E (f_{t})] \\ + 2 E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{ρ}^{2}], \end{matrix}

(14)

furthermore, if

f_{ρ} \in H_{K},

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{K}^{2}] & \leq 2 {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{K}^{2} + 2 η^{2} \sum_{t = 1}^{T} E_{t - 1} [E (f_{t})] + 2 E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{K}^{2}], \end{matrix}

(15)

where

Δ_{t}

is defined in the proof.

Proof.

Denote

\begin{matrix} Δ_{t} & = [G (0) - G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}})] [f_{t} (x_{t}) - y_{t}] K_{x_{t}} \\ = [1 - G (- \frac{{(f_{t} (x_{t}) - y_{t})}^{2}}{2 σ^{2}})] [f_{t} (x_{t}) - y_{t}] K_{x_{t}} \end{matrix}

(16)

and define a random variable

ξ (f_{t}, z_{t}) : = L_{K} (f_{t} - f_{ρ}) - (f_{t} (x_{t}) - y_{t}) K_{x_{t} .}

By (5), we have that for any

t \in N,

\begin{matrix} f_{t + 1} - f_{ρ} = f_{t} - f_{ρ} - η [f_{t} (x_{t}) - y_{t}] K_{x_{t}} + η Δ_{t} = (I - η L_{K}) (f_{t} - f_{ρ}) + η ξ (f_{t}, z_{t}) + η Δ_{t} . \end{matrix}

Applying the above equality iteratively from

t = T

to

t = 1,

we get that by

f_{1} = 0,

\begin{matrix} f_{T + 1} - f_{ρ} = - π_{1}^{T} (L_{K}) f_{ρ} + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}) + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t} . \end{matrix}

(17)

It follows from the elementary inequality that

∥ g_{1} + g_{2} ∥_{ρ}^{2} \leq 2 ∥ g_{1} ∥_{ρ}^{2} + 2 {∥ g_{2} ∥}_{ρ}^{2}

for any

g_{1}, g_{2} \in L_{ρ_{X}}^{2},

that

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{ρ}^{2}] \leq 2 E_{T} [{∥- π_{1}^{T} (L_{K}) f_{ρ} + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] + 2 E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{ρ}^{2}] . \end{matrix}

(18)

To prove (14), we consider the part of the first term on the right-hand side of (18)

\begin{matrix} E_{T} [{∥- π_{1}^{T} (L_{K}) f_{ρ} + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] = {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{ρ}^{2} \\ + E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] - 2 E_{T} [{〈π_{1}^{T} (L_{K}) f_{ρ}, η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})〉}_{ρ}] . \end{matrix}

(19)

Observe that

f_{t}

is only dependent on

{z_{1}, \dots, z_{t - 1}}

, not on

z_{t} .

Thus, by the fact that

\int_{X} y d ρ = f_{ρ} (x),

we have

\begin{matrix} E_{z_{t}} [ξ (f_{t}, z_{t})] = 0, t = 1, \dots, T . \end{matrix}

(20)

We consider the second term on the right-hand side of (19). It can be rewritten as

\begin{matrix} E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] = η^{2} \sum_{t = 1}^{T} \sum_{l = 1}^{T} E_{T} {〈π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}), π_{l + 1}^{T} (L_{K}) ξ (f_{l}, z_{l})〉}_{ρ} . \end{matrix}

When

t < l \leq T,

by (20),

\begin{matrix} E_{T} {〈π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}), π_{l + 1}^{T} (L_{K}) ξ (f_{l}, z_{l})〉}_{ρ} = E_{l - 1} E_{z_{l}} {〈π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}), π_{l + 1}^{T} (L_{K}) ξ (f_{l}, z_{l})〉}_{ρ} \\ = E_{l - 1} {〈π_{l + 1}^{T} (L_{K}) π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}), E_{z_{l}} ξ (f_{l}, z_{l})〉}_{ρ} = 0 . \end{matrix}

Obviously, the above equality holds for

l < t \leq T .

So, with (7), we get

\begin{matrix} η^{2} \sum_{t = 1}^{T} \sum_{l = 1}^{T} E_{T} {〈π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t}), π_{l + 1}^{T} (L_{K}) ξ (f_{l}, z_{l})〉}_{ρ} \\ = η^{2} \sum_{t = 1}^{T} E_{t} [{∥π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] \leq η^{2} \sum_{t = 1}^{T} {∥π_{t + 1}^{T} (L_{K}) L_{K}^{\frac{1}{2}}∥}^{2} E_{t} [{∥L_{K}^{- \frac{1}{2}} ξ (f_{t}, z_{t})∥}_{ρ}^{2}] \\ = η^{2} \sum_{t = 1}^{T} {∥π_{t + 1}^{T} (L_{K}) L_{K}^{\frac{1}{2}}∥}^{2} E_{t} [{∥ξ (f_{t}, z_{t})∥}_{K}^{2}] . \end{matrix}

To bound

E_{t} [{∥ξ (f_{t}, z_{t})∥}_{K}^{2}],

we have

\begin{matrix} E_{t} [{∥ξ (f_{t}, z_{t})∥}_{K}^{2}] = E_{t} [{∥(f_{t} (x_{t}) - y_{t}) K_{x_{t}}∥}_{K}^{2}] - {∥E_{t} [(f_{t} (x_{t}) - y_{t}) K_{x_{t}}]∥}_{K}^{2} \\ \leq E_{t - 1} E_{z_{t}} [{∥(f_{t} (x_{t}) - y_{t}) K_{x_{t}}∥}_{K}^{2}] \leq E_{t - 1} E_{z_{t}} [{(f_{t} (x_{t}) - y_{t})}^{2}] = E_{t - 1} [E (f_{t})] \end{matrix}

where the last inequality is derived from (3). Applying Lemma A1 with

β = \frac{1}{2}

,

l = t + 1

and

k = T,

we have

\begin{matrix} \sum_{t = 1}^{T} {∥π_{t + 1}^{T} (L_{K}) L_{K}^{\frac{1}{2}}∥}^{2} = \sum_{t = 1}^{T - 1} {∥π_{t + 1}^{T} (L_{K}) L_{K}^{\frac{1}{2}}∥}^{2} + {∥π_{T + 1}^{T} (L_{K}) L_{K}^{\frac{1}{2}}∥}^{2} \\ \leq \sum_{t = 1}^{T - 1} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} + 1 \leq \sum_{t = 1}^{T} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} . \end{matrix}

Based on the above analysis, we have

\begin{matrix} E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] \leq η^{2} \sum_{t = 1}^{T} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} E_{t - 1} [E (f_{t})] . \end{matrix}

(21)

Now, we estimate the last term on the right-hand side of (19). Using (20) again, we have

\begin{matrix} E_{T} [{〈π_{1}^{T} (L_{K}) f_{ρ}, η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})〉}_{ρ}] \\ = {〈π_{1}^{T} (L_{K}) f_{ρ}, η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) E_{t - 1} E_{z_{t}} [ξ (f_{t}, z_{t})]〉}_{ρ} = 0 . \end{matrix}

(22)

Plugging (21) and (22) into (19), we get

\begin{matrix} E_{T} [{∥- π_{1}^{T} (L_{K}) f_{ρ} + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{ρ}^{2}] = {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{ρ}^{2} \\ + η^{2} \sum_{t = 1}^{T} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} E_{t - 1} [E (f_{t})] . \end{matrix}

(23)

This together with (18) yields the desired conclusion (14).

Now we turn to bound

f_{T + 1} - f_{ρ}

in

H_{K}

-norm. By (17) again, we have

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{K}^{2}] \leq 2 E_{T} [{∥- π_{1}^{T} (L_{K}) f_{ρ} + η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) ξ (f_{t}, z_{t})∥}_{K}^{2}] + 2 E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{K}^{2}] . \end{matrix}

Following the similar procedure in estimating (14), we also get

\begin{matrix} E_{T} [∥ f_{T + 1} - f_{ρ} ∥_{K}^{2}] \leq 2 {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{K}^{2} + 2 η^{2} \sum_{t = 1}^{T} {∥π_{t + 1}^{T} (L_{K})∥}^{2} E_{t - 1} [E (f_{t})] + 2 E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{K}^{2}] . \end{matrix}

Noticing that

∥π_{t + 1}^{T} (L_{K})∥ \leq 1,

then the bound (15) is obtained. □

Based on the error bounds of

f_{T + 1} - f_{ρ}

in Proposition 1, we need to estimate the generalization error

E (f_{t}) .

Lemma 2.

Define

{f_{t}}_{t = 1}^{T + 1}

by (5). If

\begin{matrix} 0 < η \leq min \{1, \frac{1}{8} {({(1 / 2 e)}^{1 / 2} + 1)}^{- 2} {(log (e t) + 1)}^{- 1}\}, \end{matrix}

(24)

then for

t \geq 2,

\begin{matrix} E_{t - 1} [E (f_{t})] \leq 2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 η^{2} σ^{- 4} {(t - 1)}^{2} {(sup_{1 \leq k \leq t - 1} {∥ f_{k} ∥_{K}, M})}^{6} . \end{matrix}

(25)

Proof.

We shall prove (25) by induction. Obviously, (25) holds for

t = 2 .

Suppose (25) holds for

t \geq 2 .

Applying (14) with

T = t

, then

\begin{matrix} E_{t} [∥ f_{t + 1} - f_{ρ} ∥_{ρ}^{2}] & \leq 2 {∥π_{1}^{t} (L_{K}) f_{ρ}∥}_{ρ}^{2} + 2 η^{2} \sum_{k = 1}^{t} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (t - k)} E_{k - 1} [E (f_{k})] + 2 E_{t} [{∥η \sum_{k = 1}^{t} π_{k + 1}^{t} (L_{K}) Δ_{k}∥}_{ρ}^{2}] \\ \leq 2 {∥π_{1}^{t} (L_{K}) f_{ρ}∥}_{ρ}^{2} + 2 η^{2} \sum_{k = 1}^{t} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (t - k)} E_{k - 1} [E (f_{k})] + 2 η^{2} {(\sum_{k = 1}^{t} ∥ π_{k + 1}^{t} (L_{K}) ∥ ∥ Δ_{k} ∥_{\infty})}^{2} . \end{matrix}

(26)

Since the Gaussian G is Lipschitz continuous, we have that for each

1 \leq k \leq T,

\begin{matrix} ∥ Δ_{k} ∥_{\infty} & \leq {∥[G (0) - G (- \frac{{(f_{k} (x_{k}) - y_{t})}^{2}}{2 σ^{2}})] [f_{k} (x_{k}) - y_{k}] K_{x_{k}}∥}_{\infty} \\ \leq |[G (0) - G (- \frac{{(f_{k} (x_{k}) - y_{k})}^{2}}{2 σ^{2}})] [f_{t} (x_{k}) - y_{k}]| {∥ K_{x_{k}} ∥}_{K} \\ \leq \frac{{(f_{k} (x_{k}) - y_{k})}^{2}}{2 σ^{2}} |f_{k} (x_{k}) - y_{k}| \leq \frac{(∥ f_{k} {∥_{\infty} + M)}^{3}}{2 σ^{2}} \leq \frac{(∥ f_{k} {∥_{K} + M)}^{3}}{2 σ^{2}} \end{matrix}

where the last inequality is derived from (3).

Notice that by

0 < η \leq 1,

there holds

∥ π_{k}^{t} (L_{K}) ∥ \leq \prod_{l = k}^{t} ∥ I - η L_{K} ∥ \leq 1

for each

1 \leq k \leq t \leq T .

Then the last term on the right-hand side of (26) is bounded as

\begin{matrix} 2 η^{2} {(\sum_{k = 1}^{t} ∥ π_{k + 1}^{t} (L_{K}) ∥ ∥ Δ_{k} ∥_{\infty})}^{2} \leq 32 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq k \leq t} {∥ f_{k} ∥_{K}, M})}^{6} . \end{matrix}

(27)

For the first term

2 {∥π_{1}^{t} (L_{K}) f_{ρ}∥}_{ρ}^{2}

, it is easy to get that

2 {∥π_{1}^{t} (L_{K}) f_{ρ}∥}_{ρ}^{2} \leq 2 {∥ f_{ρ} ∥}_{ρ}^{2} .

Putting the above estimates into (26) and using the relation (13) with

f = f_{t + 1},

we have

\begin{matrix} E_{t} [E (f_{t + 1})] = E_{t} [∥ f_{t + 1} - f_{ρ} ∥_{ρ}^{2}] + E (f_{ρ}) \\ \leq E (f_{ρ}) + 2 {∥ f_{ρ} ∥}_{ρ}^{2} + 32 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq k \leq t} {∥ f_{k} ∥_{K}, M})}^{6} + 2 η^{2} \sum_{k = 1}^{t} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (t - k)} E_{k - 1} [E (f_{k})] \\ \leq E (f_{ρ}) + 2 {∥ f_{ρ} ∥}_{ρ}^{2} + 32 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq k \leq t} {∥ f_{k} ∥_{K}, M})}^{6} \\ + 2 η^{2} \sum_{k = 1}^{t} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (t - k)} (2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 η^{2} σ^{- 4} {(t - 1)}^{2} {(sup_{1 \leq k \leq t - 1} {∥ f_{k} ∥_{K}, M})}^{6}) . \end{matrix}

(28)

By the restriction (24) of

η

and Lemma A3, we know that

\begin{matrix} 2 η^{2} \sum_{k = 1}^{t} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (t - k)} \leq 4 η {({(1 / 2 e)}^{1 / 2} + 1)}^{2} (log (e t) + 1) \leq \frac{1}{2} . \end{matrix}

Plugging it into (28), we have

\begin{matrix} E_{t} [E (f_{t + 1})] \leq E (f_{ρ}) + 2 {∥ f_{ρ} ∥}_{ρ}^{2} + 32 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq k \leq t} {∥ f_{k} ∥_{K}, M})}^{6} \\ + \frac{1}{2} (2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 η^{2} σ^{- 4} {(t - 1)}^{2} {(sup_{1 \leq k \leq t - 1} {∥ f_{k} ∥_{K}, M})}^{6}) \\ \leq 2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq k \leq t} {∥ f_{k} ∥_{K}, M})}^{6} . \end{matrix}

Then the proof is completed. □

With these preliminaries in place, we shall prove our main results.

Proof of Theorem 1.

We shall prove Theorem 1 by Proposition 1. First, we will use (14) to estimate the error rate for (5) in

L_{ρ_{X}}^{2}

-space. For the first term on the right-hand side of (14), applying Lemma A2 with

f = f_{ρ}

and

η = T^{- \frac{2 r}{2 r + 1}}

, we have that

\begin{matrix} {∥π_{1}^{T} (L_{K}) f_{ρ}∥}_{ρ}^{2} \leq 4 {({(r / e)}^{r} + 1)}^{2} ∥ L_{K}^{- r} f_{ρ} ∥_{ρ}^{2} T^{- \frac{2 r}{2 r + 1}} = 4 {({(r / e)}^{r} + 1)}^{2} {∥ g ∥}_{ρ}^{2} T^{- \frac{2 r}{2 r + 1}} . \end{matrix}

For the second term on the right-hand side of (14), the choice of

η

and T in Theorem 1 implies that the restriction (24) holds. Then we can put the bound (12) into (25) and get that for

t \geq 2

\begin{matrix} E_{t - 1} [E (f_{t})] \leq 2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 M^{6} σ^{- 4} η^{5} {(t - 1)}^{5} \\ \leq (2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 M^{6}) (1 + σ^{- 4} η^{5} {(t - 1)}^{5}) \\ \leq (2 E (f_{ρ}) + 4 {∥ f_{ρ} ∥}_{ρ}^{2} + 64 M^{6}) (1 + σ^{- 4} η^{5} T^{5}) : = c_{M, ρ} (1 + σ^{- 4} T^{\frac{5}{2 r + 1}}) . \end{matrix}

This together with Lemma A3 yields that

\begin{matrix} η^{2} \sum_{t = 1}^{T} \frac{2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2}}{1 + η (T - t)} E_{t - 1} [E (f_{t})] & \leq 2 {({(1 / 2 e)}^{1 / 2} + 1)}^{2} c_{M, ρ} η (log (e T) + 1) (1 + σ^{- 4} T^{\frac{5}{2 r + 1}}) \\ \leq 4 {({(1 / 2 e)}^{1 / 2} + 1)}^{2} c_{M, ρ} log (T) (T^{- \frac{2 r}{2 r + 1}} + σ^{- 4} T^{\frac{5 - 2 r}{2 r + 1}}) . \end{matrix}

Finally, we bound the last term on the right-hand side of (14). Notice that

\begin{matrix} {∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{ρ} \leq η \sum_{t = 1}^{T} ∥ π_{t + 1}^{t} (L_{K}) ∥ ∥ Δ_{t} ∥_{\infty} . \end{matrix}

Then, using the estimate (27) and the bound (12) of

{f_{t}}

, we have

\begin{matrix} E_{T} [{∥η \sum_{t = 1}^{T} π_{t + 1}^{T} (L_{K}) Δ_{t}∥}_{ρ}^{2}] \leq η^{2} {(\sum_{t = 1}^{T} ∥ π_{t + 1}^{T} (L_{K}) ∥ ∥ Δ_{t} ∥_{\infty})}^{2} \\ \leq 16 η^{2} σ^{- 4} t^{2} {(sup_{1 \leq t \leq T} {∥ f_{t} ∥_{K}, M})}^{6} \leq 16 M^{6} σ^{- 4} T^{\frac{5}{2 r + 1}} . \end{matrix}

Based on the above analysis, the conclusion (10) is obtained by taking

C = 8 {({(r / e)}^{r} + 1)}^{2} {∥ g ∥}_{ρ}^{2} + 16 {({(1 / 2 e)}^{1 / 2} + 1)}^{2} c_{M, ρ} + 32 M^{6} .

Similarity, we can get the conclusion (11) by taking

C^{'} = 8 {({((2 r - 1) / 2 e)}^{r - \frac{1}{2}} + 1)}^{2} {∥ g ∥}_{ρ}^{2} + 8 c_{M, ρ} + 32 M^{6} .

□

Author Contributions

B.W. conceived of the presented idea. T.H. developed the theory and performed the computations. All authors discussed the results and contributed to the final manuscript.

Funding

The work described in this paper is partially supported by National Natural Science Foundation of China [Nos. 11671307 and 11571078], Natural Science Foundation of Hubei Province in China [No. 2017CFB523] and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY18033].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Useful Lemmas

The following two lemmas are slightly modified forms of Lemma 3, Lemma 7 in [24], respectively.

Lemma A1.

Let

β > 0

and

0 < η \leq 1 .

Then for any

t \in [l, k],

there holds

\begin{matrix} ∥ π_{l}^{k} (L_{K}) L_{K}^{β} ∥^{2} \leq \frac{2 {({(β / e)}^{β} + 1)}^{2}}{1 + η^{2 β} {(k - l + 1)}^{2 β}} . \end{matrix}

Lemma A2.

If

f \in L_{K}^{r} (L_{ρ_{X}}^{2})

for some

r > 0,

then

\begin{matrix} ∥ π_{1}^{T} (L_{K}) {f ∥}_{ρ} \leq 2 ({(r / e)}^{r} + 1) {∥ L_{K}^{- r} f ∥}_{ρ} η^{- r} T^{- r} . \end{matrix}

In addition, if

r > \frac{1}{2},

then

\begin{matrix} ∥ π_{1}^{T} (L_{K}) {f ∥}_{K} \leq 2 ({((2 r - 1) / 2 e)}^{r - \frac{1}{2}} + 1) {∥ L_{K}^{- r} f ∥}_{ρ} η^{- r + \frac{1}{2}} T^{- r + \frac{1}{2}} . \end{matrix}

Lemma A3.

For any

0 < η \leq 1,

there holds for

t \geq 2,

\begin{matrix} \sum_{k = 1}^{t} \frac{1}{1 + η (t - k)} \leq η^{- 1} (log (e t) + 1) . \end{matrix}

Proof.

By the elementary inequality

\sum_{k = 1}^{t} k^{- 1} \leq log e (t + 1),

we know that for

t \geq 2,

\begin{matrix} \sum_{k = 1}^{t} \frac{1}{1 + η (t - k)} & = \sum_{k = 1}^{t - 1} \frac{1}{1 + η (t - k)} + 1 \leq η^{- 1} \sum_{k = 1}^{t - 1} {(t - k)}^{- 1} + 1 = η^{- 1} \sum_{k = 1}^{t - 1} \frac{1}{k} + 1 \\ \leq η^{- 1} log (e t) + 1 \leq η^{- 1} (log (e t) + 1) . \end{matrix}

Then the proof is completed. □

References

Santamaria, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef]
Feng, Y.L.; Huang, X.L.; Shi, L.; Yang, Y.N.; Suykens, J.A.K. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
He, R.; Zheng, W.S.; Hu, B.G. Maximum Correntropy Criterion for Robust Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1561–1576. [Google Scholar] [PubMed]
Liu, W.F.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
Principe, J.C. Renyi’s Entropy and Kernel Perspectives. In Information Theoretic Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
He, R.; Zheng, W.S.; Hu, B.G.; Kong, X.W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074–2100. [Google Scholar] [CrossRef]
Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
He, R.; Hu, B.G.; Zheng, W.S.; Kong, X.W. Robust Principal Component Analysis Based on Maximum Correntropy Criterion. IEEE Trans. Image Process. 2011, 20, 1485–1494. [Google Scholar] [PubMed]
Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Principe, J.C. Steady-State Mean-Square Error Analysis for Adaptive Filtering under the Maximum Correntropy Criterion. IEEE Signal Process. Lett. 2014, 21, 880–883. [Google Scholar]
Wu, Z.; Peng, S.; Chen, B.; Zhao, H. Robust Hammerstein Adaptive Filtering under Maximum Correntropy Criterion. Entropy 2015, 17, 7149–7166. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Pokharel, P.P.; Principe, J.C. Error Entropy, Correntropy and M-Estimation. In Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, Arlington, VA, USA, 6–8 September 2006. [Google Scholar]
Syed, M.N.; Pardalos, P.M.; Principe, J.C. Invexity of the minimum error entropy criterion. IEEE Signal Process. Lett. 2013, 20, 1159–1162. [Google Scholar] [CrossRef]
Syed, M.N.; Pardalos, P.M.; Principe, J.C. On the optimization properties of the correntropic loss function in data analysis. Optim. Lett. 2014, 8, 823–839. [Google Scholar] [CrossRef]
Marques de Sá, J.P.; Silva, L.M.A.; Santos, J.M.F.; Alexandre, L.A. Minimum Error Entropy Classification; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Cucker, F.; Zhou, D.X. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Singh, A.; Pokharel, R.; Principe, J.C. The C-loss function for pattern classification. Pattern Recognit. 2014, 47, 441–453. [Google Scholar] [CrossRef]
Guo, Z.C.; Hu, T.; Shi, L. Gradient descent for robust kernel-based regression. Inverse Prob. 2018, 34. [Google Scholar] [CrossRef]
Smale, F.; Zhou, D.X. Learning theory estimates via integral operators and their approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef]
Lu, S.; Pereverzev, S.V. Regularization Theory for Ill-Posed Problems: Selected Topics; Walter de Gruyter: Berlin, Germany, 2013. [Google Scholar]
Caponnetto, A.; Vito, E.D. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer: New York, NY, USA, 2008. [Google Scholar]
Bauer, F.; Pereverzev, S.V.; Rosasco, L. On regularization algorithms in learning theory. J. Complexity 2007, 23, 52–72. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.L.; Fan, J.; Suykens, J.A. A statistical learning approach to modal regression. arXiv 2017, arXiv:1702.05960. [Google Scholar]
Ying, Y.; Pontil, M. Online gradient descent learning algorithms. Found. Comput. Math. 2008, 8, 561–596. [Google Scholar] [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Hu, T. Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy 2019, 21, 644. https://doi.org/10.3390/e21070644

AMA Style

Wang B, Hu T. Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy. 2019; 21(7):644. https://doi.org/10.3390/e21070644

Chicago/Turabian Style

Wang, Baobin, and Ting Hu. 2019. "Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion" Entropy 21, no. 7: 644. https://doi.org/10.3390/e21070644

APA Style

Wang, B., & Hu, T. (2019). Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy, 21(7), 644. https://doi.org/10.3390/e21070644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Online Gradient Descent for Kernel-Based Maximum Correntropy Criterion

Abstract

1. Introduction

2. Preliminaries and Main Results

3. Proofs of Main Result

Author Contributions

Funding

Conflicts of Interest

Appendix A. Useful Lemmas

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI