Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion

Li, Tiankai; Wang, Baobin; Peng, Chaoquan; Yin, Hong

doi:10.3390/e26121104

Open AccessArticle

Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion

¹

School of Mathematics and Statistics, South-Central MinZu University, Wuhan 430074, China

²

School of Mathematics, Renmin University of China, Beijing 100872, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(12), 1104; https://doi.org/10.3390/e26121104

Submission received: 19 November 2024 / Revised: 9 December 2024 / Accepted: 14 December 2024 / Published: 17 December 2024

(This article belongs to the Special Issue Advances in Probabilistic Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Maximum correntropy criterion (MCC) has been an important method in machine learning and signal processing communities since it was successfully applied in various non-Gaussian noise scenarios. In comparison with the classical least squares method (LS), which takes only the second-order moment of models into consideration and belongs to the convex optimization problem, MCC captures the high-order information of models that play crucial roles in robust learning, which is usually accompanied by solving the non-convexity optimization problems. As we know, the theoretical research on convex optimizations has made significant achievements, while theoretical understandings of non-convex optimization are still far from mature. Motivated by the popularity of the stochastic gradient descent (SGD) for solving nonconvex problems, this paper considers SGD applied to the kernel version of MCC, which has been shown to be robust to outliers and non-Gaussian data in nonlinear structure models. As the existing theoretical results for the SGD algorithm applied to the kernel MCC are not well established, we present the rigorous analysis for the convergence behaviors and provide explicit convergence rates under some standard conditions. Our work can fill the gap between optimization process and convergence during the iterations: the iterates need to converge to the global minimizer while the obtained estimator cannot ensure the global optimality in the learning process.

Keywords:

stochastic gradient descent; maximum correntropy criterion; non-Gaussian; convergence rate

MSC:

62J02

1. Introduction

Non-Gaussian noise and outliers are ubiquitous in many real-life applications, and robustness analysis plays an important role in the fields of signal processing and machine learning [1,2,3,4,5,6,7,8,9]. For regression problems, the classical least squares (LS) method is the most widely used tool. The LS method refers to the framework of second-order statistics due to its minimization of the variance of the prediction error, and the success of the LS method depends heavily on the assumption of Gaussianity. However, the LS method has poor performance when data may be contaminated by non-Gaussian noise or outliers. This motivates the application of MCC into robust regression problems as the correntropy loss can capture high-order moment information contained in the data. Due to its robustness to non-Gaussian noise, MCC has found wide applications in solving optimization problems in various scientific and engineering areas such as signal processing, regression analysis, feature selection, and data clustering; see [10,11,12,13,14,15,16,17,18,19,20,21] and the references therein.

Stochastic gradient descent (SGD) is a popular optimization method to solve large-scale computation problems due to its low memory requirement, computational complexity, and promising learning performances; see [22,23,24]. As an iterative method, SGD calculates a gradient randomly based on a random event and updates iterates along the direction of the negative gradient. Theoretical properties of SGD are well studied for optimization with both convex and strongly convex loss functions, but the existing theoretical results in the non-convex setting are not mature. Recently, non-convex optimization with SGD has attracted significant interest from researchers regarding the development of deep neural nets. In the paper, we employ SGD to solve the optimization problems induced by kernel-based MCC. As the correntropy loss is not convex, a rigorous theoretical foundation for applying SGD to MCC is not clear and will be investigated in the work.

Kernel methods are efficient tools for dealing with non-linear features in models. Here, the learning process of the kernel MCC method is associated with a Mercer kernel

K

. The function

K : X \times X \to R

is said to be a Mercer kernel if it is continuous, symmetric and positive semidefinite on

X \times X

. It induces a reproducing kernel Hilbert space (RKHS)

H_{K}

[25], which is defined to be the completion of the linear span of

{K_{x} = K (\cdot, x), x \in X}

with the inner product

{〈 K (\cdot, x), K (\cdot, u) 〉}_{K} = K (x, u), \forall x, u \in X

. and the reproducing property

\begin{matrix} f (x) = {〈 K (\cdot, x), f 〉}_{K}, \forall f \in H_{K}, x \in X . \end{matrix}

(1)

So, the following holds

\begin{matrix} {∥ f ∥}_{\infty} \leq sup_{x \in X} \sqrt{K (x, x)} {∥ f ∥}_{K}, \forall f \in H_{K} . \end{matrix}

(2)

In this paper, we attempt to present the convergence analysis of the kernel version of SGD applied to the correntropy loss,

ϕ_{σ} (u) = exp \{- \frac{u^{2}}{σ^{2}}\}, u \in R,

where

σ > 0

is the robustness parameter. Assume

X \in R^{d}

is a vector of explanatory variables,

Y \in R

is the response variable. Given a prediction function f, the correntropy loss

ϕ_{σ}

provides a measure of the error prediction error

E = f (X) - Y,

that is

\begin{matrix} ϕ_{σ} (E) = exp \{- \frac{{(f (X) - Y)}^{2}}{σ^{2}}\}, σ > 0 . \end{matrix}

(3)

The purpose of the maximum correntropy criterion is to find a good predictor by maximizing (3). Given a set of independent and identically distributed (i.i.d.) observations,

z = {(x_{i}, y_{i})}_{i = 1}^{T},

the implementation of MCC by SGD in an RKHS

H_{K}

is defined with an initial value of

f_{1} = 0

and the updating rule

\begin{matrix} f_{t + 1} = f_{t} - η_{t} \frac{2 u_{t}}{σ^{2}} exp \{- \frac{u_{t}^{2}}{σ^{2}}\} K_{x_{t}}, t = 1, \dots, T \end{matrix}

(4)

where

u_{t} = f (x_{t}) - y_{t}

and

η_{t} > 0

are the step size.

In a previous work [7,9], an analysis of the implementation of MCC mainly focuses on the convergence of

{f_{t}}

to the regression function

E (Y | X = x)

, which is the value of the conditional mean of the output Y. However, in a non-parametric context, the target function of MCC varies with the choice of the robustness parameter

σ .

In the following paper [26,27], the authors established a sound theoretical foundation for MCC and provided a full analysis of the prediciton ability of MCC. They showed that MCC indeed belonged to the robust regression scheme when

σ

was large enough, but when

σ \to 0

, the target function of MCC was a modal regression fucntion; that is,

max_{y \in Y} p (y | X = x)

(

p (\cdot | X = x)

denotes the condtional density function). Thus, this paper will investigate the convegence ability of SGD applied to MCC for any

σ > 0 .

The contributions of the work are as follows:

(a): This work establishes the theoretical foundations for SGD applied to MCC. Some important convergence properties are provided, in which the role of the robustness parameter $σ$ is presented.
(b): We introduce the Polyak–Łojasiewicz (PL) condition to derive the explicit convergence rates of algorithm (4). A global linear convergence rate is achieved when the step size $η_{t}$ is chosen properly.

The rest of the paper is organized as follows. In Section 2, after introducing necessary notations and assumptions that underlie our analysis, we provide and discuss our main theorems. Some discussions on the kernelized MCC are provided in Section 3. We provide the simulations in Section 4. Section 5 is devoted to the technical proof. We conclude this paper with Section 6.

2. Main Results

This paper investigates the convergence of algorithm (4) in a non-parametric context. Some necessary notations and assumptions are given. Let

ρ

be a Borel probability measure on the product space

Z : = X \times Y

. The samples

z = {\{z_{i} : = (x_{i}, y_{i})\}}_{i = 1}^{T} \subset Z

are drawn independently according to

ρ .

Let

ρ_{X}

be the marginal distribution of

ρ

on

X

, and

ρ (y | x)

be the conditional distribution on

Y

given

x \in X

. Associated with

(ϕ_{σ}, ρ)

, the generalization error

E

for

f : X \to Y

is given by

\begin{matrix} E (f) = \int_{Z} ϕ_{σ} (f (x) - y) d ρ, \end{matrix}

which is well defined for any measurable function f. We assume that there exists a function

f^{*}

that maximizes the expected correntropy error. In this sense, the goal of MCC learning is to find an approximation of

f^{*}

through only the training set

z = {\{z_{i} : = (x_{i}, y_{i})\}}_{i = 1}^{T}

when measure

ρ

is unknown. The prediction ability to algorithm (4) is estimated by

E (f^{*}) - E (f_{T + 1})

. In the rest of the work, we denote the gradient of the functional

E (f)

in a RKHS

H_{K}

by

\nabla E (f)

.

Theorem 1.

Define

{f_{t}}

by algorithm (4) and take the step size

\sum_{t = 1}^{\infty} η_{t}^{2} < \infty

. Then, the following statements holds.

(a): The generalization error $E (f^{*}) - E E (f_{t + 1})$ is uniformly bounded. More precisely, there is some constant C, such that

$E (f^{*}) - E E (f_{t + 1}) \leq C (1 + σ^{- 4}), t = 1, \dots, T .$
(b): There is some constant $C > 0$ , such that

$inf_{t = 1, \dots, T} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} \leq C (1 + σ^{- 4}) {(\sum_{t = 1}^{T} η_{t})}^{- 1},$

where the constant C will be given explicitly in the proof.

Remark 1.

Statement (a) tells us that the error

E (f^{*}) - E E (f_{t})

converges to a bounded random variable without any boundness requirement on

f_{t}

appearing in previous work [22,28]. That is because of the so-called redescending property of the correntropy loss. We say that loss

ℓ (\cdot)

satisfies the redescending property if

ℓ^{'} (\cdot)

is non-decreasing near the origin, but decreasing towards 0 far from the origin. At present, there are a great deal of papers to describe the redescending property in robustness analysis [29,30].

Statement (b) shows that after T iterations, there exists a single iterate, such that its gradient converges to 0 if we choose the step size

\sum_{t = 1}^{\infty} η_{t} = \infty .

Explicit rates are provided in the following corollary by instantiating the step sizes.

Corollary 1.

Define

{f_{t}}

by algorithm (4), we have

(a): If $η_{t} = t^{- 1 + ϵ}$ for any $0 < ϵ < 1$ , then

$\begin{matrix} inf_{t = 1, \dots, T} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} = O (T^{- 1 + ϵ}) . \end{matrix}$
(b): If $η_{t} = t^{- \frac{1}{2}} log t$ , then

$\begin{matrix} inf_{t = 1, \dots, T} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} = O (T^{- \frac{1}{2}} log T) . \end{matrix}$

As we see, the convergence rate of algorithm (4) varies with the step size and can achieve the order

O (T^{- 1})

, as presented in statement (a).

In the following, we improve our convergence analysis by imposing the Polyak–Łojasiewicz (PL) condition, which has been explored to show fast convergence rates for SGD without convexity or strong convexity, see [31,32].

Assumption 1.

We say that

E (f)

satisfies Polyak–Łojasiewicz (PL) condition if the following holds for some

μ > 0

\begin{matrix} E (f^{*}) - E (f) \leq μ^{- 1} {∥ \nabla E (f) ∥}_{\infty}^{2} \end{matrix}

(5)

for any measurable function

f : X \to Y .

This condition was originally introduced by Polyak [33] and is commonly used in non-convex gradient descent analysis [32]. It simply requires that the gradient grows faster than a quadratic distance between the generalization error, and is a sufficient condition to guarantee a global linear convergence rate for the gradient descent. In this work, we use the PL condition to provide a full analysis of SGD for MCC implementation.

Theorem 2.

Define

{f_{t}}

by algorithm (4) and suppose (5) holds.

(a): If $\sum_{t = 1}^{\infty} η_{t}^{2} < \infty$ and $\sum_{t = 1}^{\infty} η_{t} = \infty$ , then

$lim_{t \to \infty} E (f^{*}) - E E (f_{t + 1}) = 0 .$
(b): If $\sum_{t = 1}^{\infty} η_{t}^{2} < \infty$ and $\sum_{t = 1}^{\infty} η_{t} = \infty$ , then

$lim_{t \to \infty} {∥ \nabla E (f_{t}) ∥}_{K}^{2} = 0 .$
(c): If there exists some $t_{0} \in N$ , such that for any $t \geq t_{0}$ , $η_{t} \leq \frac{σ^{2}}{4}$ , then

$\begin{matrix} E (f^{*}) - E E (f_{t + 1}) \leq \prod_{j = 1}^{t} (1 - \frac{η_{j} κ^{- 2} μ}{2}) (E (f^{*}) - E (f_{1})) . \end{matrix}$

Remark 2.

Statements (a) and (b) establish sufficient conditions to guarantee almost certain convergence in SGD applied to MCC. Statement (c) tells us that with a constant step size of

η_{t} \equiv \frac{σ^{2}}{4}

, algorithm (4) has a global linear convergence rate

O ({(1 - \frac{σ^{2} κ^{- 2} μ}{8})}^{T})

. It also shows that the value of the robustness parameter σ plays an important role in the choice of stepsize for deriving fast convergence rates.

3. Discussions

Recent theoretical works on the global solutions to SDG generated by nonconvex losses are limited. Inspired by the work of [34,35], MCC is based on a smooth loss that can be utilized in the theoretical analysis and sufficient convergence conditions can be derived for MCC. Thus, a fine theoretical foundation kernel-based MCC is established in the work, allthough no convexity property assumption has been imposed on the optimization problems. We show that the usual bounded gradient assumptions on nonconvex learning can be removed without affecting the convergence rates. By using a condition referred to as PL inequality, we find that the last iteration of SGD with kernel-based MCC can enjoy fast convergence speeds. To the best of our knowledge, this work provides the first theoretical analysis of MCC that establishes a global linear convergence rate.

We review some work related to the kernelized MCC. In linear additive noise models, the studies by [7,26,35] investigated regression models associated with correntropy-induced losses, demonstrating that the scale parameter in the learning process balances the convergence rates of the regression model with its robustness. However, their approaches rely solely on batch algorithms, which are limited in streaming data scenarios and large-scale computational settings. Moreover, the convergence rates achieved in their work are noticeably inferior to those established in our study (see Theorem 2).

We turn to the theoretical works on the gradient descent in training kernelized MCC. In the works of [10,21,36], the authors focussed on the learning performance and convergence ability in terms of the target functions, defined as the conditional mean

E (Y | X = x) .

But, it has been proved in [26] that in MCC models, for a good generalization ability and robustness properties, the minimizer of the generalization error

E (f)

is not necessarily

E (Y | X = x)

. Our work does not impose extra conditions on the target functions and can apply to general learning schemes. In addition, our analysis relaxes thethe requirement for the choice of

σ

, which must be sufficiently large in [10,21,36].

4. Simulation Validation

In this section, we provide some experimental results to confirm our theoretical results. These experiments are simulated from the regression model

y_{i} = f^{*} (x_{i}) + ϵ

, where

f^{*} (x_{i}) = x_{i} (1 - x_{i})

and the random samples

x_{i}

are drawn independently according to the uniform distribution

U [- 1, 2]

. To carry out the experiments, the noise

ϵ

is generated as follows.

Gaussion noise: $ϵ$ ∼ $N (0, 0 . 1^{2});$
outlier noise: $90 %$ noise ∼ $N (0, 0 . 1^{2})$ , $10 %$ noise ∼ $U [- 1, 1];$
skewed noise: $ϵ = z_{1} - z_{2}$ , where $z_{1}, z_{2} \sim E x p (0.2);$
GMM noise: $90 %$ noise ∼ $N (0, 0 . 1^{2})$ , $10 %$ noise is generated by the following density probability

\begin{matrix} p (ϵ) = \frac{1}{2 \sqrt{2 π} \times 0.1} exp \{- \frac{{(ϵ - 0.5)}^{2}}{2 \times 0 . 1^{2}}\} + \frac{1}{2 \sqrt{2 π} \times 0.1} exp \{- \frac{{(ϵ + 0.5)}^{2}}{2 \times 0 . 1^{2}}\} . \end{matrix}

We generate a dataset of 1000 samples for the above four types of noise. See Figure 1.

We chose the Gaussian kernel function

G (x, u) = exp \{- \frac{{∥ x - u ∥}^{2}}{2}\}

for the kernel MCC. We applied SGD to the kernel MCC with stepsizes of

η = 0.1

and

σ = 0.1 .

For each type of noise, we ran the SGD for T = 1000 times. The generalization errors

E (f_{t})

with iterations t are plotted in Figure 2. It is observed that for different types of noise, the generalization errors decreased rapidly as iterations progressed. This shows the efficiency of SGD for MCC in dealing with non-Gaussain noise. This coincides with our theory.

5. Proofs

Now, we provide the proofs for our main results.

Proof of Theorem 1.

As

ϕ_{σ}^{'} (u) = - \frac{2 u}{σ^{2}} exp \{- \frac{u^{2}}{σ^{2}}\}

, we know

\begin{matrix} | ϕ_{σ}^{'} (u) - ϕ_{σ}^{'} (v) | \leq \frac{4}{σ^{2}} | u - v |, u, v \in R . \end{matrix}

(6)

It follows that for any

u, v \in R,

\begin{matrix} - ϕ_{σ} (u) \leq - ϕ_{σ} (v) + ϕ_{σ}^{'} (v) (u - v) + \frac{2}{σ^{2}} {| u - v |}^{2} . \end{matrix}

(7)

Applying the inequality above with

u = f_{t + 1} (x) - y

,

v = f_{t} (x) - y

, then by (1)

\begin{matrix} - ϕ_{σ} (f_{t + 1} (x) - y) & \leq - ϕ_{σ} (f_{t} (x) - y) + ϕ_{σ}^{'} (f_{t} (x) - y) (f_{t + 1} (x) - f_{t} (x)) + \frac{2}{σ^{2}} {| f_{t + 1} (x) - f_{t} (x) |}^{2} \\ \leq ϕ_{σ} (f_{t} (x) - y) + {〈 f_{t + 1} - f_{t}, ϕ_{σ}^{'} (f_{t} (x) - y) K_{x} 〉}_{K} + \frac{2}{σ^{2}} {| f_{t + 1} (x) - f_{t} (x) |}^{2} . \end{matrix}

(8)

Taking the expectation with respect to

z = (x, y)

on both sides, and using (4)

\begin{matrix} - E (f_{t + 1}) & \leq - E (f_{t}) - {〈 f_{t + 1} - f_{t}, - \nabla E (f_{t}) 〉}_{K} + \frac{2 η_{t}^{2}}{σ^{2}} \int_{Z} {|\frac{2 u_{t}}{σ^{2}} exp \{- \frac{u_{t}^{2}}{σ^{2}}\} K_{x_{t}}|}^{2} d ρ \\ = - E (f_{t}) - {〈 - η_{t} ϕ_{σ}^{'} (f (x_{t}) - y_{t}) K_{x_{t}}, - \nabla E (f_{t}) 〉}_{K} + \frac{2 η_{t}^{2}}{σ^{2}} \int_{Z} | ϕ_{σ}^{'} (f_{(} x_{t}) - y_{t} {) K_{x_{t}} |}^{2} d ρ . \end{matrix}

(9)

For the term

| ϕ_{σ}^{'} (f_{(} x_{t}) - y_{t} {) K_{x_{t}} |}^{2}

, using (2), we know

\begin{matrix} | ϕ_{σ}^{'} (f_{(} x_{t}) - y_{t}) K_{x_{t}} |^{2} \leq κ^{2} | ϕ_{σ}^{'} (f_{(} x_{t}) - y_{t} {) |}^{2} = \frac{4 u_{t}^{2} κ^{2}}{σ^{4}} exp \{- \frac{2 u_{t}^{2}}{σ^{2}}\} \leq \frac{2 κ^{2}}{σ^{2}} . \end{matrix}

(10)

where the last inequality is obtained by

t e^{- t} \leq 1

for any

t > 0 .

Putting it into (9) yields

\begin{matrix} - E (f_{t + 1}) \leq - E (f_{t}) - {〈 - η_{t} ϕ_{σ}^{'} (f (x_{t}) - y_{t}) K_{x_{t}}, - \nabla E (f_{t}) 〉}_{K} + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} . \end{matrix}

Noting that

f_{t}

is independent of

z_{t}

, then by taking the conditional expectation with respect to

z_{t}

, we derive

\begin{matrix} - E E (f_{t + 1}) \leq - E (f_{t}) - η_{t} {∥ \nabla E (f_{t}) ∥}_{K}^{2} + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} \end{matrix}

Taking the conditional expectation with respect to

z_{t - 1}, z_{t - 1}, \dots, z_{1}

sequentially yields

\begin{matrix} - E E (f_{t + 1}) \leq - E E (f_{t}) - η_{t} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} . \end{matrix}

(11)

This implies that

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) \leq E (f^{*}) - E E (f_{t}) + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} . \end{matrix}

An application of the inequality recursively provides

E (f^{*}) - E E (f_{t + 1}) \leq E (f^{*}) - E E (f_{1}) + \frac{4 κ^{2}}{σ^{4}} \sum_{j = 1}^{t} η_{j}^{2} .

So, we can obtain that for any

t \in N,

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) & \leq E (f^{*}) - E E (f_{1}) + \frac{4 κ^{2}}{σ^{4}} \sum_{j = 1}^{t} η_{j}^{2} \\ = C (1 + σ^{- 4}) < \infty \end{matrix}

(12)

where

C : = max {1, 4 κ^{2} \sum_{j = 1}^{\infty} η_{j}^{2}}

and the last inequality is obtained by

E (f) \leq 1

for any measurable function

f : X \to Y .

Then, statement (a) holds.

Now, we provide statement (b). Inequality (11) also provides

\begin{matrix} η_{t} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} \leq E E (f_{t + 1}) - E E (f_{t}) + \frac{4 κ^{2}}{σ^{4}} η_{t}^{2}, t = 1, \dots, T . \end{matrix}

A summation of the aforementioned inequality then implies

\begin{matrix} \sum_{j = 1}^{t} η_{t} E {∥ \nabla E (f_{j}) ∥}_{K}^{2} & \leq \sum_{j = 1}^{t} E E (f_{j + 1}) - E E (f_{j}) + \frac{4 κ^{2}}{σ^{4}} \sum_{j = 1}^{t} η_{j}^{2} \leq E E (f_{t + 1}) + \frac{4 κ^{2}}{σ^{4}} \sum_{j = 1}^{t} η_{j}^{2} \\ \leq 1 + \frac{4 κ^{2}}{σ^{4}} \sum_{j = 1}^{t} η_{j}^{2} : = C (1 + σ^{- 4}) < \infty . \end{matrix}

So, we obtain

inf_{j = 1, \dots, t} E {∥ \nabla E (f_{j}) ∥}_{K}^{2} \sum_{j = 1}^{t} η_{j} \leq C (1 + σ^{- 4}) .

Then, statement (b) holds. □

Proof of Corollary 1.

We know that for any

t \in N,

\begin{matrix} \frac{1}{1 - γ} [{(t + 1)}^{1 - γ} - 1] \leq \sum_{j = 1}^{t} j^{- γ} \leq \frac{1}{1 - γ} t^{1 - γ}, γ \in (0, 1) . \end{matrix}

Using statement (b) of Theorem 1 with

γ = 1 - ϵ

implies that statement (a) holds.

Statement (b) can be similarly proven. □

Proof of Theorem 2.

For the proof of statement (a), we use estimate (11). Adding

E (f^{*})

on both sides provides

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) \leq E (f^{*}) - E E (f_{t}) - η_{t} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} . \end{matrix}

Note that the relation (2), then

∥ \nabla E (f_{t}) ∥_{\infty} \leq κ {∥ \nabla E (f_{t}) ∥}_{K}

This together with (5) yields

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) & \leq E (f^{*}) - E E (f_{t}) - κ^{- 2} η_{t} E {∥ \nabla E (f_{t}) ∥}_{\infty}^{2} + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} \\ \leq E (f^{*}) - E E (f_{t}) - κ^{- 2} μ η_{t} (E (f^{*}) - E E (f_{t})) + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} \\ \leq (1 - μ^{- 2} η_{t}) (E (f^{*}) - E E (f_{t})) + \frac{4 κ^{2} η_{t}^{2}}{σ^{4}} . \end{matrix}

Using the inequality recursively from

t = 1

to T, then

\begin{matrix} E (f^{*}) - E E (f_{T + 1}) & \leq \sum_{t = 1}^{T} (1 - κ^{- 2} η_{t}) (E (f^{*}) - E E (f_{1})) + \frac{4 κ^{2}}{σ^{4}} \sum_{t = 1}^{T} η_{t}^{2} \prod_{j = t + 1}^{T} (1 - κ^{- 2} η_{j}) \\ \leq exp \{- κ^{- 2} \sum_{t = 1}^{T} η_{t}\} (E (f^{*}) - E E (f_{1})) + \frac{4 κ^{2}}{σ^{4}} \sum_{t = 1}^{T} η_{t}^{2} \prod_{j = t + 1}^{T} (1 - κ^{- 2} η_{j}) \end{matrix}

The part of the first term

{lim}_{T \to \infty} exp \{- κ^{- 2} \sum_{t = 1}^{T} η_{t}\} = 0

due to

\sum_{t = 1}^{\infty} η_{t} = \infty

. The part of the second term

\sum_{t = 1}^{T} η_{t}^{2} \prod_{j = t + 1}^{T}

goes to zero, as

T \to \infty

due to Lemma 11 in [37].

The the proof of (a) is complete.

For the proof of statement (b), we recall an elementary inequality in [24]. For any positive functional

ℓ (f), f \in H_{K}

, if the gradient of ℓ is L-Lipschitz continuous with respect to

H_{K}

, then

\begin{matrix} {∥ Δ ℓ (u) ∥}_{K}^{2} \leq 2 L ℓ (u) . \end{matrix}

(13)

Let

ℓ (f) = E (f^{*}) - E (f)

, then by (6), we obtain

\begin{matrix} ℓ^{'} (f) = \int_{Z} - ϕ_{σ}^{'} (f (x) - y) K_{x} d ρ \end{matrix}

and for any

f, g \in H_{K}

, it follows that

| ℓ^{'} (f) - ℓ^{'} (g) | \leq \int_{z} \frac{4}{σ^{2}} | f (x) - g (x) | K_{x} d ρ \leq \frac{4 κ^{2}}{σ^{2}} {∥ f - g ∥}_{K}

by using (2). The gradient of the functional

ℓ (f)

is

\frac{4 κ^{2}}{σ^{2}}

-Lipschitz continuous in

H_{K} .

Note that

\nabla E (f) = \nabla (E (f) - E (f^{*})) = \int_{z} - ϕ_{σ}^{'} (f (x) - y) K_{x} d ρ

as

\nabla E (f^{*}) = 0

. Then, we obtain

\begin{matrix} ∥ \nabla (E (f) - E (f^{*})) ∥_{K}^{2} = {∥ \nabla E (f) ∥}_{K}^{2} \leq \frac{8 κ^{2}}{σ^{2}} (E (f^{*}) - E (f)) \end{matrix}

(14)

using (13) with

L = \frac{4 κ^{2}}{σ^{2}}

. Replacing f with

f_{t}

, then from statement (a), we can derive

\begin{matrix} lim_{t \to \infty} {∥ \nabla E (f_{t}) ∥}_{K}^{2} \leq \frac{8 κ^{2}}{σ^{2}} lim_{t \to \infty} (E (f^{*}) - E (f_{t})) = 0 . \end{matrix}

Then, the proof of statement (b) is complete.

For the proof of statement (c), we take the expectation with respect to

z_{1}, \dots, z_{t}

and add

E (f^{*})

on both sides of (9). So, we have

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) & \leq E (f^{*}) - E E (f_{t}) - η_{t} E ∥ \nabla E (f_{t}) ∥_{K}^{2} + \frac{2 η_{t}^{2}}{σ^{2}} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} . \end{matrix}

(15)

When there exists

t_{0}

, such that for any

t \geq t_{0}

,

η_{t} \leq \frac{σ^{2}}{4}

, we have

\frac{2 η_{t}^{2}}{σ^{2}} \leq \frac{1}{2} η_{t}

and

\begin{matrix} E (f^{*}) - E E (f_{t + 1}) \leq E (f^{*}) - E E (f_{t}) - \frac{η_{t}}{2} E {∥ \nabla E (f_{t}) ∥}_{K}^{2}, t > t_{0} . \end{matrix}

(16)

Using (2) and (5) again, we obtain

E (f^{*}) - E E (f_{t + 1}) \leq E (f^{*}) - E E (f_{t}) - \frac{η_{t} κ^{- 2}}{2} E {∥ \nabla E (f_{t}) ∥}_{K}^{2} \leq (1 - \frac{η_{t} κ^{- 2} μ}{2}) (E (f^{*}) - E E (f_{t})) .

Applying the estimate iteratively from

t = 1

to T, we obtain the conclusion for statement (c).

The proof is finished. □

6. Conclusions and Future Works

This paper studies the stochastic gradient method with correntropy loss functions. More precisely, we study how convergence properties can be achieved through a suitable choice of step size in robustness learning. We also use the PL condition to provide a global linear convergence rate. These results refine the previous work and provide a theoretical foundation for SGD with MCC.

Two related questions are worthwhile for future research. First, our analysis provides very useful insights on the application of SGD for MCC. The simulation is also consistent with our theory. However, the optimal selection of

σ

is unknown in practice. It is necessary to develop empirically applicable parameter selection strategies for optimal

σ .

Secondly, with the development of deep learning, various stochastic gradient methods are continuously emerging that accelerate the training progress, such as Adam [38], AdaGrad [39], RMSProp [40], and AMSGrad [41]. Non-convexity [42,43,44] of the MCC loss function will bring major challenges in the theoretical analysis. The techniques used in this paper may also be applied to other stochastic gradient methods. Meanwhile, our error bounds already imply linear convergence rates. It is yet unknown whether the non-convexity can be overcome in the error analysis and similar convergence results can be derived in other stochastic gradient methods. This will be considered in future research.

Author Contributions

Formal analysis, C.P.; Writing—original draft, T.L.; Writing—review & editing, H.Y.; Supervision, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 12071356), the Natural Science Foundation of Hubei Province in China (Grant No. 2024AFC020) and the Fundamental Research Funds for the Central Universities, South-Central MinZu University (Grant No. CZY23010).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data supported by this study can be found in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Principe, J.C. Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion. IEEE Signal Process. Lett. 2014, 21, 880–884. [Google Scholar]
Shi, L.; Shen, L.; Chen, B. An efficient parameter optimization of maximum correntropy criterion. IEEE Signal Process. Lett. 2023, 30, 538–542. [Google Scholar] [CrossRef]
Shi, L.; Zhao, H.; Zakharov, Y. An improved variable kernel width for maximum correntropy criterion algorithm. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 1339–1343. [Google Scholar] [CrossRef]
Wang, J.; Lu, L.; Shi, L.; Zhu, G.; Yang, X. Euclidean direction search algorithm based on maximum correntropy criterion. IEEE Signal Process. Lett. 2023, 30, 1032–1036. [Google Scholar] [CrossRef]
Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and Applications in Non-Gaussian Signal Processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
Wang, Y.; Pan, C.; Xiang, S.; Zhu, F. Robust Hyperspectral Unmixing with Correntropy-Based Metric. IEEE Trans. Image Process. 2015, 24, 4027–4040. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y. Kernel-based sparse regression with the correntropy-induced loss. Appl. Comput. Harmon. Anal. 2018, 44, 144–164. [Google Scholar] [CrossRef]
He, R.; Zheng, W.; Hu, B. Maximum Correntropy Criterion for Robust Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1561–1576. [Google Scholar]
Bessa, R.J.; Miranda, V.; Gama, J. Entropy and Correntropy Against Minimum Square Error in Offline and Online Three-Day Ahead Wind Power Forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
Chen, B.; Liu, X.; Zhao, H.; Principe, J.C. Maximum correntropy Kalman filter. Automatica 2017, 76, 70–77. [Google Scholar] [CrossRef]
Liu, Y.; Chen, J. Correntropy kernel learning for nonlinear system identification with outliers. Ind. Eng. Chem. Res. 2013, 53, 5248–5260. [Google Scholar] [CrossRef]
Liu, W.; Park, I.; Principe, J.C. An information theoretic approach of designing sparse kernel adaptive filters. IEEE Trans. Neural Netw. 2009, 20, 1950–1961. [Google Scholar] [CrossRef] [PubMed]
Heravi, A.R.; Hodtani, G.A. A new correntropy-based conjugate gradient backpropagation algorithm for improving training in neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 16, 6252–6263. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Dang, L.; Chen, B.; Duan, S.; Wang, L.; Tse, C.K. Random fourier filters under maximum correntropy criterion. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 3390–3403. [Google Scholar] [CrossRef]
Wu, Z.; Shi, J.; Zhang, X.; Ma, W.; Chen, B. Kernel recursive maximum correntropy. Signal Process. 2015, 117, 11–16. [Google Scholar] [CrossRef]
Wang, S.; Zheng, Y.; Duan, S.; Wang, L.; Tan, H. Quantized kernel maximum correntropy and its mean square convergence analysis. Digit. Signal Process. 2017, 63, 164–176. [Google Scholar] [CrossRef]
Xiong, K.; Shi, W.; Wang, S. Robust multikernel maximum correntropy filters. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 1159–1163. [Google Scholar] [CrossRef]
Zhao, S.; Chen, B.; Príncipe, J.C. Kernel adaptive filtering with maximum correntropy criterion. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011. [Google Scholar]
Xiong, K.; Iu, H.H.; Wang, S. Kernel correntropy conjugate gradient algorithms based on half-quadratic optimization. IEEE Trans. Cybern. 2021, 51, 5497–5510. [Google Scholar] [CrossRef]
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
Chang, D.; Lin, M.; Zhang, C. On the Generalization Ability of Online Gradient Descent Algorithm Under the Quadratic Growth Condition. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5008–5019. [Google Scholar] [CrossRef]
Ying, Y.; Zhou, D.X. Unregularized Online Learning Algorithms with General Loss Functions. Appl. Comput. Harmon. Anal. 2017, 42, 224–244. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Feng, Y.; Fan, J.; Suykens, J.A.K. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
Feng, Y.; Ying, Y. Learning with correntropy-induced losses for regression with mixture of symmetric stable noise. Appl. Comput. Harmon. Anal. 2020, 48, 795–810. [Google Scholar] [CrossRef]
Ghadimi, S.; Lan, G. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. Siam J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
Sun, D.; Roth, S.; Black, M.J. Secrets of Optical Flow Estimation and Their Principles. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
Guo, Z.; Hu, T.; Shi, L. Gradient Descent for Robust Kernel-based Regression. Inverse Probl. 2018, 34, 065009. [Google Scholar] [CrossRef]
Lei, Y.; Hu, T.; Li, G.; Tang, K. Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4394–4400. [Google Scholar] [CrossRef]
Karimi, H.; Nutini, J.; Schmidt, M. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition. In Machine Learning and Knowledge Discovery in Databases: European Conference; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Polyak, B.T. Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 1963, 3, 864–878. [Google Scholar] [CrossRef]
Syed, M.N.; Pardalos, P.M.; Principe, J.C. On the optimization properties of the correntropic loss function in data analysis. Optim. Lett. 2014, 8, 823–839. [Google Scholar] [CrossRef]
Feng, Y. New insights into learning with correntropy-based regression. Neural Comput. 2021, 33, 157–173. [Google Scholar] [CrossRef]
Wang, B.; Hu, T. Online gradient descent for kernel-based maximum correntropy criterion. Entropy 2019, 21, 644. [Google Scholar] [CrossRef] [PubMed]
Ying, Y.; Zhou, D. Online regularized classification algorithm. IEEE Trans. Inf. Theory 2006, 52, 4775–4788. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, L.J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Tieleman, T.; Hinton, G. RMSProp: Divide the gradient by a running average of its recent magnitude. Coursera Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Hu, T. Kernel-based maximum correntropy criterion with gradient descent method. Commun. Pure Appl. Anal. 2020, 19, 4159–4177. [Google Scholar] [CrossRef]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science and Business Media: New York, NY, USA, 2008. [Google Scholar]
Hu, T.; Wu, Q.; Zhou, D.-X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal. 2020, 49, 229–256. [Google Scholar] [CrossRef]

Figure 1. Comparison of four types of noises with 1000 samples.

Figure 2. Comparison of generalization errors for different noises. The vertical coordinate denotes generalization errors and the horizontal coordinate denotes the iteration time.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, T.; Wang, B.; Peng, C.; Yin, H. Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy 2024, 26, 1104. https://doi.org/10.3390/e26121104

AMA Style

Li T, Wang B, Peng C, Yin H. Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy. 2024; 26(12):1104. https://doi.org/10.3390/e26121104

Chicago/Turabian Style

Li, Tiankai, Baobin Wang, Chaoquan Peng, and Hong Yin. 2024. "Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion" Entropy 26, no. 12: 1104. https://doi.org/10.3390/e26121104

APA Style

Li, T., Wang, B., Peng, C., & Yin, H. (2024). Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion. Entropy, 26(12), 1104. https://doi.org/10.3390/e26121104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stochastic Gradient Descent for Kernel-Based Maximum Correntropy Criterion

Abstract

1. Introduction

2. Main Results

3. Discussions

4. Simulation Validation

5. Proofs

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI