Maximum Correntropy Criterion with Distributed Method

Fan Xie; Ting Hu; Shixu Wang; Baobin Wang

doi:10.3390/math10030304

,

and

¹

School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China

²

School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Mathematics2022, 10(3), 304;https://doi.org/10.3390/math10030304

This article belongs to the Topic Machine and Deep Learning

Version Notes

Order Reprints

Abstract

The Maximum Correntropy Criterion (MCC) has recently triggered enormous research activities in engineering and machine learning communities since it is robust when faced with heavy-tailed noise or outliers in practice. This work is interested in distributed MCC algorithms, based on a divide-and-conquer strategy, which can deal with big data efficiently. By establishing minmax optimal error bounds, our results show that the averaging output function of this distributed algorithm can achieve comparable convergence rates to the algorithm processing the total data in one single machine.

Keywords:

correntropy; maximum correntropy criterion; distributed method; robustness; error analysis

1. Introduction

In the big data era, the rapid expansion of data generation brings data of prohibitive size and complexity. This brings challenges to many traditional learning algorithms requiring access to the whole data set. Distributed learning algorithms, based on the divide-and-conquer strategy, provide a simple and efficient way to address this issue and therefore have received increasing attention. Such a strategy starts with partitioning the big data set into multiple subsets that are distributed to local machines, then it obtains local estimators in each subset by using a base algorithm, and it finally pools the local estimators together by simple averaging. It can substantially cut the time and memory costs in the algorithm implementation, and in many practical applications its learning performance has shown to be as good as that of a big machine that can use all the data. This scheme has been developed in various learning contexts, including spectral algorithms [1,2], kernel ridge regression [3,4,5], gradient descent [6,7], a semi-supervised approach [8], minimum error entropy [9] and bias correction [10].

Regression estimation and inference play an important role in the fields of data mining and statistics. The traditional ordinary least squares (OLS) method provides an efficient estimator if the regression model error is normally distributed. However, heavy-tailed noise and outliers are common in the real world, which limits the application of OLS in practice. Various robust losses have been proposed to deal with the problem instead of least squares loss. The commonly used robust losses mainly include adaptive Huber loss [11], gain function [12], minimum error entropy [13], exponential squared loss [14], etc. Among them, the Maximum Correntropy Criterion (MCC) is widely employed as an efficient alternative to the ordinary least squares method which is suboptimal in the non-Gaussian and non-linear signal processing situations [15,16,17,18,19]. Recently, MCC has been studied extensively in the literature and is widely adopted for many learning tasks, e.g., wind power forecasting [20] and pattern recognition [19]. In this paper, we are interested in the implementation of MCC by a distributed gradient descent method in a big data setting. Note that the MCC loss function is non-convex, so its analysis is essentially different from the least squares method. A rigorous analysis of distributed MCC is necessary to derive the consistency and learning rates.

Given a hypothesis function

f : X \to Y

and the scaling parameter

σ > 0,

correntropy between

f (X)

and Y is defined by

V_{σ} (f) : = E [G (\frac{{(f (X) - Y)}^{2}}{2 σ^{2}})]

where

G (u)

is the Gaussian function

exp \{- u\}, u \in R .

Given the sample set

D = {(x_{i}, y_{i})}_{i = 1}^{N} \subset Z : = X \times Y

, the empirical form of

V_{σ}

is

{\hat{V}}_{σ} (f) : = \frac{1}{N} \sum_{i = 1}^{N} G (\frac{{(f (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) .

The purpose of MCC is to maximize the empirical correntropy

{\hat{V}}_{σ}

over a hypothesis space

H

, that is

\begin{matrix} f_{z, H} : = arg max_{f \in H} {\hat{V}}_{σ} (f) . \end{matrix}

(1)

In the statistical learning context, the loss induced by correntropy

ϕ_{σ} : R \to R_{+}

is defined as

\begin{matrix} ϕ_{σ} (u) : = σ^{2} (1 - G (\frac{u^{2}}{2 σ^{2}})) = σ^{2} (1 - exp \{- \frac{u^{2}}{2 σ^{2}}\}), \end{matrix}

where

σ > 0

is the scaling parameter. The loss function can be viewed as a variant of the Welsch function [21] and the estimator

f_{z, H}

of (1) is also the minimizer of the empirical minimization risk scheme over

H

, that is

\begin{matrix} min_{f \in H} \frac{1}{N} \sum_{i = 1}^{N} ϕ_{σ} (f (x_{i}) - y_{i}) . \end{matrix}

(2)

This paper aims at rigorous analysis of distributed gradient descent MCC within the framework of reproducing kernel Hilbert spaces (RKHSs). Let

K : X \times X \to R

be a Mercer kernel [22], i.e., a continuous, symmetric and positive semi-definite function. A kernel K is said to be positive semi-definite, if the matrix

{(K (u_{i}, u_{j}))}_{i, j = 1}^{m}

is positive semi-definite for any finite set

\{u_{1}, \dots, u_{m}\} \subset X

and

m \in N

. The RKHS

H_{K}

associated with the Mercer kernel K is defined to be the completion of the linear span of the set of functions

{K_{x} : = K (x, \cdot), x \in X}

with the inner product

{⟨ \cdot, \cdot ⟩}_{K}

given by

{⟨ K_{x}, K_{u} ⟩}_{K} = K (x, u) .

It has the reproducing property

\begin{matrix} f (x) = {⟨ f, K_{x} ⟩}_{K} \end{matrix}

(3)

for any

f \in H_{K}

and

x \in X .

Denote

κ : = {sup}_{x \in X} \sqrt{K (x, x)} .

By the property (3), we get that

\begin{matrix} {∥ f ∥}_{\infty} \leq κ {∥ f ∥}_{K}, f o r a n y f \in H_{K} . \end{matrix}

(4)

Definition 1.

Given the sample set

D = {(x_{i}, y_{i})}_{i = 1}^{N} \subset Z : = X \times Y

, the kernel gradient descent algorithm for solving (2) can be stated iteratively with

f_{1, D} = 0

as

\begin{matrix} f_{t + 1, D} = f_{t, D} - η \times \frac{1}{N} \sum_{i = 1}^{N} ϕ_{σ}^{'} ((f_{t, D} (x_{i}) - y_{i})) K_{x_{i}}, t \geq 2 \end{matrix}

(5)

where η is the of step size and

ϕ_{σ}^{'} ((f_{t, D} (x_{i}) - y_{i})) = G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) (f_{t, D} (x_{i}) - y_{i}) .

Divide-and-Conquer algorithm for the kernel gradient descent MCC (5) is easy to describe. Rather than performing on the whole N examples, the distributed algorithm executes the following three steps:

Partition the data set D evenly and uniformly into m disjoint subsets $D_{j}$ , $1 \leq j \leq m$ .
Perform algorithm (5) on each data set $D_{j}$ , and get the local estimate $f_{T + 1, D_{j}}$ after T-th iteration.
Take an average ${\bar{f}}_{T + 1, D} = \frac{1}{m} \sum_{j = 1}^{m} f_{T + 1, D_{j}}$ as a final output.

In the next section, we study the asymptotic behavior of the final estimator

{\bar{f}}_{T + 1, D}

and show that

{\bar{f}}_{T + 1, D}

can obtain the minimax optimal rates over all estimators using the total data set of N samples provided that the scaling parameter

σ

is chosen suitably.

2. Assumptions and Main Results

In the setting of non-parametric estimation, we denote X as the explanatory variable that takes values in a compact domain

X

,

Y \in Y \subset R

as a real-valued response variable. Let

ρ

be the underlying distribution on

Z : = X \times Y .

Moreover, let

ρ_{X}

be the marginal distribution of

ρ

on

X

and

ρ (\cdot | x)

be the conditional distribution on

Y

for given

x \in X .

This work focuses on the application of MCC in regression problems, which is linked to the additive noise model

\begin{matrix} Y = f_{ρ} (X) + e, E (e | X) = 0, \end{matrix}

where e is the noise and

f_{ρ} (x)

is the regression function, which is the conditional mean

E (Y | X = x)

for

X = x \in X .

The goal of this paper is to estimate the mean square error between

{\bar{f}}_{T + 1, D}

and

f_{ρ}

in

L_{ρ_{X}}^{2}

-metric, which is defined by

{∥ \cdot ∥}_{L_{ρ_{X}}^{2}} : = {(\int_{X} {| \cdot |}^{2} d ρ_{X})}^{\frac{1}{2}} .

For simplicity, we will use

∥ \cdot ∥

to denote the norm

{∥ \cdot ∥}_{L_{ρ_{X}}^{2}}

when the meaning is clear from the context.

Below, we present two important assumptions, which play a vital role in carrying out the analysis. The first assumption is about the regularity of the target function

f_{ρ}

. Define the integral operator

L_{K} : L_{ρ_{X}}^{2} \to L_{ρ_{X}}^{2}

associated with K by

\begin{matrix} L_{K} f : = \int_{X} \int_{X} f (x) K_{x} d ρ_{X} (x), \forall f \in L_{ρ_{X}}^{2} . \end{matrix}

As K is a Mercer kernel on the compact domain

X

, the operator

L_{K}

is hence compact and positive. So,

L_{K}^{r}

as the r-th power of

L_{K}

for

r > 0

is well defined. Our error bounds are stated in terms of the regularity of the target function

f_{ρ}

, given by [3,23]

\begin{matrix} f_{ρ} = L_{K}^{r} (h_{ρ}), for some r > 0 and h_{ρ} \in L_{ρ_{X}}^{2} . \end{matrix}

(6)

The condition (6) measures the regularity of

f_{ρ}

and is closely related to the smoothness of

f_{ρ}

when

H_{K}

is a Sobolev space. If (6) holds with

r \geq \frac{1}{2}

,

f_{ρ}

lies in the space

H_{K}

.

The second assumption (7) is about the capacity of

H_{K}

, measured by the effective dimension [24,25]

\begin{matrix} N (λ) = Trace ({(L_{K} + λ I)}^{- 1} L_{K}), for λ > 0, \end{matrix}

where I is the identity operator on

H_{K}

. In this paper, we assume that

\begin{matrix} N (λ) \leq C λ^{- s} for some C > 0 and 0 < s \leq 1 . \end{matrix}

(7)

Note that it always holds with

s = 1

. For

0 < s < 1

, it is almost equivalent to that the eigenvalues

σ_{i}

of

L_{K}

decay at a rate

i^{- \frac{1}{s}}

. The smoother the kernel function K is, the smaller s and the smaller function space

H_{K} .

In particular, if K is a Gaussian kernel, then s can be arbitrarily close to 0, as

K \in C^{\infty}

.

Throughout the paper, we assume that

κ : = sup_{x \in X} \sqrt{K (x, x)} \leq 1

and

| y | \leq M

for some

M > 0

. We denote

⌊ a ⌋

as the smallest integer not less than a.

Theorem 1.

Assume that (6) and (7) hold for some

r > \frac{1}{2}

and

0 < s \leq 1

. Taking

η_{t} = η t^{- θ}

with

0 < η \leq 1

and

0 \leq θ < 1 .

If

T = ⌊ N^{\frac{1}{(2 r + s) (1 - θ)}} ⌋

and the number of partition of the data set D

\begin{matrix} m \leq \frac{N^{\frac{r - \frac{1}{2}}{2 r + s}}}{{(log N)}^{5}}, \end{matrix}

(8)

then with confidence at least

1 - δ

,

\begin{matrix} ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \leq \tilde{C} \{N^{- \frac{r}{2 r + s}} + N^{\frac{\frac{5}{2}}{2 r + s}} σ^{- 2}\} {(log \frac{12}{δ})}^{4}, \end{matrix}

where

\tilde{C}

is a constant depending on

θ .

Remark 1.

The above theorem, to be proved in Section 3, exhibits the concrete learning rates of the distributed estimator

{\bar{f}}_{T + 1, D}

(hence the standard estimator of (5) with

m = 1

). It implies that the kernel gradient descent for MCC on the single and distributed data set both achieves the learning rate

O (N^{- \frac{r}{2 r + s}})

when σ is large enough. It equals the minimax optimal rates in the regression setting [24,26] in the case of

r > \frac{1}{2}

. This theorem suggests that the distributed MCC does not sacrifice the convergence rate provided that the partition number m satisfies the constraint (8). Thus, the distributed MCC estimator

{\bar{f}}_{T + 1, D}

enjoys both computational efficiency and statistical optimality.

With the help of Theorem 1, we can easily deduce the following optimal learning rate in expectation.

Corollary 1.

Assume that (6) and (7) hold for some

r > \frac{1}{2}

and

0 < s \leq 1

, taking

η_{t} = η t^{- θ}

with

0 < η \leq 1

and

0 \leq θ < 1 .

If

T = ⌊ N^{\frac{1}{(2 r + s) (1 - θ)}} ⌋

, m satisfies (8) and

σ \geq N^{\frac{r / 2 + 5 / 4}{2 r + s}}

, then we have

\begin{matrix} E [∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥] = O (N^{- \frac{r}{2 r + s}}) . \end{matrix}

By the confidence-based error estimate in Theorem 1, we can obtain the following almost sure convergence of the distributed gradient descent algorithm for MCC.

Corollary 2.

Assume that (6) and (7) hold for some

r > \frac{1}{2}

and

0 < s \leq 1

, taking

η_{t} = η t^{- θ}

with

0 < η \leq 1

and

0 \leq θ < 1 .

If

T = ⌊ N^{\frac{1}{(2 r + s) (1 - θ)}} ⌋

, m satisfies (8) and

σ \geq N^{\frac{r / 2 + 5 / 4}{2 r + s}}

, and for arbitrary

ϵ > 0

, we have

\begin{matrix} lim_{N \to \infty} N^{\frac{r}{2 r + s} - ϵ} [∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥] = 0 . \end{matrix}

3. Discussion and Conclusions

In this work, we have studied the theoretical properties and convergence behaviors of a distributed kernel gradient descent MCC algorithm. As shown in Theorem 1, we derived minimax optimal error bounds for the distributed learning algorithm under the regularity condition on the regression function and capacity condition on RKHS. In the standard kernel gradient descent MCC algorithm (

m = 1

), the aggregate time complexity is

O (t N^{2})

after t iterations. However, in the distributed case (

m > 1

), the aggregate time complexity reduces to

O (t N^{2} / m)

after t iterations. In conclusion, the kernel gradient descent MCC algorithm (5) with the distributed method can achieve fast convergence rates while successfully reducing algorithmic costs.

When the optimization problem arises from non-convex losses, the iteration sequence generated by the gradient descent algorithm is likely to only converge to a stationary point or a local minimizer. Note that the loss induced by correntropy

ϕ_{σ}

is not convex. Then, the convergence of the gradient descent method (5) to the global minimizer is not unconditionally guaranteed, which brings difficulties to the mathematical analysis of convergence. Our work on Theorem 1 addresses this issue, which shows that the iterative algorithm ensures the global optimality of its iterations in the theoretical analysis.

For regression problems, the distributed method has been introduced to the iteration algorithm in various learning paradigms and the minimax optimal rate has been obtained under different constraints on the partition number m. For distributed spectral algorithms [1], the lower bound of m that ensures the optimal rates is

m \leq N^{min {\frac{2}{2 r + s}, \frac{2 r - 1}{2 r + s}}} .

(9)

We see from (9) that the restriction on m suffers from a saturation phenomenon in the sense that when

r \geq 3 / 2

in the sense that the maximal m to guarantee the optimal learning rate does not improve as r is beyond

3 / 2 .

Our restriction in (8) is worse than (9) when

r < 5 / 2

but better when

r > 5 / 2

as the upper bound in (8) increases with respect to r that overcomes the saturation effect in (9). For distributed kernel gradient descent algorithms with least squares method [6] and minimum error entropy (MEE) principle [9], the restrictions of m are improved as

m \leq \frac{N^{\frac{r - \frac{1}{2}}{2 r + s}}}{{(log N)}^{4} + 1}

(10)

and

m \leq \frac{N^{\frac{r - \frac{1}{2}}{2 r + s}}}{{(log N)}^{5}},

(11)

respectively. Our bound (8) for MCC differs with (10) for least squares only up to a logarithmic term, which has little impact on the upper bound of m ensuring optimal rates, but numerical experiments show that the distributed kernel gradient descent algorithm for least squares method is inferior to that for MCC in non-Gaussian noise models [15,27,28]. Our bound (8) is the same as (11) that is applied to the MEE principle. As we know, MEE also performs well in dealing with non-Gaussian noise or heavy-tail distribution [13,29]. However, MEE belongs to pairwise learning problems that work with pairs of samples rather than single sample in MCC. Hence, the distributed kernel gradient descent algorithm for MCC has an advantage over MEE in algorithmic complexity.

Several related questions are worthwhile for future research. First, our distributed result provides the optimal rates by requiring a large robust parameter

σ

. In practice, a moderate

σ

may be enough to ensure a good learning performance in robust estimation as shown by [17]. It is therefore of interest to investigate the convergence properties of distributed version of algorithm (5) when

σ

is chosen as a constant or

σ (N) \to 0

as N approaches

\infty .

Secondly, our algorithm is carried out in the framework of supervised learning; however, in numerous real-world applications, few labeled data are available, but a large amount of unlabeled data are given since the cost of labeling data is high such as time, money. Thus, we shall investigate how to enhance the learning performance of the MCC algorithm by the distributed method and the additional information given by unlabeled data.

Thirdly, as stated in Theorem 1, the choice of the last iteration T and the partition number m depends on the parameters

r, s

, which are usually unknown in advance. In practice, cross-validation is usually used to tune T and m adaptively. It would be interesting to know whether the kernel gradient descent MCC (5) with the distributed method can achieve the optimal convergence rate with adaptive T and

m .

Last but not least, we should note that here that all the data

D = {(x_{i}, y_{i})}_{i = 1}^{N}

are drawn independently according to the same distribution. In the distributed method, we partition D evenly and uniformly into m disjoint subsets. This means that

| D_{1} | = \dots = | D_{m} | = \frac{N}{m}

and each sample

(x_{i}, y_{i})

is assigned to the subset

D_{j}

(1 \leq j \leq m)

with the same probability. In the context of uniform random sampling, such randomness splitting strategy should be reasonable and practical. So, our theoretical analysis is based on the uniform random splitting mechanism. However, for the theoretical analysis of other randomness or non-randomness splitting mechanisms, it is necessary to develop new mathematical tools for optimal performance. It is beyond the scope of this paper and will be left for our future work.

4. Proofs of Main Results

This section is devoted to proving main results in Section 2. Here and in the following, let the sample size of each subset

D_{1}, \dots, D_{m}

be n; that is,

D = D_{1} ⋃ \dots D_{m}

and

N = m n .

Define the empirical operator

L_{K, D}

on

H_{K}

as

\begin{matrix} L_{K, D} (f) & = \frac{1}{N} \sum_{i = 1}^{N} {⟨f, K_{x_{i}}⟩}_{K} K_{x_{i}}, \forall f \in H_{K}, \end{matrix}

where

x_{1} \dots, x_{N} \in \{x : (x, y) \in D w i t h s o m e y \in Y\}

. Similarly, we can define the operator

L_{K, D_{j}}

on

H_{K}

for each subset

D_{j}, 1 \leq j \leq m,

\begin{matrix} L_{K, D_{j}} (f) & = \frac{1}{n} \sum_{i = 1}^{n} {⟨f, K_{x_{i}}⟩}_{K} K_{x_{i}}, \forall f \in H_{K}, \end{matrix}

where

x_{1} \dots, x_{n} \in \{x : (x, y) \in D_{j} w i t h s o m e y \in Y\}

.

4.1. Preliminaries

We first introduce some necessary lemmas in the proofs, which can be found in [3,6,9].

Lemma 1.

Let

g (z)

be a measurable function defined on

Z

with

{∥ g ∥}_{\infty} \leq M^{'}

almost definitely for some

M^{'} > 0

. Let

0 < δ < 1;

then, each of the following estimates holds with confidence at least

1 - δ

,

\begin{matrix} ∥{(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D})∥ \leq 2 A_{D, λ} log \frac{2}{δ}, \end{matrix}

\begin{matrix} ∥{(L_{K, D} + λ I)}^{- 1} (L_{K} + λ I)∥ \leq 2 {(\frac{2 A_{D, λ} log \frac{2}{δ}}{\sqrt{λ}})}^{2} + 2 . \end{matrix}

and

\begin{matrix} ∥\frac{1}{N} \sum_{i = 1}^{N} {(L_{K} + λ I)}^{- \frac{1}{2}} [g (z_{i}) K_{x_{i}} - L_{K} g]∥ \leq 2 M^{'} A_{D, λ} log \frac{2}{δ} \end{matrix}

where

A_{D, λ} : = \frac{1}{N \sqrt{λ}} + \sqrt{\frac{N (λ)}{N}}

.

Let

π_{i}^{t}

denote the polynomial defined by

π_{i}^{t} (s) = \prod_{j = i}^{t} (1 - η_{j} x)

if

i \leq t

and, for notation simplicity, let

π_{t + 1}^{t} (s) = 1

be the identity function. In our proof, we need to deal with the polynomial operators

π_{i}^{t} (L_{K})

and

π_{i}^{t} (L_{K, D})

. For this purpose we introduce the conventional notation

\sum_{j = T + 1}^{T} : = 1

and the following preliminary lemmas.

Lemma 2.

If

0 \leq α < 1, 0 \leq θ < 1

, then for

T \geq 3,

\begin{matrix} \sum_{i = 1}^{T} i^{- (θ + α)} {(\sum_{j = i + 1}^{T} j^{- θ})}^{- 1} \leq C_{θ, α} T^{- min {α, 1 - θ}} log T, \end{matrix}

(12)

where

C_{θ, α}

is a constant depending only on θ and α, whose value is given in the proof. In particular, if

α = 0

, we have

\begin{matrix} \sum_{i = 1}^{T} i^{- θ} {(\sum_{j = i + 1}^{T} j^{- θ})}^{- 1} \leq 15 log T . \end{matrix}

(13)

Lemma 3.

If

η_{t} = η t^{- θ}

with

0 < η < 1

and

0 \leq θ < 1,

then for

1 \leq i \leq T - 1

,

\begin{matrix} ∥π_{i}^{t} (L_{K, D})∥ \leq 1 \end{matrix}

(14)

\begin{matrix} ∥π_{i}^{t} (L_{K})∥ \leq 1 \end{matrix}

(15)

\begin{matrix} ∥L_{K, D} π_{i + 1}^{T} (L_{K, D})∥ \leq {(e η \sum_{j = i + 1}^{T} j^{- θ})}^{- 1}, \end{matrix}

(16)

\begin{matrix} ∥L_{K} π_{i + 1}^{T} (L_{K})∥ \leq {(e η \sum_{j = i + 1}^{T} j^{- θ})}^{- 1}, \end{matrix}

(17)

\begin{matrix} ∥ \sum_{i = 1}^{T} η_{i} [(L_{K, D} + λ I) π_{i + 1}^{T} (L_{K, D})] ∥ \leq 1 + \frac{η λ}{1 - θ} T^{1 - θ}, \end{matrix}

(18)

\begin{matrix} ∥ \sum_{i = 1}^{T} η_{i} [(L_{K} + λ I) π_{i + 1}^{T} (L_{K})] ∥ \leq 1 + \frac{η λ}{1 - θ} T^{1 - θ} . \end{matrix}

(19)

Define a data-free gradient descent sequence for the least square method in

H_{K}

by

f_{1} = 0

and

\begin{matrix} f_{t + 1} = f_{t} - η_{t} \int_{X} (f_{t} (x) - f_{ρ} (x)) K_{x} d ρ_{X} = (I - η_{t} L_{K}) f_{t} + η_{t} L_{K} f_{ρ} . \end{matrix}

(20)

It has been well evidence in the literature [30] that under the assumption (6) with

r > \frac{1}{2}

, there are

\begin{matrix} ∥ f_{t} - f_{ρ} ∥ \leq h_{ρ} t^{- r (1 - θ)} \end{matrix}

(21)

and

\begin{matrix} ∥ f_{t} - f_{ρ} ∥_{K} \leq h_{ρ} t^{- (r - \frac{1}{2}) (1 - θ)}, \end{matrix}

(22)

where

h_{ρ} = max \{{∥ g ∥ (2 r / e)}^{r}, ∥ g ∥ {[(2 r - 1) / e]}^{r - \frac{1}{2}}\}

.

Lemma 4.

If

η_{t} = η t^{- θ}

with

0 < η < 1

and

0 \leq θ < 1

, then there is a constant

C_{ρ, θ, r}

such that

\sum_{i = 1}^{T} η_{i} ∥ L_{K, D} π_{i + 1}^{T} (L_{K, D}) ∥ ∥ f_{i} - f_{ρ} ∥_{K} \leq C_{ρ, θ, r}

(23)

and

\sum_{i = 1}^{T} η_{i} ∥ L_{K} π_{i + 1}^{T} (L_{K}) ∥ ∥ f_{i} - f_{ρ} ∥_{K} \leq C_{ρ, θ, r} .

(24)

Lemma 5.

If

η_{t} = η t^{- θ}

with

0 < η < 1

and

0 \leq θ < 1

, then there is a constant

D_{ρ, θ, r}

such that

\sum_{i = 1}^{T} η_{i} {∥ f_{i} - f_{ρ} ∥}_{K} \leq D_{ρ, θ, r} T^{1 - θ} .

(25)

Recall that the isomorphism between

H_{K}

and

L_{ρ_{X}}^{2}

, which yields in

∥ f ∥ = ∥ L_{K}^{\frac{1}{2}} {f ∥}_{K} \leq {∥ {(L_{K} + λ I)}^{\frac{1}{2}} f ∥}_{K}, for all f \in H_{K} .

(26)

4.2. Bound for the Learning Sequence

We will need the following bound for the learning sequence in the proof.

Theorem 2.

If the step size sequence

η_{t} = η t^{- θ}

with

0 < η \leq 1

and

0 \leq θ < 1,

then we have the following bound for the learning sequence

{f_{t, D}}

by (5):

\begin{matrix} ∥ f_{t, D} ∥_{K} \leq M t^{\frac{1 - θ}{2}} . \end{matrix}

(27)

Proof.

We prove the statement by induction. First note the conclusion holds trivially for

t = 1 .

Next, suppose that

∥ f_{t, D} ∥_{K} \leq M \sqrt{\sum_{i = 1}^{t - 1} η_{i}}

holds. By the updating rule (5) and the reproducing property, we have

\begin{matrix} ∥ f_{t + 1, D} ∥_{K}^{2} = & ∥ f_{t, D} ∥_{K}^{2} - \frac{2 η_{t}}{N} \sum_{i = 1}^{N} ϕ_{σ}^{'} (f_{t, D} (x_{i}) - y_{i}) f_{t, D} (x_{i}) \\ + \frac{η_{t}^{2}}{N^{2}} {∥\sum_{i = 1}^{N} ϕ_{σ}^{'} (f_{t, D} (x_{i}) - y_{i}) K_{x_{i}}∥}_{K}^{2} \\ \leq & ∥ f_{t, D} ∥_{K}^{2} - \frac{2 η_{t}}{N} \sum_{i = 1}^{N} ϕ_{σ}^{'} (f_{t, D} (x_{i}) - y_{i}) f_{t, D} (x_{i}) \\ + \frac{η_{t}^{2}}{N} \sum_{i = 1}^{N} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} {(f_{t, D} (x_{i}) - y_{i})}^{2} \\ = & ∥ f_{t, D} ∥_{K}^{2} + \frac{η_{t}}{N} \sum_{i = 1}^{N} Q_{i}, \end{matrix}

(28)

where

\begin{matrix} Q_{i} = & [η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} - 2 G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})] {(f_{t, D} (x_{i}))}^{2} \\ - 2 (G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) + η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2}) y_{i} f_{t, D} (x_{i}) \\ + η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} y_{i}^{2} . \end{matrix}

The restriction

η_{t} \leq 1

implies

η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} - 2 G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) < 0

. By the property of quadratic function, we have

\begin{matrix} Q_{i} \leq & η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} y_{i}^{2} - \frac{(- G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) + η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2}) y_{i}^{2}}{η_{t} {|G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})|}^{2} - 2 G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})} \\ = & \frac{G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) y_{i}^{2}}{2 - η_{t} G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})} \leq M^{2} . \end{matrix}

Plugging it into (28), we obtain

\begin{matrix} ∥ f_{t + 1, D} ∥_{K}^{2} \leq {∥ f_{t, D} ∥}_{K}^{2} + M^{2} η_{t} \leq M^{2} \sum_{i = 1}^{t} η_{i} = M^{2} η \sum_{i = 1}^{t} i^{- θ} \leq M^{2} t^{1 - θ} . \end{matrix}

This completes the proof. □

4.3. Error Decomposition and Estimation of Error Bounds

Now we are in a position of bounding the error of the distributed kernel gradient descent MCC. For this purpose, we decompose the error

∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥

into two parts as

\begin{matrix} ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \leq ∥ f_{T + 1} - f_{ρ} ∥ + ∥ {\bar{f}}_{T + 1, D} - f_{T + 1} ∥ . \end{matrix}

(29)

As we have mentioned in the previous subsection, the first term can be bounded by (21) under the assumption (6) with

r > \frac{1}{2}

. Our key analysis is the second term, which can be bounded with the help of the following proposition.

Proposition 1.

Assume that (6) holds for some

r > \frac{1}{2} .

Let

η_{t} = η t^{- θ}

with

0 < η \leq 1

and

0 \leq θ < 1 .

For

λ > 0

, there holds

\begin{matrix} ∥ f_{T + 1, D} - f_{T + 1} ∥ & \leq C_{r, θ}^{'} [B_{D, λ} (C_{D, λ} + G_{D, λ}) (1 + λ T^{1 - θ}) + T^{\frac{5 (1 - θ)}{2}} σ^{- 2}], \end{matrix}

(30)

and

\begin{matrix} ∥ f_{T + 1, D} - f_{T + 1} ∥_{K} & \leq C_{r, θ}^{'} [B_{D, λ} (C_{D, λ} + G_{D, λ}) (1 + λ T^{1 - θ}) / \sqrt{λ} + T^{\frac{5 (1 - θ)}{2}} σ^{- 2}], \end{matrix}

(31)

where

\begin{matrix} B_{D, λ} = & ∥ {(L_{K, D} + λ I)}^{- 1} (L_{K} + λ I) ∥, \\ C_{D, λ} = & ∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥, \\ G_{D, λ} = & ∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} f_{ρ} - {\hat{f}}_{ρ, D}) ∥_{K}, \\ {\hat{f}}_{ρ, D} = & \frac{1}{N} \sum_{i = 1}^{N} y_{i} K_{x_{i}} = \frac{1}{N} \sum_{(x, y) \in D} y K_{x}, \end{matrix}

(32)

and

C_{r, θ}^{'}

is given in the proof, depending on

r, θ

.

Proof.

By the definition of

f_{t, D}

in (5) and the definition of

f_{t}

in (20), we have

\begin{matrix} f_{t + 1, D} - f_{t + 1} = [I - η_{t} L_{K, D}] (f_{t, D} - f_{t}) + η_{t} [L_{K} - L_{K, D}] f_{t} + η_{t} [{\hat{f}}_{ρ, D} - L_{K} (f_{ρ})] + η_{t} E_{t, D}, \end{matrix}

(33)

where

{\hat{f}}_{ρ, D}

is defined in (32) and

\begin{matrix} E_{t, D} = \frac{1}{N} \sum_{i = 1}^{N} (1 - G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}})) (f_{t, D} (x_{i}) - y_{i}) K_{x_{i}}, \end{matrix}

Applying (33) iteratively from

t = 1

to

T,

we obtain

f_{T + 1, D} - f_{T + 1} = I_{1} + I_{2} + I_{3} + I_{4}

(34)

where

\begin{matrix} I_{1} = & \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K, D}) [L_{K} - L_{K, D}] (f_{i} - f_{ρ}), \\ I_{2} = & \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K, D}) [L_{K} - L_{K, D}] (f_{ρ}), \\ I_{3} = & \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K, D}) [{\hat{f}}_{ρ, D} - L_{K} (f_{ρ})], \\ I_{4} = & \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K, D}) E_{i, D} . \end{matrix}

For

I_{1},

by (26), Lemmas 4 and 5,

\begin{matrix} ∥ I_{1} ∥ = & {∥\sum_{i = 1}^{T} η_{i} {(L_{K} + λ I)}^{\frac{1}{2}} π_{i + 1}^{T} (L_{K, D}) [L_{K} - L_{K, D}] (f_{i} - f_{ρ})∥}_{K} \\ \leq & \sum_{i = 1}^{T} {η_{i} ∥{(L_{K} + λ I)}^{\frac{1}{2}} {(L_{K, D} + λ I)}^{- \frac{1}{2}}∥ ∥(L_{K, D} + λ I) π_{i + 1}^{T} (L_{K, D})∥ \\ \times ∥{(L_{K, D} + λ I)}^{- \frac{1}{2}} {(L_{K} + λ I)}^{\frac{1}{2}}∥ ∥{(L_{K} + λ I)}^{- \frac{1}{2}} [L_{K} - L_{K, D}]∥ {∥ f_{i} - f_{ρ} ∥}_{K}} \\ \leq & B_{D, λ} C_{D, λ} (\sum_{i = 1}^{T} η_{i} ∥ L_{K, D} π_{i + 1}^{T} (L_{K, D}) ∥ ∥ f_{i} - f_{ρ} ∥_{K} + λ \sum_{i = 1}^{T} η_{i} {∥ f_{i} - f_{ρ} ∥}_{K}) \\ \leq & B_{D, λ} C_{D, λ} (C_{ρ, θ, r} + D_{ρ, θ, r} λ T^{1 - θ}) . \end{matrix}

(35)

For

I_{2}

, by (26), Lemma 3, and the fact

∥ f_{ρ} ∥_{\infty} \leq M

, we have

\begin{matrix} ∥ I_{2} ∥ = & ∥\sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K, D}) [L_{K} - L_{K, D}] (f_{ρ})∥ \\ \leq & {∥\sum_{i = 1}^{T} η_{i} {(L_{K} + λ I)}^{\frac{1}{2}} π_{i + 1}^{T} (L_{K, D}) [L_{K} - L_{K, D}] (f_{ρ})∥}_{K} \\ \leq & ∥{(L_{K} + λ I)}^{\frac{1}{2}} {(L_{K, D} + λ I)}^{- \frac{1}{2}}∥ ∥\sum_{i = 1}^{T} η_{i} (L_{K, D} + λ I) π_{i + 1}^{T} (L_{K, D})∥ \\ \times ∥{(L_{K, D} + λ I)}^{- \frac{1}{2}} {(L_{K} + λ I)}^{\frac{1}{2}}∥ ∥{(L_{K} + λ I)}^{- \frac{1}{2}} [L_{K} - L_{K, D}]∥ {∥f_{ρ}∥}_{K} \\ \leq & M (1 + \frac{λ T^{1 - θ}}{1 - θ}) B_{D, λ} C_{D, λ} . \end{matrix}

(36)

Similarly, we can bound

I_{3}

as

\begin{matrix} I_{3} \leq (1 + \frac{λ T^{1 - θ}}{1 - θ}) B_{D, λ} G_{D, λ} . \end{matrix}

(37)

For

I_{4},

first note that by the bound (27) of

{f_{t, D}}

, we see

\begin{matrix} {∥(G (\frac{{(f_{t, D} (x_{i}) - y_{i})}^{2}}{2 σ^{2}}) - 1) (f_{t, D} (x_{i}) - y_{i}) K_{x_{i}}∥}_{K} \\ \leq & \frac{(M + ∥ f_{t, D} {∥_{K})}^{3}}{2 σ^{2}} \leq \frac{2^{2}}{σ^{2}} {∥ f_{t, D} ∥}_{K}^{3} \\ \leq & 2^{2} M^{3} t^{\frac{3 (1 - θ)}{2}} σ^{- 2} \end{matrix}

This implies that

\begin{matrix} ∥ E_{t, D} ∥_{K} \leq 2^{2} M^{3} t^{\frac{(1 - θ) (3)}{2}} σ^{- 2} . \end{matrix}

(38)

Thistogether with the estimate

∥ π_{i + 1}^{t} (L_{K, D}) ∥ \leq 1

gives

\begin{matrix} ∥ I_{4} ∥ & \leq \sum_{i = 1}^{T} η_{i} {∥ E_{i, D} ∥}_{K} \leq 2^{2} M^{3} η \sum_{i = 1}^{T} i^{\frac{3 (1 - θ)}{2} - θ} σ^{- 2} \\ \leq \frac{2^{2} M^{3}}{(1 - θ) (\frac{5}{2})} T^{\frac{5 (1 - θ)}{2}} σ^{- 2} . \end{matrix}

(39)

Combining the estimates in (36), (37), (39) and (35), we obtain (30) holds with

C_{r, θ}^{'} = C_{ρ, θ, r} + D_{ρ, θ, r} + \frac{2 M}{1 - θ} + \frac{2^{3} M^{3}}{5 (1 - θ)} .

Following a similar process we can obtain the bound in (31). □

The following theorem provides a bound for the second term in (29).

Theorem 3.

Take

λ = T^{- (1 - θ)} .

There is a constant

C_{r, θ}^{″}

such that

\begin{matrix} ∥ {\bar{f}}_{T + 1, D} - f_{T + 1} ∥ \leq & C_{r, θ}^{″} [G_{D, λ} + C_{D, λ} + λ^{- \frac{1}{2}} log T sup_{1 \leq l \leq m} C_{D_{l}, λ} B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) \\ + σ^{- 2} T^{\frac{5 (1 - θ)}{2}} (1 + log T sup_{1 \leq l \leq m} C_{D_{l}, λ})] . \end{matrix}

(40)

Proof.

For each subset

D_{l}

and each

1 \leq t \leq T

, we have

f_{T + 1, D_{l}} - f_{T + 1} = [I - η_{t} L_{K}] (f_{T, D_{l}} - f_{t}) + η_{T} [L_{K} - L_{K, D}] f_{T, D_{l}} + η_{T} [{\hat{f}}_{ρ, D_{l}} - L_{K} (f_{ρ})] + η_{T} E_{T, D_{l}} .

This implies that

\begin{matrix} f_{T + 1, D_{l}} - f_{T + 1} = & \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) [L_{K} - L_{K, D_{l}}] f_{i, D_{l}} \\ + \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) [{\hat{f}}_{ρ, D_{l}} - L_{K} (f_{ρ})] + \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) E_{i, D_{l}}, \end{matrix}

and therefore

\begin{matrix} ∥ {\bar{f}}_{T + 1, D} - f_{T + 1} ∥ = & ∥\frac{1}{m} \sum_{l = 1}^{m} (f_{T + 1, D_{l}} - f_{T + 1})∥ \\ \leq & ∥\sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} [L_{K} - L_{K, D_{l}}] f_{i, D_{l}}∥ \\ + ∥\sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} [{\hat{f}}_{ρ, D_{l}} - L_{K} (f_{ρ})]∥ \\ + ∥\frac{1}{m} \sum_{l = 1}^{m} \sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) E_{i, D_{l}}∥ \\ : = & J_{1} + J_{2} + J_{3} . \end{matrix}

We first estimate

J_{2} .

By (26), Lemma 3, and the choice

λ = T^{- (1 - θ)}

, we obtain

\begin{matrix} J_{2} \leq & {∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [{\hat{f}}_{ρ, D_{l}} - L_{K} (f_{ρ})]∥}_{K} \\ \leq & (1 + \frac{λ T^{1 - θ}}{1 - θ}) {∥{(L_{K} + λ)}^{- \frac{1}{2}} [{\hat{f}}_{ρ, D} - L_{K} (f_{ρ})]∥}_{K} \\ \leq & \frac{2 M}{1 - θ} (1 + λ T^{1 - θ}) G_{D, λ} \\ : = & \frac{4 M}{1 - θ} G_{D, λ} . \end{matrix}

(41)

For

J_{3},

by (39) we have

\begin{matrix} J_{3} \leq sup_{1 \leq l \leq m} ∥\sum_{i = 1}^{T} η_{i} π_{i + 1}^{T} (L_{K}) E_{i, D_{l}}∥ \leq \frac{2^{3} M^{3} η}{5 (1 - θ)} T^{5 (1 - θ) / 2} σ^{- 2} . \end{matrix}

(42)

The estimation of

J_{1}

is much more complicated. We decompose it into three parts,

\begin{matrix} J_{1} \leq & {∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D_{l}}] f_{i, D_{l}}∥}_{K} \\ \leq & {∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D_{l}}] (f_{i, D_{l}} - f_{i})∥}_{K} \\ + {∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K}) {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D}] (f_{i} - f_{ρ})∥}_{K} \\ + {∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K}) {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D}] (f_{ρ})∥}_{K} \\ : = & J_{11} + J_{12} + J_{13} . \end{matrix}

By Lemmas 4 and 5 and the fact

λ T^{1 - θ} = 1

, we obtain

\begin{matrix} J_{12} & \leq C_{D, λ} (\sum_{i = 1}^{T} ∥η_{i} L_{K} π_{i + 1}^{T} (L_{K})∥ ∥ f_{i} - f_{ρ} ∥_{K} + λ \sum_{i = 1}^{T} η_{i} {∥ f_{i} - f_{ρ} ∥}_{K}) \\ \leq C_{D, λ} (C_{ρ, θ, r} + D_{ρ, θ, r}) . \end{matrix}

For

J_{13}

, by (19) we have

\begin{matrix} J_{13} \leq & ∥\sum_{i = 1}^{T} η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ ∥{(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D}]∥ {∥f_{ρ}∥}_{K} \\ \leq & M (1 + \frac{λ T^{1 - θ}}{1 - θ}) C_{D, λ} = \frac{2 M}{1 - θ} C_{D, λ} . \end{matrix}

Now we turn to

J_{11} .

We have

\begin{matrix} J_{11} \leq & \sum_{i = 1}^{T} ∥η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ {∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D_{l}}] (f_{i, D_{l}} - f_{i})∥}_{K} \\ \leq & \sum_{i = 1}^{T} ∥η_{i} (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ sup_{1 \leq l \leq m} {∥{(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, D_{l}}] (f_{i, D_{l}} - f_{i})∥}_{K} \\ \leq & \sum_{i = 1}^{T} η_{i} [{(\sum_{j = i + 1}^{T} η_{j})}^{- 1} + λ] sup_{1 \leq l \leq m} C_{D_{l}, λ} {∥f_{i, D_{l}} - f_{i}∥}_{K} . \end{matrix}

(43)

By Theorem 1 and the choice

λ = T^{- (1 - θ)}

, for

1 \leq i \leq T

, there holds that

λ i^{(1 - θ)} \leq 1

and

\begin{matrix} {∥f_{i, D_{l}} - f_{i}∥}_{K} & \leq C_{r, θ}^{'} [B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) (1 + λ i^{1 - θ}) / \sqrt{λ} + i^{\frac{5 (1 - θ}{2})} σ^{- 2}] \\ \leq C_{r, θ}^{'} [2 B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) / \sqrt{λ} + T^{(\frac{5 (1 - θ)}{2})} σ^{- 2}] . \end{matrix}

Plugging it into (43), we obtain

\begin{matrix} J_{11} & \leq C_{r, θ}^{'} sup_{1 \leq l \leq m} C_{D_{l}, λ} [2 B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) / \sqrt{λ} + T^{(\frac{5 (1 - θ)}{2})} σ^{- 2}] \sum_{i = 1}^{T} η_{i} [{(\sum_{j = i + 1}^{T} η_{j})}^{- 1} + λ] . \end{matrix}

From Lemma 2, we see that

\sum_{i = 1}^{T} η_{i} [{(\sum_{j = i + 1}^{T} η_{j})}^{- 1} + λ] \leq 15 log T + \frac{η λ T^{1 - θ}}{1 - θ} = 15 log T + \frac{1}{1 - θ} \leq (15 + \frac{1}{1 - θ}) log T .

So, we have

J_{11} \leq C_{r, θ}^{'} (15 + \frac{1}{1 - θ}) log T sup_{1 \leq l \leq m} C_{D_{l}, λ} [2 B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) / \sqrt{λ} + T^{(\frac{5 (1 - θ)}{2})} σ^{- 2}] .

Combining the estimations for

J_{11},

J_{12}

and

J_{13}

, we obtain

\begin{matrix} J_{1} \leq & (\frac{2 M}{1 - θ} + C_{ρ, θ, r} + D_{ρ, θ, r}) C_{D, λ} \\ + 2 C_{r, θ}^{'} (15 + \frac{1}{1 - θ}) λ^{- \frac{1}{2}} log T sup_{1 \leq l \leq m} C_{D_{l}, λ} B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) \\ + C_{r, θ}^{'} (15 + \frac{1}{1 - θ}) σ^{- 2} T^{(\frac{5 (1 - θ)}{2})} log T sup_{1 \leq l \leq m} C_{D_{l}, λ} . \end{matrix}

(44)

Now the desired bound for

∥ {\bar{f}}_{T + 1, D} - f_{T + 1} ∥

in (40) follows by combining the estimations for

J_{1}

,

J_{2}

, and

J_{3}

and the constant is given by

C_{r, θ}^{″} : = (\frac{2 M θ}{1 - θ} + C_{ρ, θ, r} + D_{ρ, θ, r}) + 3 C_{r, θ}^{'} (15 + \frac{1}{1 - θ}) + \frac{2^{3} M^{3} η}{5 (1 - θ)} .

This proves the theorem. □

4.4. Proofs

Now we can prove Theorem 1.

Proof.

Firstly, note that with the choice

T = ⌊ N^{\frac{1}{(2 r + s) (1 - θ)}} ⌋

and

λ = T^{- (1 - θ)}

, and under the restriction (8) on m, we have

\begin{matrix} A_{D, λ} & \leq N^{- 1 + \frac{1}{4 r + 2 s}} + \sqrt{C} N^{- \frac{1}{2} + \frac{s}{4 r + 2 s}} \leq (\sqrt{C} + 1) N^{- \frac{r}{2 r + s}} . \end{matrix}

Therefore,

\begin{matrix} A_{D_{l}, λ} & \leq m N^{- 1} N^{\frac{1}{4 r + 2 s}} + \sqrt{C} m^{\frac{1}{2}} N^{- \frac{1}{2}} N^{\frac{s}{4 r + 2 s}} \\ \leq (1 + \sqrt{C}) m^{\frac{1}{2}} N^{- \frac{r}{2 r + s}} \end{matrix}

and

\begin{matrix} \frac{A_{D_{l}, λ}}{\sqrt{λ}} \leq (1 + \sqrt{C}) m^{\frac{1}{2}} N^{- \frac{r}{2 r + s}} N^{\frac{1}{4 r + 2 s}} \leq (1 + \sqrt{C}) . \end{matrix}

By applying Lemma 1, for any

1 \leq l \leq m,

we have with confidence at least

1 - \frac{δ}{6 m}

,

\begin{matrix} B_{D_{l}, λ} \leq 2 {(\frac{2 A_{D_{l}, λ} log \frac{12 m}{δ}}{\sqrt{λ}})}^{2} + 2, C_{D_{l}, λ} \leq 2 A_{D_{l}, λ} log \frac{12 m}{δ}, G_{D_{l}, λ} \leq 4 A_{D_{l}, λ} M log \frac{12 m}{δ} . \end{matrix}

Consequently, these bounds hold simultaneously with confidence at least

1 - \frac{δ}{2} .

This implies that with confidence at least

1 - \frac{δ}{2}

, there holds

\begin{matrix} λ^{- \frac{1}{2}} log T sup_{1 \leq l \leq m} C_{D_{l}, λ} B_{D_{l}, λ} (C_{D_{l}, λ} + G_{D_{l}, λ}) \\ \leq & 2^{6} (M + 1) log T [{(\frac{A_{D_{l}, λ}}{\sqrt{λ}})}^{2} + 1] \frac{A_{D_{l}, λ}^{2}}{\sqrt{λ}} {(log \frac{12 m}{δ})}^{4} \\ \leq & 2^{6} (M + 1) {[{(1 + \sqrt{C})}^{2} + 1]}^{2} m N^{- \frac{2 r - \frac{1}{2}}{2 r + s}} log T {(log \frac{12 m}{δ})}^{4} \\ \leq & 2^{10} (M + 1) {[{(1 + \sqrt{C})}^{2} + 1]}^{2} m N^{- \frac{2 r - \frac{1}{2}}{2 r + s}} log T {(log m)}^{4} {(log \frac{12}{δ})}^{4} \\ \leq & \frac{2^{10} (M + 1) {[{(1 + \sqrt{C})}^{2} + 1]}^{2}}{(2 r + s) (1 - θ)} m N^{- \frac{2 r - \frac{1}{2}}{2 r + s}} {(log N)}^{5} {(log \frac{12}{δ})}^{4} \\ \leq & \frac{2^{10} (M + 1) {[{(1 + \sqrt{C})}^{2} + 1]}^{2}}{(2 r + s) (1 - θ)} N^{- \frac{r}{2 r + s}} {(log \frac{12}{δ})}^{4} \end{matrix}

(45)

and

\begin{matrix} σ^{- 2} T^{\frac{5 (1 - θ)}{2}} (1 + (log T) sup_{1 \leq l \leq m} C_{D_{l}, λ}) \\ \leq & σ^{- 2} T^{\frac{5 (1 - θ)}{2}} (1 + (log T) A_{D_{l}, λ} log \frac{12 m}{δ}) \\ \leq & 2 σ^{- 2} N^{\frac{\frac{5}{2}}{2 r + s}} (1 + \frac{2 + 2 \sqrt{C}}{(2 r + s) (1 - θ)} (log N) m^{\frac{1}{2}} N^{- \frac{r}{2 r + s}} log m log \frac{12}{δ}) \\ \leq & 2 σ^{- 2} N^{\frac{\frac{5}{2}}{2 r + s}} (1 + \frac{2 + 2 \sqrt{C}}{(2 r + s) (1 - θ)} m^{\frac{1}{2}} N^{- \frac{r}{2 r + s}} {(log N)}^{2} log \frac{12}{δ}) \\ \leq & 2 σ^{- 2} N^{\frac{\frac{5}{2}}{2 r + s}} (1 + \frac{2 + 2 \sqrt{C}}{(2 r + s) (1 - θ)}) log \frac{12}{δ} . \end{matrix}

(46)

By Lemma 1, we have with confidence at least

1 - \frac{δ}{4}

,

\begin{matrix} C_{D, λ} \leq 2 A_{D, λ} log \frac{8}{δ} \leq 2 (\sqrt{C} + 1) N^{- \frac{r}{2 r + s}} log \frac{12}{δ} \end{matrix}

(47)

and

G_{D, λ} \leq 2 M A_{D, λ} log \frac{8}{δ} \leq 2 M (\sqrt{C} + 1) N^{- \frac{r}{2 r + s}} log \frac{12}{δ} .

(48)

Plugging the estimates (45)–(48) into (40), we obtain with confidence at least

1 - δ,

\begin{matrix} ∥ {\hat{f}}_{T + 1, D} - f_{T + 1} ∥ & \leq C (N^{- \frac{r}{2 r + s}} + σ^{- 2} N^{\frac{\frac{5}{2}}{2 r + s}}) {(log \frac{12}{δ})}^{4} \end{matrix}

where

\begin{matrix} C & = C_{r, θ, p}^{″} [2 M (\sqrt{C_{1}} + 1) + 2 (\sqrt{C} + 1) \\ + \frac{2^{10} (M + 1) {[{(1 + \sqrt{C})}^{2} + 1]}^{2}}{(2 r + s) (1 - θ)} + 2 (1 + \frac{2 + 2 \sqrt{C}}{(2 r + s) (1 - θ)})] . \end{matrix}

This, together with the bound

∥ f_{T + 1} - f_{ρ} ∥ \leq h_{ρ} T^{- r (1 - θ)} \leq h_{ρ} N^{- \frac{r}{2 r + s}},

leads to the desired conclusion with

\tilde{C} = C + h_{ρ}

. □

Proof of Corollary 1.

When

σ \geq N^{\frac{r / 2 + 5 / 4}{2 r + s}}

, by Theorem 1, we have that with confidence at least

1 - δ,

∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \leq 2 \tilde{C} N^{- \frac{r}{2 r + s}} {(log \frac{12}{δ})}^{4}

. Replacing

2 \tilde{C} N^{- \frac{r}{2 r + s}} {(log \frac{12}{δ})}^{4}

by t, then

\begin{matrix} Prob \{D : ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \geq t\} \leq 12 exp \{- {(2 \tilde{C})}^{- \frac{1}{4}} N^{\frac{r}{4 (2 r + p)}} t^{\frac{1}{4}}\} . \end{matrix}

Using the probability to expectation formula

\begin{matrix} E [ξ] = \int_{0}^{\infty} Pr {ξ \geq t} d t \end{matrix}

with

ξ = ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥,

we have

\begin{matrix} E [∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥] & = \int_{0}^{\infty} Prob \{D : ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \geq t\} d t \leq 12 \int_{0}^{\infty} exp \{- {(2 \tilde{C})}^{- \frac{1}{4}} N^{\frac{r}{4 (2 r + p)}} t^{\frac{1}{4}}\} d t \\ = 324 \tilde{C} N^{- \frac{r}{2 r + s}} \int_{0}^{\infty} u^{3} e^{- u} d u = 324 \tilde{C} Γ (4) N^{- \frac{r}{2 r + s}}, \end{matrix}

where

Γ (d)

is the Gamma function defined for

u > 0

by

Γ (d) = \int_{0}^{\infty} u^{d - 1} e^{- u} d u

.

The proof is complete. □

To prove Corollary 2, we need the following Borel-Cantelli Lemma which is provided in [31].

Lemma 6.

Let

{a_{N}}

be a sequence of events in some probability space and

{ξ_{N}}

be a sequence of positive numbers satisfying

{lim}_{N \to \infty} ξ_{N} = 0

. If

\begin{matrix} \sum_{N = 1}^{\infty} Prob \{| a_{N} - a | > ξ_{N}\} < \infty, \end{matrix}

then

a_{N}

will almost certainly converge to a.

Proof of Corollary 2.

Let

δ = N^{- 2}

in Theorem 1; then we have

\begin{matrix} Prob \{N^{\frac{r}{2 r + s}} {∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥}_{ρ} \geq 2^{5} \tilde{C} (log 12 N)\} < N^{- 2} . \end{matrix}

Thus, for any

ϵ > 0

,

\begin{matrix} Prob \{N^{\frac{r}{2 r + s} - ϵ} ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥ \geq 2^{5} \tilde{C} (log 12 N) N^{- ϵ}\} < N^{- 2} . \end{matrix}

Applying Lemma 6 with

a_{N} = N^{\frac{r}{2 r + s} - ϵ} ∥ {\bar{f}}_{T + 1, D} - f_{ρ} ∥

,

a = 0

and

ξ_{N} = 2^{5} \tilde{C} (log 12 N) N^{- ϵ}

, we can obtain the conclusion of Corollary 2 by noting

{lim}_{N \to \infty} ξ_{N} = 0

and

\sum_{N = 1}^{\infty} N^{- 2} < \infty .

The proof is finished. □

Author Contributions

Validation, F.X. and S.W.; Writing (original draft), B.W.; Writing (review and editing), T.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported partially by the National Key Research and Development Program of China (Grant No. 2021YFA1000600) and the National Natural Science Foundation of China (Grant No.12071356).

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Probl. 2017, 33, 074009. [Google Scholar] [CrossRef]
Mücke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. [Google Scholar]
Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. [Google Scholar]
Hu, T.; Zhou, D.X. Distributed regularized least squares with flexible Gaussian kernels. Appl. Comput. Harmon. Anal. 2021, 53, 349–377. [Google Scholar] [CrossRef]
Zhang, Y.; Duchi, J.; Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2015, 16, 3299–3340. [Google Scholar]
Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
Shamir, O.; Srebro, N. Distributed stochastic optimization and learning. In Proceedings of the 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 30 September–3 October 2014; pp. 850–857. [Google Scholar]
Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge regression. J. Mach. Learn. Res. 2017, 18, 1493–1514. [Google Scholar]
Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal. 2020, 49, 229–256. [Google Scholar] [CrossRef]
Sun, H.; Wu, Q. Optimal Rates of Distributed Regression with Imperfect Kernels. J. Mach. Learn. Res. 2021, 22, 1–34. [Google Scholar]
Sun, Q.; Zhou, W.X.; Fan, J. Adaptive Huber regression. J. Am. Stat. Assoc. 2020, 115, 254–265. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.; Wu, Q. A Framework of Learning Through Empirical Gain Maximization. Neural Comput. 2021, 33, 1656–1697. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. Proc. ICA 2000, 5, 6. [Google Scholar]
Song, Y.; Liang, X.; Zhu, Y.; Lin, L. Robust variable selection with exponential squared loss for the spatial autoregressive model. Comput. Stat. Data Anal. 2021, 155, 107094. [Google Scholar] [CrossRef]
Feng, Y.; Fan, J.; Suykens, J.A. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
Feng, Y.; Ying, Y. Learning with correntropy-induced losses for regression with mixture of symmetric stable noise. Appl. Comput. Harmon. Anal. 2020, 48, 795–810. [Google Scholar] [CrossRef] [Green Version]
Gunduz, A.; Principe, J.C. Correntropy as a novel measure for nonlinearity tests. Signal Process. 2009, 89, 14–23. [Google Scholar] [CrossRef]
He, R.; Zheng, W.S.; Hu, B.G.; Kong, X.W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074–2100. [Google Scholar] [CrossRef]
Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
Holland, P.W.; Welsch, R.E. Robust regression using iteratively reweighted least-squares. Commun. Stat.-Theory Methods 1977, 6, 813–827. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Smale, S.; Zhou, D.X. Learning theory estimates via integral operators and their approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef] [Green Version]
Caponnetto, A.; De Vito, E. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin, Germany, 2008. [Google Scholar]
Blanchard, G.; Mücke, N. Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math. 2018, 18, 971–1013. [Google Scholar] [CrossRef] [Green Version]
Santamaría, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Maximum Correntropy Criterion with Distributed Method

Abstract

1. Introduction

2. Assumptions and Main Results

3. Discussion and Conclusions

4. Proofs of Main Results

4.1. Preliminaries

4.2. Bound for the Learning Sequence

4.3. Error Decomposition and Estimation of Error Bounds

4.4. Proofs

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics