Distributed Robust Algorithms with Dependent Sampling

Baobin Wang; Ting Hu; Liangzhen Lei

doi:10.3390/math13233813

,

and

¹

Hubei Province Key Laboratory of Systems Science in Metallurgical Process, Wuhan University of Science and Technology, Wuhan 430081, China

²

School of Mathematics and Statistics, South-Central MinZu University, Wuhan 430074, China

³

School of Management, Xi’an Jiaotong University, Xi’an 710049, China

⁴

School of Mathematical Sciences, Capital Normal Univeristy, Beijing 100048, China

Mathematics2025, 13(23), 3813;https://doi.org/10.3390/math13233813

This article belongs to the Special Issue Computational Statistics with Applications

Version Notes

Order Reprints

Abstract

Robust algorithms have been widely used and intensively studied in the communities of engineering, statistics, and machine learning since such algorithms are less sensitive to outliers and effective in addressing the issue of non-Gaussian noise during the learning process. In this paper we study the learning performance of a distributed robust algorithm with mixing dependent samples, where big data are collected distributively and have a dependence structure. Learning rates are derived by means of an integral operator decomposition technique and probability inequalities in Hilbert spaces. The results show that with a suitable robustification parameter, the performance of the distributed robust algorithm is comparable with that of its non-distributed counterpart, even if the dependent feature restricts the availability and the effective amount of data.

Keywords:

robustness; dependent samples; distributed learning; integral operator; learning rates

MSC:

62J02

1. Introduction

Regression analysis plays an important role in the framework of supervised learning, which aims at estimating the function relations between input data and output data. The ordinary least-squares (OLS) method is widely employed for regression learning in real applications. However, when the sampling data are contaminated by high levels of noise, OLS behaves poorly since it belongs to two-moment statistical methods and cannot capture the high-moment information contained in the data. To tackle this problem, new learning methods are increasingly developed that are robust to heavy-tailed data or other potential forms of contamination. Among them, various robust loss functions are proposed to replace the least squares loss function, which can overcome the limitation of OLS caused by the characteristic of second-order statistics (see [,,,,,,,,]). In this paper we focus on the robust loss functions of the form

\begin{matrix} V_{h} (z) = G (\frac{z^{2}}{h^{2}}) \end{matrix}

(1)

induced by a windowing function

G : R \to R_{+}

and the robustification parameter

h .

Here the windowing function G is required to satisfy the following two conditions:

For any $s \geq 0, G^{'} (s) > 0$ and $G (s) \geq G (0)$ . In addition, $C_{G} : = {sup}_{s > 0} | G^{'} (s) | < \infty$ .
There exists some $p > 0$ such that for some constant $c_{p} > 0$ ,

$\begin{matrix} | G^{'} (s) - G^{'} (0) | \leq c_{p} {| s |}^{p}, s > 0 . \end{matrix}$

(2)

Some commonly used robust loss functions fall into this category and are listed below.

Fair loss function: $V_{h} (z) = \frac{| z |}{h} - log (1 + \frac{| z |}{h}), G (s) = \sqrt{s} = log (1 + \sqrt{s}), p = \frac{1}{2}$ , $c_{p} = \frac{1}{2}$ ;
Cauchy loss function: $V_{h} (z) = log (1 + \frac{z^{2}}{2 h^{2}}), G (s) = log (1 + \frac{s}{2}), p = 1, c_{p} = \frac{1}{4}$ ;
Correntropy loss function: $V_{h} (z) = 1 - exp \{- \frac{z^{2}}{2 h^{2}}\}, G (s) = 1 - exp {- \frac{s}{2}}, p = 1$ , $c_{p} = \frac{1}{4}$ ;
Huber loss function: $V_{h} (z) = I_{| z | \leq h} \frac{z^{2}}{2 h^{2}} + I_{| z | > h} (\frac{| z |}{h} - \frac{1}{2}), G (s) = I_{| z | \leq h} \frac{s}{2} + I_{| z | > h} (s^{\frac{1}{2}} - \frac{1}{2}),$ $p = 0, c_{p} = \frac{1}{2}$ .

We can easily check that the windowing function G has the redescending property []. That is, when

| s | \leq h

, the loss is convex and behaves like the least-squares loss; when

| s | \geq h

the loss function tends to be concave and rapidly decreases to be flat as

| s |

goes far from zero. Therefore, with a suitable chosen robustification parameter h, robust loss can completely reject gross outliers while maintaining a prediction accuracy similar to that of least-squares losses. In this work, we are interested in the application of robust loss in regression problems, linked to the data generation model given as

\begin{matrix} Y = f^{*} (X) + ε, E [ε | X = x] = 0 . \end{matrix}

(3)

Here X is the input variable that takes values in a separable metric space

X

and

Y \in Y \subset R

stands for the output variable;

ε

is the noise of the model, having a conditional mean of zero with respect to given

X = x

. The main purpose of the regression problem is to estimate

f^{*}

according to a set of sampling data

D = {(x_{i}, y_{i})}_{i = 1}^{m} \subset Z : = X \times Y

generated by model (3).

Distributed learning has received considerable attention for the rapid expansion of computing capacities in the era of big data. Due to privacy concerns and communication cost, this paper studies a distributed robust algorithm with one communication round, based on the divide-and-conquer principle (DC). DC starts by using a massive data set that is stored distributively in local machines or dividing the whole data set into multiple subsets that are distributed to local machines, then takes a base algorithm to analyze the local data set, and finally averages all local estimators by only one communication to the master machine. It is thus computationally efficient by enabling parallel computing in the local learning process and can preserve data security and privacy without mutual information communications. This scheme has been developed for many classical learning algorithms, including kernel ridge regression, bias correction, minimum error entropy, and spectral algorithms, see [,,,,,,,,].

During the data collection or assignment, the sampling data is generated from some non-trivial stochastic process with memory. In particular, when big data are collected in a temporal order, they display a dependence feature between samples. Typical examples include Monte Carlo estimation in the Markov chain, covariance matrix estimation for Markov-dependent samples, and multiarm bandit problems [,,,]. In the literature the mixing process is frequently used to model a dependence structure between samples due to its ubiquitousness in stationary stochastic processes [,], which will be described in the next section.

With the development of modern science and technology, data collection becomes easier and thus more big data have been obtained and stored. Besides the property of large scale, such data usually have temporal dependence characteristics, such as in economy, finance, biology, medicine, industry, agriculture, transportation, and other fields. It is thus worthy to investigate the learning ability of distributed algorithms in the mixing sampling process. Robust learning is widely employed for regression learning when the sampling data are contaminated by high levels of noise. Therefore, we shall investigate the interplay between the level of robustness, degree of dependence, partition number, and generalization ability. Our works will demonstrate that in robust learning, the distributed method with dependent sampling can obtain statistical optimality while reducing computation cost substantially.

The aim of this paper is to study the learning performance of distributed robust algorithms for mixing sequences of sampling data. Our theoretical results will be the capacity-dependent error bounds obtained by using a recently developed integral operator technique and probability inequalities in Hilbert spaces for mixing sequences. We prove that such a distributed learning scheme can obtain optimal learning rates in the regression setting if the robustification parameter is suitably chosen.

2. Main Results

In this section, we state our main results and discuss their relations to the existing works. For this purpose, we first introduce some notations and concepts.

2.1. Preliminaries and Problem Setup

Given the sequence of random variables

{x_{i}}_{i = 1}^{\infty}

, let

M_{j}^{t} : = σ (x_{i} : j \leq i \leq t), t \in N

be the sigma-algebra generated by the random variables

x_{j}, \dots, x_{t}

. The strong mixing condition and uniform mixing condition are defined as follows.

Definition 1.

For two σ-fields

E

and

F

, define the α-coefficient as

\begin{matrix} α (E, F) = sup_{A \in E, B \in F} | P (A ⋂ B) - P (A) P (B) | \end{matrix}

and ϕ-coefficient as

\begin{matrix} ϕ (E, F) = sup_{A \in E, B \in F} | P (A | B) - P (A) P (B) | . \end{matrix}

A set of random sequences

{x_{i}}, i \geq 1,

is said to satisfy a strongly mixing condition (or α-mixing condition) if

\begin{matrix} α_{i} = sup_{k \geq 1} α (M_{1}^{k}, M_{k + i}^{\infty}) \to 0, as i \to \infty . \end{matrix}

(4)

It satisfies a uniformly mixing condition (or ϕ-mixing condition) if

\begin{matrix} ϕ_{i} = sup_{k \geq 1} ϕ (M_{1}^{k}, M_{k + i}^{\infty}) \to 0, as i \to \infty . \end{matrix}

(5)

Remark 1.

When the sampling data

{x_{i}}_{i \geq 1}

are drawn independently, both mixing conditions hold with

α_{i} = ϕ_{i} = 0 .

According to the definitions, the strongly mixing condition is weaker than the ϕ-mixing condition. Many random processes satisfy the strongly mixing condition, for example, the stationary Markov process, which is uniformly purely non-deterministic, the stationary Gaussian sequence with a continuous spectral density that is bounded away from 0, certain ARMA processes, and some aperiodic Harris recurrent Markov processes; see [,].

In this work, we study the robust learning algorithm under the framework of reproducing kernel Hilbert space (RKHS)

H_{K}

[]. Let

K : X \times X \to

be a Mercer kernel, that is, a continuous, symmetric, and positive semi-definite function. The RKHS

H_{K}

associated with K is defined to be the completion of the linear span of the set of functions

{K_{x} : = K (x, \cdot), x \in X}

equipped with the inner product

{⟨ \cdot, \cdot ⟩}_{K}

satisfying the reproducing property

\begin{matrix} {⟨ g, K_{x} ⟩}_{K} = g (x), f o r a n y x \in X, g \in H_{K} . \end{matrix}

(6)

Denote

κ : = {sup}_{x \in X} \sqrt{k (x, x)}

[]. This property implies that

{∥ g ∥}_{\infty} \leq κ {∥ g ∥}_{K}, \forall g \in H_{K} .

Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features and RKHSs are used here as hypothesis spaces in the design of robust algorithms. Define the empirical robust risk over

D = {(x_{i}, y_{i})}_{i = 1}^{m} \subset Z

,

\begin{matrix} E_{D} (f) = \frac{h^{2}}{2 m} \sum_{(x, y) \in D} G (\frac{{[(f (x) - y)]}^{2}}{h^{2}}) . \end{matrix}

Definition 2.

Given a sample set

D = {(x_{i}, y_{i})}_{i = 1}^{m} \subset Z

, the regularized robust algorithm with an RKHS

H_{K}

in supervised learning is defined by

\begin{matrix} f_{D, λ} = arg min_{f \in H_{K}} E_{D} (f) + \frac{λ}{2} {∥ f ∥}_{K}^{2} \end{matrix}

(7)

where λ is a regularization parameter and h is a robustification parameter.

Distributed learning considered in the paper is applied to two real situations: 1. Data are collected and stored distributively and cannot be shared with each other due to privacy or communication cost. The typical example is continuous monitoring data from hospitals. 2. The collections of data have the time-serial property and the partitioning of data is performed sequentially. The typical example is spot price data. In the following, we describe the implementation of distributed learning in the two situations.

Based on DC, the implementation of (7) with distributed learning is described as follows:

Decompose the data sets $D$ into k disjoint subsets ${D_{ℓ}}_{ℓ = 1}^{k}$ of equal size so that each subset $D_{ℓ}$ has the sample size $| D_{ℓ} | = \frac{m}{k} = n$ .
Assign $D_{ℓ}$ to the ℓ-th local machine and produce the local estimator $f_{D_{ℓ}, λ}$ using the base algorithm (7) performing on $D_{ℓ} .$
Obtain the final estimator ${\bar{f}}_{D, λ}$ by averaging the local estimators

$\begin{matrix} {\bar{f}}_{D, λ} = \frac{1}{k} \sum_{ℓ = 1}^{k} f_{D_{ℓ}, λ} \end{matrix}$

(8)

When a sample set

D = {(x_{i}, y_{i})}_{i = 1}^{m} \subset Z

is drawn independently according to the identical distribution, it has been proven that for a broad class of algorithms, distributed learning has the same learning rates as its non-distributed counterpart as long as the local machines do not have too little training data [,,,]. When data have dependent structures, distributed regularized least-squares methods were shown in [] to perform as well as the standard least-squares methods via attaining optimal learning rates. Their theoretical analysis is based on the characteristics of the squared losses, which cannot apply to the robust loss directly since it is not necessarily convex. Therefore, the learning performance of distributed robust algorithms under the dependency condition on the sampling data is still unknown.

2.2. Learning Rates

Throughout this paper, we assume that

ρ

is a Borel probability measure on

Z = X \times Y .

The joint measure

ρ

can be decomposed into

\begin{matrix} ρ (x, y) = ρ (y | x) P_{X} (x), z = (x, y) \in Z \end{matrix}

where

ρ (y | x)

is the conditional distribution on

Y

for given

X = x

and

ρ_{X}

is the marginal distribution of

ρ

on

X

that describes the input data set

D_{X} = {x_{i}}_{i = 1}^{m}

. We assume the sample sequence

{z_{i} = (x_{i}, y_{i})}_{i}

comes from a strictly stationary process, and the dependence will be measured by the strongly mixing condition and uniformly mixing condition.

The goal of this paper is to estimate the learning error between

{\bar{f}}_{D, λ}

and the target function

f^{*}

in the

L_{P_{X}}^{2}

-space, that is

∥ {\bar{f}}_{D, λ} - f^{*} ∥_{L_{P_{X}}^{2}}^{2}

,

\begin{matrix} L_{P_{X}}^{2} : = \{f : X \to {R, ∥ f ∥}_{L_{P_{X}}^{2}} = \int_{X} {| f (x) |}^{2} d P_{X} (x) < \infty\} . \end{matrix}

We now provide some necessary assumptions with respect to an integral operator

L_{K} : L_{P_{X}}^{2} ⟶ L_{P_{X}}^{2}

associated with the kernel K by

\begin{matrix} L_{K} (f) : = \int_{X} f (x) K_{x} d P_{X}, \forall f \in L_{P_{X}}^{2} . \end{matrix}

By the reproducing property of

H_{K},

for any

f \in H_{K},

it can be expressed as

\begin{matrix} L_{K} (f) = \int_{X} {⟨ f, K_{x} ⟩}_{K} K_{x} d P_{X} (x), \forall f \in L_{P_{X}}^{2} . \end{matrix}

(9)

Since K is continuous, symmetric, and positive semi-definite,

L_{K}

is a compact positive operator of trace class and

L_{K} + λ I

is invertible for all

λ > 0

.

Our first assumption is related to the complexity of the RKHS

H_{K},

which can be measured by the concept effective dimension [], that is, the trace of the operator

{(L_{K} + λ I)}^{- 1} L_{K}

defined as

\begin{matrix} N (λ) = T r ({(L_{K} + λ I)}^{- 1} L_{K}), λ > 0 . \end{matrix}

Assumption 1.

With a parameter

0 < s \leq 1,

there exists a constant

C_{0} > 0

such that

N (λ) \leq C_{0} λ^{- s} .

Let

{μ_{i}}_{i = 1}^{\infty}

be the eigenvalues of the operator

L_{K}

; then the eigenvalues of the operator

{(L_{K} + λ I)}^{- 1} L_{K}

are

{\{\frac{μ_{i}}{μ_{i} + λ}\}}_{i = 1}^{\infty}

and the trace

T r ({(L_{K} + λ I)}^{- 1} L_{K})

is equal to

\sum_{i = 1}^{\infty} \frac{μ_{i}}{μ_{i} + λ} .

By the compactness of

X

, we know that

T r (L_{K}) < \infty .

Hence, the above assumption is always satisfied with

s = 1

by taking the constant

C_{0} = T r (L_{K}) .

When

H_{K}

is a finite rank space, for example, the linear space, the parameter s tends to 0. Furthermore, we assume that

{μ_{i}}_{i = 1}^{\infty}

decays as

μ_{i} = O (i^{- \frac{1}{b}})

for some

0 < b < 1

; then it is easy to check that

N (λ) = O (λ^{- b}) .

In fact, the effective dimension

N (λ)

is a common tool in leaning theory and spectral algorithms. It reflects the structure of the hypothesis space

H_{K}

and establishes a connection between integral operators and spectral methods. For more details, see [].

The second assumption is stated in terms of the regularity of the target function

f^{*} .

Assumption 2.

For some

r > 0

and

h_{ρ} \in L_{P_{X}}^{2},

f^{*} = L_{K}^{r} (h_{ρ}) .

Here

L_{K}^{r}

denotes the r-th power of

L_{K}

, and it is well defined since

L_{K}

is compact and positive. By the definition of

H_{K},

we know that for any

f \in L_{P_{X}}^{2}

,

{∥ f ∥}_{L_{P_{X}}^{2}} = {∥ L_{K}^{\frac{1}{2}} f ∥}_{K} .

Then if

r \geq \frac{1}{2},

it implies that

f^{*}

lies in

H_{K} .

This assumption is called the Holder source condition [] in inverse problems and characterizes the smoothness of the target function

f^{*} .

In the sequel, let

{∥ \cdot ∥}_{ρ} : = {∥ \cdot ∥}_{L_{P_{X}}^{2}}

for simplicity and

| y | \leq M

almost surely for some constant

M > 0 .

Without loss of generality, let

G^{'} (0) = 1

and

κ : = {sup}_{x \in X} \sqrt{k (x, x)} = 1

. Our main results can be stated as follows. Here we assume that each sub-data set has the same size

| D_{ℓ} | = \frac{m}{k} = n

for

1 \leq ℓ \leq k .

Theorem 1.

Define

{\bar{f}}_{D, λ}

by (8). Under Assumptions 1 and 2 with

0 < s < 1

and

\frac{1}{2} < r \leq 1,

if the sample data

D = {(x_{i}, y_{i})}_{i = 1}^{m}

satisfies the α-mixing condition, let

λ = m^{- \frac{1}{2 r + s}}, δ > 0

,

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \\ \leq C_{1} k m^{- \frac{2 r + s - 1}{2 r + s}} {(1 + \sum_{ℓ = 1}^{n} α_{ℓ})}^{\frac{1}{2}} (m^{\frac{s}{4 r + 2 s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} + m^{\frac{p + \frac{1}{2}}{2 r + s}} h^{- 2 p}) \\ + C_{1} (m^{- \frac{r}{2 r + s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

(10)

where

C_{1}

is a constant depending on the constants

C_{0}, r, s, M

(appearing in the previous necessary assumptions), independent of

m, n, k, h

(will be given in the proof).

A consequence of Theorem 1 is that the error bound for the distributed robust algorithm (8) depends on the partition number

k,

robustification parameter h, and dependence coefficients

α_{ℓ}

. In particular, we have the following learning rates.

Corollary 1.

Under the same conditions as Theorem 1 with

2 r + s > 2

, if the α-mixing coefficients satisfy

α_{ℓ} = O (ℓ^{- j})

with

j > 1

and

\begin{matrix} k < m^{\frac{2 r + s - 2}{4 r + 2 s}}, \end{matrix}

(11)

then for any arbitarily small

ϵ > 0,

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \leq C_{2} (m^{- \frac{r + ϵ}{2 r + s}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) \end{matrix}

(12)

where

C_{2}

is a constant depending on the constants

C_{0}, r, s, M, j

(appearing in the previous necessary assumptions), independent of

m, n, k, h

(will be given in the proof).

Remark 2.

It was shown in [] that the distributed least-squares method maintains the nearly optimal learning performance

O (m^{- \frac{r + ϵ}{2 r + s}})

for small enough

ϵ > 0

provided the number of local machines k is not so large and samples are drawn according to the α-mixing process. This corollary suggests that as the robustification parameter h increases, the learning rates can improve to the best order

O (m^{- \frac{r + ϵ}{2 r + s}})

, coinciding with the results for the least-squares methods. However, further increasing of h beyond

m^{\frac{p + 1 + r}{2 p (2 r + s)}}

may not help to improve the learning rate. This shows that h should keep a balance between the prediction accuracy and the degree of robustness during the training process.

For example, when

H_{K}

is a linear space and

f^{*}

lies in

H_{K}

, then s goes to 0 and

r = \frac{1}{2} .

By Corollary 1, we deduce that the generalization error bound for the robust losses (including Huber loss) is on the order of

O (m^{- \frac{1}{2} + ϵ^{'}})

(

ϵ^{'}

is arbitrarily small) by taking h large enough and ϵ small enough.

Next, Theorem 2 concerns the error bound for the distributed robust algorithm (8) under the

ϕ

-mixing process.

Theorem 2.

Define

{\bar{f}}_{D, λ}

by (8). Under Assumptions 1,2 with

0 < s < 1

and

\frac{1}{2} < r \leq 1,

if the sample data

D = {(x_{i}, y_{i})}_{i = 1}^{m}

satisfies the ϕ-mixing condition, let

λ = m^{- \frac{1}{2 r + s}},

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \\ \leq C_{3} k m^{- \frac{2 r + s - 1}{2 r + s}} {(1 + \sum_{ℓ = 1}^{n} ϕ_{ℓ})}^{\frac{1}{2}} (m^{\frac{s}{4 r + 2 s}} + m^{\frac{s}{4 r + 2 s}} {(\sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} + m^{\frac{p + \frac{1}{2}}{2 r + s}} h^{- 2 p}) \\ + C_{3} (m^{- \frac{r}{2 r + s}} + m^{- \frac{r}{2 r + s}} {(\sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

(13)

where

C_{3}

is a constant depending on the constants

C_{0}, r, s, M

(appearing in the previous necessary assumptions), independent of

m, n, k, h

(will be given in the proof).

Corollary 2.

Under the same conditions as Theorem 2 with

2 r + s > 2

, if the ϕ-mixing coefficients satisfy

ϕ_{ℓ} = O (ℓ^{- j})

with some

j > 2

and

\begin{matrix} k < m^{\frac{2 r + s - 2}{4 r + 2 s}}, \end{matrix}

(14)

then

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \leq C_{4} (m^{- \frac{r}{2 r + s}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) \end{matrix}

(15)

where

C_{4}

is a constant depending on the constants

C_{0}, r, s, M, j

(appearing in the previous necessary assumptions), independent of

m, n, k, h

(will be given in the proof).

Remark 3.

Corollaries 1 and 2 tell us that distributed robust algorithms can achieve the same learning rate as that for independent samples if the mixing coefficients are summable, which was proved in []. Therefore, such algorithms have advantages in dealing with the large sample size and noise models.

3. Proofs

Now we are in a position of proving the consistency results stated in Section 2. First, we will give the error decomposition of

f_{D, λ}

defined in algorithm (7).

3.1. Expression of $f_{D, λ}$

Define the empirical operator

L_{K, D} : H_{K} \to H_{K}

by

\begin{matrix} L_{K, D} : = \frac{1}{m} \sum_{x \in D_{X}} {⟨ \cdot, K_{x} ⟩}_{K} K_{x}, \end{matrix}

So for any

f \in H_{K}

,

L_{K, D} f = \frac{1}{m} \sum_{x \in D_{X}} f (x) K_{x}

. Then we have the following representation for

f_{D, λ}

.

Lemma 1.

Define

f_{D, λ}

by (7). Then it satisfies

\begin{matrix} f_{D, λ} = {(L_{K, D} + λ I)}^{- 1} {\hat{f}}_{ρ, D} + {(L_{K, D} + λ I)}^{- 1} E_{D, λ} \end{matrix}

(16)

where

\begin{matrix} {\hat{f}}_{ρ, D} = \frac{1}{m} \sum_{(x, y) \in D} y K_{x} \end{matrix}

and

\begin{matrix} E_{D, λ} = \frac{1}{m} \sum_{(x, y) \in D} [G^{'} (\frac{{(f_{D, λ} (x) - y)}^{2}}{h^{2}}) - G^{'} (0)] (f_{D, λ} (x) - y) K_{x} . \end{matrix}

Proof.

Since

f_{D, λ}

is the minimizer of algorithm (7), we take the gradient of the regularized functional on

H_{K}

in (7) to give

\begin{matrix} \frac{1}{m} \sum_{(x, y) \in D} G^{'} (\frac{{(f_{D, λ} (x) - y)}^{2}}{h^{2}}) (f_{D, λ} (x) - y) K_{x} + λ f_{D, λ} = 0, \end{matrix}

or equivalently (recall the assumption

G^{'} (0) = 1

),

\begin{matrix} \frac{1}{m} \sum_{(x, y) \in D} (f_{D, λ} (x) - y) K_{x} + λ f_{D, λ} - E_{D, λ} = 0, \end{matrix}

which is

(L_{K, D} + λ I) f_{D, λ} - {\hat{f}}_{ρ, D} - E_{D, λ} = 0

.

The proof is completed. □

The quantity

E_{D, λ}

is referred to as the robust error, which is determined by the degree of robustness in the training process and can be estimated as follows. By the definition of

f_{D, λ}

in (7), we have that

E_{D} (f_{D, λ}) + λ {∥ f_{D, λ} ∥}_{K}^{2} \leq E_{D} (0)

. Recall that

C_{G} = {sup}_{s} | G^{'} (s) |

. With the fact that

G (0) < G (a)

for all

a > 0

and considering Taylor expansion,

\begin{matrix} λ ∥ f_{D, λ} ∥_{K}^{2} \leq & E_{D} (0) - E_{D} (f_{D, λ}) \leq \frac{h^{2}}{2 m} \sum_{(x, y) \in D} G (\frac{y^{2}}{h^{2}}) - \frac{h^{2} G (0)}{2} \\ \leq & \frac{C_{G}}{2 m} \sum_{(x, y) \in D} y^{2} \leq C_{G} M^{2} . \end{matrix}

It follows that

\begin{matrix} ∥ f_{D, λ} ∥_{K} \leq \sqrt{C_{G}} λ^{- \frac{1}{2}} M . \end{matrix}

(17)

By (2), we see that

\begin{matrix} ∥ E_{D, λ} ∥_{K} & \leq c_{p} h^{- 2 p} \frac{1}{m} \sum_{(x, y) \in D} {(∥ f_{D, λ} ∥_{K} + | y |)}^{2 p + 1} \\ \leq 2^{2 p} c_{p} h^{- 2 p} \frac{1}{m} \sum_{(x, y) \in D} (∥ f_{D, λ} ∥_{K}^{2 p + 1} + {| y |}^{2 p + 1}) \\ \leq 2^{2 p} c_{p} h^{- 2 p} (∥ f_{D, λ} ∥_{K}^{2 p + 1} + M^{2 p + 1}) . \end{matrix}

(18)

This in combination with the bounds (17) gives that

\begin{matrix} ∥ E_{D, λ} ∥_{K} \leq c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p} \end{matrix}

(19)

where

c_{p, M} = 2^{2 p} c_{p} (C_{G}^{p + \frac{1}{2}} + 1) M^{2 p + 1}

.

3.2. Error Decomposition

To derive the explicit learning rate of the distributed algorithm (8), we define the regularization function

f_{λ}

in

H_{K}

,

\begin{matrix} f_{λ} : = arg min_{f \in H_{K}} E_{ls} (f) + λ {∥ f ∥}_{K}^{2} \end{matrix}

where

E_{ls} (f) = \int_{Z} {(f (x) - y)}^{2} d ρ

is the expected risk associated with the least-squares loss. It was proved in [] that

\begin{matrix} f_{λ} = {(L_{K} + λ I)}^{- 1} L_{K} f^{*}, \end{matrix}

(20)

so

f_{λ} - f^{*} = - λ {(L_{K} + λ I)}^{- 1} f^{*}

. Under Assumption 2 with

r > \frac{1}{2}

,

\begin{matrix} ∥ f_{λ} - f^{*} ∥_{ρ} \leq {∥ h_{ρ} ∥}_{ρ} λ^{r}, \end{matrix}

(21)

and

∥ f_{λ} ∥_{K} \leq {∥ h_{ρ} ∥}_{ρ} λ^{r - \frac{1}{2}} .

(22)

Now we state two error decompositions for

f_{D, λ} - f_{λ}

. By (20),

- L_{K, D} f_{λ} - λ f_{λ} = - L_{K, D} f_{λ} + L_{K} f_{λ} - L_{K} f^{*}

, so

\begin{matrix} - f_{λ} = {(L_{K, D} + λ I)}^{- 1} [(L_{K} - L_{K, D}) f_{λ} - L_{K} f^{*}], \end{matrix}

(23)

and we obtain the first decomposition

\begin{matrix} f_{D, λ} - f_{λ} = & {(L_{K, D} + λ I)}^{- 1} (L_{K} - L_{K, D}) f_{λ} + {(L_{K, D} + λ I)}^{- 1} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) \\ + {(L_{K, D} + λ I)}^{- 1} E_{D, λ} . \end{matrix}

(24)

Recall that

λ f_{D, λ} = - L_{K, D} f_{D, λ} + {\hat{f}}_{ρ, D} + E_{D, λ}

, so

\begin{matrix} (L_{K} + λ I) f_{D, λ} = (L_{K} - L_{K, D}) (f_{D, λ} - f_{λ}) + (L_{K} - L_{K, D}) f_{λ} + {\hat{f}}_{ρ, D} + E_{D, λ}, \end{matrix}

and we obtain the second decomposition

\begin{matrix} f_{D, λ} - f_{λ} = f_{D, λ} - {(L_{K} + λ I)}^{- 1} L_{K} f^{*} = {(L_{K} + λ I)}^{- 1} [(L_{K} + λ I) f_{D, λ} - L_{K} f^{*}] \\ = {(L_{K} + λ I)}^{- 1} (L_{K} - L_{K, D}) (f_{D, λ} - f_{λ}) + {(L_{K} + λ I)}^{- 1} (L_{K} - L_{K, D}) f_{λ} \\ + {(L_{K} + λ I)}^{- 1} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) + {(L_{K} + λ I)}^{- 1} E_{D, λ} . \end{matrix}

(25)

3.3. Estimates in Distributed Learning

In order to deal with the mixing sequences, we use the following two lemmas, which can be found in []. For a random variable

ξ

with values in a Hilbert space

H

and

1 \leq u \leq \infty

, denote

{∥ ξ ∥}_{u} = {(E ∥ ξ ∥}_{H}^{u})^{1 / u}

if

1 \leq u < \infty

and

{∥ ξ ∥}_{\infty} = sup {∥ ξ ∥}_{H} .

Lemma 2.

Let ξ and η be random variables with values in separable Hilbert space

H

measurable σ-fields

E

and

F

with finite u-th and v-th moments, respectively. If

1 < u, v, t < \infty

with

u^{- 1} + v^{- 1} + t^{- 1} = 1,

then

\begin{matrix} | E (ξ, η) - (E ξ, E η) | \leq 15 α^{\frac{1}{t}} (E, F) {∥ ξ ∥}_{u} {∥ η ∥}_{v} . \end{matrix}

(26)

Lemma 3.

Let ξ and η be random variables with values in separable Hilbert space

H

measurable σ-fields

E

and

F

with finite p-th and q-th moments, respectively. If

1 \leq p, q < \infty

with

p^{- 1} + q^{- 1} = 1,

then

\begin{matrix} | E (ξ, η) - (E ξ, E η) | \leq 2 ϕ^{\frac{1}{p}} (E, F) {∥ ξ ∥}_{p} {∥ η ∥}_{q} . \end{matrix}

(27)

Next, we define some notations used in this paper.

\begin{matrix} A (D, λ) : = E [∥ {(L_{k} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}], \\ B (D, λ) : = E [∥ {(L_{k} + λ I)}^{- 1} (L_{K} + λ I) ∥], \\ C (D, λ) : = E [∥ {(L_{k} + λ I)}^{- \frac{1}{2}} (L_{K} f^{*} - {\hat{f}}_{ρ, D}) ∥_{K}^{2}], \\ I (D) : = E [∥ L_{K} - L_{K, D} ∥^{2}], \\ J (D) : = E [∥ L_{K} f^{*} - {\hat{f}}_{ρ, D} ∥_{K}^{2}] . \end{matrix}

Lemma 4.

If the sample sequence

D = {(x_{i}, y_{i})}_{i = 1}^{m}

satisfies the α-mixing condition, then for any

δ > 0,

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} \leq \sqrt{\frac{N (λ)}{m}} + 4 \frac{{(N (λ))}^{\frac{1}{2 + δ}}}{λ^{\frac{δ}{2 δ + 4}}} (\sqrt{\frac{1}{m} \sum_{ℓ}^{m} {(α_{ℓ})}^{\frac{δ}{2 + δ}}}); \end{matrix}

(28)

\begin{matrix} B {(D, λ)}^{\frac{1}{2}} \leq \sqrt{\frac{2 N (λ)}{m λ}} + 6 \frac{{(N (λ))}^{\frac{1}{2 + δ}}}{λ^{\frac{δ + 1}{δ + 2}}} (\sqrt{\frac{1}{m} \sum_{ℓ}^{m} {(α_{ℓ})}^{\frac{δ}{2 + δ}}}) + \sqrt{3}; \end{matrix}

(29)

\begin{matrix} C {(D, λ)}^{\frac{1}{2}} \leq M \sqrt{\frac{N (λ)}{m}} + 4 M \frac{{(N (λ))}^{\frac{1}{2 + δ}}}{λ^{\frac{δ}{2 δ + 4}}} (\sqrt{\frac{1}{m} \sum_{ℓ}^{m} {(α_{ℓ})}^{\frac{δ}{2 + δ}}}); \end{matrix}

(30)

\begin{matrix} I {(D)}^{\frac{1}{2}} \leq \frac{1}{\sqrt{m}} {(1 + 30 \sum_{ℓ = 1}^{m} α_{ℓ})}^{\frac{1}{2}}; \end{matrix}

(31)

\begin{matrix} J {(D)}^{\frac{1}{2}} \leq \frac{M}{\sqrt{m}} {(1 + 30 \sum_{ℓ = 1}^{m} α_{ℓ})}^{\frac{1}{2}} . \end{matrix}

(32)

If the sample sequence

D = {(x_{i}, y_{i})}_{i = 1}^{m}

satisfies the ϕ-mixing condition, then

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} \leq \sqrt{\frac{N (λ)}{m}} + \sqrt{\frac{2 N (λ)}{m} \sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}}}; \end{matrix}

(33)

\begin{matrix} B {(D, λ)}^{\frac{1}{2}} \leq \sqrt{\frac{2 N (λ)}{m λ}} + 2 \sqrt{\frac{N (λ)}{m λ} \sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}}} + \sqrt{2}; \end{matrix}

(34)

\begin{matrix} C {(D, λ)}^{\frac{1}{2}} \leq M \sqrt{\frac{N (λ)}{m}} + M \sqrt{\frac{2 N (λ)}{m} \sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}}}; \end{matrix}

(35)

\begin{matrix} I {(D)}^{\frac{1}{2}} \leq \frac{1}{\sqrt{m}} {(1 + 4 \sum_{ℓ = 1}^{m} ϕ_{ℓ})}^{\frac{1}{2}}; \end{matrix}

(36)

\begin{matrix} J {(D)}^{\frac{1}{2}} \leq \frac{M}{\sqrt{m}} {(1 + 4 \sum_{ℓ = 1}^{m} ϕ_{ℓ})}^{\frac{1}{2}} . \end{matrix}

(37)

Proof.

When

D = {(x_{i}, y_{i})}_{i = 1}^{m}

satisfies the

α

-mixing, the estimates (28)–(32) can be found in []. We only consider the situation of

ϕ

-mixing. Define

η_{1} (x) = {(L_{K} + λ I)}^{- \frac{1}{2}} {⟨ \cdot, K_{x} ⟩}_{K} K_{x}, \forall x \in X .

It takes values on

H S (H_{K})

, which is the Hilbert space of Hilbert–Schmidt operators on

H_{K}

with the inner product

{⟨ A, B ⟩}_{H S} = T r (A^{T} B)

. The norm is given by

{∥ A ∥}_{H S}^{2} = \sum_{i} {∥ A e_{i} ∥}_{K}^{2}

, where

{e_{i}}_{i}

is an orthonormal basis on

H_{K}

.

H S (H_{K})

is a subspace of the space of bounded linear operators on

H_{K}

with the norm relations

\begin{matrix} ∥ A ∥ \leq {∥ A ∥}_{H S}, {∥ A B ∥}_{H S} \leq ∥ A ∥ {∥ B ∥}_{H S} . \end{matrix}

It was proved in [] that

\begin{matrix} E [∥ η_{1} {(x) ∥}_{H S}^{2}] \leq N (λ) . \end{matrix}

Thus, for any

i,

we have

\begin{matrix} E_{x_{i}} {⟨ η_{1} (x_{i}), η_{1} (x_{i}) ⟩}_{H S} = E [∥ η_{1} {(x) ∥}_{H S}^{2}] \leq N (λ) . \end{matrix}

(38)

For any

i < j

, by Lemma 3 with

p = q = 2,

\begin{matrix} |E {⟨ η_{1} (x_{i}), η_{1} (x_{j}) ⟩}_{H S} - {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} L_{K} ∥}_{H S}^{2}| \leq 2 ϕ_{j - i}^{\frac{1}{2}} E [∥ η_{1} (x) ∥_{H S}^{2}] \leq 2 ϕ_{j - i}^{\frac{1}{2}} N (λ) . \end{matrix}

(39)

In the rest, denote

{\hat{E}}_{D} [η_{1}] = \sum_{i = 1}^{m} {(L_{K} + λ I)}^{- \frac{1}{2}} {⟨ \cdot, K_{x_{i}} ⟩}_{K} K_{x_{i}}

. Then

\begin{matrix} E [∥ {(L_{k} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}] = E [{⟨ E [η_{1}] - {\hat{E}}_{D} [η_{1}], E [η_{1}] - {\hat{E}}_{D} [η_{1}] ⟩}_{H S}] \\ = E [∥ {\hat{E}}_{D} [η_{1}] ∥_{H S}^{2}] - {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} L_{K} ∥}_{H S}^{2} . \end{matrix}

(40)

Note that by (38) and (39),

\begin{matrix} E [∥ {\hat{E}}_{D} [η_{1}] ∥_{H S}^{2}] = \frac{1}{m^{2}} \sum_{i = 1}^{m} E [∥ η_{1} (x_{i}) ∥_{H S}^{2}] + \frac{2}{m^{2}} \sum_{i < j} E [⟨ η_{1} (x_{i}), η_{1} (x_{j}) ⟩] \\ \leq \frac{N (λ)}{m} + {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} L_{K} ∥}_{H S}^{2} + \frac{2 N (λ)}{m^{2}} \sum_{1 \leq i < j \leq m} ϕ_{j - i}^{\frac{1}{2}} . \end{matrix}

Putting the estimate above into (40) yields

\begin{matrix} E [∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}] \leq \frac{N (λ)}{m} + + \frac{2 N (λ)}{m} \sum_{1 \leq ℓ \leq m} ϕ_{ℓ}^{\frac{1}{2}} . \end{matrix}

(41)

Then the estimate (33) holds.

Now we turn to estimating (34).

\begin{matrix} E [{(L_{K, D} + λ I)}^{- 1} (L_{K} + λ I)] \\ \leq \frac{1}{λ} E [∥ {(L_{k} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}] + \frac{1}{\sqrt{λ}} {(E [∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}])}^{\frac{1}{2}} + 1 \\ \leq \frac{2}{λ} E [∥ {(L_{k} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥^{2}] + 2 \\ \leq \frac{2 N (λ)}{m λ} + + \frac{4 N (λ)}{m λ} \sum_{1 \leq ℓ \leq m} ϕ_{ℓ}^{\frac{1}{2}} + 2 . \end{matrix}

where the last inequality is obtained by (41). Formula (34) can be obtained directly.

We proceed to estimating (35). Define the random variable

η_{3} (z) = {(L_{K} + λ)}^{- \frac{1}{2}} y K_{x}, z = (x, y) \in Z .

It takes values on

H_{K}

, and

\begin{matrix} E [η_{3} (z)] = {(L_{K} + λ I)}^{- \frac{1}{2}} L_{K} f^{*} \end{matrix}

Denote the empirical form of

E [η_{3}]

as

{\hat{E}}_{D} [η_{3}] = \sum_{i = 1}^{m} {(L_{K} + λ I)}^{- \frac{1}{2}} {\hat{f}}_{ρ, D}

. It is easy to check that

∥ η_{s} {(z) ∥}_{K} \leq \frac{M}{\sqrt{λ}}

and

E [∥ η_{3} (z) ∥_{K}^{2} \leq M^{2} N (λ)

. For any

i < j

, by Lemma 3 with

p = q = 2,

\begin{matrix} |E {⟨ η_{3} (x_{i}), η_{3} (x_{j}) ⟩}_{K} - {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} L_{K} f^{*} ∥}_{H S}^{2}| \leq 2 ϕ_{j - i}^{\frac{1}{2}} E [∥ η_{3} (x) ∥_{K}^{2}] \leq 2 M^{2} ϕ_{j - i}^{\frac{1}{2}} N (λ) . \end{matrix}

(42)

Following similar procedures in estimating (33) again, we can get (35).

We are in the position of estimating (36). Define

η_{2} (x) = {⟨ \cdot, K_{x} ⟩}_{K} K_{x}

. Then for any

i,

we have

\begin{matrix} E_{x_{i}} {⟨ η_{2} (x_{i}), η_{2} (x_{i}) ⟩}_{H S} = E [∥ η_{2} {(x) ∥}_{H S}^{2}] \leq 1 \end{matrix}

(43)

due to

κ = 1 .

For any

i < j

, by Lemma 3 with

p = 1, q = \infty,

\begin{matrix} |E {⟨ η_{2} (x_{i}), η_{2} (x_{j}) ⟩}_{H S} - {∥ L_{K} ∥}_{H S}^{2}| \leq 2 ϕ_{j - i} sup_{x} [∥ η_{2} (x) ∥_{H S}] \leq 2 ϕ_{j - i} . \end{matrix}

(44)

Following similar procedures in estimating (33), we can get estimate (36).

For estimating (37), we define

η_{4} (z) = y K_{x}, z = (x, y) \in D .

Then for any

i,

we have

\begin{matrix} E_{z_{i}} {⟨ η_{4} (z_{i}), η_{4} (z_{i}) ⟩}_{K} = E [y_{i}^{2} K (x_{i}, x_{i})] \leq M^{2} . \end{matrix}

(45)

For any

i < j

, by Lemma 3 with

p = 1, q = \infty,

\begin{matrix} |E {⟨ η_{4} (z_{i}), η_{4} (z_{j}) ⟩}_{K} - {∥ L_{K} f^{*} ∥}_{K}^{2}| \leq 2 ϕ_{j - i} sup_{x} [∥ η_{4} (z) ∥_{K}] \leq 2 M ϕ_{j - i} . \end{matrix}

(46)

Following similar procedures in estimating (36), we can get estimate (37).

The proof is finished. □

Proposition 1.

Under Assumptions 1,2 with

0 < s < 1

and

\frac{1}{2} < r \leq 1,

we have

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \\ \leq 2 λ^{- 1} sup_{ℓ} A {(D_{l}, λ)}^{\frac{1}{2}} ((I {(D_{ℓ})}^{\frac{1}{2}} {∥ h_{ρ} ∥}_{ρ} + J {(D_{ℓ})}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p})) \\ + A {(D, λ)}^{\frac{1}{2}} ∥ h_{ρ} ∥_{ρ} + C {(D, λ)}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p} + {∥ h_{ρ} ∥}_{ρ} λ^{r} . \end{matrix}

(47)

Proof.

We split

∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}

into

∥ {\bar{f}}_{D, λ} - f_{λ} ∥_{ρ}

and

∥ f^{*} - f_{λ} ∥_{ρ} .

First, we use the decomposition (24) and get

\begin{matrix} E [∥ f_{D, λ} - f_{λ} ∥_{K}^{2}] \leq 3 E [∥ {(L_{K, D} + λ I)}^{- 1} (L_{K} - L_{K, D}) f_{λ} ∥^{2}] \\ + 3 E [∥ {(L_{K, D} + λ I)}^{- 1} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) ∥^{2}] + 3 E [∥ {(L_{K, D} + λ I)}^{- 1} E_{D, λ} ∥^{2}] \\ \leq 3 λ^{- 2} (I (D) ∥ f_{λ} ∥_{K}^{2} + J (D) + {∥ E_{D, λ} ∥}_{K}^{2}) . \end{matrix}

(48)

By the definition of

H_{k}

, we know that for any

f \in L_{ρ_{X}}^{2}

,

∥ L_{K}^{\frac{1}{2}} {f ∥}_{K} = {∥ f ∥}_{ρ}

; then

\begin{matrix} ∥ {\bar{f}}_{D, λ} - f_{λ} ∥_{ρ} = ∥ L_{K}^{\frac{1}{2}} ({\bar{f}}_{D, λ} - f_{λ}) ∥_{K} \leq {∥ {(L_{K} + λ I)}^{\frac{1}{2}} ({\bar{f}}_{D, λ} - f_{λ}) ∥}_{K} . \end{matrix}

Then by decomposition (25), we have

\begin{matrix} ∥ {\bar{f}}_{D, λ} - f_{λ} ∥_{ρ} \leq \frac{1}{k} \sum_{l = 1}^{k} {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D_{l}}) (f_{D_{l}, λ} - f_{λ}) ∥}_{K} \\ + {∥\frac{1}{k} \sum_{l = 1}^{k} {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D_{l}}) f_{λ}∥}_{K} \\ + {∥\frac{1}{k} \sum_{l = 1}^{k} {(L_{K} + λ I)}^{- \frac{1}{2}} ({\hat{f}}_{ρ, D_{l}} - L_{K} f_{ρ})∥}_{K} + λ^{- \frac{1}{2}} \frac{1}{k} \sum_{l = 1}^{k} {∥E_{D_{l}, λ}∥}_{K} \\ = \frac{1}{k} \sum_{l = 1}^{k} {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D_{l}}) (f_{D_{l}, λ} - f_{λ}) ∥}_{K} \\ + {∥{(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) f_{λ}∥}_{K} \\ + {∥{(L_{K} + λ I)}^{- \frac{1}{2}} ({\hat{f}}_{ρ, D} - L_{K} f_{ρ})∥}_{K} + λ^{- \frac{1}{2}} \frac{1}{k} \sum_{l = 1}^{k} {∥E_{D_{l}, λ}∥}_{K} . \end{matrix}

Therefore, we have

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f_{λ} ∥_{ρ}] & \leq E [\frac{1}{k} \sum_{l = 1}^{k} {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D_{l}}) (f_{D_{l}, λ} - f_{λ}) ∥}_{K}] \\ + E [{∥{(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) f_{λ}∥}_{K}] \\ + E [{∥{(L_{K} + λ I)}^{- \frac{1}{2}} ({\hat{f}}_{ρ, D} - L_{K} f_{ρ})∥}_{K}] + λ^{- \frac{1}{2}} E [\frac{1}{k} \sum_{l = 1}^{k} {∥E_{D_{l}, λ}∥}_{K}] \\ \leq sup_{ℓ} A {(D_{l}, λ)}^{\frac{1}{2}} {(E [∥ f_{D_{l}, λ} - f_{λ} ∥_{K}^{2}])}^{\frac{1}{2}} + A {(D, λ)}^{\frac{1}{2}} {∥ f_{λ} ∥}_{K} \\ + C {(D, λ)}^{\frac{1}{2}} + λ^{- \frac{1}{2}} sup_{ℓ} {∥E_{D_{l}, λ}∥}_{K} \end{matrix}

(49)

where the last inequality is obtained by the Cauchy inequality. Using (48) with

D = D_{ℓ}

and putting it into the estimate above, we have

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] \leq E [∥ {\bar{f}}_{D, λ} - f_{λ} ∥_{ρ}] + ∥ f^{*} - f_{λ} ∥_{ρ}] \\ \leq 2 λ^{- 1} sup_{ℓ} A {(D_{l}, λ)}^{\frac{1}{2}} {((I ({\underline{D}}_{ℓ}) ∥ f_{λ} ∥_{K}^{2} + J (D_{ℓ}) + {∥ E_{D_{ℓ}, λ} ∥}_{K}^{2}))}^{\frac{1}{2}} \\ + A {(\underline{D}, λ)}^{\frac{1}{2}} ∥ f_{λ} ∥_{K} + C {(D, λ)}^{\frac{1}{2}} + λ^{- \frac{1}{2}} sup_{ℓ} {∥E_{D_{l}, λ}∥}_{K} + {∥ f^{*} - f_{λ} ∥}_{ρ} . \end{matrix}

This together with (19) and (22) gives the desired conclusion. □

3.4. Proofs of the Main Results

Proof of Theorem 1.

We apply Proposition 1 to prove our main results. With the choice of

λ = m^{- \frac{1}{2 r + s}}

, by Assumption 1, we get

\begin{matrix} λ^{- (p + \frac{1}{2})} = m^{\frac{p + \frac{1}{2}}{2 r + s}}, \end{matrix}

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} & \leq \sqrt{C_{0}} λ^{- \frac{s}{2}} m^{- \frac{1}{2}} + 4 C_{0}^{\frac{1}{δ + 2}} λ^{- \frac{δ + 2 s}{2 δ + 4}} m^{- \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} \\ = \sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 4 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} . \end{matrix}

Similarly, we obtain

\begin{matrix} C {(D, λ)}^{\frac{1}{2}} = M (\sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 6 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}}) . \end{matrix}

Next, by

| D_{ℓ} | = n = \frac{m}{k},

we get

\begin{matrix} A {(D_{ℓ}, λ)}^{\frac{1}{2}} & \leq \sqrt{C_{0}} λ^{- \frac{s}{2}} n^{- \frac{1}{2}} + 4 C_{0}^{\frac{1}{δ + 2}} λ^{- \frac{δ + 2 s}{2 δ + 4}} n^{- \frac{1}{2}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} \\ = k^{\frac{1}{2}} (\sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 4 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}}) . \end{matrix}

Similarly, we have

\begin{matrix} I {(D_{ℓ})}^{\frac{1}{2}} = k^{\frac{1}{2}} \frac{1}{\sqrt{m}} {(1 + 30 \sum_{ℓ = 1}^{n} α_{ℓ})}^{\frac{1}{2}}; \\ J {(D_{ℓ})}^{\frac{1}{2}} = k^{\frac{1}{2}} \frac{M}{\sqrt{m}} {(1 + 30 \sum_{ℓ = 1}^{n} α_{ℓ})}^{\frac{1}{2}} . \end{matrix}

We find that

\begin{matrix} λ^{- 1} sup_{ℓ} A {(D_{l}, λ)}^{\frac{1}{2}} ((I {(D_{ℓ})}^{\frac{1}{2}} {∥ h_{ρ} ∥}_{ρ} + J {(D_{ℓ})}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p})) \\ \leq C_{1} k m^{- \frac{2 r + s - 1}{2 r + s}} {(1 + \sum_{ℓ = 1}^{n} α_{ℓ})}^{\frac{1}{2}} (m^{\frac{s}{4 r + 2 s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} + m^{\frac{p + \frac{1}{2}}{2 r + s}} h^{- 2 p}) \end{matrix}

with

C_{1} = 30 {(\sqrt{C_{0}} + 6 C_{0}^{\frac{1}{δ + 2}})}^{2} (∥ h_{ρ} ∥_{ρ} + 1 + 2 c_{p, M})

and

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} ∥ h_{ρ} ∥_{ρ} + C {(D, λ)}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p} + {∥ h_{ρ} ∥}_{ρ} λ^{r} \\ \leq C_{1} (m^{- \frac{r}{2 r + s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

Putting the estimates into (47) yields the desired conclusion (10).

□

Proof of Corollary 1.

We shall prove it by Theorem 1. Note that

α_{ℓ} = O (ℓ^{- j})

with

j > 1

; then there exists a constant

c_{j}

such that

\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{1}{2}} \leq c_{j}

. In addition, with arbitrarily small

ε > 0,

we choose large enough

δ = (\frac{(2 + s) ε}{2 r + s} + 4) / (j - 1)

. By simple calculation,

\begin{matrix} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} \leq c_{j, ε} m^{\frac{s + ε}{4 r + 2 s}} \end{matrix}

where

c_{j, ε}

is a constant depending only on

j, ε, r, s

, independent of m. Note that

m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} \leq m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{n} α_{ℓ}^{\frac{δ}{2 + δ}})}^{\frac{1}{2}} \leq c_{j, ε} m^{\frac{s + 2 ε}{4 r + 2 s}} .

Putting the estimates above into (10), we get

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] & \leq C_{1} (1 + c_{j}) (1 + c_{j, ε}) k (m^{- \frac{4 r + s - 2 - 2 ε}{4 r + 2 s}} + h^{- 2 p} m^{\frac{p + \frac{1}{2}}{2 r + s}}) \\ + C_{1} (1 + c_{j, ε}) (m^{- \frac{r + ε}{2 r + s}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

This together with (14) yields the desired conclusion by letting

C_{2} = C_{1} (1 + c_{j}) (1 + c_{j, ε})

. □

Proof of Theorem 2.

We also apply Proposition 1 to prove our Theorem 2. With the choice of

λ = m^{- \frac{1}{2 r + s}}

, by Assumption 1, we get

\begin{matrix} λ^{- (p + \frac{1}{2})} = m^{\frac{p + \frac{1}{2}}{2 r + s}}, \end{matrix}

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} & \leq \sqrt{C_{0}} λ^{- \frac{s}{2}} m^{- \frac{1}{2}} + 2 C_{0}^{\frac{1}{δ + 2}} λ^{- \frac{δ + 2 s}{2 δ + 4}} m^{- \frac{1}{2}} {(\sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} \\ = \sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 2 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} . \end{matrix}

Similarly, we obtain

\begin{matrix} C {(D, λ)}^{\frac{1}{2}} = M (\sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 2 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}}) . \end{matrix}

Next, by

| D_{ℓ} | = n = \frac{m}{k},

we get

\begin{matrix} A {(D_{ℓ}, λ)}^{\frac{1}{2}} & \leq \sqrt{C_{0}} λ^{- \frac{s}{2}} n^{- \frac{1}{2}} + 2 C_{0}^{\frac{1}{δ + 2}} λ^{- \frac{δ + 2 s}{2 δ + 4}} n^{- \frac{1}{2}} {(\sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} \\ = k^{\frac{1}{2}} (\sqrt{C_{0}} m^{- \frac{r}{2 r + s}} + 2 C_{0}^{\frac{1}{δ + 2}} m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}}) . \end{matrix}

Similarly, we have

\begin{matrix} I {(D_{ℓ})}^{\frac{1}{2}} = k^{\frac{1}{2}} \frac{1}{\sqrt{m}} {(1 + 4 \sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}}; \\ J {(D_{ℓ})}^{\frac{1}{2}} = k^{\frac{1}{2}} \frac{M}{\sqrt{m}} {(1 + 4 \sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} . \end{matrix}

Then,

\begin{matrix} λ^{- 1} sup_{ℓ} A {(D_{l}, λ)}^{\frac{1}{2}} ((I {(D_{ℓ})}^{\frac{1}{2}} {∥ h_{ρ} ∥}_{ρ} + J {(D_{ℓ})}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p})) \\ \leq C_{3} k m^{- \frac{2 r + s - 1}{2 r + s}} {(1 + \sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} (m^{\frac{s}{4 r + 2 s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)}} {(\sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} + m^{\frac{p + \frac{1}{2}}{2 r + s}} h^{- 2 p}) \end{matrix}

with

C_{3} = 16 {(\sqrt{C_{0}} + 4 C_{0}^{\frac{1}{δ + 2}})}^{2} (∥ h_{ρ} ∥_{ρ} + 1 + 2 c_{p, M})

and

\begin{matrix} A {(D, λ)}^{\frac{1}{2}} ∥ h_{ρ} ∥_{ρ} + C {(D, λ)}^{\frac{1}{2}} + c_{p, M} (λ^{- (p + \frac{1}{2})} + 1) h^{- 2 p} + {∥ h_{ρ} ∥}_{ρ} λ^{r} \\ \leq C_{3} (m^{- \frac{r}{2 r + s}} + m^{\frac{2 s + δ}{(2 r + s) (2 δ + 4)} - \frac{1}{2}} {(\sum_{ℓ = 1}^{m} ϕ_{ℓ}^{\frac{1}{2}})}^{\frac{1}{2}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

Putting the estimates into (47) yields the desired conclusion (13).

□

Proof of Corollary 2.

Note that

\sum_{ℓ = 1}^{n} ϕ_{ℓ}^{\frac{1}{2}} < c

for some

c > 0

due to

ϕ_{ℓ} = O (ℓ^{- j})

for

j > 2

. Then following similar procedures to those in the proof of Corollary 1, we get

\begin{matrix} E [∥ {\bar{f}}_{D, λ} - f^{*} ∥_{ρ}] & \leq C_{3} {(1 + c^{\frac{1}{2}})}^{2} k (m^{- \frac{4 r + s - 2}{4 r + 2 s}} + h^{- 2 p} m^{\frac{p + \frac{1}{2}}{2 r + s}}) \\ + C_{3} (1 + c^{\frac{1}{2}}) (m^{- \frac{r}{2 r + s}} + h^{- 2 p} m^{\frac{p + 1}{2 r + s}}) . \end{matrix}

The restriction of k implies that

k m^{- \frac{4 r + s - 2}{4 r + 2 s}} \leq m^{- \frac{r}{2 r + s}}

. Then the desired conclusion holds with

C_{4} = 2 C_{3} {(1 + c^{\frac{1}{2}})}^{2}

.

□

4. Experimental Setup

We consider the non-parametric regression model

Y_{t} = f^{*} (X_{t}) + ε_{t},

where the regression function is chosen as

f^{*} (x) = sin (2 π x) + 0.3 cos (4 π x),

which is infinitely smooth and therefore satisfies the regularity assumptions of our theory for any

r > 0

. The noise variables are generated independently as

ε_{t} \sim N (0, 0 . 1^{2})

and are independent of the covariates

X_{t}

.

To instantiate the theoretical rates in a concrete manner, we fix the smoothness, capacity, and robustness indices as

r = 1, s = \frac{1}{2} .

These choices correspond to a moderately smooth regression function, a kernel with effective dimension exponent

s = 1 / 2

, and a light-tailed noise regime compatible with the Huber loss used in the algorithm.

Given

(r, s, p)

, the theoretically guided tuning parameters for distributed robust regression are set as

λ_{m} = 0.1 m^{- 1 / (2 r + s)} and h_{m} = m^{γ_{h}}, γ_{h} = \frac{p + 1 + r}{2 p (2 r + s)} = \frac{3}{5},

which agree with the convergence theory developed in this paper. All experiments employ the Gaussian kernel with bandwidth parameter

γ = 1

and the Huber loss.

For each sample size m, we draw a training sample

{(X_{t}, Y_{t})}_{t = 1}^{m}

under the specified dependence structure and compute the distributed estimator using k machines, as described earlier. Each configuration is repeated 100 independent times. For evaluation, we generate an independent test set of size 500 and report the mean test MSE together with

\pm 1

standard deviation. The only quantity that varies in Experiment 1 is the dependence structure of the covariates

(X_{t})

; all other aspects of the data generating process and all tuning parameters are kept fixed to ensure a fair comparison.

The covariates follow the stationary Gaussian AR(1) model

X_{t} = ϕ X_{t - 1} + η_{t}, η_{t} \sim N (0, 0 . 5^{2}),

where the dependence strength is varied by

ϕ \in {0.3, 0.6, 0.9}

. For

| ϕ | < 1

, the process is geometrically

α

-mixing, and larger

ϕ

corresponds to slower mixing.

To obtain a process that is provably

Φ

-mixing, we generate

X_{t}

from a finite-state, irreducible, aperiodic Markov chain. Let the state space be five equally spaced points

Z = {z_{1}, \dots, z_{5}} \subset [- 2, 2] .

Given a parameter

ρ \in [0, 1)

, the transition matrix is defined as

P = (1 - ρ) \frac{1}{5} m_{1} m_{1}^{⊤} + ρ I_{5} .

That is, with probability

ρ

the chain stays in its current state, and with probability

1 - ρ

it jumps uniformly to any state in

Z

. This construction is well known to yield a uniformly ergodic (hence geometrically

Φ

-mixing) Markov chain, and the parameter

ρ

directly controls the dependence strength: larger

ρ

implies slower mixing. We consider

ρ \in {0.1, 0.5, 0.9}

.

4.1. Experiment 1: Effect of Dependence Strength

The sample sizes are

m \in {1000, 2000, 5000, 10000, 20000},

and all aspects of the data generating process and estimator are held fixed, except for the temporal dependence strength of the covariates

(X_{t})

. For

α

-mixing, the strength is controlled by the AR(1) coefficient

ϕ \in {0.3, 0.6, 0.9}

; for

Φ

-mixing, the strength is controlled by the Markov chain persistence parameter

ρ \in {0.1, 0.5, 0.9}

. Each configuration is repeated 100 times, and the test set contains 500 samples.

Figure 1 shows that the test MSE is monotone in the dependence parameter: smaller

ϕ

or

ρ

produces lower error. This is consistent with the fact that stronger dependence corresponds to slower decay of the mixing coefficients, which enlarges the variance component of the risk.

Figure 1. Experiment 1: Effect of dependence strength under

α

-mixing (left) and

Φ

-mixing (right). Curves show the mean test MSE over 100 repetitions, and the shaded region shows

\pm 1

standard deviation. Stronger dependence leads to larger MSE, while the nearly parallel slopes confirm the theoretical rate

m^{- r / (2 r + s)}

.

When plotted against

log m

, all curves decay almost linearly, indicating a power-law rate close to the theoretical

m^{- r / (2 r + s)}

. That is, dependence modifies only the multiplicative constant while the convergence rate remains unchanged. Although AR(1) processes and finite-state Markov chains generate different types of temporal dependence (

α

-mixing versus

Φ

-mixing), both exhibit identical qualitative behavior: stronger dependence increases the finite-sample error but does not alter the asymptotic decay rate. This supports the generality of our theory across multiple dependence frameworks.

The tuning parameters

(λ_{m}, h_{m})

depend solely on

(r, s, p)

and do not use any information about the mixing strength. The smooth decay of all curves, despite very different dependence scenarios, demonstrates the practical robustness of this theoretically motivated choice. The parallel slopes in all settings show that the proposed estimator remains stable across a wide range of temporal dependence strengths. Thus, even in moderately or strongly dependent environments, the method retains its theoretical convergence behavior and delivers reliable empirical performance.

4.2. Experiment 2: Effect of the Number of Machines

In this experiment we investigate the scalability of the proposed distributed robust regression estimator as the number of machines increases. We fix the total sample size at

m = 10000,

and the data follow the AR(1) model with dependence level

ϕ = 0.6

. The number of machines is varied over

k \in {1, 2, 3, 4, 5, 6, 10, 15, 20} .

According to Theorem 1, the distributed estimator remains minimax optimal only if k grows no faster than

k_{max} ≍ m^{(2 r + s - 2) / (4 r + 2 s)},

which equals approximately

2.5

under our choice

r = 1

and

s = 1 / 2

. When k exceeds this threshold, the local sample size

m / k

becomes too small to guarantee the stability of the bias and variance terms.

Figure 2 exhibits a clear phase transition: the MSE remains nearly flat for

k = 1

and

k = 2

, both of which lie below the predicted upper bound. Once k exceeds

k_{max}

, the error begins to increase and becomes unstable. This is consistent with the theoretical constraint that guarantees statistical optimality only when each local machine receives sufficiently many samples. For large values of k, each local machine receives only

m / k

observations (e.g., only 500 samples when

k = 20

). In this regime, the local estimators become significantly more variable, and the averaging step cannot adequately correct for the inflated variance. The resulting oscillatory behavior in the curve reflects precisely this high-variance effect.

Figure 2. Experiment 2: Test MSE as a function of the number of machines k. The red dashed line marks the theoretical upper limit

k_{max} \approx 2.5

. Test error remains stable up to

k = 2

and begins to fluctuate and increase once

k > k_{max}

, in accordance with the theoretical prediction.

The experiment provides a direct empirical confirmation of the theoretical scalability bound: distributing the data across too many machines degrades performance even when the total number of samples is fixed. The fact that the degradation occurs almost exactly beyond the predicted threshold demonstrates that the bound

k_{max}

is not merely an artifact of the analysis but accurately characterizes the practical limitation of distributed learning under dependence. The results highlight an important guideline for real-world implementations: to preserve minimax optimality, the number of computational workers must be chosen in accordance with the problem’s smoothness

(r)

and marginal dimension

(s)

. The experiment confirms that over-parallelization can harm statistical efficiency, especially in the presence of temporal dependence.

5. Conclusions

Non-Gaussian noise and dependent samples are two stylized features of big data. In this paper we studied the learning performance of distributed robust regression with

α

-mixing and

ϕ

-mixing inputs. Our analysis provides very useful information on the application of distributed kernel-sounding algorithms. It tells us that the number of local machines and robustification parameter should be selected according to the big sample size and mixing coefficients. It is worth mentioning that the selection of the robustification parameter should balance between the learning rates and the degree of robustness.

We should point out that integral operator decompositions are mostly used to analyze ordinary least-squares methods, where the loss functions are convex and derivatives of losses have a linear structure. However, the robust losses in this work are generally not convex and their derivatives do not have linear structures, which brings about essential difficulties in the technical analysis. Therefore, we introduce the robust error term to deal with the technical difficulty caused by non-convexity features of robust algorithms. We notice that the robust error term is vital in analyzing the learning performance of robust algorithms, as shown in Theorems 1 and 2. It helps us understand the difference between robust learning and OLS and explains that robust loss can completely reject gross outliers while keeping a prediction accuracy similar to that of least-squares losses.

Author Contributions

Formal analysis, L.L.; writing—original draft, T.H.; writing—review and editing, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Hubei Province Key Laboratory of Systems Science in Metallurgical Process (Wuhan University of Science and Technology) (Grant No. Y202303) and the National Natural Science Foundation of China (Grant No. 12471095), the Natural Science Foundation of Hubei Province in China (Grant No. 2024AFC020), and the Fundamental Research Funds for the Central Universities, South-Central MinZu University (Grant No. CZY23010).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Feng, Y.; Wu, Q. Learning under (1+ϵ)-moment conditions. Appl. Comput. Harmon. Anal. 2020, 49, 495–520. [Google Scholar] [CrossRef]
Ganan, S.; McClure, D. Bayesian image analysis: An application to single photon emission tomography. Amer. Statist. Assoc. 1985, 12–18. [Google Scholar]
Hampel, F. A general definition of qualitative robustness. Ann. Math. Stat 1971, 42, 1887–1896. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 523. [Google Scholar]
Mizera, I.; Müller, C.H. Breakdown points of Cauchy regression-scale estimators. Stat. Probab. Lett. 2002, 57, 79–89. [Google Scholar] [CrossRef]
Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Sun, D.; Roth, S.; Black, M.J. Secrets of optical flow estimation and their principles. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2432–2439. [Google Scholar]
Tukey, J.W. A survey of sampling from contaminated distributions. Contrib. Probab. Stat. 1960, 448–485. [Google Scholar]
Wang, X.; Jiang, Y.; Huang, M.; Zhang, H. Robust variable selection with exponential squared loss. J. Am. Stat. Assoc. 2013, 108, 632–643. [Google Scholar] [CrossRef]
Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 1–31. [Google Scholar]
Muckec, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1–29. [Google Scholar]
Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal. 2020, 49, 229–256. [Google Scholar] [CrossRef]
Guo, Z.; Shi, L.; Wu, Q. Learning Theory of Distributed Regression with Bias Corrected Regularization Kernel Network. J. Mach. Learn. Res. 2017, 18, 1–25. [Google Scholar]
Shi, L. Distributed learning with indefinite kernels. Anal. Appl. 2020, 17, 947–975. [Google Scholar] [CrossRef]
Ouakrim, Y.; Boutaayamou, I.; Yazidi, Y.; Zafrar, A. Convergence analysis of an alternating direction method of multipliers for the identification of nonsmooth diffusion parameters with total variation. Inverse Probl. 2023, 39. [Google Scholar] [CrossRef]
Liu, J.; Shi, L. Distributed learning with discretely observed functional data. Inverse Probl. 2025, 41. [Google Scholar] [CrossRef]
Jiao, Y.; Yang, K.; Song, D. Distributed Distributionally Robust Optimization with Non-Convex Objectives. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Allouah, Y.; Guerraoui, R.; Gupta, N.; Pinot, R.; Rizk, G. Robust Distributed Learning: Tight Error Bounds and Breakdown Point under Data Heterogeneity. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Fan, J.; Liao, Y.; Liu, H. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
Sun, Q.; Zhou, W.X.; Fan, J. Adaptive Huber Regression. J. Am. Stat. Assoc. 2017, 115, 254–265. [Google Scholar] [CrossRef]
Sun, H.; Wu, Q. Regularized least square regression with dependent samples. Adv. Comput. Math. 2010, 32, 175–189. [Google Scholar] [CrossRef]
Sun, Z.; Lin, S.B. Distributed Learning with Dependent Samples. Inf. Theory IEEE Trans. (T-IT) 2022, 68, 6003–6020. [Google Scholar] [CrossRef]
Li, L.; Wan, C. Support Vector Machines with Beta-Mixing Input Sequences. In International Symposium on Neural Networks; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Mirzaei, E.; Maurer, A.; Kostic, V.R.; Pontil, M. An Empirical Bernstein Inequality for Dependent Data in Hilbert Spaces and Applications. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics, Mai Khao, Thailand, 3–5 May 2025. [Google Scholar]
Agarwal, A.; Duchi, J.C. The Generalization Ability of Online Algorithms for Dependent Data. IEEE Trans. Inf. Theory 2011, 59, 573–587. [Google Scholar] [CrossRef]
Modha, D.S.; Masry, E. Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inf. Theory 2002, 42, 2133–2145. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Blanchard, G.; Mücke, N. Optimal Rates for Regularization of Statistical Inverse Learning Problems. Found. Comput. Math. 2016, 18, 971–1013. [Google Scholar] [CrossRef]
Smale, S.; Zhou, D.X. Learning Theory Estimates via Integral Operators and Their Approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef]

Figure 1. Experiment 1: Effect of dependence strength under

α

-mixing (left) and

Φ

-mixing (right). Curves show the mean test MSE over 100 repetitions, and the shaded region shows

\pm 1

standard deviation. Stronger dependence leads to larger MSE, while the nearly parallel slopes confirm the theoretical rate

m^{- r / (2 r + s)}

.

Figure 2. Experiment 2: Test MSE as a function of the number of machines k. The red dashed line marks the theoretical upper limit

k_{max} \approx 2.5

. Test error remains stable up to

k = 2

and begins to fluctuate and increase once

k > k_{max}

, in accordance with the theoretical prediction.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Distributed Robust Algorithms with Dependent Sampling

Abstract

1. Introduction

2. Main Results

2.1. Preliminaries and Problem Setup

2.2. Learning Rates

3. Proofs

3.1. Expression of $f_{D, λ}$

3.2. Error Decomposition

3.3. Estimates in Distributed Learning

3.4. Proofs of the Main Results

4. Experimental Setup

4.1. Experiment 1: Effect of Dependence Strength

4.2. Experiment 2: Effect of the Number of Machines

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Distributed Robust Algorithms with Dependent Sampling

Abstract

1. Introduction

2. Main Results

2.1. Preliminaries and Problem Setup

2.2. Learning Rates

3. Proofs

3.1. Expression of f D , λ

3.2. Error Decomposition

3.3. Estimates in Distributed Learning

3.4. Proofs of the Main Results

4. Experimental Setup

4.1. Experiment 1: Effect of Dependence Strength

4.2. Experiment 2: Effect of the Number of Machines

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3.1. Expression of $f_{D, λ}$