1. Introduction
Regression analysis plays an important role in the framework of supervised learning, which aims at estimating the function relations between input data and output data. The ordinary least-squares (OLS) method is widely employed for regression learning in real applications. However, when the sampling data are contaminated by high levels of noise, OLS behaves poorly since it belongs to two-moment statistical methods and cannot capture the high-moment information contained in the data. To tackle this problem, new learning methods are increasingly developed that are robust to heavy-tailed data or other potential forms of contamination. Among them, various robust loss functions are proposed to replace the least squares loss function, which can overcome the limitation of OLS caused by the characteristic of second-order statistics (see [
1,
2,
3,
4,
5,
6,
7,
8,
9]). In this paper we focus on the robust loss functions of the form
induced by a windowing function
and the robustification parameter
Here the windowing function
G is required to satisfy the following two conditions:
Some commonly used robust loss functions fall into this category and are listed below.
Fair loss function: , ;
Cauchy loss function: ;
Correntropy loss function: , ;
Huber loss function: .
We can easily check that the windowing function
G has the redescending property [
4]. That is, when
, the loss is convex and behaves like the least-squares loss; when
the loss function tends to be concave and rapidly decreases to be flat as
goes far from zero. Therefore, with a suitable chosen robustification parameter
h, robust loss can completely reject gross outliers while maintaining a prediction accuracy similar to that of least-squares losses. In this work, we are interested in the application of robust loss in regression problems, linked to the data generation model given as
Here
X is the input variable that takes values in a separable metric space
and
stands for the output variable;
is the noise of the model, having a conditional mean of zero with respect to given
. The main purpose of the regression problem is to estimate
according to a set of sampling data
generated by model (
3).
Distributed learning has received considerable attention for the rapid expansion of computing capacities in the era of big data. Due to privacy concerns and communication cost, this paper studies a distributed robust algorithm with one communication round, based on the divide-and-conquer principle (DC). DC starts by using a massive data set that is stored distributively in local machines or dividing the whole data set into multiple subsets that are distributed to local machines, then takes a base algorithm to analyze the local data set, and finally averages all local estimators by only one communication to the master machine. It is thus computationally efficient by enabling parallel computing in the local learning process and can preserve data security and privacy without mutual information communications. This scheme has been developed for many classical learning algorithms, including kernel ridge regression, bias correction, minimum error entropy, and spectral algorithms, see [
10,
11,
12,
13,
14,
15,
16,
17,
18].
During the data collection or assignment, the sampling data is generated from some non-trivial stochastic process with memory. In particular, when big data are collected in a temporal order, they display a dependence feature between samples. Typical examples include Monte Carlo estimation in the Markov chain, covariance matrix estimation for Markov-dependent samples, and multiarm bandit problems [
19,
20,
21,
22]. In the literature the mixing process is frequently used to model a dependence structure between samples due to its ubiquitousness in stationary stochastic processes [
23,
24], which will be described in the next section.
With the development of modern science and technology, data collection becomes easier and thus more big data have been obtained and stored. Besides the property of large scale, such data usually have temporal dependence characteristics, such as in economy, finance, biology, medicine, industry, agriculture, transportation, and other fields. It is thus worthy to investigate the learning ability of distributed algorithms in the mixing sampling process. Robust learning is widely employed for regression learning when the sampling data are contaminated by high levels of noise. Therefore, we shall investigate the interplay between the level of robustness, degree of dependence, partition number, and generalization ability. Our works will demonstrate that in robust learning, the distributed method with dependent sampling can obtain statistical optimality while reducing computation cost substantially.
The aim of this paper is to study the learning performance of distributed robust algorithms for mixing sequences of sampling data. Our theoretical results will be the capacity-dependent error bounds obtained by using a recently developed integral operator technique and probability inequalities in Hilbert spaces for mixing sequences. We prove that such a distributed learning scheme can obtain optimal learning rates in the regression setting if the robustification parameter is suitably chosen.
2. Main Results
In this section, we state our main results and discuss their relations to the existing works. For this purpose, we first introduce some notations and concepts.
2.1. Preliminaries and Problem Setup
Given the sequence of random variables , let be the sigma-algebra generated by the random variables . The strong mixing condition and uniform mixing condition are defined as follows.
Definition 1. For two σ-fields and , define the α-coefficient asand ϕ-coefficient asA set of random sequences is said to satisfy a strongly mixing condition (or α-mixing condition) ifIt satisfies a uniformly mixing condition (or ϕ-mixing condition) if Remark 1. When the sampling data are drawn independently, both mixing conditions hold with According to the definitions, the strongly mixing condition is weaker than the ϕ-mixing condition. Many random processes satisfy the strongly mixing condition, for example, the stationary Markov process, which is uniformly purely non-deterministic, the stationary Gaussian sequence with a continuous spectral density that is bounded away from 0, certain ARMA processes, and some aperiodic Harris recurrent Markov processes; see [25,26]. In this work, we study the robust learning algorithm under the framework of reproducing kernel Hilbert space (RKHS)
[
27]. Let
be a Mercer kernel, that is, a continuous, symmetric, and positive semi-definite function. The RKHS
associated with
K is defined to be the completion of the linear span of the set of functions
equipped with the inner product
satisfying
the reproducing property
Denote
[
10]. This property implies that
Kernel methods provide efficient non-parametric learning algorithms for dealing with nonlinear features and RKHSs are used here as hypothesis spaces in the design of robust algorithms. Define the
empirical robust risk over
,
Definition 2. Given a sample set , the regularized robust algorithm with an RKHS in supervised learning is defined bywhere λ is a regularization parameter and h is a robustification parameter. Distributed learning considered in the paper is applied to two real situations: 1. Data are collected and stored distributively and cannot be shared with each other due to privacy or communication cost. The typical example is continuous monitoring data from hospitals. 2. The collections of data have the time-serial property and the partitioning of data is performed sequentially. The typical example is spot price data. In the following, we describe the implementation of distributed learning in the two situations.
Based on DC, the implementation of (7) with distributed learning is described as follows:
When a sample set
is drawn independently according to the identical distribution, it has been proven that for a broad class of algorithms, distributed learning has the same learning rates as its non-distributed counterpart as long as the local machines do not have too little training data [
10,
11,
12,
13]. When data have dependent structures, distributed regularized least-squares methods were shown in [
22] to perform as well as the standard least-squares methods via attaining optimal learning rates. Their theoretical analysis is based on the characteristics of the squared losses, which cannot apply to the robust loss directly since it is not necessarily convex. Therefore, the learning performance of distributed robust algorithms under the dependency condition on the sampling data is still unknown.
2.2. Learning Rates
Throughout this paper, we assume that
is a Borel probability measure on
The joint measure
can be decomposed into
where
is the conditional distribution on
for given
and
is the marginal distribution of
on
that describes the input data set
. We assume the sample sequence
comes from a strictly stationary process, and the dependence will be measured by the strongly mixing condition and uniformly mixing condition.
The goal of this paper is to estimate the learning error between
and the target function
in the
-space, that is
,
We now provide some necessary assumptions with respect to an integral operator
associated with the kernel
K by
By the reproducing property of
for any
it can be expressed as
Since
K is continuous, symmetric, and positive semi-definite,
is a compact positive operator of trace class and
is invertible for all
.
Our first assumption is related to the complexity of the RKHS
which can be measured by the concept
effective dimension [
10], that is, the trace of the operator
defined as
Assumption 1. With a parameter there exists a constant such that
Let
be the eigenvalues of the operator
; then the eigenvalues of the operator
are
and the trace
is equal to
By the compactness of
, we know that
Hence, the above assumption is always satisfied with
by taking the constant
When
is a finite rank space, for example, the linear space, the parameter
s tends to 0. Furthermore, we assume that
decays as
for some
; then it is easy to check that
In fact, the effective dimension
is a common tool in leaning theory and spectral algorithms. It reflects the structure of the hypothesis space
and establishes a connection between integral operators and spectral methods. For more details, see [
21].
The second assumption is stated in terms of the regularity of the target function
Assumption 2. For some and
Here
denotes the
r-th power of
, and it is well defined since
is compact and positive. By the definition of
we know that for any
,
Then if
it implies that
lies in
This assumption is called the Holder
source condition [
28] in inverse problems and characterizes the smoothness of the target function
In the sequel, let for simplicity and almost surely for some constant Without loss of generality, let and . Our main results can be stated as follows. Here we assume that each sub-data set has the same size for
Theorem 1. Define by (8). Under Assumptions 1 and 2 with and if the sample data satisfies the α-mixing condition, let ,where is a constant depending on the constants (appearing in the previous necessary assumptions), independent of (will be given in the proof). A consequence of Theorem 1 is that the error bound for the distributed robust algorithm (
8) depends on the partition number
robustification parameter
h, and dependence coefficients
. In particular, we have the following learning rates.
Corollary 1. Under the same conditions as Theorem 1 with , if the α-mixing coefficients satisfy with andthen for any arbitarily small where is a constant depending on the constants (appearing in the previous necessary assumptions), independent of (will be given in the proof). Remark 2. It was shown in [22] that the distributed least-squares method maintains the nearly optimal learning performance for small enough provided the number of local machines k is not so large and samples are drawn according to the α-mixing process. This corollary suggests that as the robustification parameter h increases, the learning rates can improve to the best order , coinciding with the results for the least-squares methods. However, further increasing of h beyond may not help to improve the learning rate. This shows that h should keep a balance between the prediction accuracy and the degree of robustness during the training process. For example, when is a linear space and lies in , then s goes to 0 and By Corollary 1, we deduce that the generalization error bound for the robust losses (including Huber loss) is on the order of ( is arbitrarily small) by taking h large enough and ϵ small enough.
Next, Theorem 2 concerns the error bound for the distributed robust algorithm (
8) under the
-mixing process.
Theorem 2. Define by (8). Under Assumptions 1,2 with and if the sample data satisfies the ϕ-mixing condition, let where is a constant depending on the constants (appearing in the previous necessary assumptions), independent of (will be given in the proof). Corollary 2. Under the same conditions as Theorem 2 with , if the ϕ-mixing coefficients satisfy with some andthenwhere is a constant depending on the constants (appearing in the previous necessary assumptions), independent of (will be given in the proof). Remark 3. Corollaries 1 and 2 tell us that distributed robust algorithms can achieve the same learning rate as that for independent samples if the mixing coefficients are summable, which was proved in [10]. Therefore, such algorithms have advantages in dealing with the large sample size and noise models. 4. Experimental Setup
We consider the non-parametric regression model
where the regression function is chosen as
which is infinitely smooth and therefore satisfies the regularity assumptions of our theory for any
. The noise variables are generated independently as
and are independent of the covariates
.
To instantiate the theoretical rates in a concrete manner, we fix the smoothness, capacity, and robustness indices as
These choices correspond to a moderately smooth regression function, a kernel with effective dimension exponent
, and a light-tailed noise regime compatible with the Huber loss used in the algorithm.
Given
, the theoretically guided tuning parameters for distributed robust regression are set as
which agree with the convergence theory developed in this paper. All experiments employ the Gaussian kernel with bandwidth parameter
and the Huber loss.
For each sample size m, we draw a training sample under the specified dependence structure and compute the distributed estimator using k machines, as described earlier. Each configuration is repeated 100 independent times. For evaluation, we generate an independent test set of size 500 and report the mean test MSE together with standard deviation. The only quantity that varies in Experiment 1 is the dependence structure of the covariates ; all other aspects of the data generating process and all tuning parameters are kept fixed to ensure a fair comparison.
The covariates follow the stationary Gaussian AR(1) model
where the dependence strength is varied by
. For
, the process is geometrically
-mixing, and larger
corresponds to slower mixing.
To obtain a process that is provably
-mixing, we generate
from a finite-state, irreducible, aperiodic Markov chain. Let the state space be five equally spaced points
Given a parameter
, the transition matrix is defined as
That is, with probability
the chain stays in its current state, and with probability
it jumps uniformly to any state in
. This construction is well known to yield a uniformly ergodic (hence geometrically
-mixing) Markov chain, and the parameter
directly controls the dependence strength: larger
implies slower mixing. We consider
.
4.1. Experiment 1: Effect of Dependence Strength
The sample sizes are
and all aspects of the data generating process and estimator are held fixed, except for the temporal dependence strength of the covariates
. For
-mixing, the strength is controlled by the AR(1) coefficient
; for
-mixing, the strength is controlled by the Markov chain persistence parameter
. Each configuration is repeated 100 times, and the test set contains 500 samples.
Figure 1 shows that the test MSE is monotone in the dependence parameter: smaller
or
produces lower error. This is consistent with the fact that stronger dependence corresponds to slower decay of the mixing coefficients, which enlarges the variance component of the risk.
When plotted against , all curves decay almost linearly, indicating a power-law rate close to the theoretical . That is, dependence modifies only the multiplicative constant while the convergence rate remains unchanged. Although AR(1) processes and finite-state Markov chains generate different types of temporal dependence (-mixing versus -mixing), both exhibit identical qualitative behavior: stronger dependence increases the finite-sample error but does not alter the asymptotic decay rate. This supports the generality of our theory across multiple dependence frameworks.
The tuning parameters depend solely on and do not use any information about the mixing strength. The smooth decay of all curves, despite very different dependence scenarios, demonstrates the practical robustness of this theoretically motivated choice. The parallel slopes in all settings show that the proposed estimator remains stable across a wide range of temporal dependence strengths. Thus, even in moderately or strongly dependent environments, the method retains its theoretical convergence behavior and delivers reliable empirical performance.
4.2. Experiment 2: Effect of the Number of Machines
In this experiment we investigate the scalability of the proposed distributed robust regression estimator as the number of machines increases. We fix the total sample size at
and the data follow the AR(1) model with dependence level
. The number of machines is varied over
According to Theorem 1, the distributed estimator remains minimax optimal only if
k grows no faster than
which equals approximately
under our choice
and
. When
k exceeds this threshold, the local sample size
becomes too small to guarantee the stability of the bias and variance terms.
Figure 2 exhibits a clear phase transition: the MSE remains nearly flat for
and
, both of which lie below the predicted upper bound. Once
k exceeds
, the error begins to increase and becomes unstable. This is consistent with the theoretical constraint that guarantees statistical optimality only when each local machine receives sufficiently many samples. For large values of
k, each local machine receives only
observations (e.g., only 500 samples when
). In this regime, the local estimators become significantly more variable, and the averaging step cannot adequately correct for the inflated variance. The resulting oscillatory behavior in the curve reflects precisely this high-variance effect.
The experiment provides a direct empirical confirmation of the theoretical scalability bound: distributing the data across too many machines degrades performance even when the total number of samples is fixed. The fact that the degradation occurs almost exactly beyond the predicted threshold demonstrates that the bound is not merely an artifact of the analysis but accurately characterizes the practical limitation of distributed learning under dependence. The results highlight an important guideline for real-world implementations: to preserve minimax optimality, the number of computational workers must be chosen in accordance with the problem’s smoothness and marginal dimension . The experiment confirms that over-parallelization can harm statistical efficiency, especially in the presence of temporal dependence.
5. Conclusions
Non-Gaussian noise and dependent samples are two stylized features of big data. In this paper we studied the learning performance of distributed robust regression with -mixing and -mixing inputs. Our analysis provides very useful information on the application of distributed kernel-sounding algorithms. It tells us that the number of local machines and robustification parameter should be selected according to the big sample size and mixing coefficients. It is worth mentioning that the selection of the robustification parameter should balance between the learning rates and the degree of robustness.
We should point out that integral operator decompositions are mostly used to analyze ordinary least-squares methods, where the loss functions are convex and derivatives of losses have a linear structure. However, the robust losses in this work are generally not convex and their derivatives do not have linear structures, which brings about essential difficulties in the technical analysis. Therefore, we introduce the robust error term to deal with the technical difficulty caused by non-convexity features of robust algorithms. We notice that the robust error term is vital in analyzing the learning performance of robust algorithms, as shown in Theorems 1 and 2. It helps us understand the difference between robust learning and OLS and explains that robust loss can completely reject gross outliers while keeping a prediction accuracy similar to that of least-squares losses.