Maximum Correntropy Criterion with Distributed Method

: The Maximum Correntropy Criterion (MCC) has recently triggered enormous research activities in engineering and machine learning communities since it is robust when faced with heavy-tailed noise or outliers in practice. This work is interested in distributed MCC algorithms, based on a divide-and-conquer strategy, which can deal with big data efﬁciently. By establishing minmax optimal error bounds, our results show that the averaging output function of this distributed algorithm can achieve comparable convergence rates to the algorithm processing the total data in one single machine.


Introduction
In the big data era, the rapid expansion of data generation brings data of prohibitive size and complexity. This brings challenges to many traditional learning algorithms requiring access to the whole data set. Distributed learning algorithms, based on the divide-andconquer strategy, provide a simple and efficient way to address this issue and therefore have received increasing attention. Such a strategy starts with partitioning the big data set into multiple subsets that are distributed to local machines, then it obtains local estimators in each subset by using a base algorithm, and it finally pools the local estimators together by simple averaging. It can substantially cut the time and memory costs in the algorithm implementation, and in many practical applications its learning performance has shown to be as good as that of a big machine that can use all the data. This scheme has been developed in various learning contexts, including spectral algorithms [1,2], kernel ridge regression [3][4][5], gradient descent [6,7], a semi-supervised approach [8], minimum error entropy [9] and bias correction [10].
Regression estimation and inference play an important role in the fields of data mining and statistics. The traditional ordinary least squares (OLS) method provides an efficient estimator if the regression model error is normally distributed. However, heavy-tailed noise and outliers are common in the real world, which limits the application of OLS in practice. Various robust losses have been proposed to deal with the problem instead of least squares loss. The commonly used robust losses mainly include adaptive Huber loss [11], gain function [12], minimum error entropy [13], exponential squared loss [14], etc. Among them, the Maximum Correntropy Criterion (MCC) is widely employed as an efficient alternative to the ordinary least squares method which is suboptimal in the non-Gaussian and non-linear signal processing situations [15][16][17][18][19]. Recently, MCC has been studied extensively in the literature and is widely adopted for many learning tasks, e.g., wind power forecasting [20] and pattern recognition [19]. In this paper, we are interested in the implementation of MCC by a distributed gradient descent method in a big data setting. Note that the MCC loss function is non-convex, so its analysis is essentially different from the least squares method. A rigorous analysis of distributed MCC is necessary to derive the consistency and learning rates.
Given a hypothesis function f : X → Y and the scaling parameter σ > 0, correntropy between f (X) and Y is defined by The purpose of MCC is to maximize the empirical correntropyV σ over a hypothesis space H, that is In the statistical learning context, the loss induced by correntropy φ σ : R → R + is defined as where σ > 0 is the scaling parameter. The loss function can be viewed as a variant of the Welsch function [21] and the estimator f z,H of (1) is also the minimizer of the empirical minimization risk scheme over H, that is This paper aims at rigorous analysis of distributed gradient descent MCC within the framework of reproducing kernel Hilbert spaces (RKHSs). Let K : X × X → R be a Mercer kernel [22], i.e., a continuous, symmetric and positive semi-definite function. A kernel K is said to be positive semi-definite, if the matrix K(u i , u j ) m i,j=1 is positive semidefinite for any finite set {u 1 , · · · , u m } ⊂ X and m ∈ N. The RKHS H K associated with the Mercer kernel K is defined to be the completion of the linear span of the set of functions {K x := K(x, ·), x ∈ X } with the inner product ·, · K given by K x , K u K = K(x, u). It has the reproducing property for any f ∈ H K and x ∈ X . Denote κ := sup x∈X K(x, x). By the property (3), we get that Definition 1. Given the sample set D = {(x i , y i )} N i=1 ⊂ Z := X × Y, the kernel gradient descent algorithm for solving (2) can be stated iteratively with f 1,D = 0 as where η is the of step size and φ σ ( Divide-and-Conquer algorithm for the kernel gradient descent MCC (5) is easy to describe. Rather than performing on the whole N examples, the distributed algorithm executes the following three steps:

1.
Partition the data set D evenly and uniformly into m disjoint subsets D j , 1 ≤ j ≤ m.

2.
Perform algorithm (5) on each data set D j , and get the local estimate f T+1,D j after T-th iteration.
In the next section, we study the asymptotic behavior of the final estimatorf T+1,D and show thatf T+1,D can obtain the minimax optimal rates over all estimators using the total data set of N samples provided that the scaling parameter σ is chosen suitably.

Assumptions and Main Results
In the setting of non-parametric estimation, we denote X as the explanatory variable that takes values in a compact domain X , Y ∈ Y ⊂ R as a real-valued response variable. Let ρ be the underlying distribution on Z := X × Y. Moreover, let ρ X be the marginal distribution of ρ on X and ρ(·|x) be the conditional distribution on Y for given x ∈ X .
This work focuses on the application of MCC in regression problems, which is linked to the additive noise model where e is the noise and f ρ (x) is the regression function, which is the conditional mean E(Y|X = x) for X = x ∈ X . The goal of this paper is to estimate the mean square error betweenf T+1,D and f ρ in L 2 ρ X -metric, which is defined by · L 2 ρ X := X | · | 2 dρ X 1 2 . For simplicity, we will use · to denote the norm · L 2 ρ X when the meaning is clear from the context. Below, we present two important assumptions, which play a vital role in carrying out the analysis. The first assumption is about the regularity of the target function f ρ . Define the integral operator L K : L 2 ρ X → L 2 ρ X associated with K by As K is a Mercer kernel on the compact domain X , the operator L K is hence compact and positive. So, L r K as the r-th power of L K for r > 0 is well defined. Our error bounds are stated in terms of the regularity of the target function f ρ , given by [3,23] The condition (6) measures the regularity of f ρ and is closely related to the smoothness of f ρ when H K is a Sobolev space. If (6) holds with r ≥ 1 2 , f ρ lies in the space H K . The second assumption (7) is about the capacity of H K , measured by the effective dimension [24,25] where I is the identity operator on H K . In this paper, we assume that N (λ) ≤ Cλ −s for some C > 0 and 0 < s ≤ 1.
Note that it always holds with s = 1. For 0 < s < 1, it is almost equivalent to that the eigenvalues σ i of L K decay at a rate i − 1 s . The smoother the kernel function K is, the smaller s and the smaller function space H K . In particular, if K is a Gaussian kernel, then s can be arbitrarily close to 0, as K ∈ C ∞ . Throughout the paper, we assume that κ := sup x∈X K(x, x) ≤ 1 and |y| ≤ M for some M > 0. We denote a as the smallest integer not less than a. Theorem 1. Assume that (6) and (7) hold for some r > 1 2 and 0 < s ≤ 1.
then with confidence at least 1 − δ, whereC is a constant depending on θ.
Remark 1. The above theorem, to be proved in Section 3, exhibits the concrete learning rates of the distributed estimatorf T+1,D (hence the standard estimator of (5) with m = 1). It implies that the kernel gradient descent for MCC on the single and distributed data set both achieves the learning rate when σ is large enough. It equals the minimax optimal rates in the regression setting [24,26] in the case of r > 1 2 . This theorem suggests that the distributed MCC does not sacrifice the convergence rate provided that the partition number m satisfies the constraint (8). Thus, the distributed MCC estimatorf T+1,D enjoys both computational efficiency and statistical optimality.
With the help of Theorem 1, we can easily deduce the following optimal learning rate in expectation.

Corollary 1.
Assume that (6) and (7) hold for some r > 1 2 and 0 < s ≤ 1, taking η t = ηt −θ with 0 < η ≤ 1 and 0 ≤ θ < 1. If T = N By the confidence-based error estimate in Theorem 1, we can obtain the following almost sure convergence of the distributed gradient descent algorithm for MCC.

Discussion and Conclusions
In this work, we have studied the theoretical properties and convergence behaviors of a distributed kernel gradient descent MCC algorithm. As shown in Theorem 1, we derived minimax optimal error bounds for the distributed learning algorithm under the regularity condition on the regression function and capacity condition on RKHS. In the standard kernel gradient descent MCC algorithm (m = 1), the aggregate time complexity is O tN 2 after t iterations. However, in the distributed case (m > 1), the aggregate time complexity reduces to O tN 2 /m after t iterations. In conclusion, the kernel gradient descent MCC algorithm (5) with the distributed method can achieve fast convergence rates while successfully reducing algorithmic costs.
When the optimization problem arises from non-convex losses, the iteration sequence generated by the gradient descent algorithm is likely to only converge to a stationary point or a local minimizer. Note that the loss induced by correntropy φ σ is not convex. Then, the convergence of the gradient descent method (5) to the global minimizer is not unconditionally guaranteed, which brings difficulties to the mathematical analysis of convergence. Our work on Theorem 1 addresses this issue, which shows that the iterative algorithm ensures the global optimality of its iterations in the theoretical analysis.
For regression problems, the distributed method has been introduced to the iteration algorithm in various learning paradigms and the minimax optimal rate has been obtained under different constraints on the partition number m. For distributed spectral algorithms [1], the lower bound of m that ensures the optimal rates is We see from (9) that the restriction on m suffers from a saturation phenomenon in the sense that when r ≥ 3/2 in the sense that the maximal m to guarantee the optimal learning rate does not improve as r is beyond 3/2. Our restriction in (8) is worse than (9) when r < 5/2 but better when r > 5/2 as the upper bound in (8) increases with respect to r that overcomes the saturation effect in (9). For distributed kernel gradient descent algorithms with least squares method [6] and minimum error entropy (MEE) principle [9], the restrictions of m are improved as and respectively. Our bound (8) for MCC differs with (10) for least squares only up to a logarithmic term, which has little impact on the upper bound of m ensuring optimal rates, but numerical experiments show that the distributed kernel gradient descent algorithm for least squares method is inferior to that for MCC in non-Gaussian noise models [15,27,28]. Our bound (8) is the same as (11) that is applied to the MEE principle. As we know, MEE also performs well in dealing with non-Gaussian noise or heavy-tail distribution [13,29]. However, MEE belongs to pairwise learning problems that work with pairs of samples rather than single sample in MCC. Hence, the distributed kernel gradient descent algorithm for MCC has an advantage over MEE in algorithmic complexity. Several related questions are worthwhile for future research. First, our distributed result provides the optimal rates by requiring a large robust parameter σ. In practice, a moderate σ may be enough to ensure a good learning performance in robust estimation as shown by [17]. It is therefore of interest to investigate the convergence properties of distributed version of algorithm (5) when σ is chosen as a constant or σ(N) → 0 as N approaches ∞.
Secondly, our algorithm is carried out in the framework of supervised learning; however, in numerous real-world applications, few labeled data are available, but a large amount of unlabeled data are given since the cost of labeling data is high such as time, money. Thus, we shall investigate how to enhance the learning performance of the MCC algorithm by the distributed method and the additional information given by unlabeled data.
Thirdly, as stated in Theorem 1, the choice of the last iteration T and the partition number m depends on the parameters r, s, which are usually unknown in advance. In practice, cross-validation is usually used to tune T and m adaptively. It would be interesting to know whether the kernel gradient descent MCC (5) with the distributed method can achieve the optimal convergence rate with adaptive T and m.
Last but not least, we should note that here that all the data D = {(x i , y i )} N i=1 are drawn independently according to the same distribution. In the distributed method, we partition D evenly and uniformly into m disjoint subsets. This means that |D 1 | = · · · = |D m | = N m and each sample (x i , y i ) is assigned to the subset D j (1 ≤ j ≤ m) with the same probability. In the context of uniform random sampling, such randomness splitting strategy should be reasonable and practical. So, our theoretical analysis is based on the uniform random splitting mechanism. However, for the theoretical analysis of other randomness or nonrandomness splitting mechanisms, it is necessary to develop new mathematical tools for optimal performance. It is beyond the scope of this paper and will be left for our future work.

Proofs of Main Results
This section is devoted to proving main results in Section 2. Here and in the following, let the sample size of each subset D 1 , · · · , D m be n; that is, D = D 1 · · · D m and N = mn. Define the empirical operator L K,D on H K as where x 1 · · · , x n ∈ x : (x, y) ∈ D j with some y ∈ Y .

Preliminaries
We first introduce some necessary lemmas in the proofs, which can be found in [3,6,9].

Lemma 1.
Let g(z) be a measurable function defined on Z with g ∞ ≤ M almost definitely for some M > 0. Let 0 < δ < 1; then, each of the following estimates holds with confidence at least 1 − δ, Let π t i denote the polynomial defined by π t i (s) = ∏ t j=i (1 − η j x) if i ≤ t and, for notation simplicity, let π t t+1 (s) = 1 be the identity function. In our proof, we need to deal with the polynomial operators π t i (L K ) and π t i (L K,D ). For this purpose we introduce the conventional notation ∑ T j=T+1 := 1 and the following preliminary lemmas.

Lemma 2.
If 0 ≤ α < 1, 0 ≤ θ < 1,then for T ≥ 3, where C θ,α is a constant depending only on θ and α, whose value is given in the proof. In particular, if α = 0, we have Define a data-free gradient descent sequence for the least square method in H K by f 1 = 0 and It has been well evidence in the literature [30] that under the assumption (6) with r > 1 2 , there are and where h ρ = max g (2r/e) r , g [(2r − 1)/e] r− 1 2 .

Bound for the Learning Sequence
We will need the following bound for the learning sequence in the proof.
Theorem 2. If the step size sequence η t = ηt −θ with 0 < η ≤ 1 and 0 ≤ θ < 1, then we have the following bound for the learning sequence { f t,D } by (5): Proof. We prove the statement by induction. First note the conclusion holds trivially for By the updating rule (5) and the reproducing property, we have where By the property of quadratic function, we have Plugging it into (28), we obtain This completes the proof.

Error Decomposition and Estimation of Error Bounds
Now we are in a position of bounding the error of the distributed kernel gradient descent MCC. For this purpose, we decompose the error f T+1,D − f ρ into two parts as As we have mentioned in the previous subsection, the first term can be bounded by (21) under the assumption (6) with r > 1 2 . Our key analysis is the second term, which can be bounded with the help of the following proposition. Proposition 1. Assume that (6) holds for some r > 1 2 . Let η t = ηt −θ with 0 < η ≤ 1 and 0 ≤ θ < 1. For λ > 0, there holds and where and C r,θ is given in the proof, depending on r, θ.
Proof. By the definition of f t,D in (5) and the definition of f t in (20), we have where f ρ,D is defined in (32) and Applying (33) iteratively from t = 1 to T, we obtain where For I 1 , by (26), Lemmas 4 and 5, For I 2 , by (26), Lemma 3, and the fact f ρ ∞ ≤ M, we have Similarly, we can bound I 3 as For I 4 , first note that by the bound (27) This implies that Thistogether with the estimate π t i+1 (L K,D ) ≤ 1 gives Combining the estimates in (36), (37), (39) and (35), we obtain (30) holds with .
Following a similar process we can obtain the bound in (31).
The following theorem provides a bound for the second term in (29).
There is a constant C r,θ such that Proof. For each subset D l and each 1 ≤ t ≤ T, we have This implies that We first estimate J 2 . By (26), Lemma 3, and the choice λ = T −(1−θ) , we obtain For J 3 , by (39) we have The estimation of J 1 is much more complicated. We decompose it into three parts, By Lemmas 4 and 5 and the fact λT 1−θ = 1, we obtain For J 13 , by (19) we have Now we turn to J 11 . We have By Theorem 1 and the choice λ = T −(1−θ) , for 1 ≤ i ≤ T, there holds that λi (1−θ) ≤ 1 and Plugging it into (43), we obtain From Lemma 2, we see that So, we have Combining the estimations for J 11 , J 12 and J 13 , we obtain Now the desired bound for f T+1,D − f T+1 in (40) follows by combining the estimations for J 1 , J 2 , and J 3 and the constant is given by C r,θ := 2Mθ 1 − θ + C ρ,θ,r + D ρ,θ,r + 3C r,θ 15 This proves the theorem.

Proofs
Now we can prove Theorem 1.