Abstract
The Maximum Correntropy Criterion (MCC) has recently triggered enormous research activities in engineering and machine learning communities since it is robust when faced with heavy-tailed noise or outliers in practice. This work is interested in distributed MCC algorithms, based on a divide-and-conquer strategy, which can deal with big data efficiently. By establishing minmax optimal error bounds, our results show that the averaging output function of this distributed algorithm can achieve comparable convergence rates to the algorithm processing the total data in one single machine.
1. Introduction
In the big data era, the rapid expansion of data generation brings data of prohibitive size and complexity. This brings challenges to many traditional learning algorithms requiring access to the whole data set. Distributed learning algorithms, based on the divide-and-conquer strategy, provide a simple and efficient way to address this issue and therefore have received increasing attention. Such a strategy starts with partitioning the big data set into multiple subsets that are distributed to local machines, then it obtains local estimators in each subset by using a base algorithm, and it finally pools the local estimators together by simple averaging. It can substantially cut the time and memory costs in the algorithm implementation, and in many practical applications its learning performance has shown to be as good as that of a big machine that can use all the data. This scheme has been developed in various learning contexts, including spectral algorithms [1,2], kernel ridge regression [3,4,5], gradient descent [6,7], a semi-supervised approach [8], minimum error entropy [9] and bias correction [10].
Regression estimation and inference play an important role in the fields of data mining and statistics. The traditional ordinary least squares (OLS) method provides an efficient estimator if the regression model error is normally distributed. However, heavy-tailed noise and outliers are common in the real world, which limits the application of OLS in practice. Various robust losses have been proposed to deal with the problem instead of least squares loss. The commonly used robust losses mainly include adaptive Huber loss [11], gain function [12], minimum error entropy [13], exponential squared loss [14], etc. Among them, the Maximum Correntropy Criterion (MCC) is widely employed as an efficient alternative to the ordinary least squares method which is suboptimal in the non-Gaussian and non-linear signal processing situations [15,16,17,18,19]. Recently, MCC has been studied extensively in the literature and is widely adopted for many learning tasks, e.g., wind power forecasting [20] and pattern recognition [19]. In this paper, we are interested in the implementation of MCC by a distributed gradient descent method in a big data setting. Note that the MCC loss function is non-convex, so its analysis is essentially different from the least squares method. A rigorous analysis of distributed MCC is necessary to derive the consistency and learning rates.
Given a hypothesis function and the scaling parameter correntropy between and Y is defined by
where is the Gaussian function Given the sample set , the empirical form of is
The purpose of MCC is to maximize the empirical correntropy over a hypothesis space , that is
In the statistical learning context, the loss induced by correntropy is defined as
where is the scaling parameter. The loss function can be viewed as a variant of the Welsch function [21] and the estimator of (1) is also the minimizer of the empirical minimization risk scheme over , that is
This paper aims at rigorous analysis of distributed gradient descent MCC within the framework of reproducing kernel Hilbert spaces (RKHSs). Let be a Mercer kernel [22], i.e., a continuous, symmetric and positive semi-definite function. A kernel K is said to be positive semi-definite, if the matrix is positive semi-definite for any finite set and . The RKHS associated with the Mercer kernel K is defined to be the completion of the linear span of the set of functions with the inner product given by It has the reproducing property
for any and Denote By the property (3), we get that
Definition 1.
Given the sample set , the kernel gradient descent algorithm for solving (2) can be stated iteratively with as
where η is the of step size and
Divide-and-Conquer algorithm for the kernel gradient descent MCC (5) is easy to describe. Rather than performing on the whole N examples, the distributed algorithm executes the following three steps:
- Partition the data set D evenly and uniformly into m disjoint subsets , .
- Perform algorithm (5) on each data set , and get the local estimate after T-th iteration.
- Take an average as a final output.
In the next section, we study the asymptotic behavior of the final estimator and show that can obtain the minimax optimal rates over all estimators using the total data set of N samples provided that the scaling parameter is chosen suitably.
2. Assumptions and Main Results
In the setting of non-parametric estimation, we denote X as the explanatory variable that takes values in a compact domain , as a real-valued response variable. Let be the underlying distribution on Moreover, let be the marginal distribution of on and be the conditional distribution on for given
This work focuses on the application of MCC in regression problems, which is linked to the additive noise model
where e is the noise and is the regression function, which is the conditional mean for The goal of this paper is to estimate the mean square error between and in -metric, which is defined by For simplicity, we will use to denote the norm when the meaning is clear from the context.
Below, we present two important assumptions, which play a vital role in carrying out the analysis. The first assumption is about the regularity of the target function . Define the integral operator associated with K by
As K is a Mercer kernel on the compact domain , the operator is hence compact and positive. So, as the r-th power of for is well defined. Our error bounds are stated in terms of the regularity of the target function , given by [3,23]
The condition (6) measures the regularity of and is closely related to the smoothness of when is a Sobolev space. If (6) holds with , lies in the space .
The second assumption (7) is about the capacity of , measured by the effective dimension [24,25]
where I is the identity operator on . In this paper, we assume that
Note that it always holds with . For , it is almost equivalent to that the eigenvalues of decay at a rate . The smoother the kernel function K is, the smaller s and the smaller function space In particular, if K is a Gaussian kernel, then s can be arbitrarily close to 0, as .
Throughout the paper, we assume that and for some . We denote as the smallest integer not less than a.
Theorem 1.
Assume that (6) and (7) hold for some and . Taking with and If and the number of partition of the data set D
then with confidence at least ,
where is a constant depending on
Remark 1.
The above theorem, to be proved in Section 3, exhibits the concrete learning rates of the distributed estimator (hence the standard estimator of (5) with ). It implies that the kernel gradient descent for MCC on the single and distributed data set both achieves the learning rate when σ is large enough. It equals the minimax optimal rates in the regression setting [24,26] in the case of . This theorem suggests that the distributed MCC does not sacrifice the convergence rate provided that the partition number m satisfies the constraint (8). Thus, the distributed MCC estimator enjoys both computational efficiency and statistical optimality.
With the help of Theorem 1, we can easily deduce the following optimal learning rate in expectation.
Corollary 1.
Assume that (6) and (7) hold for some and , taking with and If , m satisfies (8) and , then we have
By the confidence-based error estimate in Theorem 1, we can obtain the following almost sure convergence of the distributed gradient descent algorithm for MCC.
Corollary 2.
Assume that (6) and (7) hold for some and , taking with and If , m satisfies (8) and , and for arbitrary , we have
3. Discussion and Conclusions
In this work, we have studied the theoretical properties and convergence behaviors of a distributed kernel gradient descent MCC algorithm. As shown in Theorem 1, we derived minimax optimal error bounds for the distributed learning algorithm under the regularity condition on the regression function and capacity condition on RKHS. In the standard kernel gradient descent MCC algorithm (), the aggregate time complexity is after t iterations. However, in the distributed case (), the aggregate time complexity reduces to after t iterations. In conclusion, the kernel gradient descent MCC algorithm (5) with the distributed method can achieve fast convergence rates while successfully reducing algorithmic costs.
When the optimization problem arises from non-convex losses, the iteration sequence generated by the gradient descent algorithm is likely to only converge to a stationary point or a local minimizer. Note that the loss induced by correntropy is not convex. Then, the convergence of the gradient descent method (5) to the global minimizer is not unconditionally guaranteed, which brings difficulties to the mathematical analysis of convergence. Our work on Theorem 1 addresses this issue, which shows that the iterative algorithm ensures the global optimality of its iterations in the theoretical analysis.
For regression problems, the distributed method has been introduced to the iteration algorithm in various learning paradigms and the minimax optimal rate has been obtained under different constraints on the partition number m. For distributed spectral algorithms [1], the lower bound of m that ensures the optimal rates is
We see from (9) that the restriction on m suffers from a saturation phenomenon in the sense that when in the sense that the maximal m to guarantee the optimal learning rate does not improve as r is beyond Our restriction in (8) is worse than (9) when but better when as the upper bound in (8) increases with respect to r that overcomes the saturation effect in (9). For distributed kernel gradient descent algorithms with least squares method [6] and minimum error entropy (MEE) principle [9], the restrictions of m are improved as
and
respectively. Our bound (8) for MCC differs with (10) for least squares only up to a logarithmic term, which has little impact on the upper bound of m ensuring optimal rates, but numerical experiments show that the distributed kernel gradient descent algorithm for least squares method is inferior to that for MCC in non-Gaussian noise models [15,27,28]. Our bound (8) is the same as (11) that is applied to the MEE principle. As we know, MEE also performs well in dealing with non-Gaussian noise or heavy-tail distribution [13,29]. However, MEE belongs to pairwise learning problems that work with pairs of samples rather than single sample in MCC. Hence, the distributed kernel gradient descent algorithm for MCC has an advantage over MEE in algorithmic complexity.
Several related questions are worthwhile for future research. First, our distributed result provides the optimal rates by requiring a large robust parameter . In practice, a moderate may be enough to ensure a good learning performance in robust estimation as shown by [17]. It is therefore of interest to investigate the convergence properties of distributed version of algorithm (5) when is chosen as a constant or as N approaches
Secondly, our algorithm is carried out in the framework of supervised learning; however, in numerous real-world applications, few labeled data are available, but a large amount of unlabeled data are given since the cost of labeling data is high such as time, money. Thus, we shall investigate how to enhance the learning performance of the MCC algorithm by the distributed method and the additional information given by unlabeled data.
Thirdly, as stated in Theorem 1, the choice of the last iteration T and the partition number m depends on the parameters , which are usually unknown in advance. In practice, cross-validation is usually used to tune T and m adaptively. It would be interesting to know whether the kernel gradient descent MCC (5) with the distributed method can achieve the optimal convergence rate with adaptive T and
Last but not least, we should note that here that all the data are drawn independently according to the same distribution. In the distributed method, we partition D evenly and uniformly into m disjoint subsets. This means that and each sample is assigned to the subset with the same probability. In the context of uniform random sampling, such randomness splitting strategy should be reasonable and practical. So, our theoretical analysis is based on the uniform random splitting mechanism. However, for the theoretical analysis of other randomness or non-randomness splitting mechanisms, it is necessary to develop new mathematical tools for optimal performance. It is beyond the scope of this paper and will be left for our future work.
4. Proofs of Main Results
This section is devoted to proving main results in Section 2. Here and in the following, let the sample size of each subset be n; that is, and Define the empirical operator on as
where . Similarly, we can define the operator on for each subset
where .
4.1. Preliminaries
We first introduce some necessary lemmas in the proofs, which can be found in [3,6,9].
Lemma 1.
Let be a measurable function defined on with almost definitely for some . Let then, each of the following estimates holds with confidence at least ,
and
where .
Let denote the polynomial defined by if and, for notation simplicity, let be the identity function. In our proof, we need to deal with the polynomial operators and . For this purpose we introduce the conventional notation and the following preliminary lemmas.
Lemma 2.
If , then for
where is a constant depending only on θ and α, whose value is given in the proof. In particular, if , we have
Lemma 3.
If with and then for ,
Define a data-free gradient descent sequence for the least square method in by and
It has been well evidence in the literature [30] that under the assumption (6) with , there are
and
where .
Lemma 4.
If with and , then there is a constant such that
and
Lemma 5.
If with and , then there is a constant such that
Recall that the isomorphism between and , which yields in
4.2. Bound for the Learning Sequence
We will need the following bound for the learning sequence in the proof.
Theorem 2.
If the step size sequence with and then we have the following bound for the learning sequence by (5):
Proof.
We prove the statement by induction. First note the conclusion holds trivially for Next, suppose that holds. By the updating rule (5) and the reproducing property, we have
where
The restriction implies . By the property of quadratic function, we have
Plugging it into (28), we obtain
This completes the proof. □
4.3. Error Decomposition and Estimation of Error Bounds
Now we are in a position of bounding the error of the distributed kernel gradient descent MCC. For this purpose, we decompose the error into two parts as
As we have mentioned in the previous subsection, the first term can be bounded by (21) under the assumption (6) with . Our key analysis is the second term, which can be bounded with the help of the following proposition.
Proposition 1.
Assume that (6) holds for some Let with and For , there holds
and
where
and is given in the proof, depending on .
Proof.
For by (26), Lemmas 4 and 5,
For , by (26), Lemma 3, and the fact , we have
Similarly, we can bound as
For first note that by the bound (27) of , we see
This implies that
Thistogether with the estimate gives
Following a similar process we can obtain the bound in (31). □
The following theorem provides a bound for the second term in (29).
Theorem 3.
Take There is a constant such that
Proof.
For each subset and each , we have
This implies that
and therefore
We first estimate By (26), Lemma 3, and the choice , we obtain
For by (39) we have
The estimation of is much more complicated. We decompose it into three parts,
By Lemmas 4 and 5 and the fact , we obtain
For , by (19) we have
Now we turn to We have
By Theorem 1 and the choice , for , there holds that and
Plugging it into (43), we obtain
From Lemma 2, we see that
So, we have
Combining the estimations for and , we obtain
Now the desired bound for in (40) follows by combining the estimations for , , and and the constant is given by
This proves the theorem. □
4.4. Proofs
Now we can prove Theorem 1.
Proof.
Therefore,
and
By applying Lemma 1, for any we have with confidence at least ,
Consequently, these bounds hold simultaneously with confidence at least This implies that with confidence at least , there holds
and
By Lemma 1, we have with confidence at least ,
and
This, together with the bound
leads to the desired conclusion with . □
Proof of Corollary 1.
When , by Theorem 1, we have that with confidence at least . Replacing by t, then
Using the probability to expectation formula
with we have
where is the Gamma function defined for by .
The proof is complete. □
To prove Corollary 2, we need the following Borel-Cantelli Lemma which is provided in [31].
Lemma 6.
Let be a sequence of events in some probability space and be a sequence of positive numbers satisfying . If
then will almost certainly converge to a.
Proof of Corollary 2.
Let in Theorem 1; then we have
Thus, for any ,
Applying Lemma 6 with , and , we can obtain the conclusion of Corollary 2 by noting and
The proof is finished. □
Author Contributions
Validation, F.X. and S.W.; Writing (original draft), B.W.; Writing (review and editing), T.H. All authors have read and agreed to the published version of the manuscript.
Funding
The work is supported partially by the National Key Research and Development Program of China (Grant No. 2021YFA1000600) and the National Natural Science Foundation of China (Grant No.12071356).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Probl. 2017, 33, 074009. [Google Scholar] [CrossRef]
- Mücke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. [Google Scholar]
- Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. [Google Scholar]
- Hu, T.; Zhou, D.X. Distributed regularized least squares with flexible Gaussian kernels. Appl. Comput. Harmon. Anal. 2021, 53, 349–377. [Google Scholar] [CrossRef]
- Zhang, Y.; Duchi, J.; Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2015, 16, 3299–3340. [Google Scholar]
- Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
- Shamir, O.; Srebro, N. Distributed stochastic optimization and learning. In Proceedings of the 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 30 September–3 October 2014; pp. 850–857. [Google Scholar]
- Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge regression. J. Mach. Learn. Res. 2017, 18, 1493–1514. [Google Scholar]
- Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal. 2020, 49, 229–256. [Google Scholar] [CrossRef]
- Sun, H.; Wu, Q. Optimal Rates of Distributed Regression with Imperfect Kernels. J. Mach. Learn. Res. 2021, 22, 1–34. [Google Scholar]
- Sun, Q.; Zhou, W.X.; Fan, J. Adaptive Huber regression. J. Am. Stat. Assoc. 2020, 115, 254–265. [Google Scholar] [CrossRef] [Green Version]
- Feng, Y.; Wu, Q. A Framework of Learning Through Empirical Gain Maximization. Neural Comput. 2021, 33, 1656–1697. [Google Scholar] [CrossRef]
- Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. Proc. ICA 2000, 5, 6. [Google Scholar]
- Song, Y.; Liang, X.; Zhu, Y.; Lin, L. Robust variable selection with exponential squared loss for the spatial autoregressive model. Comput. Stat. Data Anal. 2021, 155, 107094. [Google Scholar] [CrossRef]
- Feng, Y.; Fan, J.; Suykens, J.A. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
- Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
- Feng, Y.; Ying, Y. Learning with correntropy-induced losses for regression with mixture of symmetric stable noise. Appl. Comput. Harmon. Anal. 2020, 48, 795–810. [Google Scholar] [CrossRef] [Green Version]
- Gunduz, A.; Principe, J.C. Correntropy as a novel measure for nonlinearity tests. Signal Process. 2009, 89, 14–23. [Google Scholar] [CrossRef]
- He, R.; Zheng, W.S.; Hu, B.G.; Kong, X.W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074–2100. [Google Scholar] [CrossRef]
- Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
- Holland, P.W.; Welsch, R.E. Robust regression using iteratively reweighted least-squares. Commun. Stat.-Theory Methods 1977, 6, 813–827. [Google Scholar] [CrossRef]
- Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
- Smale, S.; Zhou, D.X. Learning theory estimates via integral operators and their approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef] [Green Version]
- Caponnetto, A.; De Vito, E. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
- Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin, Germany, 2008. [Google Scholar]
- Blanchard, G.; Mücke, N. Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math. 2018, 18, 971–1013. [Google Scholar] [CrossRef] [Green Version]
- Santamaría, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
- Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
- Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
- Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).