Semi-Supervised Minimum Error Entropy Principle with Distributed Method

The minimum error entropy principle (MEE) is an alternative of the classical least squares for its robustness to non-Gaussian noise. This paper studies the gradient descent algorithm for MEE with a semi-supervised approach and distributed method, and shows that using the additional information of unlabeled data can enhance the learning ability of the distributed MEE algorithm. Our result proves that the mean squared error of the distributed gradient descent MEE algorithm can be minimax optimal for regression if the number of local machines increases polynomially as the total datasize.


Introduction
The minimum error entropy (MEE) principle is an important criterion proposed in information theoretical learning (ITL) [1] and was firstly addressed for adaptive system training by Erdogmus and Principe [2]. It has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, minimum cross-entropy for model selection, and some other topics [3][4][5][6][7][8]. Taking entropy as a measure of the error, the MEE principle can extract the information contained in data fully and produce robustness to outliers in the implementation of algorithms.
Let X ∈ R n be an explanatory variable with values taken in a compact metric space (X , d), Y be a real response variable with Y ∈ Y ⊂ R, and g : X → Y be a prediction function. For a given set of labeled examples D = {(x i , y i )} N i=1 ⊂ X × Y (N denotes the sample size) and a windowing function G : R → R + , the MEE principle is to find a minimizer of the empirical quadratic entropy: where h > 0 is the scaling parameter. Its goal is to solve the problem y = g ρ (x) + ε, where ε is the noise and g ρ (x) is the target function. Taking a function f (x i , x j ) := g(x i ) − g(x j ), MEE belongs to pairwise learning problems, which involves with the intersections of example pairs. Since logarithmic function is monotonic, we only consider the empirical information error of MEE: as we know that the example pairs will grow quadratically with the increasing example size N, which will bring the computational burden in the MEE implementation. Thus, it is necessary to reduce the algorithmic complexity by the distributed method based on a divide-and-conquer strategy [10]. Semi-supervised learning (SSL) [11] has attracted extensive attention as an emerging field in machine learning research and data mining. Actually, in many practical problems, few data are given, but a large number of unlabeled data are available, since labeling data requires a lot of time, effort or money.
In this paper, we study a distributed MEE algorithm in the framework of SSL and show that the learning ability of the MEE algorithm can be enhanced by the distributed method and the combination of labeled data with unlabeled data. There are mainly three contributions in this paper. The first one is that we derive the explicit learning rate of the gradient descent method for distributed MEE in the context of SSL, which is comparable to the minimax optimal rate of the least squares in regression. This implies that the MEE algorithm can be an alternative of the least squares in SSL in the sense that both of them have the same prediction power. The second one is that we provide the theoretical upper bound for the number of local machines guaranteeing the optimal rate in the distributed computation. The last one is that we extend the range of the target function allowed in the distributed MEE algorithm.
In Table 1, we summarize some notations used in this paper. Table 1. List of notations used throughout the paper.

Notation Meaning of the Notation
the size of labeled data set D N/4 the largest integer not exceeding N/4 |D| the cardinality of D, |D| = N D * the unlabeled data set D * = {x 1 , . . . , x S } S the size of unlabeled data set D * |D * | the cardinality of D * , |D * | = S the lth subset ofD, 1 ≤ l ≤ m G the loss function of MEE algorithm L K the integral operator associated with K L K,D the empirical operator of L K onD f t+1,D the function output by the kernel gradient descent MEE algorithm with data D and kernel K after t iterations f t+1,D l the function output by the kernel gradient MEE algorithm with data D l and kernel K after t iterations f t+1,D the global output averaging over local outputs f t+1,D l , l = 1, . . . , m

Algorithms and Main Results
We considered MEE for the regression problem. To allow noise in sampling processes, we assumed that a Borel measure ρ(·, ·) is defined on the product space X × Y. Let ρ(y|x) be the conditional distribution of y ∈ Y for any given x ∈ X , and ρ X (·) the marginal distribution on X . For the semi-supervised MEE algorithm, our goal was to estimate the regression function and unlabeled examples D * = {x j } S j=1 drawn from the distribution ρ and ρ X , respectively. Based on the divide-and-conquer strategy, both D and D * are partitioned equally into m subsets, Here, we denote the size of subsets |D l | = n and |D * l | = s, 1 ≤ l ≤ m, i.e., N = mn, S = ms. We construct a new datasetD = ∑ m l=1 D l by: where: Based on the gradient descent algorithm (Equation (2)), we can get a set of local estimators { f t,D l } for each subsetD l , 1 ≤ l ≤ m. Then, the global estimator averaging over these local estimators is given by: In the pairwise setting, our target function x, x ∈ X , which is the difference of the regression function g ρ . Denote by L 2 ρ X 2 the space of square integrable functions on the product space X 2 : The goodness of f t,D is usually measured by the mean squared error f t,D − f ρ 2 L 2 . Throughout the paper, we assumed that sup (x,x )∈X 2 K((x, x ), (x, x )) ≤ 1 and for some constant M > 0, |y| ≤ M almost surely. Without generality, windowing function G is assumed to be differentiable and satisfies G (0) = −1, G (u) < 0 for u > 0, C G := sup u∈(0,∞) |G (u)| < ∞ and there exists some p such that c p > 0 and: It is easy to check that the Gaussian kernel G(u) = exp{−u} satisfies the assumptions above with p = 1.
Before we present our main results, define an integral operator L K : L 2 ρ X 2 −→ L 2 ρ X 2 associated with the kernel K by: Our error analysis for the distributed MEE algorithm (Equation (3)) is stated in terms of the following regularity condition: where L r K denotes the r-th power of L K on L 2 ρ X 2 and is well defined, since the operator L K is positive and compact with the Mercer kernel K. We use the effective dimension [12,13] N (λ) to measure the complexity of H K with respect to ρ X , which is defined to be the trace of the operator (λI + L K ) −1 L K as: To obtain optimal learning rates, we need to quantify N (λ) of H K . A suitable assumption is: that The eigenvalues assumption is typical in the analysis of the performances of kernel methods estimators and recently used in References [13,15,16] to establish the optimal learning rate in the least square problems.
The following theorem shows that the distributed gradient descent algorithm (Equation (3)) can achieve the optimal rate by providing the iteration time T and the maximal number of local machines, whose proof can be found in Section 3. (5) and (6)

Theorem 1. (Main Result) Assume Equations
then for any 0 < δ < 1, with confidence at least 1 − δ: where C is a constant independent of N, S, δ, h and N/4 denotes the largest number not exceeding N/4.

Corollary 1.
Under the same conditions of Theorem 1, if the scaling parameter: then for any 0 < δ < 1, with confidence at least 1 − δ: in Equation (9) is optimal in the minimax sense for kernel regression problems [13]. When m = 1, the result of Equation (9) shows that the kernel gradient descent MEE algorithm (Equation (2)) on a single big data set can achieve the minimax optimal rate for regression. Thus, MEE is a nice alternative of the classical least squares. Meanwhile, the upper bound (Equation (7)) for the number of local machines implies that the performance of the distributed MEE algorithm (Equation (3)) can be as good as the standard MEE algorithm (2) (acting on the whole data setD), provided that the subsetD l 's size n + s is not too small.

Remark 3.
If no unlabeled data is engaged in the algorithm (Equation (3)), then S = 0 and the upper bound (Equation (7)) for the number of local machines m that ensures the optimal rate is about O N r− 1 2 2r+β . So, when the regularity parameter r in Equation (5)  reduces to a constant and then the distributed algorithm (Equation (3)) will not be feasible in real applications. A similar phenomenon is observed in various distributed algorithms [15][16][17][18]. When the size of unlabeled data S > 0, we see from Equation (7) that the upper bound of m keeps growing with the increase of S when the size of labeled data N is fixed. For example, let β > 1 2 and S = N 1 2r+β , then the upper bound in Equation (7) is O N r 2r+β and will not be a constant when r → 1 2 . Hence, with sufficient unlabeled data D * , the distributed algorithm (Equation (3)) will allow more local machines in the distributed method.

Remark 4.
A series of distributed works [15][16][17][18][19] were carried out when the target function f ρ lies in the space H K , i.e., the regularization parameter r > 1 2 . As a byproduct, our work in Theorem 1 does not impose the restriction r > 1 2 on the distributed algorithm (Equation (3)).

Proof of Main Result
In this section we prove our main results in Theorem 1. To this end, we introduce the data-free gradient descent method in H K for the least squares, defined as f 1 = 0 and: Recalling the definition of L K , it can be written as: Following the standard decomposition technique in leaning theory, we split the errorf t+1,D − f ρ into the sample errorf t+1,D − f t+1 and the approximation error f t+1 − f ρ .

Sample Error
Define the empirical operator L K,D : H K → H K by: and for any f ∈ H K : Then, the MEE gradient descent algorithm (Equation (2)) onD can be written as: where: and: In the sequel, denote: With these preliminaries in place, we now turn to the estimates of the sample errorf t+1,D − f t+1 presented in the following Lemma, whose proof can be found in the Appendix. Here and in the sequel, we use the conventional

Lemma 3.
Let λ > 0 and 0 < η < min{C −1 G , 1}, for any f * ∈ H K , there holds: where the constant c p,M = 2 4p+2 c p C 2p+1 G M 2p+1 : and: With the help of Lemma above, to bound the sample error f T+1,D − f T+1 L 2 , we first need to estimate the quantities the quantities BD ,λ , CD ,λ , DD ,λ FD λ and GD ,λ . Denote A D,λ := |D|/4 (|D| is the cardinality of D). In previous work [19,[21][22][23], we have foundnd that each of the following inequality holds with confidence at least 1 − δ: By Lemma 3, we also see that the function f * is crucial to determine f T+1,D − f T+1 . To get a tight bound for the learning error, we should choose an appropriate f * ∈ H K according to the regularity of the target function. When r ≥ 1 2 , f ρ ∈ H K and we take f * = f ρ . When 0 < r < 1 2 , f ρ is out of the space H K and we let f * = 0. Now, we give the first main result when the target function f ρ is out of H K with 0 < r < 1 2 . (5) for 0 < r < 1 2 . Let 0 < η < min{1, C −1 G }, T ∈ N and λ = T −1 . Then, for any 0 < δ < 1, with probability at least 1 − δ, there holds:

Theorem 2. Assume Equation
where C * is a constant given in the proof, J D,D,λ = sup The estimate of f T+1 − f ρ L 2 is presented in Lemma 1. We only need to handle f T+1,D − f T+1 L 2 by Lemma 3.
Next, we give the result when the target function f ρ is in H K with r ≥ 1 2 .
Proof of Theorem 1. We first prove Equation (8)  and λ = T −1 . Notice that |D| = N, |D| = N + S and m|D l | = |D|, m|D l | = |D| for 1 ≤ l ≤ m, with r + β > 1 2 and Equation (7), we obtain that: and: Thus: It follows for l = 1, · · · , m: Thus, by the above estimates: and: Putting the above estimates into Theorem 2, we have the desired conclusion (Equation (8)) with: When r ≥ 1 2 , we apply Theorem 3 and take the same proof procedure a above. Then, the conclusion (Equation (8)) can be obtained. The proof is completed.

Simulation and Conclusions
In this section, we provide the simulation to verify our theoretical statements. We assume that the inputs {x i } are independently drawn according to the uniform distribution on [0, 1]. Consider the regression model y i = g ρ (x i ) + ε i , i = 1, · · · , N, where ε i is the independent Gaussian noise N (0, 0.1 2 ) and: Define the pairwise kernel K : We apply the kernel K to the distributed algorithm (Equation (3)). In Figure 1, we plot the mean squared error of Equation (3) for N = 600 and S = 0, 300, 600 when the number of local machines m varies. Note that S = 0, and it is a standard distributed MEE algorithm without unlabeled data. When m becomes large, the red curve increases dramatically. However, when we add 300 or 600 unlabeled data, the error curves begin to increase very slowly. This coincides with our theory that using unlabeled data can enlarge the range of m in the distributed method. This paper studied the convergence rate of the distribute gradient descent MEE algorithm in a semi-supervised setting. Our results demonstrated that using additional unlabeled data can improve the learning performance of the distributed MEE algorithm, especially in enlarging the range of m to guarantee the learning rate. As we know, there are many gaps between theory and empirical studies. We regard this paper as mainly a theoretical paper and expect that the theoretical analysis give some guidance to real applications.

Conflicts of Interest:
The authors declare no conflict of interest. and the second one is: It has been proven in Reference [22] Then, we can follow the proof procedure in Proposition 1 of Reference [24] to prove Equations (A2) and (A1).
With the help of the lemmas above, we can prove Lemma 3.
Firstly, we will bound I 1 , which is most difficult to handle. It can be decomposed as: Then, it is easy to get that by Lemma A1 and f 1,D l = f 1 = 0: Together with the bound of Equations (A6), (A7), and (A8), we can get the desired conclusion (Equation (15)).