Huber Regression Analysis with a Semi-Supervised Method

: In this paper, we study the regularized Huber regression algorithm in a reproducing kernel Hilbert space (RKHS), which is applicable to both fully supervised and semi-supervised learning schemes. Our focus in the work is two-fold: ﬁrst, we provide the convergence properties of the algorithm with fully supervised data. We establish optimal convergence rates in the minimax sense when the regression function lies in RKHSs. Second, we improve the learning performance of the Huber regression algorithm by a semi-supervised method. We show that, with sufﬁcient unlabeled data, the minimax optimal rates can be retained if the regression function is out of RKHSs. 62J02


Introduction
The ordinary least squares (OLS) is an important statistical tool applied in regression analysis. However, OLS does not perform well when the data are contaminated by the occurrence of outliers or heavy-tailed noise. Thus, OLS is suboptimal in the robust regression analysis and a variety of robust loss functions have been developed that are not so easily affected by noises. Among them, Huber loss function is usually a popular choice in the fields of statistics, machine learning and optimization since it is less sensitive to outliers and can address the issue of heavy-tailed errors effectively. Huber regression was initiated by Peter Huber in his seminal work [1,2]. Statistical bounds and convergence properties for Huber estimation and inference have been further investigated in the subsequent works. See, e.g., [3][4][5][6][7][8][9].
Semi-supervised learning has been gaining increased attention as an active research area in the fields of science and engineering. The original idea of semi-supervised method can date back to self-learning in the context of classification [10] and then is well developed in decision-directed learning, co-training in text classification, and manifold learning [11][12][13]. Most existing research on Huber regression work is in the supervised framework. Unlabeled data had been deemed useless and thus thrown away in the design of algorithms. Recently, it has been shown in vast literature that utilizing the additional information in unlabeled data can effectively improve the learning performance of algorithms. See, e.g., [14][15][16][17][18]. In this paper, we focus on the Huber regression algorithm performance with unlabeled data. By the semi-supervised method, we find that optimal learning rates are available if sufficient unlabeled data are added in the Huber regression analysis.
In the standard framework of statistical learning, we let the explanatory variable X take values in a compact domain X in a Euclidean space, and the response variable Y takes values in the output space Y ⊂ R. This work investigates the application of the Huber loss that is linked to the following regression model: where f * is the regression function and is the noise in the regression model. Let ρ be a Borel probability measure on the product space Z = X × Y. Let ρ X and ρ(y|x) and denote the marginal distribution of ρ on X , and the conditional distribution on Y given x ∈ X , respectively. In the supervised learning setting, ρ is assumed to be unknown and the purpose of regression is to estimate f * (X) according to a sample drawn independently from ρ, where N is the sample size, the cardinality of D. The Huber loss function σ (·) is defined as where σ > 0 is a robustification parameter. Given the prediction function f : X → Y, Huber regression searches for a good approximation of f * (X) by minimizing the empirical prediction error with the Huber loss over a suitable hypothesis space.
In this work, we study the kernel based Huber regression algorithm and the minimization of (1) performs in a reproducing kernel Hilbert space (RKHS) [19]. Recall that K : X × X → R is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The RKHS H K is the completion of the linear span of the function set {K x = K(x, ·), x ∈ X } with the inner product induced by K x , K y K = K(x, y). The reproducing property is given by f (x) = f , K x K . Note that, by Cauchy-Schwarz inequality and [19], To avoid overfitting, the regularized Huber regression algorithm in the RKHS H K is given as where λ > 0 is a regularization parameter.
In this paper, we derive the explicit learning rate of Algorithm (2) in the supervised learning, which is comparable to the minimax optimal rate of OLS. By a semi-supervised method, we show that utilizing unlabeled data can conquer the bottleneck that optimal learning rates for algorithm (2) are only achievable when f * lies in H K .

Assumptions and Main Results
To present our main results, we introduce some necessary assumptions. In this section, we study the convergence of f D,λ to f * in the square integrable space (L 2 ρ X , · ρ ). Below, we elaborate on three important assumptions to carry out the analysis. The first assumption (3) is about the regularity of the regression function f * . Define the integral operator L K : L 2 ρ X → L 2 ρ X associated with the kernel K by Since K is a Mercer kernel on the compact domain X , L K is compact and positive. Thus, L r K as the r-th power of L K for r > 0 is well defined [20]. Our error bounds are stated in terms of the regularity of f * , given by The condition (3) characterizes the regularity of f * and is directly related to the smoothness of f * when H K is a Sobolev space. If (3) holds with r ≥ 1 2 , f * lies in the space H K [21].
The second assumption (4) is about the capacity of H K , measured by the effective dimension [22][23][24] where I is the identity operator on H K . In this paper, we assume that This condition measures the complexity of H K with respect to the marginal distribution ρ X . It is typical in the analysis of the performances of kernel methods' estimators. It is always satisfied with s = 1 by taking the constant C = Trace(L K ). When H K is a Sobolev space W α (X ), X ⊂ R n with all derivatives of an order up to α > n 2 , then (4) is satisfied with s = n 2α [25]. When 0 < s < 1, (4) is weaker than the eigenvalue decaying assumption in the literature [17,23].
The third assumption is about the conditional probability distribution ρ(y|x) on the output space Y. We assume that the output variable Y satisfies the moment condition when there exist two positive numbers t, M > 0 such that, for any integer q ≥ 2, The assumption (5) covers many common distributions, for example, Gaussian, sub-Gaussian, and the distributions with compact support [26]. Now, we are ready to present the main results of this paper. Without loss of generality, we assume sup x∈X K(x, x) = 1.

Convergence in the Supervised Learning
The following error estimate for Algorithm (2) is the first result of this section, which presents the convergence of Huber regression with fully supervised data and will be proved in Section 3.
The above theorem shows that the parameter σ in the Huber loss σ balances the robustness of Algorithm (2) and its convergence rates. We can see that, when the Huber loss function is employed in nonparametric regression problems, the enhancement of robustness occurs with the sacrifice of the convergence rate of Algorithm (2). Thus, what one needs to do is to find a trade-off. It is then direct to obtain the following corollary that provides the explicit learning rates for (2) with a suitable choice of σ.
Remark 1. The above corollary tells us that, when 1 2 ≤ r ≤ 1, Algorithm (2) achieves the error rate O N − r 2r+s , which coincides with the minimax lower bound proved in [23,25], and is optimal. We also notice that the convergence rate can not improve when r > 1. It is referred to as the saturation phenomenon, which has been found in a vast amount of literature [20,22,25].

Convergence in the Semi-Supervised Learning
Although optimal convergence rates of the Algorithm (2) were deduced when f * lies in H K ( r ≥ 1 2 ) in the previous subsection, the error rate for the case 0 < r < 1 2 needs improvements. In this subsection, we study the influence of unlabeled data on the convergence of (2) by using semi-supervised data.
Let an unlabeled data setD(x) = {x i }Ñ i=1 be drawn independently according to the marginal distribution ρ X , whereÑ is the cardinality ofD(x). With the fully supervised , we then introduce the supervised data set associated with Huber regression problems as By replacing D with D * in Algorithm (2), we then obtain the output function f D * ,λ with semi-supervised data D * . The enhanced convergence results are as follows. where and C 2 is a constant independent of N,Ñ, σ, or δ.
Based on the theorem above, we can obtain the improved convergence rate as follows.

Corollary 2. Under the same conditions of Theorem 2, if
then, with probability 1 − δ, Remark 2. Corollary 1 shows that, provided no unlabeled data are involved, the minimax optimal convergence rate for (2) is obtained only in the situation r > 1 2 . When 0 < r ≤ 1 2 , the rate reduces to O N − r s+1 . It implies that the regression function f * is assumed to belong to H K for achieving the optimal rate, which is difficult to verify in practice. In contrast, Corollary 2 tells us that, with sufficient unlabeled dataD(x) engaged in Algorithm (2), the minimax optimal rate O N − r 2r+s is retained for 0 < r ≤ 1. This removes the strict regularity condition on f * .

Proofs
Now, we are in a position of proving results stated in Section 2.

Useful Estimates
First, we will estimate the bound of f D,λ defined by (2). In the sequel, for notational simplicity, let z = (x, y) and define the empirical operator L K,D : H K → H K by Then, we have the following representation for f D,λ .
Lemma 1. Define f D,λ by (2). Then, it satisfies wheref Proof. Note that σ (u) = σ 2 G u 2 σ 2 . Since f D,λ is the minimizer of Algorithm (2), we take the gradient of the regularized functional on H K to give With the fact G + (0) = 1, it yields The proof is complete.
Based on the above lemma, we can obtain the bound of f D,λ .

Lemma 2.
Under the moment condition (5), with a probability at least 1 − δ, there holds Proof. Under the moment condition (5), it has been proven in [27] that, with a probability of at least 1 − δ, there holds max{|y| : there exists an x ∈ X , such that (x, y) ∈ D} ≤ (4M + 5t) log N δ .
By the definition of f D,λ , we have that It follows that This together with (15) yields the desired conclusion.
Furthermore, we see that This in combination with the bounds (15) and (16) provides that, with probability at least 1 − δ,

Error Decomposition
To derive the explicit convergence rate of Algorithm (2), we introduce the regularization function f λ in H K , defined by 2 dρ is the expected risk associated with the least squares loss. It is direct to verify that so f λ − f * = −λ(L K + λI) −1 f * . By the work in [20], we know that under the regularity assumption (3) with r > 0, and Now, we state two error decompositions for f D,λ − f λ . By (19), we have It implies (22) which leads the decomposition by (13), In the sequel, we denote Noting that, for any f ∈ H K , by the fact f ρ = L 1 2 K f K [21], one obtains a bound for the sample error f D,λ − f λ ρ by the decomposition (23) above.
Proposition 1. Define f D,λ by (2). Then, there holds Proof. Let I 1 , I 2 , and I 3 denote the three terms on the right-hand side of (23), respectively. Consider the H K norm of Then, Similarly, With the above bounds, we use (24) to obtain the statement. The proof is finished.

Deriving Main Results
To prove our main results, we need to bound the quantities B D,λ , C D,λ , G D,λ by the following probability estimates.
These inequalities are well studied in the literature and can be found in [17,18].
By the restrictionÑ ≥ max{N s+1 2r+s − N + 1, 1}, we conclude that Putting the estimates above into (33) yields that the desired conclusion (10) with The proof is finished.

Numerical Simulation
In this part, we carry out simulations to verify our theoretical statements. We employ the mean squared error of a testing set for the comparison. We generate N = 500 labeled data {x i , y i } 500 i=1 by the regression model y i = f * (x i ) + , where f * (x) = x(1 − x), and the random inputs x i 's are independently drawn according to the Normal distribution N (0, 1), and is the independent Gaussian noise N (0, 0.005). We also generateÑ = 200 unlabeled data {x i } 200 i=1 withx i 's drawn independently according to the uniform distribution on [0, 1]. We choose the Gaussian kernel K(x, u) = exp{−|x − u| 2 /2}, h = 5 and regularization parameter λ = 0.7. Algorithm 1 shows the mean squared error of Algorithm (1) with the training data set D = {x i , y i } 500 i=1 . Algorithm 2 shows the mean squared error of Algorithm (2) with the semi-supervised data set D * by (9). Algorithm (2)' s error is obviously smaller than Algorithm (1) if 20 unlabeled data are added into the training data. When we add more unlabeled data from 20 to 200, Algorithm (2)' s curve decreases continuously. These experimental results coincide with our theoretical analysis through the following Figure 1.

Discussion
Unlabeled data are ubiquitous in a variety of fields including signal processing, privacy concerns, feature selection, and data clustering. For the applications of Huber regression that have robustness, we adopted a semi-supervised learning method to our regularized Huber regression algorithm. We derived the explicit learning rate of algorithm (2) in the supervised learning, which was comparable to the minimax optimal rate of OLS. By a semi-supervised method, we showed that an inflation of unlabeled data could improve learning performance for Huber regression analysis. It suggested that using the additional information of unlabeled data could extend the application of Huber regression.