Sub-Graph Regularization on Kernel Regression for Robust Semi-Supervised Dimensionality Reduction

Dimensionality reduction has always been a major problem for handling huge dimensionality datasets. Due to the utilization of labeled data, supervised dimensionality reduction methods such as Linear Discriminant Analysis tend achieve better classification performance compared with unsupervised methods. However, supervised methods need sufficient labeled data in order to achieve satisfying results. Therefore, semi-supervised learning (SSL) methods can be a practical selection rather than utilizing labeled data. In this paper, we develop a novel SSL method by extending anchor graph regularization (AGR) for dimensionality reduction. In detail, the AGR is an accelerating semi-supervised learning method to propagate the class labels to unlabeled data. However, it cannot handle new incoming samples. We thereby improve AGR by adding kernel regression on the basic objective function of AGR. Therefore, the proposed method can not only estimate the class labels of unlabeled data but also achieve dimensionality reduction. Extensive simulations on several benchmark datasets are conducted, and the simulation results verify the effectiveness for the proposed work.


Introduction
Dimensionality reduction is an important issue when handing high-dimensional data in many real-world applications, such as image classification, text recognition, etc. In general, dimensionality reduction is achieved by finding a linear or nonlinear projection matrix that casts the original high-dimensional data into a low-dimensional subspace so that the computational complexity can be reduced and the key intrinsic information can be preserved [1][2][3][4][5][6][7][8][9][10]. Principal component analysis (PCA) and linear discriminant analysis (LDA) [11] are two of the most widely-used methods for dimensionality reduction. PCA is achieved by finding a projection matrix along the maximum variance of the dataset with the best reconstruction. While LDA is utilized to search for the optimal direction ensuring that the dataset in the reduced subspace can maximize the between-class scatter while minimizing the within-class scatter. As LDA is a supervised approach, it generally outperforms PCA by giving sufficient labeled information.
A key problem is that obtaining a large amount of labeled data is time-consuming and expensive. On the other hand, unlabeled data may be abundant in some real world applications. Therefore, semi-supervised learning (SSL) approaches have become increasingly important in the area of pattern recognition and machine learning [1,2,4,[12][13][14]. Over the past decades, according to the manifold or clustering assumptions-i.e., nearby data likely have the same labels [1,2,4]-graph based SSL is one of the most popular methods in the aspect of SSL, which includes the manifold regularization (MR) [3], learning with local and global consistency (LGC) [2] and Gaussian fields and harmonic functions (GFHF) [1] methods. All of these utilize labeled and unlabeled sets to formulate a graph for approximating the geometry of data manifolds [5].
The above graph-based SSL can be usually divided into two categorizations: The first is the inductive learning method and the second is the transductive learning one. The transductive learning methods aim to propagate the labeled information via a graph [1,2,4], so that the labels of an unlabeled set are estimated. However, a key problem for transductive learning methods is that they cannot estimate the class labels of new incoming data, therefore suffering from the out-of-sample problem. In constrast, the inductive learning methods, known as MR [3] and Semi-supervised Discriminant Analysis (SDA) [5], aim to study a decision function for classification on the original data space, so that they can reduce the dimensionality as well as naturally solve out-of-sample problems.
It can be noted that the graph in SSL tends to be a k nearest neighborhood (kNN) based graph that is first to find the k-neighborhoods of each data [15][16][17] and then define a weight matrix measuring the similarity between any pair-wise data [1,2,4,[18][19][20][21]. However, kNN graph has a key limit in that it cannot be scalable to a large-scale dataset, as the computational complexity for searching the k neighborhoods of data is O kn 2 , which is not linear with n. To solve this problem, Liu et al. [22,23] proposed an efficient anchor graph (AGR), where each data point is first to find the k neighborhoods of anchor points, then the graph is constructed by the inner product of coefficients between the data and anchors, through which the class labels can be inferred from anchors to the whole dataset. As a result, the computational complexity can be greatly reduced. While there are different ways to build the adjacency matrix S in AGR [24][25][26], we argue that most of them are developed intuitively and lack a probability explanation. In addition, AGR cannot directly infer the class labels of incoming data.
In this paper, we aim to enhance AGR by solving the above problems. From the element concept idea of AGR, we point that the anchors should have the same probability distribution to those of data points, as the anchors refer to the data that can roughly approximate the distribution of data points. Based on this assumption, we then analyze S from the stochastic view and further extend it to be doubly-stochastic. As a result, the distribution of anchors is the same to those of data points, and the updated S can be treated as a transition matrix, where each value in S can be viewed as a transition probability value between any data point and anchor point. Benefiting from S, we then develop a sub-graph regularized framework for SSL. The new sub-graph is constructed by S in an efficient way and can preserve the geometry of data structure. Accordingly, an SSL strategy based on such a sub-graph is also developed, which is first to infer the labels of anchors and then to calculate those of the training data. The is quite different from conventional graph-based SSL, which is directly to infer the class labels of datasets on the whole graph and may result in a huge computational cost if the dataset is large-scale. However, this SSL strategy is efficient and suitable for handling a large-scale dataset. The experiments on extensive benchmark datasets show the effectiveness and efficiency of the proposed SSL method.
The main contributions of this paper are given as follows: (1) We develop a doubly-stochastic S that measures the similarity between data points and anchors. The new updated S has probability means and can be viewed as transition probability between data points and anchors. In addition, the proposed S is also a stochastic extension to the ones in AGR. (2) We develop a sub-graph regularized framework for SSL. The new sub-graph is constructed by S in an efficient way and can preserve the geometry of the data manifold. (3) We also adopt a linear predictor for inferring the class labels of new incoming data, which can handle out-of-sample problems. In addition, the computational complexity of this linear predictor is linear with the number of anchors, and hence is efficient.
The organization of the paper is as follows: In Section 2, basic notations and reviews for SSL are provided; in Section 3, the proposed model for graph construction and SSL are developed. In Section 4, we conduct extensive simulations, and give our final conclusions in Section 5.

Notations
Let X = [X l , X u ] ∈ R d×(l+u) be the data matrix, where d presents the feature number, l and u are the number of labeled and unlabeled sets, respectively, so that X l and X u are respectively the labeled and unlabeled sets, Y = [y 1 , y 2 , . . . , y l+u ] ∈ R c×(l+u) be the one hot labels of data, . . , f l+u ] ∈ R c×(l+u) is the predicted label matrix satisfying 0 ≤ f ij ≤ 1.

Review of Graph Based Semi-Supervised Learning
We will review the prior graph based SSL methods. Two well-known methods for SSL include LGC [1] and GFHF [2]. The objective of LGC and GFHF can be given as: where λ is a balancing parameter that controls the trade off between the label fitness and the manifold smoothness. λ ∞ is a large value such that

Anchor Graph Regularization
Anchor graph regularization (AGR) is an efficient graph based learning method for large-scale SSL. In detail, let A = {a 1 , a 2 , . . . a m } ∈ R d×m be the anchor point set, G = {g 1 , g 2 , . . . g m } ∈ R c×m be the label matrix of A, Z ∈ R m×n be the weight matrix measuring the similarity between each x j and a i with constraints Z ij ≥ 0 and ∑ m i=1 Z ij = 1, which is usually formulated by the kernel weights or the local reconstructed strategy making the computational complexity for both two strategies linear with the data number. Then, the label matrix F can be estimated as: so that AGR is to minimize the following objective function: where the first term is the loss function and the second term is the manifold regularized term, W a = Z T ∆ −1 Z ∈ R n×n is the anchor graph, and ∆ ∈ R m×m is a diagonal matrix with each element satisfying ∆ ii = ∑ n j=1 Z ij . It can be easily proven that W a is doubly-stochastic, hence it has probability meaning. In addition, given two data points x i and x j with common anchor points, it follows W a ij > 0; otherwise W a ij = 0. This indicates that the data points with common anchor points have similar semantic concepts hence W a can characterize the semantic structure of datasets. L r = Z (I − W a ) Z T ∈ R m×m is the reduced Laplacian matrix, Z l ∈ R m×l is formed by the first l columns of Z. Here, we can see that although AGR is performed with a regularization term on all data points, it is equivalent to being regularized on anchor points with a reduced Laplacian matrix L r . Finally, the labels of data points can be inferred from those of anchor points, where the computational complexity can be reduced to O (n). Therefore, both graph construction and the regularized procedure in AGR are efficient and scalable to a large-scale dataset.

Analysis of Anchor Graph Construction
The key point for anchor graph construction is to define the weight matrix for measuring the similarity between each data point and anchor data. A typical way is to use kernel regression [22]: where δ is the bandwidth of Gaussian function and i denotes the indices of the k neighborhood anchors of x i . Obviously, we have S T 1 q = 1 n , where 1 n ∈ R n×1 and 1 q ∈ R q×1 is the column vectors with n and q ones, respectively, so that the sum of each column of S is equal to 1. This means S ij can be viewed as a probability value P b i |x j , which represents the transferred probability from x j to b j . Then, following the Bayes rule, we have: where P x j ≈ 1/n follows a uniform distribution based on the strong law of large number n → ∞.
In addition, since the anchors are also sampled from the dataset, we can further assume P (b i ) also follows a uniform distribution, i.e., P (b i ) = 1/q. With these assumptions, we have: where S i is the i-th row of S and σ = n/q is a fixed value so that S1 n = (n/q) 1 q =σ1 q . We thereby have two constraints on S, i.e., S T 1 q = 1 n and S1 n =σ1 q (the advantages will be shown in the next subsection). Our goal is to calculate a weight matrix S that follows the above constraints so that S has clear stochastic meaning. Fortunately, this can be simply achieved by iteratively normalizing S both in row and column, i.e., where P c (S) = S∆ −1 c and P r (S) = ∆ −1 r S, ∆ c = diag(1S) ∈ R (l+u)×(l+u) and ∆ r = diag(S1) ∈ R q×q . Acutally, the above iterative procedure is equivalent to solving the following optimization problem: where S 0 is the initial S as calculated in Equation (4). Equation (8) involves an instance of quadratic programming (QP), which can be divided into two convex sub-problems: By the above derivations, the initial QP problem in Equation (8) is tackled by successively alternating between two sub-problems in Equations (9) and (10). This alternate optimization procedure will converge due to Von-Neumann's lemma [27,28]. In addition, Von-Neumann's lemma guarantees that alternately solving the sub-problems in Equations (9) and (10) with the current solution is theoretically guaranteed to converge to the global optima of Equation (8).

Sub-Graph Construction
We have now obtained q anchors and the coefficient s j of each data x j . The weight matrix S reflects the affinities between data points and anchors, i.e., X ≈ BS. If we further assume such affinities in the original high-dimensional dataset can be preserved in the low-dimensional class labels, then we have F ≈ ZS, where Z = [z 1 , z 2 , . . . , z q ] ∈ R c×q represents the class labels of anchors B. This indicates that the class labels of the dataset can be easily obtained by F = ZS, given that the class labels of anchors have already been inferred. Since the number of anchors is smaller than that of the dataset, the computational cost for calculating Z can be much lower than directly calculating F in certain conventional graph-based SSL methods. We thereby present an efficient method for semi-supervised learning, in which we aim to develop a sub-graph regularized (SGR) framework for semi-supervised learning by utilizing the information of anchors.
Here, in order to develop our proposed sub-graph SSL method, we need to first construct a sub-graph on the set of anchors and define the adjacency matrix to measure the similarity between any two anchors. There are many approaches to construct the graph by utilizing the anchors, such as conventional kNN graph [1,18,20,21]. However, intuitively, we will design the adjacency matrix W d ∈ R q×q by using S as follows: It can be easily proven that W d 1 q = (1/σ)SS T 1 q = (1/σ)S1 n = 1 q . This indicates W d is a doubly-stochastic matrix. Therefore, the above graph construction can be theoretically derived by a probabilistic means. More straightforward, it can be easily noted that W d in Equation (11) is an inner product of S with each element W d ij = s r i , where s r i s r j T and s r j are the i-th and j-th rows of S = {s r 1 , s r 2 , . . . , s r q }. This indicates that the rows of S are denoted as the representations of anchors. In addition, given b i and b j share more common data points choosing them as anchors, their corresponding s r i and s r j will be similar and W d ij will become a large value; To the constrast, W d ij will be equal to 0, if b i and b j do not share any data points. Therefore W d derived in Equation (11) can be viewed as an adjacency matrix to measure the similarity between any two anchors.

Efficient Semi-Supervised Learning via Sub-Graph Construction
With the above graph construction, we then develop our sub-graph model for efficient semi-supervised learning. Since the number of anchors is much smaller than that of the dataset, our goal is first to estimate the labels of anchors Z from labeled data via the sub-graph model, and then to calculate those of unlabeled samples by the weight matrix. Here, we first give the objective function of the proposed sub-graph regularized framework for calculating the class labels of anchors as follows: The first term in Equation (12) is to measure the smoothness of estimated labels on the graph, while the second term is to measure how the estimated labels are consistent original labels, and the third one is a Tikhonov regularization term to avoid the singularity of possible solutions. η A and η I are the parameters balancing the tradeoff of the three terms. By conducting the derivation of J (Z) with regard to Z, we can calculate the class labels for anchors as follows: where U is a diagonal matrix where the first l and the remaining u element are 1 and 0, respectively, L d is the graph Laplacian matrix of W d . Following Equation (13), we can observe that key computations for Z * are the inverse of S l S T l + η I L d + η A I, where the complexity is O q 3 . Note that q l + u, calculating Z can be much smaller than directly calculating F as in LGC and GFHF. Finally, the class labels of the dataset can be calculated by The basic steps of the proposed SGR are in Algorithm 1.

Algorithm 1:
The proposed SGR 1 Input: Data X ∈ R D×(l+u) , label matrix Y ∈ R c×(l+u) , the number of anchors q and other parameters. 2 From S as Equation (8). 3 Form sub-graph weight matrix as SS T in Equation (11). 4 Estimate the label matrix of anchors Z * = YUS T SUS T + η I L d + η A I −1 as in Equation (12). 5 Estimate the label matrix of dataset by F = Z * S. 6 Output: The predicted label matrix of anchors and dataset Z ∈ R c×q , F ∈ R c×(l+u) , respectively.

Out-of-Sample Extension via Kernel Regression
The proposed SGR can be used to estimate the labels of unlabeled data. It cannot directly infer the labels of new data. One way to handle such problems is to find a linear projective model by regressioning anchors B on Z, i.e.,: where V ∈ R d×c is the projection and b is the bias term. Though this linearization assumption Z=V T B + b T e provides an effective and efficient solution to the out-of-sample problem. However it is not able to fit the nonlinear distribution. Therefore, we solve the above problem in two ways: (1) We combine the objective function of SGR and the regression term to form a unified framework, so that the class labels of Z, the projection V, and the bias b can be simultaneously calculated; (2) we utilize the kernel trick to search a nonlinear projection. Specifically, we give the objective function as: It should be noted that ϕ (B) is only implicit and not available. To calculate the optimal V, we have to involve some restrictions. In detail, let V have a linear combination of ϕ (B), i.e., V = ϕ (B) A, where A ∈ R q×c is the coefficient for V, then: where K represents the kernel matrix and we can select Gaussian kernel. By setting the derivatives of Equation (16), if follows: where η = η I /η R , L c = I − 1 q T 1 q /1 q 1 q T is to subtract the mean of all data, L r = L c − L c K T KL c K T + η I −1 KL c . Here, denote x as a new coming data and x k as its kernel representation, its projected data t can be given t = V T x k + b and the label of x is estimated as: One toy model example for verifying out-of-sample extensions can be given in Figure 1. In this toy example, we annotate two datasets as labeled sets in each class. We then infer the labels in the region {(x, y) |x ∈ [−2, 2], y ∈ [−2, 2]} by out-of-sample extension both in the linear version and kernel version. The experiment results show that the decision boundary learned by the kernel version is satisfied, since they are both consistent with the data manifold. While the linear version fails to handle the task, due to the two-cycle dataset following a nonlinear distribution.  Note that the proposed method includes three stages of training: (1) initialize the anchors by k-means; (2) construct the sub-graph w d ; (3) perform SSL. Here, the computational cost of k-means in the first stage is O (q (l + u)), while the one for sub-graph construction and SSL strategy in the second and third stage are W d is O (q (l + u)) and O q 3 + (l + u)q , respectively. The summary of the computational complexity is in Table 1, from which we can see that if we use a fixed q (q l + u) anchors for large scale dataset, the computational complexity of proposed SGR scales linearly with l + u, which indicates the proposed SGR is suitable for handling large-scale data.
It should be noted a recent work, [29], has proposed another SSL method based coupled graph Laplacian regularization, which is similar to our proposed work. The main advantages for our proposed work compared to [29] can be issued as follows: (1) The proposed constructed graph is doubly-stochastic, so that the constructed graph Laplacian is normalized in each row or column. For the coupled graph Laplacian rigorization, their constructed graph may not be doubly-stochastic; (2) the proposed work can directly handle out-of-sample problems by projecting the newly-coming data on the projection matrix so that the class membership of newly-coming data can be inferred. While for the coupled graph Laplacian regularization, it does not consider this point.

Toy Examples for Synthetic Datasets
We will first show the iterative approach of the proposed method can adaptively reduce the bias of a data manifold, where a dataset of two classes with noises is generated with a half-moon distribution in each class. Here, we use a kernel version of the proposed method to learn the classification model to handle such nonlinear distribution. Figure 2 shows the decision surfaces and boundaries obtained by the proposed method during the iterations. From Figure 2, we can observe that for the two-moon dataset, the results converge fast by only using four iterations. In Figure 2, we can observe that by initially treating each local regression term equal, the boundary learned by the proposed method cannot well separate the two classes as there are many mis-classified data points. However, during the iterative rewrighted process, the converged boundary in Figure 2 after four iterations can be more and more accurate and distinctive due to the reason that the biases caused by the noisy data are seriously reduced.

Description of Dataset
In this section, we will utilize six real-world datasets for verification. The six datasets are the Extended Yale-B, Carnegie Mellon University Pose, Illumination and Expression (CMU-PIE), Columbia Object Image Library 100 (COIL-100), Eidgenössische Technische Hochschule 80 (ETH80), U. S. Post Station (USPS) digit image and Chinese Academy of Sciences, Institute of Automation, Hand-Written Digit Base (CASIA-HWDB) datasets. For each dataset, we only select 5%, 10%, 15%, and 20% of the data points to formulate a labeled set randomly, 20% of the data to formulate a test set, and the remaining ones to formulate an unlabeled set. The information of the data and sampled images can be observed in Table 2 and Figure 3, respectively.

Image Classification
We will show the effectiveness of the proposed SGR for image classification. The experiment settings are as follows [36,37]: For most SSL methods, e.g., LGC, Special Label Propagation (SLP), Linear Neighborhood Propagation (LNP), AGR, Efficient Anchor Graph Regularization (EAGR) and MR, the parameter k for constructing the kNN graph is determined by five-fold cross validation, which is chosen from 6 to 20. For LGC, LNP AGR, and EAGR, the regularized parameter is needed to set, which is determined from 10 −6 , 10 −3 , 10 −1 , 1, 10, 10 3 , 10 6 . The average accuracies of over 50 random splits with changed numbers of labeled data are shown in Tables 3-8. From the classification results, we have:     (1) For almost all methods, the classification results increase given that the number of labeled data increases. For instance, the results of SGR will increase 15% as the number of labeled data is increased from 5% to 20% in most cases. This can almost get 17% increase in CASIA-HWDB dataset. In addition, the classification results will not increase given the number of labeled samples are sufficient especially in the cases of COIL100, USPS, and ETH80 datasets; (2) The proposed SGR can outperform other methods in all cases. For instance, SGR can achieve 5%-9% superiority over SLP, LNP, and MR in almost all cases. Especially in the CASIA-HWDB dataset, this improvement can even achieve 9%. AGR and EAGR can obtain competitive results as SGR by tuning the parameters. However, the proposed SGR can automatically adjust them while achieving satisfying results; (3) The accuracies of the unlabeled set outperform those of the test set. This is because the testing data are not utilized for training. However, the accuracies of the test set are still good showing that SGR is able to handling the new incoming data.

Parameter Analysis with Different Numbers of Anchors
In this subsection, we will verify the accuracies of SGR against different numbers of anchors. In this study, we selected 5% data to formulate a labeled set and the remaining ones to formulate an unlabeled set. Then, in Figure 4, we give the accuracy curve of SGR under different numbers of anchors, where the candidate set is chosen from √ n to 10 √ n. From Figure 4, we can see that in ETH80 dataset, the classification results increase when the number of anchors increase. However, the accuracies will not increase anymore given sufficient number of anchors, such as 10 √ n. Here, 10 √ n is still much smaller compared with that of original data. For other datasets, the classification accuracies have no change and are less sensitive to the number of anchors.

Image Visualization
In this subsection, we will demonstrate the visualization of the proposed method to show its superiority. In this study, we choose the digit and letter images of the first five classes from CASIA-HWDB dataset for experiment, where we randomly select 20 data and 80 data in each class to formulate a labeled set and an unlabeled set, respectively. The rest are used to formulate testing data. We then project the test set on the 2D subspace by utilizing a 2D projection matrix for visualization. Since the out-of-sample extension of the proposed SGR and MR are derived from the regression problem, we perform PCA operator on the projection data of V T X to reduce its dimensionality into two in order to handle the sub-manifold visualization problem. Then, the test data can be visualized on 2D subspace. The experiment results are shown in Figures 5 and 6. From the experiment results, we can observe that SGR can obtain the better performance especially in CASIA-HWDB digit image data.

Conclusions
In this paper, we proposed a sub-graph-based SSL for image classification. The main contributions of the proposed work are as follows: (1) We developed a doubly-stochastic S that measures the similarity between data points and anchors.
The new updated S has probability means and can be viewed asa transition probability between data points and anchors. In addition, the new sub-graph is constructed by S in an efficient way and can preserve the geometry of data manifold. Simulation results verify the superiority of the proposed SGR; (2) We also adopt a linear predictor for inferring the labels of new incoming data, which can handle out-of-sample problems. The computational complexity of this linear predictor is linear with the number of anchors; hence it is efficient. This shows that SGR can handle a large-scale dataset, which is quite practical; From the above analysis, we can see that the main advantages for the proposed work is the effectiveness for handling the classification problems and that it needs less computational complexity for both graph construction and SSL. It can also handle out-of-sample problems based on a kernel regression on anchors. However, it also suffers the drawback that the parameters are not adaptive. In addition, the graph construction and SSL inference are in two different stages. Our future work can lie in developing a unified framework for optimization with adaptive adjusted parameters.
While the proposed work mainly focuses on image classification, our future work can also lie in handling other state-of-the-art applications, such as image retagging [38], and context classification in the natural language processing field [39,40].