Semi-Supervised Classiﬁcation Based on Low Rank Representation

: Graph-based semi-supervised classiﬁcation uses a graph to capture the relationship between samples and exploits label propagation techniques on the graph to predict the labels of unlabeled samples. However, it is difﬁcult to construct a graph that faithfully describes the relationship between high-dimensional samples. Recently, low-rank representation has been introduced to construct a graph, which can preserve the global structure of high-dimensional samples and help to train accurate transductive classiﬁers. In this paper, we take advantage of low-rank representation for graph construction and propose an inductive semi-supervised classiﬁer called Semi-Supervised Classiﬁcation based on Low-Rank Representation (SSC-LRR). SSC-LRR ﬁrst utilizes a linearized alternating direction method with adaptive penalty to compute the coefﬁcient matrix of low-rank representation of samples. Then, the coefﬁcient matrix is adopted to deﬁne a graph. Finally, SSC-LRR incorporates this graph into a graph-based semi-supervised linear classiﬁer to classify unlabeled samples. Experiments are conducted on four widely used facial datasets to validate the effectiveness of the proposed SSC-LRR and the results demonstrate that SSC-LRR achieves higher accuracy than other related methods.


Introduction
In many machine learning tasks, one often lacks of sufficient labeled samples, which are usually difficult or expensive to accumulate.However, unlabeled samples are easy to obtain or collect.To get an accurate classifier, it is necessary to develop techniques that can leverage limited labeled samples and many unlabeled samples.Semi-supervised learning is one of the techniques that can take advantage of both the labeled and unlabeled samples, and it shows improved learning results compared with using scarce labeled data alone [1].
In this paper, we focus on graph based semi-supervised classification (GSSC).GSSC uses a graph G = (V, E, W) to represent the data structure, where V is a set of vertices and each vertex represents a sample, E ⊆ V × V is a set of edges connecting samples, and W is an adjacency matrix recording the weight of edges (or similarity) between samples.GSSC often exploits a graph-based regularization framework to classify unlabeled samples [1].Zhu et al. [2] proposed an approach called Gaussian random fields and harmonic function (GFHF).GFHF predicts the labels of unlabeled samples by propagating labels of labeled samples on a k nearest neighborhood (kNN) graph.It is based on the consistency assumption that nearby points are likely to have similar outputs and points on the same structure (typically referred to as a cluster or a manifold) are likely to have the same label.Zhou et al. [3] introduced a local and global consistent (LGC) method to classify unlabeled samples on a kNN graph.Nevertheless, in essence, most graph-based semi-supervised classification algorithms are transductive; they cannot directly extend to new samples outside of the graph [1].To perform inductive classification, Zhu et al. [4] suggested predicting the labels of unlabeled samples in the training set at first, and then labeling a new sample based on the labels of its nearest neighbors in the training set.That suggestion is only sensible when the number of unlabeled samples is sufficiently large and the predicted labels of unlabeled samples in the training set are correct.
Researchers have recognized that the graph determines the performance of GSSC [1,5,6].However, how to construct a graph that correctly reflects the underlying distribution structure of samples is a public problem [1].That is principally because the distance between samples becomes isometric as the dimensionality of samples increases [7].Furthermore, many traditional similarity metrics are distorted by noisy or redundant features of high-dimensional data.For these reasons, researchers move forward to graph optimization based semi-supervised classification.Wang et al. [5] introduced a linear neighborhood propagation (LNP) method.LNP optimizes edge weights of a predefined kNN graph via minimizing the reconstruction error of a sample to its k nearest neighborhood samples by using an objective function similar to local linear embedding [8].Zhao et al. [9] proposed a method called compact graph based semi-supervised learning (CGSSL).CGSSL infers labels of unlabeled samples by using an l 2 -graph, which is constructed by utilizing neighborhood samples of a sample and neighborhood samples of its reciprocal neighbors.Cheng et al. [10] proposed an l 1 -graph based transductive classifier by using sparse representation regularized with l 1 -norm, and the l 1 graph is constructed based on sparse representation coefficients.Fan et al. [11] proposed a sparse representation regularized least square classification (S-RLSC).To speed up the solution of sparse representation, Yu et al. [12] proposed a semi-supervised classification based on subspace sparse representation (SSC-SSR).SSC-SSR solves the l 1 -norm regularized sparse representation problem in several random subspaces, and trains a semi-supervised linear classifier on the l 1 -graph defined by the sparse representation coefficients in each subspace, and then combines these classifiers into an ensemble classifier.Sparse representation forces the coefficients to be sparse.Low rank representation was recently introduced to GSSC [13].Yang et al. [13] constructed a graph by using the calculated low rank representation coefficients of both labelled and unlabeled samples as the graph weights, and incorporated that graph into GFHF for transductive classification.Low rank representation forces the coefficient matrix to be low rank and optimizes the matrix as a whole, whereas sparse representation often optimizes the coefficients per sample.It is recognized that low-rank is an appropriate approach for capturing the global structure of the data and the global mixture of subspace structure [14,15].Peng et al. [16] proposed structure preserving low-rank representation technique by enforcing the local affinity property to be preserved without distorting the distant repulsion property and by utilizing the label information.Yang et al. [17] integrated the kernel trick with low rank representation, and introduced a kernel low-rank representation graph for GSSC.However, these approaches focus on transductive classification.Similar to LGC and GFHF, they cannot directly apply to samples outside of the graph.
In this paper, we introduce an inductive semi-supervised classifier based on low-rank representation (SSC-LRR).SSC-LRR first constructs a graph based on the low-rank representation coefficients.Next, it incorporates this graph into a graph-based semi-supervised linear classifier to classify unlabeled samples that are outside of the graph.Experimental results on four high-dimensional facial image datasets demonstrate that SSC-LRR performs better than other related GSSC methods.In addition, SSC-LRR is robust to noisy features and input parameters.
The remainder of this paper is organized as follows.In Section 2, we give the details of how to construct a graph by low rank constraints and introduce inductive semi-supervised classification based on a low rank.Section 3 provides the experimental results and analysis, followed with conclusions in Section 4.

Methodology
In this section, we present a semi-supervised classification based on low rank representation (SSC-LRR).

Low-Rank Representation for Graph Construction
Let X = [x 1 ; x 2 ; x 3 ; . . .; x N ] ∈ R D×N be a set of samples, each column x i ∈ R D represents a sample.Each sample can be viewed as a linear combination of bases from a dictionary A = [a 1 ; a 2 ; a 3 ; . . .; a N ] ∈ R D×N .Similar to work in [13,15], we set A = X in this paper.Low rank representation represents each sample by a linear combination of the bases in A as follows: where Z = [z 1 , z 2 , . . ., z N ] ∈ R N×N is the coefficient matrix of low-rank representation.Each element in z i can be viewed as the contribution to the reconstruction of x i with A as the dictionary.A is often over complete.Therefore, Equation ( 1) cannot be solved in finite steps.To overcome this problem, we enforce Z to be low rank and solve the following optimization problem: Obviously, X is reconstructed by the low rank constrained matrix Z and the dictionary matrix A. Equation ( 2) is coined as low rank representation (LRR) [15].Z ≥ 0 is added since Zhuang et al.In [18], it was observed that non-negative Z often leads to improved performance for data representation and graph construction.However, Equation ( 2) is NP-hard.Fortunately, Equation ( 2) can be relaxed to the following problem: where Z * is the nuclear norm of Z and it is the sum of singular values of Z [19].Equation ( 3) can be solved by matrix completion methods [20].Equation ( 3) can be further relaxed to take into account noisy features as follows: where [21] and λ > 0 is used to balance the effect of noise.The l 2,1 -norm encourages the columns of E to be zero.To solve Equation (4), Lin et al. [22] suggested a Linearized Alternating Direction Method with Adaptive Penalty(LADMAP) technique to iteratively optimize Z and E with each of them fixed as constant while optimizing the other.Next, each column of Z * is normalized via z i = z i / z i 2 , and then each negative entry of z * is set to zero.Low rank representation jointly finds the low-ranked coefficient matrix Z for all samples in X. Z can be used to define a low rank representation based undirected graph, whose weighted adjacent matrix is S = (Z + Z T )/2.

Semi-Supervised Classification Based on Low Rank Representation
Suppose there are N = l + u samples X = [x 1 ; x 2 ; . . .; x l ; x l+1 ; . . .; x N ] ∈ R D×N , where the first l samples are labeled and the left u samples are unlabeled.y = [y 1 , y 2 , . . ., y N ] T is the label vector of labeled samples and y i is the label of sample x i .To perform multi-class classification, we extend the label vector y ∈ R N into a label matrix Y ∈ R N×C as: For unlabeled sample x j , its corresponding label vector Y j is a zero vector.
The general form of a linear classifier can be defined as: where W ∈ R D×C is the predictive matrix, b ∈ R C is the label bias, f (x) ∈ R C is the predicted likelihood vector for x with respect to C different labels.
Here, we consider a graph-regularized semi-supervised linear classifier as follows: where the first term is the empirical loss on labeled samples, the second term is to take advantage of the global structure of samples, the last term controls the complexity of f (x) and to avoid over-fitting.
The first term is defined as follows: where tr() is the matrix trace operator, H is an N × N diagonal matrix with 1 is an N-dimensional vector with all elements are set to 1.The second term of Equation ( 7) can be computed as: where S ij is the weight of edge between x i and x j , S = (Z + Z T )/2 and Z is the low rank representation coefficient matrix obtained from Equation (4).D is a diagonal matrix with the graph Laplacian matrix [23].The last term f 2 H of Equation ( 7) is used to control the complexity of f (x), it is computed as follows: Based on Equations ( 8)- (10), J(W, b) can be reformulated as: Equation ( 11) can be solved by taking partial derivative of J(W, b) with respect to W and b as below: ∂J ∂b Let ∂J ∂W = 0 and ∂J ∂b = 0, we can obtain: where U is: Given X and Y are already known, L = D − S, f (x) is mainly determined by S. Here, S is the weighted adjacent matrix of a graph constructed by low rank representation coefficient matrix Z (see Equation ( 4)).In contrast to the transductive classifier, our proposed SSC-LRR directly uses W and b to predict the likelihood of x with respect to C classes.The predicted label of x is: where f (x, c) indicates the c-th entry of f (x) ∈ R C , and l(x) is the predicted label of x. Figure 1 briefly lists the process of SSC-LRR algorithm.

X x x x
Unlabeled training samples

Experiments Setup
In this section, we conduct experiments on four facial datasets AR [24], ORL [25], PIE [26] and YaleB [27] to validate the effectiveness of our proposed SSC-LRR with several related and representative graph-based semi-supervised classification methods: GFHF [2], LGC [3], LNP [5], CGSSL [9], LRR-GFHF [18], S-RLSC [11] and LRR-MR.LGC and GFHF use kNN graph, CGSSL and LNP employ l 2 graph , S-RLSC uses l 1 -graph.LRR-GFHF constructs a graph by utilizing coefficient matrix Z in Equation ( 4), and then applies GFHF on this graph to classify unlabeled samples.LRR-MR uses the same graph as LRR-GFHF and SSC-LRR, but it employs a representative semi-supervised nonlinear classifier-manifold regularization (MR) [28] on the graph to predict the label of unlabeled samples.GFHF, LGC, CGSSL and LRR-GFHF are transductive classifiers that can not directly classify out-of-sample, which is currently not in the graph.We extend them for out-of-sample situations by setting the label of a new sample as the label of its nearest training sample, and the labels of unlabeled training samples are predicted by the respective transductive classifier in advance.Notably, S-RLSC, LRR-MR and SSC-LRR directly use all the labeled and unlabeled samples to predict the label of a new sample, without predicting the labels of unlabeled training samples in advance.
AR contains 2600 images of 100 persons.These images are transformed into grayscale and cropped into 42 × 30 pixels.Thus, each image can be viewed as a point in the 1260-dimensional space.For each person, we choose 13 images as training set, and set the rest ones as testing set.YaleB contains 2414 images of 38 people, and each image is cropped into 32 × 32 pixels, and then transformed into grayscale images.For each person, we choose 20 images per person as training set, the remaining samples as testing set.ORL contains 400 face images of 40 people, and each image is cropped into 32 × 32 pixels and transformed into a grayscale image.For each person, we select six images per person as training set, and remaining four images as testing set.PIE includes 41,368 face images of 68 individuals.In the experiment, we select the subset Pose27 (including 3329 images) for experiments and crop these images to 64 × 64 pixels.For each person, we select 20 pictures of each person as training set and the rest images as testing set.
For presentation, we introduce several symbols: N-the number of training samples, Nt-the number of testing samples, D-the dimensionality of samples, C-the number of classes, LU-the number of images per person in the training set, m-the number of labeled images per person in the training set, k-neighborhood size.In the experiments, unless extra specified, λ is set to 0.01, α is set to 0.05, β is set to 0.05, k is set to 5. To reduce random effect, all the experimental results are the average of 20 independent runs.In each run, we randomly select a fixed number of samples from the training set as labeled samples, and the remaining samples of training set are used as unlabeled samples.

Accuracy with Respect to Different Number of Labeled Samples
In order to study the influence of the number of labeled samples on the accuracy of semi-supervised classification, we conduct experiments on AR, PIE, and YaleB by varying m from 1 to 10, on ORL by varying m from 1 to 5. The recorded results are plotted in Figure 2.
From Figure 2, we can observe that the accuracy of all methods increases with the number of labeled samples m rising and SSC-LRR always achieves better performance than other comparing methods.GFHF and LGC use a single kNN graph, while LGC outperforms GFHF.That is because LGC classifies samples based on the consistency assumption.This fact indicates that the consistency assumption can improve the performance of the algorithm in a small range.GFHF and LRR-GFHF use the same classifier, while the accuracy of GFHF is much lower than that of LRR-GFHF.This fact shows that LRR-based graph can better reflect the relationship between samples than kNN graph.The accuracy of LGC is lower than that of CGSSL and LNP, which utilize an optimized l 2 -graph by different techniques.The cause is that LGC uses a single kNN graph, which can be easily destroyed by noisy featured.In contrast, l 2 -graph is more effective than kNN graph in capturing the similarity relationship between samples.These observations also corroborate that graph determines the performance of GSSC methods.
S-RLSC uses a single l 1 graph, and achieves better performance than CGSSL and LNP on AR, PIE and YaleB.This fact coincides with previous study that l 1 -graph is generally more effective than l 2 graph based semi-supervised classification methods [12].However, the accuracy of S-RLSC is lower than that of LNP on ORL, since ORL has a relative small number of training samples and l 1 -graph asks for a large number of basic samples to optimize the sparse representation coefficients.
Compared with GFHF, LGC, LNP and CGSSL, both the l 1 -graph and LRR based graph demonstrate better ability to capture the relationship between high-dimensional samples.We want to remark that SSC-LRR achieves higher accuracy than S-RLSC, although both of them are inductive classifiers.The reason is that LRR graph has better capacity in exploiting the global data structure of samples than l 1 graph.The performance margin between S-RLSC and SSC-LRR on AR is more obvious than that on other facial databases, since there are more noises in the images of AR than that of other datasets.This fact shows SSC-LRR is more robust to noise than l 1 and l 2 graphs.LRR-GFHF, LRR-MR and SSC-LRR employ LRR to construct a graph for semi-supervised classification, SSC-LRR and LRR-MR show improved accuracy than LRR-GFHF.This is principally because LRR-GFHF is extended to an inductive classifier by transferring the labels of labeled samples and pseudo labels of unlabeled samples to a new sample x, but the pseudo labels of unlabeled samples in the training set are not correctly predicted.In contrast, SSC-LRR does not predict the labels of unlabeled samples in the training set, it directly exploits W and b in Equation ( 6) to predict the label of x.LRR-MR is a non-linear inductive classifier, it often gets higher accuracy than other comparing methods, except SSC-LRR, although these adopted image datasets are not explicitly linear classifiable.This comparison shows that both linear classifiers and non-linear classifiers can achieve good performance by exploiting a LRR-based graph.These results support our motivation to use LRR for inductive semi-supervised classification.

Sensitivity Analysis on Input Parameters
To study the influence of the balance parameter λ (see Equation ( 4)) of LRR for SSC-LRR, we conduct additional experiments on AR, ORL, PIE, and YaleB by setting λ to 100, 10, 1, 0.1, and 0.01, respectively.The value of m on AR, ORL, PIE, and YaleB are fixed as 10, 5, 10, 10, respectively.The recorded results are plotted in Figure 3. From Figure 3, we can observe that the value of λ has no obvious influence on the classification accuracy on each dataset.The performance of SSC-LRR almost remains the same within a relative large range of λ.Therefore, SSC-LRR is robust to input parameter λ, which balances the effect of noise and low rankness.
In addition, we also investigate the sensitivity of SSC-LRR with respect to α and β (see Equation ( 7)).We perform additional experiments on ORL and AR with both α and β rising 0.05 to 1 with stepsize 0.05.The value of m on AR and ORL are fixed as 10 and 5, respectively.The recorded results for each combination of α and β are revealed in Figure 4. From Figure 4, we can also observe that SSC-LRR achieves rather stable performance within a relative large range of α and β.
From these results, we can conclude SSC-LRR is robust to noise and can work well under a wide range of input values of parameters.

Conclusions
In this paper, we investigate how to boost the performance of graph-based semi-supervised classification by constructing a well structured graph.We employ low rankness representation to construct a weighted adjacent matrix between samples and incorporate this graph into a semi-supervised inductive classifier, and thus introduce a method called Semi-Supervised Classification based on Low-Rank Representation (SSC-LRR).Experimental results on four high-dimensional face datasets show that SSC-LRR not only has higher accuracy than other related methods, but is also robust to the input parameter.We are planning to study more effective and efficient techniques to construct a graph and to further enhance the performance of graph based semi-supervised classification.

Figure 2 .
Figure 2. Accuracy versus m (number of labeled images per person).