Robust Sparse Representation for Incomplete and Noisy Data

Owing to the robustness of large sparse corruptions and the discrimination of class labels, sparse signal representation has been one of the most advanced techniques in the fields of pattern classification, computer vision, machine learning and so on. This paper investigates the problem of robust face classification when a test sample has missing values. Firstly, we propose a classification method based on the incomplete sparse representation. This representation is boiled down to an l1 minimization problem and an alternating direction method of multipliers is employed to solve it. Then, we provide a convergent analysis and a model extension on incomplete sparse representation. Finally, we conduct experiments on two real-world face datasets and compare the proposed method with the nearest neighbor classifier and the sparse representation-based classification. The experimental results demonstrate that the proposed method has the superiority in classification accuracy, completion of the missing entries and recovery of noise. Keyword: sparse representation; robust; face classification; alternating direction method of multipliers; incomplete; l1 minimization


Introduction
As a parsimony method, the sparse signal representation means that we desire to represent a signal by the linear combination of a few basis elements in an over-complete dictionary.The emerging theory of sparse representation and compressed sensing [1,2] has made exciting breakthroughs and received a great deal of attention in the past decade.Nowadays, the sparse representation has already been a powerful technique for efficiently acquiring, compressing and reconstructing a signal.Besides, the sparse representation also has two powerful functions, that is, it is robust to large sparse corruptions and discriminative to class labels.These two distinguished functions promote its extensive and successful applications in areas such as pattern classification [3][4][5], computer vision [6] and machine learning [7].
It is worth mentioning that Wright et al. [3] proposed a novel method for robust face classification.They applied the idea of sparse representation to pattern classification and demonstrated that this unorthodox method can obtain significant improvements in classification accuracy over traditional methods.Subsequently, Yin et al. [8] extended the aforementioned classification method to the kernel version.Moreover, Huang et al. [9] and Qiao et al. [4] performed respectively face classification and signal classification by combining the discriminative methods with sparse representation.In [7], Cheng et al., constructed a robust and datum-adaptive l1-graph on the basis of sparse representation.Compared with k-nearest graph and ε-ball graph, the l1-graph is more robust to large sparse noise and more discriminative to neighbors.Zhang et al. [10] presented a robust seminonnegative graph embedding framework, and Chen et al. [11] applied non-negative sparse coding to facial expression classification.Elhamifar et al. [12] proposed a framework of sparse subspace clustering, which harnessed the sparse representation to cluster data drawn from multiple low dimensional subspaces.
In the community of pattern classification and machine learning, we mainly restrict our attention to the situation that all samples do not have any missing entries.However, the datasets with missing values are ubiquitous in many practical applications such as image in-painting, video encoding and collaborative filtering.A commonly-used modeling assumption in data analysis is that the investigated dataset is (approximately) low-rank.Based on this assumption, Candès et al. [13] proposed a technique of matrix completion via convex optimization and showed that most low-rank matrices can be exactly completed under certain conditions.If we make a further clustering analysis on these datasets, Shi et al. [14] proposed the method of incomplete low-rank representation, which is validated to be very robust to missing values.
For the task of pattern classification, we usually stipulate that all samples from the same class lie in a low dimensional subspace.If the training samples have missing entries, the matrix completion or the incomplete low-rank representation can be employed to complete or recover all missing values.This paper considers the pattern classification problem that the test samples have missing values while the training samples are complete.To address it, we propose a method of incomplete sparse representation.This method treats each incomplete test sample as the linear combination of all training samples and searches the sparest representation.
The remainder of this paper is organized as follows.Section 2 reviews the classification problem based on sparse representation.In Section 3, we propose a model of incomplete sparse representation and develop an alternating direction method of multipliers (ADMM) [15] to solve it.Convergence analysis and model extension are made in Section 4. In Section 5, we carry out experiments on two well-known face datasets and validate the superiority of the proposed method by comparing with other techniques.The last section draws the conclusions.

Sparse Representation for Classification
A fundamental problem in pattern classification is how to determine the class label of a test sample according to labeled training samples from distinct classes.Given a training set collected by C classes, we express all samples from the i-th class as   , where d is the dimensionality of each sample, and i N is the sample number of the i-th class, 1, 2,..., , , , . Thus, the entire training samples can be concatenated into a d N We assume that the samples from the same class lie in a low-dimensional linear subspace and there are sufficient training samples for each class.Given a new test sample In view of the assumption that the training samples are enough, the coefficient vector w satisfying Equation ( 2) is not unique.Moreover, the vector w is also sparse if / i N N is very small.
If the class label of y is unknown, we still desire to obtain its label on the basis of the linear representation coefficients w .The sparse representation is essentially discriminative to classification and robust to large sparse noise or outliers.Considering these advantages, we will construct a sparse representation model to perform classification.For the given dictionary matrix A , the sparse representation of y can be reached by solving an l1-minimization problem where the error bound 0   , 1  and 2  are the l1-norm and the l2-norm of vectors respectively.
Sparse representation is a global method and it has superiority in determining the class label over other local methods such as nearest neighbor (NN) and nearest subspace (NS) [3].Denote the optimal solution of Problem (3) or (4) by ŵ .Sparse representation-based classification (SRC) [3] utilizes ŵ to judge which class y belongs to.The following will present the detailed implementation process.We first introduce C characteristic functions : for arbitrary . Finally, the class of y is labeled as arg min ( ) i i r y .

Incomplete Sparse Representation for Classification
As the second orders generalization of the compressed sensing and sparse representation theory, the low-rank matrix completion is a technique to complete all missing or unknown entries of a matrix by means of the low-rank structure.If all training samples are complete and the test sample has missing entries, we can not effectively recover the missing values through the use of the low rank property.In other words, the available matrix completion methods become invalid.To solve this problem, we propose a method of incomplete sparse representation for classification.The proposed method not only completes effectively the missing entries but also obtains better classification performance.

Model of Incomplete Sparse Representation
Considering the existence of noise, we decompose a given test sample y into the sum of two terms: be an index set, then k y is missing if and only if k   .For the convenience of description, we define an orthogonal projector operator ( ) : Thus, the known entries of y can be written as 0 ( ) For the incomplete test sample 0 y , we hope to complete all missing entries and obtain the sparsest linear representation on the basis of A and  .To this end, we construct an l1-minimization problem y Aw e y y (7) where the tradeoff factor 0   .As a matter of fact, the objective function in the above problem can be replaced by This conclusion means that it is impossible to recover the noise corresponding to the missing positions.Problem ( 7) is a convex and non-smooth minimization with equality constraints.We will employ the alternating direction method of multipliers (ADMM) to solve this problem.

Generic Formulation of ADMM
The ADMM [15] is a simple and easily implemented optimization method proposed in 1970s.It is well suited to distributed convex optimization, and, in particular, to large scale optimization problems with multiple non-smooth terms in the objective functions.Hence, the method has received a lot of attention in recent years.
Generally, ADMM solves the constrained optimization problem taking the following generic form: where : , and they are proper and convex.The augmented Lagrange function of Problem ( 8) is defined as follows where  is a positive scalar,  is the Lagrange multiplier vector and ,   is the inner operator of vectors.
x y (10) Moreover, the values of  will be increased during the procedure of iterations.

Algorithm for Incomplete Sparse Representation
We adopt the method of ADMM with multiple blocks of variables to solve the problem of incomplete sparse representation.By introducing two auxiliary variables , , where the penalty coefficient 0 , ,  λ λ λ λ , then Equation ( 12) is equivalent to ( , , , , , ) ADMM updates alternatively each block of variables by minimizing or maximizing L  .We subsequently give the detailed iterative procedure for Problem (11).Computing w .When w is unknown and other blocks of variables are fixed, the calculation formulation of w is as follows:   where 1/ ( ):  is the absolute shrinkage operator [16] defined by for arbitrary N  x  .Computing e .If e is unknown and other variables are given, e is updated by minimizing L  : Computing u .The update formulation of u is calculated as follows: : arg min arg min ( ) where By setting the derivative of ( ) f u to zeros, we have or, equivalently, where N I is an N-order identity matrix.Computing z .Fix , , , w e y u and λ , and minimize L  with respect to z : where  is the complementary set of  .Computing λ .Given , , , , w e y z u , we update λ as follows The whole iterative process for solving Problem (11) is outlined in Algorithm 1.In the initialization step, the blocks of variables can be chosen as follows: We set the stopping condition of Algorithm 1 to be   where  is a sufficiently small positive number.The inverse matrix   is computed only once and the corresponding computational complexity is , , , ,

End while
Let ŷ , ŵ and ê be the output variables of Algorithm 1.The vector ŷ denotes the completed sample of 0 y and ŵ indicates the sparse representation of ŷ over the basis matrix A. In view of the discriminative performance of ŵ , this sparse vector can be employed to obtain the class label of 0 y .
More specially, we first compute C residuals

Convergence Analysis and Model Extension
Although the minimization Problem ( 11) is convex and continuous, it is still difficult to straightly prove the convergence of Algorithm 1.The main reason for this difficulty is that the number of blocks of variables is more than two.If there is no missing value, we can design an exact ADMM for solving Problem (11).The following theorem shows the convergence of the modified method. : where d N


O is a zero matrix with size of d N  .
In consideration of the characteristics of the objective function ( , , , , , , , ) L  w e y z u λ λ λ converges to the optimal value [15].This completes the proof.□ For the aforementioned ISRC, we only consider one incomplete test sample.In the following, we will extent it to the case of a batch of test samples with missing values.Given a set of m incomplete test samples   0 1 We establish the batch learning model of incomplete sparse representation: where , , , ,  and F  are the component-wise l1-norm and the Frobenius norm of matrices respectively.Without considering the constraints 0 ( ) where 0 , , , ,      .When solving Problem (27), we adopt the similar iterative procedure with Problem (11) by minimizing or maximizing alternatively ( , , , , , )

Experiments
This section demonstrates the effectiveness and the efficiency of ISRC by conducting experiments on Olivetti Research Laboratory (ORL) and Yale face datasets.We compare the results of the proposed method with that of NN and SRC.

Datasets Description and Experimental Setting
ORL dataset contains 10 different face images of each of 40 individuals [17].These 400 images were captured at different time with different illuminations and varying facial details.The Yale face dataset consists of 165 images from 15 persons and there are 11 images for each person [18].The images were taken with different illuminations, varying facial expressions and details.All images in both datasets are in grayscale and resized to be 64 × 64 for computational convenience.Hence, the dimensionality of each sample is 4096 d  .Moreover, each sample is normalized to a unit vector in the sense of the l2-norm due to the existence of variable illumination conditions and poses.
In ORL dataset, five images per person are randomly selected for training and the remaining five images for testing.In Yale dataset, we randomly choose six images per person as the training samples and the other images as the testing samples.For any sample y from the testing set, we generate randomly an index set  according to the Bernoulli distribution, i.e., the probability of i   is stipulated as p for arbitrary {1, 2,..., } i d  , where (0,1] p  . The probability p is also named the sampling probability and 1 p  means that no entry is missing.Thus, an incomplete sample of y is expressed as 0 ( ) P   y y .The generating manner of  indicates that the number of missing entries is approximately pd .
In Algorithm 1, the parameters are set as

Experimental Analysis
We first compare the sparsity of the coefficient vectors obtained by SRC and ISRC respectively.Two different sampling probabilities are considered, that is, p = 0.1 and p = 0.3.The comparison results are partially shown in Figure 1.From this figure, we can see that each linear representation vector of ISRC has only a few relatively large components in the sense of absolute values and other values are close to zero.Compared with ISRC, SRC has worse sparsity performance due to the fact that its amplitude is relatively small.These observations show that ISRC has superiority over SRC in obtaining sparse representations.Then, we compare the classification performance of ISRC with that of SRC and NN on the two given datasets.To this end, we vary the values of p from 0.1 to 1 with an interval of 0.1.When p = 1, ISRC becomes SRC. Figure 2 shows the comparison results of classification accuracy, where (a) and (b) represent the results of ORL and Yale respectively.It can be seen from this figure that ISRC achieves the best classification accuracy compared with SRC and NN, and it is relatively stable for different values of p. SRC is very sensitive to the choice of p and its classification accuracy degenerates steeply with the decreasing of p.In addition, NN has lower classification accuracy although it is stable.To sum it up, ISRC is the most robust method and has the best classification performance.For a test sample with missing values, both SRC and ISRC can recover all missing entries and noise to some extent.Finally, we compare their performance in completing missing entries and recovering the sparse noise.Here, we only consider two sampling probabilities, i.e., p = 0.1 and p = 0.3.For these two given probabilities, we compare the completed images and the recovered noise images obtained by SRC and ISRC respectively, as shown partially in Figures 3 and 4.   In the above two figures, the sample probability is set to 0.1 in the first two lines of images and 0.3 in the latter two lines.For each figure, the first two columns of images display the original and the incomplete images respectively, where the positions with missing entries are shown in white.The third and the fifth columns of images give the completed images by SRC and ISRC respectively.The fourth and the last columns of images show the noise recovered by SRC and ISRC respectively.By comparing the completed images with the original images, we can see that ISRC not only has the better completion performance, but also automatically corrects the corruptions to a certain extent.Moreover, ISRC is more efficient in recovering noise than SRC.In summary, ISRC has better recovery performance than SRC.

Conclusions
This paper studies the problem of robust face classification with incomplete test samples.To address this problem, we propose a classification model based on incomplete sparse representation, which can be regarded as the generalization of sparse representation-based classification.Firstly, the incomplete sparse representation is described as an l1-minimization and the alternating direction method of multipliers is employed to solve this optimization problem.Then, we analyze the convergence of the proposed algorithm and extend the model to the case of a batch of test samples.Finally, the experimental results on two well-known face datasets demonstrate that the proposed classification method is very effective in improving classification performance and recovering the missing entries and noise.It still needs further research on the model and algorithm of incomplete sparse representation.In the future, we will consider the sparse representation-based classification problem that both training and testing samples have missing values.
y y , the augmented Lagrange function of the above optimization problem is constructed as

w e y z u λ w e z Au e λ w u λ z y λ
we further obtain the iterative formulation of y :

Algorithm 1 .
Solving Problem (11) via ADMM.Input: the dictionary matrix A constructed by all training samples, an incomplete test sample 0 y and the sampling index set  .Output: y , w and e .
The above method is called incomplete SRC (ISRC), a variant of SRC.
L w e y z u λ has a saddle point, then the iterative formulations on the basis of an exact ADMM L  w e y z u λ converges to the optimal value.Proof.The objective function of Problem(11) can be rewritten as ( , ) ( , ) is obvious that ( , ) f z w and ( , ) g u e are two closed, proper and convex functions.Since {1, 2,..., } d   , we have 0  y y , which means it is not necessary to consider the update of y .Under this circumstance, the constraints in Problem (11) are equivalent to , , , , ,) L  w e y z u λ λ λ and the constraints (26), we have the following results for the iterative formulations (25): consider different values of p.For fixed p, the experiments are repeated 10 times and the average classification accuracies are reported.When carrying out NN, we compute the distance between 0 , all missing values are replaced with zeros in implementing SRC.

Figure 1 .
Figure 1.Sparsity comparisons of the linear representations between SRC and ISRC.

Figure 3 .
Figure 3. Completion and recovery performance comparisons on ORL.
(a) Original images (b) Incomplete images (c) Images completed by SRC (d) Noises recovered by SRC (e) Images completed by ISRC (f) Noises recovered by ISRC

Figure 4 .
Figure 4. Completion and recovery performance comparisons on Yale.