Robust Facial Expression Recognition via Compressive Sensing

Recently, compressive sensing (CS) has attracted increasing attention in the areas of signal processing, computer vision and pattern recognition. In this paper, a new method based on the CS theory is presented for robust facial expression recognition. The CS theory is used to construct a sparse representation classifier (SRC). The effectiveness and robustness of the SRC method is investigated on clean and occluded facial expression images. Three typical facial features, i.e., the raw pixels, Gabor wavelets representation and local binary patterns (LBP), are extracted to evaluate the performance of the SRC method. Compared with the nearest neighbor (NN), linear support vector machines (SVM) and the nearest subspace (NS), experimental results on the popular Cohn-Kanade facial expression database demonstrate that the SRC method obtains better performance and stronger robustness to corruption and occlusion on robust facial expression recognition tasks.


Introduction
The traditional human computer interaction (HCI) system, in which a single user faces a computer and interacts with it via a mouse or a keyboard, were developed to emphasize the transmission of explicit messages while ignoring implicit information about the user, such as the user's changes in the affective states. Such interactions are thus frequently perceived as incompetent, cold, and socially inept.

OPEN ACCESS
This fact has inspired an active research field of "affective computing" [1], which aims at enabling computers to recognize, express, model, communicate and respond to a user's emotion information. One of the most important applications of affective computing is to make HCI become more human-like, more effective, and more efficient. Specifically, such computers with the ability of affective computing could detect and track a user's affective states and initiate communications based on this information, rather than simply responding to a user's commands.
Affective arousal modulates all nonverbal communication cues such as facial expression, vocal intonations, body gesture and movement. Facial expression is the most natural and efficient means for humans to communicate their emotions and intentions, as communication is primarily carried out face to face. During the past decade, facial expression recognition has attracted a significant interest in the scientific community, as it plays a vital role in the field of HCI.
Generally, a basic automatic facial expression recognition system consists of three steps [2]: face acquisition, facial feature extraction and representation, and facial expression recognition. Face acquisition is a preprocessing stage to automatically find the face regions in input images or sequences. Many face detection methods have been proposed to detect faces in an arbitrary scene. Viola and Jones [3] presented a robust real-time face detector based on a set of rectangle of features. Heisele et al. [4] developed a component-based framework to detect frontal and near-frontal views of faces in still gray images. El-Bakry [5] proposed a principal component analysis (PCA) based real-time face detection method by performing cross-correlation in the frequency domain between the input images and eigenvectors. The detected face regions are usually aligned based on the eye positions that can be detected in the face regions.
Facial feature extraction and representation aims to extract facial features to represent the facial changes caused by facial expressions. Two types of features, i.e., geometric features and appearance features, are usually used for facial representation [2]. Geometric features present the shape and locations of facial components such as mouth, eyes, brows, and nose. The facial components or facial feature points are extracted to form a feature vector that represents the face geometry. Fiducial facial feature points have been widely adopted as geometric features for facial representation. For instance, the geometric positions of 34 fiducial points on a face are usually used to represent facial images [6,7]. In contrast to geometric features, appearance features encode changes in skin texture such as wrinkles, bulges and furrows. The representative appearance features contains the raw pixels of facial images, Gabor wavelets representation [8,9], Eigenfaces [10], and Fisherfaces [11], etc. In recent years, a new face descriptor called local binary patterns (LBP) [12], have been widely used as appearance features for facial representation [13][14][15][16] due to its tolerance against illumination changes and computational simplicity.
Facial expression recognition is to use the extracted facial features to recognize different expressions. Depending on whether the temporal information is considered, facial expression recognition approaches can be categorized as frame-based or sequence-based. The frame-based method does not take the temporal information of input images into account, and use the extracted features from a single image to recognize the expression of that image. In contrast, the sequence-based method attempts to capture the temporal pattern in a sequence to recognize the expression for one or more images. So far, various classifiers, including artificial neural network (ANN) [17], the nearest neighbor (NN) or K-nearest neighbor (KNN) [18,19], support vector machines (SVM) [20], and so on, have been applied for frame-based expression recognition. For sequence-based expression recognition, the widely used techniques are hidden Markov models (HMM) [21], dynamic Bayesian networks [22], SVM [23].
Among the above mentioned three steps, facial expression recognition is the most critical aspect for any successful facial expression recognition system. The performance of a facial expression recognition system is mainly decided by a classifier. Therefore, designing a good classifier is a crucial step on facial expression recognition tasks.
The recently-emerged compressive sensing (CS) (also called compressive sampling) theory [24][25][26], which originally aims to address signal sensing and coding problems, has shown tremendous potential for other problems like pattern recognition [27,28]. Recently, Nagesh and Li [29] have successfully employed the CS theory to develop a promising technique for expression-invariant face recognition. Nevertheless, they did not exploit the performance of the CS theory on the robust classification of occluded facial expression images. Note that, in real-world sceneries, facial images are usually corrupted by noise or outliers, that is, some pixels that do not belong to the facial images are depicted. Therefore, a study on robust facial expression recognition is more practical and meaningful. In addition, most existing facial expression recognition systems [13][14][15][16][17][18][19][20][21][22][23] focus on expression classification on clean facial images without any corruption. Motivated by little studies on the robust classification of facial expression with the CS theory, in this paper a new method of robust facial expression recognition based on the CS theory is presented.
The remainder of this paper is organized as follows: Section 2 gives the background and related work. In Section 3, facial feature extraction, including Gabor wavelets representation and local binary pattern (LBP), is reviewed briefly. The experiment verification is presented in detail in Section 4. Section 5 gives the conclusions.

Background and Related Work
In this section, we briefly review the CS theory, and then present the details of the recently-emerged sparse representation classifier (SRC) based on the CS theory.

Compressive Sensing (CS)
Given a system of under-determined equation: Its known that the above Equation (1) has no unique solution, since the number of variables is larger than the number of equations. In signal processing terms, the length of the signal ( n ) is larger than the number of samples ( m ). However, according to the CS theory, if the signal is sparse, it is necessarily unique, and can be reconstructed by practical algorithms. Suppose that the signal is k-sparse if it is a linear combination of only k basis vectors. That is, there are only k non-zero values in x , and the remainder are all zeroes. In this case, it is possible to find the solution to Equation (1) by a brute force enumeration of all the possible k-sparse vectors of length n . Mathematically speaking, this problem can be expressed as: where 0 is the 0 l -norm and denotes the number of non-zero elements in the vector. Equation (2) is known to be an NP(non-deterministic polynomial) hard problem, and is thus not a practical solution to Equation (1). The CS literatures [24][25][26] indicates that under a certain condition on the projection matrix A, i.e., restricted isometry property (RIP), the sparsest solution to Equation (1) can be obtained by replacing the 0 l -norm in Equation (2) by its closest convex surrogate, the 1 l -norm ( 1 ). Therefore, the solution to Equation (2) is equivalent to the following 1 l -norm minimization problem: where the 1 l -norm, 1 , denotes the minimization of the sum of absolute values of elements in the vector, and serves as an approximation of the 0 l -norm. In practice, the equality y x  A is often relaxed to take into account the existence of measurement error in the sensing process due to a small amount of noise. Suppose the measurements are inaccurate and consider the noisy model: where e is a stochastic or deterministic error term. Particularly, if the error term e is assumed to be white noise such that 2 e   , where  is a small constant. A noise robust version of Equation (3) is defined as follows: To solve the 1 l -minimization of Equations (3) and (5), various efficient algorithms have been developed. Two typical algorithms based on the interior-point idea, are l1-magic [30] and l1-ls [31]. The l1-magic algorithm [30] recasts the 1 l -minimization problem as a second-order cone program and then applies the primal log-barrier approach. The l1-ls algorithm [31] is a specialized interior-point method for solving the large-scale 11-regularized least-squares programs that uses the preconditioned conjugate gradients algorithm to compute the search direction.

Sparse Representation Classifier (SRC)
Recently, a sparse representation classifier (SRC) has been developed based on the CS theory [27,28]. In the SRC algorithm, it is assumed that the whole set of training samples form a dictionary, and then the recognition problem is cast as one of discriminatively finding a sparse representation of the test image as a linear combination of training images by solving the optimization problem in Equation (3) or (5). Formally, for the training samples of a single class, this assumption can be expressed as: In matrix-vector notation, Equation (7) can be rewritten as: The linearity assumption in the SRC algorithm coupled with Equation (8) implies that the weight vector α should be zero except those associated with the correct class of the test sample. To obtain the weight vector α , the following 0 l -norm minimization problem should be solved: It is known that Equation (9) is an NP-hard problem. The NP-hard 0 l -norm can be replaced by its closest convex surrogate, the 1 l -norm. Therefore, the solution of Equation (9) is equivalent to the following 1 l -norm minimization problem: This is a convex optimization problem and can be solved by quadratic programming. Once a sparse solution of α is obtained, the classification procedure of SRC is summarized as follows: Step 1: Solve the 1 l -norm minimization problem in Equation (10).
Step 2: For each class i , compute the residuals between the reconstructed sample Step 3: The class of the given test sample is determined by identify ( ) argmin ( , ) test i test y r y i  .

Facial Feature Extraction
In this section, two types of facial feature extraction: Gabor wavelets representation and local binary pattern (LBP), are briefly introduced.

Gabor Wavelets Representation
Gabor wavelets model quite well the receptive field properties of cells in the primary visual cortex [8,9]. The Gabor wavelets kernels exhibit strong characteristics of spatial locality and orientation selectivity, making them a suitable choice for image feature extraction when one's goal is to derive local and discriminating features for facial expression classification. The Gabor wavelet kernels can be defined as: , and 2 f  . Figure 1 shows the real part of the Gabor wavelet kernels at five scales and eight orientations, and their magnitudes. The Gabor wavelets representation is essentially the concatenated pixels of the 40 modulus-of-convolution images obtained by convolving the input image with these 40 Gabor kernels. In practice, the magnitude of Gabor wavelets representation is used for facial expression recognition. As suggested in [33], before concatenation each output image is down-sampled by a factor of 16 and normalized to zero mean and unit variance.

Local Binary Patterns
The local binary pattern (LBP) operator [12] is a gray-scale invariant texture primitive statistic, which has shown excellent performance in the classification of various kinds of textures. For each pixel in an image, a binary code is produced by thresholding its neighborhood with the value of the center pixel. The LBP code of the center pixel in the neighborhood is obtained by converting the binary code into a decimal one. Based on the LBP operator, each pixel of an image is labeled with an LBP code. The 256-bin histogram of the labels contains the density of each label and can be used as a texture descriptor of the considered region.
The process of LBP features extraction is summarized as follows: firstly, a facial image is divided into several non-overlapping blocks. Secondly, LBP histograms are computed for each block. Finally, the block LBP histograms are concatenated into a single vector. As a result, the facial image is represented by the LBP code. Figure 2 presents the process of LBP features extraction.

Experiment Verification
To verify the effectiveness and robustness of SRC on facial expression recognition tasks, the popular Cohn-Kanade database [34], are used for experiments. Three typical facial features, including the raw pixels, Gabor wavelets representation and local binary patterns (LBP), are extracted to testify the performance of SRC on facial expression recognition tasks. To reduce the feature length of Gabor wavelets representation, principal component analysis (PCA) [35] is used for dimensionality reduction. The reduced feature dimension is confined to the range (0, 100) with an interval of 10. The performance of SRC is compared with the nearest neighbor (NN), linear SVM as well as the recently developed non-parametric nearest subspace (NS) method [36]. Note that, for the SRC method, it's necessary to normalize the training and testing data with unit 2 l -norm. The experiment platform is Intel CPU 2.10 GHz, 1 G RAM memory, MATLAB 7.0.1 (R14).
A 10-fold cross validation scheme is employed in 7-class facial expression recognition experiments, and the average recognition results are reported. In detail, each classification model is trained on nine tenths of the total data and tested on the remaining tenth. This process is repeated ten times, each with a different partitioning seed, in order to account for variance between the partitions. We provide facial expression recognition results and analysis in two aspects. On one hand, facial expression recognition experiments are directly performed on original clean images without any occlusion. On the other hand, facial expression recognition experiments are conducted when the random pixel corruption and the random block occlusion occur in the test images resized with 32 × 32 pixels.

Database and Pre-Processing
The Cohn-Kanade database [34] consists of 100 university students aged from 18 to 30 years, of which 65% were female, 15% were African-American and 3% were Asian or Latino. Subjects were instructed to perform a series of 23 facial displays, six of which were based on description of prototypic emotions. Image sequences from neutral to target display were digitized into 640 × 490 pixels with 8-bit precision for grayscale values. Figure 3 shows some sample images from the Cohn-Kanade database. In this work, 320 image sequences were selected from the Cohn-Kanade database. The selected sequences, each of which could be labeled as one of the six basic emotions, come from 96 subjects, with 1 to 6 emotions per subject. For each sequence, the neutral face and one peak frames were used for prototypic expression recognition. Finally, 470 images (32 anger, 100 joy, 55 sadness, 75 surprise, 47 fear, 45 disgust and 116 neutral) were obtained for experiments. For the raw pixels extraction, the size of original facial images is directly down-sampled to 32 × 32 pixels. The only reason for resizing the image with 32 × 32 pixels is that all the experiments can be performed within the memory size of MATLAB on a typical PC.
For Gabor wavelets representation and LBP features extraction, our pre-processing is similar to that used in [13,15]. We normalized the eye distance of facial images to a fixed distance of 55 pixels once the centers of two eyes were located. Generally, it is observed that the width of a face is roughly two times of the distance, and the height is roughly three times. Therefore, based on the normalized value of the eye distance, a resized image of 110 × 150 pixels was cropped from an original image.
The cropped facial images of 110 × 150 pixels contain facial main components such as mouth, eyes, brows and noses. The Gabor wavelets representation is obtained by convolving the whole region of the cropped facial image with the Gabor kernels. Likewise, the LBP features are obtained by applying the LBP operator to the whole region of the cropped facial images. Similar to the settings in [13][14][15]38], we selected the 59-bin operator, 2 , u P R LBP , where the notation (P, R) denotes a neighborhood of P equally spaced sampling points on a circle of radius of R that form a circularly symmetric neighbor set, and the superscript u2 in 2 , u P R LBP indicates using only uniform patterns and labeling all remaining patterns with a single label. And then we divided the 110 × 150 pixels facial images into 18 × 21 pixels regions, giving a good trade-off between recognition performance and feature vector length. Thus facial images were divided into 42 (6 × 7) regions, and represented by the LBP histograms with the length of 2,478 (59 × 42). surprise sadness fear disgust anger joy neutral

Experimental Results without Occlusion
When using the raw pixels (i.e., the resized images of 32 × 32 pixels) and LBP features for experiments, the corresponding recognition results and standard deviations (std) of different methods, including NN, SVM, NS, as well as SRC, are given in Table 1. The recognition results of different methods along with reduced dimension of Gabor wavelets representation are presented in Figure 4. Table 2 shows the best accuracy of different methods with the corresponding reduced dimension of Gabor wavelets representation. The results in Tables 1-2 and Figure 4 reveal that SRC achieves an accuracy of 94.76% with the raw pixels, 97.14% with LBP features, and 98.1% at best with 50 reduced dimension of Gabor wavelets representation, outperforming the other used methods. This confirms the validity and high performance of SRC for facial expression recognition.
Tables 3-5 displays the confusion matrix of recognition results of SRC with the raw pixels, LBP features, and Gabor wavelets representation, respectively. From the results in Tables 3-5, we can see that most of seven expressions are identified very well with an accuracy of 100%.
The obtained recognition accuracy of SRC (i.e., 97.14% with LBP features, and 98.1% with Gabor wavelets representation) on 7-class facial expression recognition tasks is highly competitive, compared to previously reported results on the Cohn-Kanade database. In [14], on 7-class facial expression recognition tasks they employed LBP-based SVM to give the best accuracy of 88.4%. In [13], with LBP features and SVM they reported a 7-class recognition accuracy of 91.4% at best. In [37], they obtained the highest accuracy of 93.4% with SVM on 7-class tasks, but they used an improved LBP features called local directional pattern (LDP).  Table 3. Confusion matrix of recognition results of SRC with the raw pixels.

Experimental Results with Occlusion
In this section, we used the resized image of 32 × 32 pixels from the Cohn-Kanade database to verify the robustness of SRC to two kinds of occlusions, i.e., the random pixel corruption and the random block occlusion.  Figure 5 gives an example of a 50% corrupted face image on the resized image of 32 × 32 pixels. As shown in Figure 5, beyond 50% corruption, the corrupted images are scarcely identified as facial images. Figure 6 plots the recognition accuracy of all used methods, i.e., NN, SVM, NS and SRC, under different percentage corrupted from 0% to 90%. It can be observed that the performance of all used methods decreased as the percentage corrupted increased. Nevertheless, SRC still dramatically outperforms the other used methods at various levels of corruption. This indicates SRC is more robust to the random pixels corruption than the other used methods. We next investigate the robustness of SRC to the random block occlusion. We simulate this situation under different percentage occluded, from 0% to 50%, by replacing a randomly located square block of each test image with an unrelated image of a baboon, as shown in Figure 7(a). Note that, the location of occlusion is randomly chosen for each image and is unknown to the algorithm. Figure 7 shows an example of a 30% occluded face image. To the human eye, beyond 30% occlusion, the entire facial regions have been almost completely occluded. In this case, it's a difficult recognition task even for humans. Figure 8 gives the recognition performance of SRC and its three competitors, as a function of the percentage occluded from 0% to 50%. As illustrated in Figure 8, we can see that the recognition accuracy of SRC significantly exceeds that of other used methods at various levels of occlusion. This demonstrates SRC achieves a higher level of robustness to the random block occlusion in comparison with the other used methods.

Conclusions
In this paper, we present a new technique of robust facial expression recognition via sparse representation classifier (SRC) based on the CS theory. Experimental results on the popular Cohn-Kanade facial expression database demonstrate that SRC obtains promising performance on facial expression recognition without occlusion, and exhibits a strong robustness to the random pixel corruption and the random block occlusion occurred in facial expression images. It's worth pointing out that in this work for simplicity we only focus on the static 2D facial expression recognition. In recent years, 3D facial expression recognition has been considered as a major solution to handle the unsolved issues of reliable 2D facial expression recognition, i.e., illumination and pose changes. Therefore, it's also an interesting task to investigate the performance of the presented method for 3D facial expression recognition.