Active AU Based Patch Weighting for Facial Expression Recognition

Facial expression has many applications in human-computer interaction. Although feature extraction and selection have been well studied, the specificity of each expression variation is not fully explored in state-of-the-art works. In this work, the problem of multiclass expression recognition is converted into triplet-wise expression recognition. For each expression triplet, a new feature optimization model based on action unit (AU) weighting and patch weight optimization is proposed to represent the specificity of the expression triplet. The sparse representation-based approach is then proposed to detect the active AUs of the testing sample for better generalization. The algorithm achieved competitive accuracies of 89.67% and 94.09% for the Jaffe and Cohn–Kanade (CK+) databases, respectively. Better cross-database performance has also been observed.


Introduction
With the help of facial expression recognition, human-computer interaction can automatically obtain the information of the human face and infer the psychological status of the user, which can be applied to driver monitoring, face paralysis expression recognition, intelligent access control, and so on.
Recognition of six basic expressions, like happy (Ha), angry (An), surprise (Su), fear (Fe), disgust (Di), sad (Sa) and neutral (Ne) expression, can be categorized into 3D-based and 2D-based approaches. The 3D-based expression recognition is a current research hot topic [1], which often employs the geometry features, like differential curvature [2,3], based on an aligned face mesh [4]. The 2D-based approaches are currently prevailing due to the easy accessibility of the training samples. The facial action coding system (FACS) [5] is one of the important 2D approaches, i.e., the expressions are described and tracked according to some basic action units (AUs). The FACS was defined by Ekman [6] to reflect the deformation status of the corresponding facial part, which was developed based on a set of discrete emotions and initially applied to measure some specific facial muscle movements named AUs. While AUs were often used as an intermediate step for recognizing the basic expressions, image-based feature representation is often considered for recognizing the expression directly. Dynamic recognition based on expression images is important in face animation and multimedia analysis [7][8][9][10]. As less information about the considered expression is available, static image-based recognition is more challenging.
Deep learning with a convolutional neural network (CNN), such as multiscale feature-based CNN [11], hierarchical committee-based CNN [12] and architecture-improved CNN [13], has also been applied for static expression recognition. Pramerdorfer and Kampel [14] gave a detailed survey about these algorithms. Although it is experimentally verified in [15] that visually similar features to However, current AU-based algorithms modeled the AU relations without considering the weights of patches contained in each AU. The AU feature is suitable for encoding the macro-scale information of each expression, since it integrates the large-scale information of face parts. However, they are not good for encoding the micro-scale feature, since the combination space of AUs is limited when they are not carefully organized. Moreover, most of these algorithms learned the features from the training expression samples. However, the active feature implied in each testing expression sample is not fully exploited. The model learned from training samples might produce the wrong classification when the testing sample is significantly different. However, only a few works detect and use the active features of the testing sample for recognition.
Although many research works have been conducted in face expression recognition, the following problems are still to be addressed. First, current algorithms recognize the seven expressions entirely with a uniform feature weight, which may leave out the feature specialty and multi-scale property. In this work, the seven expressions' recognition is divided into multiple sub-problems with appropriate subsets, i.e., the expression triplet. Moreover, the weight vector w.r.t. each expression triplet is fine tuned individually to fully consider its specificity. Second, the patch-based and AU-based features are often encoded separately without considering the influence of their deficiencies. In this work, the weights of patches contained in each AU are finely optimized to represent the characteristics of different expressions. In this way, the advantages of large-scale (AU-based) and small-scale (patch-wise) features are both explored. Third, few of the current works make use of the specificity of each testing sample before recognition; the wrong classification could occur when the testing sample is significantly different. Thus, this work exploits the active features of each testing expression.
The main novelties of this work are mainly on three aspects. First, a two-stage expression recognition using the idea of expression triplet weighting is introduced for the representation of diversity among different expressions. Second, a new offline weight optimization for the patches contained in each AU is proposed to increase the discrimination abilities of both the patch and AU features. Third, online detection of active AUs for each testing sample is proposed to fully exploit its specificity for feature encoding.
This paper is structured as follows. Section 2 gives a description of the proposed algorithm step by step. The experimental results of the proposed algorithm on public databases are presented in Section 3. Finally, discussions and some conclusions are addressed in Section 4.

Framework of the Algorithm
The sketch of expression recognition is illustrated in Figure 1. In offline training, faces were divided into a number of non-overlapped patches, regions and facial parts, like eyes, nose and mouth, where Gabor surface features were extracted. AUs for each face part are defined, as well. For each expression triplet combination, the weights of AUs and patches were optimized. In testing, the standard seven class-based expression recognition was conducted at the first stage, and the top three expression candidates were proposed. Based on the suggested expression triplet, the active AUs of the testing sample were detected and weighted using the learned weight vector. Weighted SVM was finally applied to give the expression label. The entire algorithm is then elaborated in the following sections.

Definition of Patch, Region, Part and AU
The face image was first aligned with reference to the feature points located using the approach presented in [50,51] and then resized to 84 × 68. Only the central region with the size of 80 × 64 is cropped out for the following processing. Illumination was normalized by the method proposed in [52].
To extract features representing local face variations caused by expression, the face image was divided into 10 × 8 patches. Based on the 80 patches, 12 regions with relatively fixed shapes (RFSs) were defined to represent the expression-sensitive face parts (PT). Figure 2c shows the 12 regions, and Table 1 lists the involved landmarks for each region. When patches are used to represent local texture, regions encode the important variations correlated with expression changes. Figure 3 presents the involved regions used to define each of the seven face parts, i.e., eyes, brows, mouth, forehead, nose root, nasolabial regions and chin (PT 1 − PT 7 ). The darker region on the mouth part denotes the patches having a nonempty intersection with R 6 , R 8 , R 9 in (c), which presents an example of the relation between patch, region and part. . The AUs of seven facial parts with the corresponding RFSs and expression labels. The first part (eye, PT 1 ) consists of two regions R 4 , R 5 , which can be classified as AU 1 − AU 3 according to the eye status. The listed e 1 − e 7 are the expressions related with each defined AU, i.e., neutral, angry, disgust, fear, happy, sad and surprise.  Figure 2c and the region sizes in Figure 2d. Take R 1 as an example; points {P 3 , P 8 } define the length and width d 1 of R 1 .
The same as d 2 The same as d 3 .
{P 32 − P 43 }, For encoding the face part variations related with expression [53], AUs have been widely used in the literature. We defined 20 AUs to encode the seven expressions named as e 1 − e 7 , i.e., neutral, angry, disgust, fear, happy, sad and surprise. Different from the AU labeling in [53], the part regions of each AU are manually labeled for training samples in this work. See Figure 3 for the definition of the 20 AUs and their relationship with the 12 regions and the seven parts. For example, AU 3 encodes wide open eyes, which is usually correlated with e 7 , i.e., surprise. As both AU 1 and AU 2 encode the status of the two eyes, they mainly consist of regions around eyes, i.e., R 4 and R 5 . Figure 2d labels the involved patches of an example mouth part (AU 8 − AU 12 ) with dark grey, i.e., each AU of a part may involve multiple patches.
To encode the texture magnitude map, the Gabor surface feature (GSF) [18] is employed, since it can depict the curvature information of the wrinkles and the direction of the expression texture. To extract GSF, the n s × n a Gabor magnitude images are first extracted, which are then encoded by the local binary pattern (LBP) to reduce the feature sensitivity to misalignment. More precisely, feature GSF for the k-th patch p k is formulated as follows: where B, B x , B y , B 2 are the binarizations of the Gabor magnitude image I, its first-order and second-order gradient pictures I x , I y , I xx + I yy corresponding to the patch p k . As an example, for each pixel (i, j) of p k , its binary value is defined as: where the threshold ThresMed i,j is the median of the pixel values of patch p k . Thus, p f p k is the output feature map with the value ranging from zero to 15, which is further transformed to the histogram for feature representation. For each face patch or region, the corresponding feature GSF is then vectorized as a 16 × n s × n a dimension vector, where n s , n a are defined in Equation (1). Finally, the feature of the i-th expression sample is represented as: where n is the number of patches or regions. For convenience of the following illustration, the feature of the i-th expression sample can be also grouped as ( f i ) according to the seven face parts presented in Figure 3.

Feature Optimization
Based on the feature representation, the weights of the AU and patches for each expression triplet are optimized. First, the original AUs are weighted with the conditional transition probability matrix, based on which, weight optimization is performed to weigh the patches involved in each AU in the second step. These two steps are conducted on the training samples and are offline. The third step is to select the active AUs for each testing sample. The entire procedure of the feature optimization is presented in Algorithm 1.
Algorithm 1 AU weighting, patch weight optimization and active AU detection. 1: Offline Training: 2: AU weighting (W AU ) using the conditional probability matrix presented in Section 2.3.1; 3: Patch-wise weight optimization of weight vectors ({W i }) by multi-task sparse learning presented in Section 2.3.2; 4: Online Testing: 5: Active AU (AU A ) detection for testing samples by sparse representation presented in Section 2.3.3.

AU Weighting
Motivated from the causal AU pair extraction with a large transition probability by the Bayesian network (BN) [46], the representative abilities of all of the AUs for each expression are weighted in this work to decrease the influence of weakly-related AUs and provide a constraint for the following patch-wise weight optimization.
With the labeled AUs of all of the training samples, the causal relation network between the AUs is obtained with the conditional probability matrix. The probability p e i|j of the i-th AU conditioned on the j-th AU w.r.t. the e-th expression is defined as the product of the co-occurrence and co-absence probabilities as follows where the co-occurrence probability of the i, j-th AUs conditioned on the j-th AU is defined as the conditional probability: where AU e denotes all of the action units of the e-th expression. Additionally, the co-absence probability of action units i, j conditioned on j / ∈ AU e is defined as the probability: Then, the degree of the causal relation of AU pair (i, j) is defined as follows: The 'min' function of two conditional probabilities is adopted to avoid abnormal probability resulting from the imbalance of expressions related with each AU. For example (see Figure 3), AU 2 is related to a large number of expressions, which may result in significantly larger arrival transition probability.
For each expression, a relation probability matrix of dimensions of 20 × 20 is obtained with Equation (8). With the causal relation matrices for all of the expressions, the representative ability of the AU for each expression can be found by simultaneously maximizing the representative ability for the considered expression and minimizing the representative abilities for the other expressions. That is, the representative ability of the AU pair (i, j) for the e-th expression is obtained as follows: where |{l = e}| represents the number of elements in the set {l = e}. Finally, the representative ability of the i-th AU for the e-th expression is obtained as follows: For applying {RA e i , 1 ≤ i ≤ 20, 1 ≤ e ≤ 7} for recognition, the maximum AU representative ability of each expression and face part is collected, which are denoted as RAP e i and presented as follows: where the set I A i denotes the AU indices corresponding to the i-th part. As the correspondence between AU and face part (PT), the maximum representative abilities of the parts are used to weigh the corresponding AUs, which are denoted as Due to the number limitation of AUs and the training samples, the representation space of AUs is limited when they are simply organized. In order to expand the representation space of the AUs, the contribution of the patches contained in each AU is weighted by the following weight optimization model in this work.

Patch Weight Optimization
Based on the weights of AUs for each expression, a weight optimization model is proposed to weigh the patches of each AU for the considered expression triplet G = {e 1 , e 2 , e 3 }.
The objective is to minimize a loss function with the weight sparseness and regularization constraints, which is presented as follows: where f j , f k are the features of the j-th and k-th training samples of the triplet G, whose patch feature is defined in Equation (2) and reduced to two dimensions using PCA and LDA [30]. N = ∑ j |ID j |, and ID j records all of the training sample indices. Vector w records the weights of all of the patches; w PT i records the weights of the patches related with the i-th part PT i ; parameter α is fixed to one. RAP G i denotes the normalized representative abilities w.r.t. the considered expression triplet G, which is formulated as follows: The loss function g( f j , f k , w) reflects the similarity loss of the feature vectors f j , f k with the weight vector w, which is constructed to minimize the intra-class variance and maximize the inter-class variance as follows: where L( f k ) is the expression label of the training feature vector f k . For solving the optimization problem (12), the gradient of the first term of the minimization objective function in Equation (12) is formulated as follows: where e j,k (w) = 1 + exp(−α · g( f j , f k , w)) and is computed based on Equation (14) as follows: The sparseness term of Model (12) adopts the L 1 norm, which is a special case of the L 1 /L 2 mixed-norm employed in the work [45,55]. Thus, the optimization model (12) is solved with the modified multi-task sparse learning algorithm employed in [45,55] with several differences presented as follows. The overall optimization algorithm is elaborated in Algorithm 2.
• The weight of each patch is initialized as a ratio of the corresponding AU representative ability as follows: where i is the index of the part including the j-th patch, |PT i | denotes the number of patches in the part PT i and w P j ,0 records the weights of the j-th patch P j . The initialization procedure is presented in Step 3 of Algorithm 2; • The weight vector w s+1 and auxiliary vector v s+1 in the s + 1-th iteration of Algorithm 2 are normalized to satisfy the constraint defined in Equation (12) as follows: where i is the index of the part including the j-th patch and w PT i denotes the weights of the part PT i . The normalization is employed in Steps 11 and 19; • Compared with [45], optimization Model (12) is proposed by minimizing the feature similarity bias of different expression classes in Equation (14), which uses the information of mutual feature difference and contains more information than that of expression label matching in [45]. The corresponding objective function and the gradient vector are changed according to Equations (12) and (15), as revealed in Step 5.
With the weight optimization model (12) and Algorithm 2, the weights of the patches for each expression triplet are obtained. The number of optimized weight vectors {W i } equals C 3 7 = 35, i.e., the number of expression triplets.  Algorithm 2 The modified multi-task sparse optimization.

Active AU Detection
Though each expression is related with several AUs, these AUs may not be present at the same time. For example, while the AUs involving brows and eyes are present for the surprise expression shown in Figure 4e,j, the AU involving mouth was less active. In this case, error may occur if the features extracted from the AU involving mouth are included for expression recognition. To address this issue, we proposed a sparse representation-based approach to identify the parts where the corresponding AUs are active for each testing sample before expression recognition.
For the k-th part of the i-th testing sample in Figure 3, the sparse representation is represented as follows: min where f tr n ] are the patch features corresponding to the k-th face part (PT k ) of all of the training samples of the candidate expression triplet and the neutral expression. Weight vector c (k) records the n-dimensional sparse representation coefficients, and λ is the regularization parameter set as 1e −3 in this work [56,57].
With the part-based sparse representation, the coefficients w.r.t. the AUs related with neutral and non-neutral expressions are obtained, where the AUs related with neutral expression (AU NE ) are {AU 1 , AU 4 , AU 8 , AU 13 , AU 15 , AU 17 , AU 19 } and the others are the AUs related with non-neutral expressions (AU EX ) as presented in Figure 3. More precisely, for the k-th part of the weight vector c (k) , the corresponding weight components w.r.t. the AUs related with neutral and non-neutral expressions are obtained as follows: (c where r EX with neutral and non-neutral expressions. Finally, the activeness of PT k of the i-th testing sample is defined as follows: where t j is the index of the weight with the the j-th largest value in the vector c NE and n = 10 is the number of patches set to reduce the influence of abnormal weight components by sparse representation (19).
To judge whether PT k or the corresponding AU is active or not, we treat each training sample f i with a non-neutral label as the testing sample and obtain its activeness value TrActV is removed from the dictionary D (k) when obtaining the sparse coefficients c (k) in Equation (19). Finally, the considered part of the testing image is decided to be active if ActV (k) i is larger than the average of training activeness values as follows: where n tr is the number of training samples with non-neutral labels. When the number of the selected active AUs for the i-th testing sample is less than two, which is likely to happen for the neutral expression sample, then the AUs with the top two largest activeness values ActV    Figure 5 a,b show that the number of non-zero coefficients corresponding to non-neutral expression samples for active 'Brow' is significantly larger than that for non-active 'NoseRoot'. Figure 5c shows the activeness values of the seven parts of the same example, which clearly suggests that 'Brow' is the most active part and 'NoseRoot' is the most non-active part. While 'Brow', 'Eye' and 'Forehead' are decided to be active and included for feature representation, non-active parts, like 'Mouth', 'NoseRoot', 'Nasolabial' and 'Chin' regions, will not be involved in the following expression recognition.
After the active AU detection for each testing sample, the optimized patch weights for the corresponding candidate expression triplet with Algorithm 2 are used to weigh the selected active AUs (AU A ) and the involved patches for the following recognition.

Weighted SVM for Classification
After the feature weight optimization and AU activeness detection, support vector machine (SVM) with a slightly modified kernel function is employed for the classification [58]. Rather than treat the feature weights as variables in SVM and obtaining them with mutual information [59], the optimized feature weights learned in Section 2.3.2 are directly integrated with the patches involved with the detected active AUs in Section 2.3.3 for the recognition. That is, the new inner product f i , f j w of two features f i , f j with weight vector w is defined as follows: where x, y = x T y is the inner product of two vectors and x · y = (x 1 y 1 , · · · , x n y n ) is the dot product of two vectors. Finally, with the new defined inner product (23) as the kernel function, SVM is used for the recognition.

Experimental Results
We perform the experiments using MATLAB 2014b on a PC with a 4-GHZ core processor and 32 GB RAM. For the experimental testing, the Jaffe [60], Cohn-Kanade (CK+) [61] and SFEW2 [62] databases are employed for the performance and feature optimization study. Another three databases, i.e., Taiwanese Facial Expression Image Database (TFEID) [63], Yale-A database (YALE) [64] and EURECOM [65], are employed for the generalization testing. Among them, the database SFEW2 was collected in the real life, and the faces were captured with un-controlled head poses and lighting conditions. Actually, the appearance of the same expression is different from person to person, to guarantee that the image really represents a specific expression, these collected expressions are labeled by two independent labellers [62]. The remaining databases were videoed in the controlled lighting condition, and the faces are all frontal; the corresponding participants were instructed by an experimenter to perform a series of facial displays for each expression [61].
The Jaffe database consists of 213 expression images of 10 Japanese female models, which can be categorized into six basic and the neutral expressions, i.e., angry (An), disgust (Di), fear (Fe), happy (Ha), sad (Sa) and surprise (Su). The CK+ database consists of 593 expression sequences from 123 subjects, where 327 sequences are labeled with one of seven expressions (angry, disgust, fear, happy, sad, surprise and contempt). Each sequence contains a set of captured frames when the subject changes his/her expression; 1033 expression images, i.e., the neutral and three non-neutral images sampled from each expression sequence are used for testing. The database SFEW2 is derived from the sub-challenge of static expression recognition in The Third Emotion Recognition in the Wild Challenge [62], which includes 958, 436 and 372 training, validation and testing samples of seven basic expressions. As the labels of the testing set are not publicly available, the validation set was used in this paper for testing. The images were videoed in the un-controlled condition with different lighting, head poses, profiles, resolutions and face colors. Five landmark points were located with [50,66] for face alignment.  Figure 6. For the following experiment, the person-independent strategy with ten-fold setting is employed for testing and comparison. More precisely, the considered database is divided into ten groups with approximately an equal number of person IDs. While nine of them were used for training, the remaining group was used for testing. The process was randomly repeated ten times, and the average accuracy is recorded as the final result.

Number of Candidate Expressions Suggested by the First Stage Classifier
Take the Jaffe database as an example, some expressions in the dataset are quite difficult to discriminate, even for human eyes. For example, the angry and sad expressions shown in Figure 4a,d,f,i are very similar. It would be more plausible to develop a hierarchy system, which could discriminate the easy categories at the first stage, and then differentiate the difficult categories at the second stage.
To decide the number of candidate expressions proposed by the first stage classifier, we show in Figure 7 the variation of the accuracy with the value of k when the top-k strategy is adopted for expression recognition. A classification is said to be correct if one of the top-k labels returned by the system matches the true label of the sample. The accuracy generally increases with the values of rank, k. While the accuracy of 91.5% was achieved for k = 2, the accuracy reached 96% for k = 3. To reach a trade-off between accuracy and efficiency, we set k = 3 for the first stage classification, i.e., the top three expression labels were assigned for the testing sample at the first stage. Based on the three candidates, the final label was given by a different model trained using finer features at the second stage.

Recognition Performance Analysis
To evaluate the effects of different models like AU weights, patch weight optimization and active AU detection, we tested the performance of the recognition system with/without those models on the Jaffe, CK+ and SFEW2 databases. For traditional one-stage recognition, the features extracted from each patch were concatenated (see Equation (4)) and input to SVM for classification. The feature was further optimized using the proposed AU weight, patch weight optimization and active AU detection. The recognition performance of the system for different models is tabulated in Table 2. One can observe from the table that the proposed models significantly boost the performance. For example, when all three models were used, the recognition performance increased from 82.63% to 89.67%, from 89.06% to 94.09% and from 42.2% to 46.1%, for the Jaffe, CK+ and SFEW2 datasets, respectively.  Figure 8 shows the top three most representative AUs of each expression; one can observe from the figure that the most representative parts of the surprise expression (g) are the brows, eyes and mouth. The most representative regions of the laugh expression (e) are the brow, mouth and nasolabial parts. To analyze the performance of the patch weight optimization, Figure 9 depicts the optimized weight vectors of five expression triplets of the Jaffe database. It can be seen from the figure that the weights of the patches of each AU are further optimized. With the proposed weight optimization, the discrimination ability of the weighted patches for expressions with small variation is increased, and performance improvements on the databases Jaffe and CK+ are observed in the fifth column of Table 2. Due to the limited number of training samples, the weight optimization is not always beneficial to the recognition rate improvement. Figure 10 demonstrates the variation of the objective function values in Equation (12) and the testing accuracy of an example expression triplet (angry, fear and sad) w.r.t. the number of iterations on the Jaffe database, when active AU detection was not applied. It can be seen that the recognition rate is not always increasing with the descendant of the objective function values due to the difference between the testing and training samples. Thus, active AUs could be detected to represent the specific features for each testing expression sample. To study the effect of the active AU detection for recognition, Figure 11 presents the top two active AUs of six example testing expressions, where Figure 11c,i show that the brow and eye parts are more active than the other parts for the expression sample presented in Figure 4e. When these active parts are used for the feature encoding, the expression samples will be correctly recognized. To analyze the algorithm performance overall, the confusion matrix of the final recognition results on the databases Jaffe and CK+ is presented in Tables 3 and 4, respectively. Both tables show that the angry, fear and sad expressions are relatively difficult to be correctly recognized. The difficulty is verified by the expressions presented in Figure 4, where faces present similar features, not only in the appearance, but also in the face part deformation. Table 4 suggests that the sad expression is mostly misclassified as the neutral expression (error rate 14.28%).

Feature Optimization Comparison
This section mainly compares the performance of the proposed weight optimization algorithm with other related algorithms, such as AdaBoost [19,20,26,35], linear discriminant analysis (LDA) [30,43], the chi square statistic (CSS) [48], multi-task salient patch selection (MTSPS) [45] and the uniform weights (UWs) setting. For the AdaBoost feature selection [35], the strong classifier of the final recognition is linearly composed of a number of patch-based weak classifiers. In the expression recognition [48], only the chi square statistic for weight assignment is employed. In the feature selection [43], the patch saliency score is related with the classification accuracy of the training expression samples, where PCA and LDA are employed to reduce the feature dimension. The salient feature selection in [45] trains a set of active common and specific expression patches. The same strategy of the triplet mode and the GSF feature is employed for a fair comparison. The recognition rates obtained by these algorithms on the databases Jaffe and CK+ are presented in Table 5.  Table 5 shows that the recognition rates of AdaBoost and LDA are lower than that of UWs. CSS achieves slightly better performance than UWs on the Jaffe database. In these models, the specificity of each expression and the causal relation information among AUs are not sufficiently exploited. To reduce the effects of personal ID information, the salient feature selection in [45] integrated the common and specific expression features, and higher recognition rates are achieved.
Different from the other feature selection algorithms, the AU-based feature optimization in the proposed algorithm weighs the AUs and the corresponding patches with the conditional transient probability matrix. The discrimination information contained in both large-scale AUs and small-scale patches is considered. Moreover, active AUs of each testing expression sample are also detected for the feature encoding. The best recognition rates achieved in Table 5 justified the advantages of the proposed feature optimization.

Comparison with the State-Of-The-Art
In this section, a comparison of the overall recognition rates with a number of the state-of-the-art algorithms is conducted. To make the comparison fair, the competing algorithms were all tuned for the best performance. The comparison results on the databases Jaffe, CK+ and SFEW2 are demonstrated in Tables 6 to 8, respectively, where the algorithm description, the category, the number of subjects, testing protocol and the final recognition rates are considered.  [11] Deep learning-based 10 10-fold 88.6 Deep Belief Network [41] Deep learning-based 10 10-fold 91.8 For the Jaffe database, our proposed algorithm achieves a competitive recognition rate among all of the algorithms in Table 6. The algorithm [41] using the deep belief network yields the highest recognition rate of 91.8%. However, feature selection and classifier training are time consuming, and the process requires several days for each database. Rather than using a well-designed feature representation, the proposed algorithm achieves the best accuracy of 89.67% as the radial feature-based algorithm [22] among the traditional algorithms. For the CK+ database, the proposed algorithm achieves the highest accuracy of 94.09%. As we are focused on seven-class expression recognition, those works developed for six expressions, like [21,37,41,[43][44][45]49], are not included for comparison in this paper.
The feature and classifier adopted in the proposed algorithm are significantly different from the convolutional neural network (CNN)-based algorithms. In the following, the database (SFEW2) collected in real life is taken to compare the overall performance between CNN and the proposed algorithms. As SFEW2 was used in Emotion Recognition in the Wild Challenge for performance evaluation, we directly take the accuracies of participants for comparison. All of the top three participants adopted CNN, and their results are listed in Table 8, together with that of our approach.
While our approach achieves the top performance for the CK+ database, CNN-based methods perform much better for the real life dataset, i.e., SFEW2. As CNN-based algorithms employ randomly-cropped face regions for dataset augmentation, they are less sensitive to the face misalignment than the traditional algorithms. However, when a large training dataset is not available and the images were mostly frontal faces, e.g., Jaffe and CK+, the traditional approaches could perform better than CNN-based approaches. Furthermore, the network and parameters of CNN need to be finely tuned, which is much more time consuming than traditional algorithms.

Cross-Database Performance Study
To study the generalization ability of the proposed model, cross-database experiments are conducted, and the corresponding recognition rates are presented in Table 9. In this testing, while one database is set as the training set, the other database is used as the testing set for evaluation. It can be seen from Table 9 that the radial feature encoding [22] with the probability projection achieves the highest accuracy when the databases Jaffe and CK+ are used for testing and training, respectively. The proposed algorithm achieves a competitive recognition rate of 46.01%, which is better than the recognition rate of 32.86% achieved by [22] when the employed probability projection is replaced with the Borda count strategy. When Jaffe is used as the training and CK+ is used for testing, the proposed algorithm also achieves competitive accuracy.
To further study the generalization ability of the proposed model, the databases of CK+ and Jaffe are used as the training, while one of the other three databases is chosen for the testing. The accuracy is presented in the last three columns of Table 9, which shows that the proposed algorithm achieves a much better recognition rate than the algorithm [22] on the database TFEID and a competitive recognition rate on the database YALE.

Discussion and Conclusions
In this work, a two-stage expression recognition model based on triplet-wise feature optimization is proposed; the novelty of the this work is concentrated on three aspects. First, overall facial expression recognition is transformed into the triplet-wise mode to sufficiently exploit the specificity of each expression. Second, AU weighting and patch weight optimization are proposed for each expression triplet. Lastly, the online detection of active AUs is proposed for each testing expression sample to reduce the influence of the non-active features in recognition. Experimental results and a comparison with the related state-of-the-art algorithms verify the effectiveness and competitiveness of the proposed algorithm.
Although competitive results are obtained with the proposed model, this still leaves room for further improvement. First, feature optimization of more than two stages can be explored for the performance improvement. Second, more efficient features should be devised and integrated into the feature optimization model. Third, the cross-database recognition rates are still not high enough for the real application, which will be explored in our future work. Lastly, the ideas of AU weighting, feature sparseness optimization and active AU detection can be combined with CNN-based algorithms to improve the feature encoding based on face frontalization [71].