Facial Expression Recognition Based on Auxiliary Models

: In recent years, with the development of artiﬁcial intelligence and human–computer interaction, more attention has been paid to the recognition and analysis of facial expressions. Despite much great success, there are a lot of unsatisfying problems, because facial expressions are subtle and complex. Hence, facial expression recognition is still a challenging problem. In most papers, the entire face image is often chosen as the input information. In our daily life, people can perceive other’s current emotions only by several facial components (such as eye, mouth and nose), and other areas of the face (such as hair, skin tone, ears, etc.) play a smaller role in determining one’s emotion. If the entire face image is used as the only input information, the system will produce some unnecessary information and miss some important information in the process of feature extraction. To solve the above problem, this paper proposes a method that combines multiple sub-regions and the entire face image by weighting, which can capture more important feature information that is conducive to improving the recognition accuracy. Our proposed method was evaluated based on four well-known publicly available facial expression databases: JAFFE, CK+, FER2013 and SFEW. The new method showed better performance than most state-of-the-art methods.


Introduction
Facial expression recognition plays a great role in human-computer interaction. In the course of human communication, 55% of the information is conveyed by different facial expressions, voice constitutes 38% of a communicated message, and language only constitutes 7% [1], therefore facial expression recognition has attracted much attention in recent years [2,3], and has many important applications in, e.g., remote education, safety, medicine, psychology and human-robot interaction systems. Although great progress has been made [4], it is difficult to acquire a facial expression recognition system with a satisfactory accuracy rate due to a variety of complex external conditions such as head pose, image resolution, deformations, and illumination variations. Hence, facial expression analysis is still a challenging work.
Generally, facial expression recognition is composed of three steps: preprocessing, feature extraction and classification [5]. The image preprocessing plays two roles: firstly, the system access to the original image is not generally perfect in practical application, such as the effects of noise, illumination and contrast, hence it is necessary to enhance the image processing with a view to increasing the quality requirements. Secondly, the acquired image information does not meet the specific requirements of subsequent operations, such as size and angle, therefore it is necessary to perform the image processing. The image preprocessing serves as a transition, which needs to be considered comprehensively. Feature extraction is a key step in the whole recognition work [6], and the feature that we expect should minimize the distance of within-class variations of expression while maximizing the distance of between-class variations. If features are inadequate, even the best classifier would fail to achieve good performance. Among machine learning algorithms, features are extracted by hand, such as local binary patterns (LBP) [7], Gabor [8], local Gabor binary patterns (LGBP) [9], scale invariant feature transforms (SIFT) [10], and histograms of oriented gradient (HOG) [11]. Handcrafted features such as LBP, HOG, and SIFT have been widely used in the traditional approach owing to their proven performances under specific circumstances and their low computational cost in the feature extraction process. After feature extraction, the classification method should be applied to perform facial expression recognition, such as SVM [12], random forest [13], sparse coding [14], neural network [15], etc. Although these methods have achieved great success in specific fields, the handcrafted feature [16] has its inherent drawbacks. When we use handcrafted features, either unintended features that have no effects on classification may get included or important features that have a great influence on the classification may get omitted. This is because the features are "crafted" by human experts, and the experts may not be able to consider all possible cases and include them in the feature. Meanwhile, it is difficult to realize a good recognition result for big datasets with large inter-personal differences in facial expression appearance.
To cope with the above disadvantages, deep learning methods [17,18] are considered, especially the emergence of convolutional neural networks. Convolutional neural network (CNN) [19] is a very effective method to recognize facial emotions. They can perform the feature extraction and classification process simultaneously, and can automatically discover the multiple levels of representations in data. This is why they succeed in breaking the most world records in recognition tasks.
The structure of early convolutional neural network was relatively simple. With the development of the relative research, the structure of convolutional neural network has been continuously optimized and its application field has been extended. In recent years, the research on the structure of convolutional neural network is still very hot, and some network structures with excellent performance have been proposed. The research results of convolutional neural network in various fields make it one of the most concerned research hotspots.
In the 1980s and 1990s, some researchers published relevant research work of CNN, and achieved good recognition results in several pattern recognition fields. Shan et al. proposed a CNN model called LeNet-5, and the success of handwriting character recognition about this model has aroused the attention of academia on convolutional neural network. However, CNN is only suitable for small image recognition at this time. For large-scale data, the recognition effect is not good. At the same time, Convolutional neural network is gradually developing in many fields such as speech recognition, object detection, face recognition etc. In 2012, Krizhevsky et al. used the extended CNN model (AlexNet) to win the champion in ImageNet Large Scale Visual Recognition Challenge (LSVRC) with a huge advantage of accuracy over the second place of 11%, making convolutional neural network become the focus of academia. Since AlexNet, many new convolution models have been proposed: such as the VGG (Visual Geometry Group) proposed by Oxford University, Google's GoogleNet, Microsoft's ResNet, etc., and these models have been constantly breaking AlexNet's ImageNet record.
Since Convolutional Neural Network (CNN) has already proved its excellence in many image recognition tasks, we expect that it can show better results than already existing methods in facial expression prediction problems. Most CNN -based facial recognition tasks use the entire face image as the input information, but what we found in our observations is that the judgment of facial expression is usually completed based on the information of several sensitive components in some areas of the face, such as eye, nose, and mouth. Other areas of the face contribute very little to the main feature of expression. If we use the entire face image to extract the features, the extracted feature vectors might lose some important information because it failed to catch the focus of the facial features. If the above feature information is adopted in the test experiment, it will make the test results very irrational, because the extracted feature information by using the above methods is greatly different from the real feature information. This is often caused by two factors: (1) There are too few data. Unlike large scale visual object recognition databases such as ImageNet [17], most existing facial expression recognition databases do not have sufficient training data, which leads to the overfitting problem. (2) The learning mechanism of the single-task CNN network itself has some limitations, and problems cannot be solved by the only one CNN model.
Aiming at the above problems, some improvements have been proposed in this paper. To solve the problems caused by too few data, this paper divides some organ images that have important contributions to facial expression recognition from the raw images, which can not only improve the quantity of datasets, but also improve the quality of extracted information. Meanwhile, since it is difficult for the CNN model based on single task to improve the overall accuracy rate in the recognition task, this paper proposes a multi-task learning-based recognition model, which can modify the expression features extracted from the raw images with the help of the auxiliary model, so that the final extracted feature information is more in line with the ideal expression feature information.
The paper is arranged as follows: After this introduction, Related Work (Section 2) is presented which focuses on the various approaches with better performance that scientists have taken recently. Section 3 focuses on the main components of the architecture proposed in this paper. Section 4 presents the experiments and its results. Finally, Section 5 summarizes and concludes this paper.

Related Work
A detailed overview for expression recognition was given by Shan [20] and Cohen [21]. This section discusses some recent methods that achieve high accuracy in facial expression recognition using a comparable experimental methodology.
Shan et al. [20] proposed an approach called Boosted-LBP to extract the most discriminant LBP features, and the best recognition performance is obtained by using Support Vector Machine classifiers with Boosted-LBP features. They conducted experiments on the Cohn-Kanade database, MMI database and JAFFE database. They showed that the LBP-based SVMs perform slightly better than the Gabor-wavelet based SVMs by using the 10-fold cross-validation on each dataset.
S L et al. [22] proposed a method which uses Haar classifier for face detection purpose and Local Binary Pattern (LBP) histograms of different block sizes of a face image as feature vectors, and classifies various facial expressions using Principal Component Analysis (PCA). They used grayscale frontal face images of a person to classify six basic emotions, namely happiness, sadness, disgust, fear, surprise and anger.
Zhang et al. [23] proposed a novel facial expression recognition method using local binary pattern (LBP) and local phase quantization (LPQ) based on Gabor face image. Firstly, Gabor wavelets can capture the prominent visual attribute by extracting multi-scale and multi-direction spatial frequency features from the face images, which is separable and robust to illumination changes. Then, the LBP and LPQ feature based on the Gabor wavelet transform are fused for face representation. Considering the dimension of the fused feature is too large, the PCA-LDA algorithm is used to extract complex features. The method is finally tested and verified by multi-class SVM classifiers. This approach was implemented on JAFFE database. Two methods were used to test the effect of the classification. The first validation method was "leave one out". All expression images of one subject were selected as testing samples and the rest images as training samples, and it achieved a recognition accuracy of 81.82%. The other validation method was that two samples of each facial expression for each person were used to form the training set, and the remaining samples were used for testing. The proposed method showed the recognition rate of 98.57%.
Lisai et al. [24] proposed a novel algorithm for Facial Expression Recognition (FER), which is based on fusion of Gabor texture features and Local Phase Quantization (LPQ). Firstly, the LPQ feature and Gabor texture feature are, respectively, extracted from ever expression image. The image is first transformed by LPQ, and then divided into 3 × 5 blocks. Then, the LPQ histograms are calculated from each block. LPQ histograms of 15 blocks are concatenated into a long series of histogram as a single vector. Then, Five scales and eight orientations of Gabor wavelet filters are used to extract Gabor texture features and adaboost algorithm is used to select Gabor features. Gabor features are obtained by 40 filters. Then, adaboost algorithm is used to select the 100 most effective features from each Gabor features image. Finally, the final concatenates the 4000 features from the 40 Gabor features images used as facial expression features. They obtained two expression recognition results on both expression features by Sparse Representation based Classification (SRC) method. Finally, the final expression recognition is performed by fusion of residuals of two SRC algorithms. The experiment results on Japanese Female Facial Expression (JAFFE) database demonstrated that the new algorithm was better than the original two algorithms, and this algorithm had a much higher recognition rate of 73.33%.
Minchul et al. [25] used the Convolutional neural network model to realize facial expression recognition. They cropped faces from each dataset and aligned the faces with respect to the landmark position of the eye, and the original 482 × 48 facial images were cropped into a size of 42 × 42. The training data were augmented 10 times by flipping them. Five types of data input (raw, histogram equalization, isotropic smoothing, diffusion-based normalization, and difference of Gaussian) were tested on four different network structures, respectively. They selected the one that showed the highest accuracy as a target structure for fine parameter tuning. For the performance evaluation, five different datasets were chosen: FER-2013, SFEW2.0, CK+ (extended CohnKanade), KDEF (Karolinska Directed Emotional Faces), and Jaffe. Finally, Tang's simple network with Hist-eq images was chosen as a baseline CNN model for further research.
Yu et al. [26] proposed a method that contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, they presented two schemes for learning the ensemble weights of the network responses: by minimizing the log-likelihood loss, and by minimizing the hinge loss. Their proposed method achieved 55.96% and 61.29%, respectively, on the validation and test set of SFEW 2.0.
Heechul Jung et al. [27] proposed a new CNN method based on two different models. The first deep network model can extract temporal appearance features from image sequences, while the other deep network model can extract temporal geometry features from temporal facial landmark points. The faces in the input image sequences are detected, cropped, and rescaled to 64 × 64. IntraFace algorithm is used to extract facial landmark points, and accurate facial landmark points are provided consisting of 49 landmark points, including two eyes, a nose, a mouth, and two eyebrows. Finally, these two models are combined using a new integration method. Through several experiments on the CK+ and Oulu-CASIA databases, as well as many data by various data augmentation techniques, this new model showed that the two models cooperate with each other.
Most of the previous methods have processed the entire facial region as the input information, and pay less attention to the sub-regions of human faces, which will lead to a large difference between the extracted features and the expected features. If the extracted information obtained from the entire face image is not ideal, the final recognition result will be affected. Because the judgment of facial expression is usually based on the information of several sensitive components in some areas of the face, such as eye, nose, and mouth, this paper proposes a new method that combines several important sub-regions (i.e., eye, nose, and mouth) and the entire image, which not only can modify the extracted feature information of the entire image, but also can further improve the overall recognition rate of the system.

Data Pre-Pocessing
Images of most face databases include not only faces, but also much background information; therefore, removing background is an important step in face expression recognition preprocessing. Although there are many face databases online, the face regions in most of them are not cut out and cannot be directly used in face expression recognition experiments. If the uncropped image is directly used as the input image, it will not only bring a huge amount of calculation, but also affect the final expression recognition results. Therefore, whether the human face can be correctly detected has a great impact on the expression recognition accuracy. To improve the accuracy of face detection, this paper introduces the method of eye positioning to improve the result of face detection. Eyes are the most prominent facial feature; once their position in the face is relatively fixed, the entire useful face will be obtained easily. Because there is a certain distance between the eyes and the size of the face, many face detection algorithms have a strong dependence on the location of the eye, taking the location of the eye as an important step in the recognition process. Other prominent facial features, such as mouth, nose and eyebrows, can be easily obtained from fixed geometric relations after positioning eyes, so the accurate positioning of human eyes is helpful for the face positioning. The algorithm to obtain the cropped face region image is shown in Figure 1, and experimental results are shown in Figure 2.

Convolutional Neural Networks
Convolutional neural network is a non-fully connected multi-layer neural network, which is generally composed of convolution layer (Conv), down-sampling layer (or pooling layer) and full-connection layer (FC). Firstly, the raw image is convoluted by several filters on the convolution layer, which can get several feature maps. Then, the feature is blurred by the down-sampling layer. Finally, a set of eigenvectors is obtained through a full connection layer. The architecture of Convolutional Neural Network is represented in Figure 3.  Convolutional Layer: In convolutional layer, multiple convolutional kernels f k with a kernel size n × m are applied to the input x to calculate a more rich and diverse representation of the input. It is not sufficient to have only one convolution kernel for feature extraction, hence multiple convolution kernels can be used in this step. If there are 50 convolution kernels, 50 features will be learned correspondingly. No matter how many channels there are in the input image, the total number of channels in the output image is equal to the number of convolution kernels. Figure 4 shows the process of computing the convolution region.  Pooling Layer: The main function of the pooling layer is to lower sampling, and further reduce the number of parameters by removing unimportant samples in feature map. A large image can be downsized by the pooling layer, while retaining much important information. There are many methods for pooling operation: Max Pooling, Mean Pooling, etc. Max Pooling is the most commonly used method. In fact, Max Pooling is to take the maximum value of n × n samples as the sample value. Figure 5 is the computation process based on 2 × 2 Max Pooling. Activation function: In the process of facial expression classification based on the convolutional neural network, the selection of an activation function plays a great role in the whole system, which is mainly used to introduce nonlinear factors. The sigmoid function, tanh function and rule function are commonly used. Relu function is more efficient than most other activation functions. It has a relatively cheap computation, because no exponential function has to be calculated. This function also can prevent the vanishing gradient error, since the gradients are linear functions or zero but in no case non-linear functions. Figure 6 shows the curve of this activation function. Fully Connected Layer: The fully connected layer connects all neurons of the prior layer to every neuron of its own layer.

The Acquisition of Some Important Components of the Face
When all datasets are ready, we align and crop regions of the two eyes, nose, mouth, and whole face. Then, four images are all resized into 96 × 96 pixels. Figure 7 shows a part of images of some sub-regions.

New Structure
In our daily life, the judgment of one facial expression is mostly based on several sensitive organs in some areas of the face, such as eyes, nose and mouth. Considering the advantages of ensemble learning and the importance of sensitive components' features in facial expression classification, this paper designs a new recognition system based on an auxiliary model. The structure of the model is shown in Figure 8.
There is no specific formula to build a convolutional neural network to ensure that it can work for all scenarios. Different problems require different architectures to produce our desired verification accuracy. Therefore, this paper designs a CNN structure for facial expression recognition according to the requirement of the research task. Figure 9 shows four different architecture of the designed CNN used in this task.   The detailed algorithm of the fuser in Figure 8 is as follows. Let us introduce some notation: CNN i (i = 1, 2, 3, 4) stands for the name of four CNN model that used in this work.
is a vector that has seven rows and one column, which stands for the probability of one CNN i classifier assigned to seven classes. For example, p 1 2 stands for the probabilities that one test sample belongs to the first emotion class in CNN 2 model. Then, p i is normalized by the following equation: p i = p i max(p i ) . The final recognition result is determined by the following formula: , where m 1 = 1 and the initial value of m 2 is 0.01. Seven values can be obtained from the above equation; the final recognition label can be get from the location of these seven values.
As shown in Figure 8, this new structure not only takes into account the strong abstract feature extraction ability for face images, but also takes into account the strong expression ability of important components for the facial expression. Meanwhile, the application of the probability-based ensemble learning can further improve the performance of the system.

Database
JAFFE database was published in 1998 [28], and it is a relatively small database. This database includes 213 images produced by 10 Japanese women, and each person has seven emotional images: disgust, anger, fear, happy, sad, surprise and neutral. Figure 10 shows parts samples of JAFFE. CK+ database is expanded based on Cohn-Kanade database, which was published in 2010. This dataset was introduced by Lucey et al. [29]. There are 593 images and 123 sub-folders, and 327 images have their facial expression labels. This database is one of the most widely used in the field of facial expression recognition. There are 123 university students ranging from 18 to 30 years old, where 65% are female, 15% are African-American and 3% are Asian or South American. The emotions consist of anger, disgust, fear, happiness, sadness, surprise, and contempt. In our experiments, we used the first frame as the neutral category and the last four frames as one of the seven emotional categories for training the network as a frame-based classifier. Some examples of the CK+ database images are shown in Figure 11.  SFEW database [31] is a part of a temporal facial expressions database acted facial expressions in the wild, which we extracted from movies. This database is close to the real world illumination. There are 958 images in the training set and 436 images in the validation set. Some examples of the SFEW database images are shown in Figure 13.

Data Augmentation
In deep learning, data enhancement is generally carried out on the database in order to enrich the training set and better extract facial expression features. The more original data there are, the higher the accuracy and generalization ability of the trained model will be; therefore, data enhancement is very important, especially for some datasets with uneven distribution. A good training dataset is the prerequisite of training an advanced model. When the training data are done well, it is often twice the result with half the effort in the following model training. However, data annotation is time-consuming, and it is hard to collect enough data. There are some common methods for data enhancement, such as rotating the image, cutting the image, changing the color difference of the image, distorting the image features, changing the size of the image and enhancing the image noise. In this study, the above methods were used to enhance the data of the original dataset. Finally, 33,885 images were produced in JAFFE, and about 4840 sample images were contained in each of seven expression folders. CK+ eventually obtained 53,506 images. The number of the experimental databases have been shown in Table 1.  Figure 14 shows the confusion matrixes based on the new model. Furthermore, we compared the results for five different inputs (i.e., the whole face input, the combination of face and eyes, the combination of face and nose, the combination of face and mouth, and the whole face region based on sub-regions), as shown in Figure 15. In Figure 15, we can see that it is superior to the same model given in Figure 8 with only the entire face region as the input. Meanwhile, the new method proposed in this paper is still better than the current state of the art in emotion recognition on the JAFFE, CK+, FER2013 and SFEW datasets, as can be seen in Table 2. In Table 2, we can see that nose has a small contribution to the final accuracy, and mouth has the biggest contribution to the accuracy rate.

Conclusions
In many cases, human beings communicate their emotions and intensions by their facial expressions, which is one of the most powerful, natural and immediate means. Facial expression analysis is an interesting and challenging task, and it has been applied in many fields such as human-computer interaction and remote education. Although much progress has been made in expression recognition field by researchers, it is not yet easily performed by computers or intelligent robots. In most research tasks, the whole face is used as the input information. In people's daily life, when one person judges the expression of the other person, they usually capture the characteristics of several key parts of the face to judge the final expression. The eyes, nose and mouth are some sensitive parts that play a decisive role in determining one's expression, while others play a small role in the final result. To solve the above problem, we propose a novel CNN framework based on the sub-region auxiliary model in this paper, which takes full advantage of three important regions, and modifies the learning results of the main task by setting different s to improve the final accuracy rate. In the experimental verification on JAFFE dataset, m 1 = 1 and m 2 = 0.68. In the experimental verification on CK+dataset, m 1 = 1 and m 2 = 0.59. In the experimental verification on FER2013 dataset, m 1 = 1 and m 2 = 0.72. In the experimental verification on SFEW dataset, m 1 = 1 and m 2 = 0.63.
Future Work: The recognition accuracy of the system is improved through the auxiliary role of the sub-region model. In fact, there are many factors that affect the expression. In addition to the few special regions of the facial image, there are many other key factors that need to be studied continuously.