Facial Expression Recognition Based on Multi-Features Cooperative Deep Convolutional Network

: This paper addresses the problem of Facial Expression Recognition (FER), focusing on unobvious facial movements. Traditional methods often cause overﬁtting problems or incomplete information due to insufﬁcient data and manual selection of features. Instead, our proposed network, which is called the Multi-features Cooperative Deep Convolutional Network (MC-DCN), maintains focus on the overall feature of the face and the trend of key parts. The processing of video data is the ﬁrst stage. The method of ensemble of regression trees (ERT) is used to obtain the overall contour of the face. Then, the attention model is used to pick up the parts of face that are more susceptible to expressions. Under the combined effect of these two methods, the image which can be called a local feature map is obtained. After that, the video data are sent to MC-DCN, containing parallel sub-networks. While the overall spatiotemporal characteristics of facial expressions are obtained through the sequence of images, the selection of keys parts can better learn the changes in facial expressions brought about by subtle facial movements. By combining local features and global features, the proposed method can acquire more information, leading to better performance. The experimental results show that MC-DCN can achieve recognition rates of 95%, 78.6% and 78.3% on the three datasets SAVEE, MMI, and edited GEMEP, respectively.


Introduction
The face contains a large amount of information such as identity, age, expression and ethnicity. The amount of information contained in facial expressions in the communication process is second only to language [1]. Facial expressions are subtle signals of the communication process. Ekman et al. identified the six facial expressions (happiness, sadness, disgust, fear, angry and surprise) as basic facial expressions that are universal among human beings, while other researchers added neutral, which, together with the previous six emotions, constitutes seven basic emotions [2][3][4][5].

Related Work
Part of the research into this problem has focused on recognizing facial expressions in static images [6][7][8]. These methods use image features such as texture analysis. Although these approaches are effective methods in extracting spatial information, they fail to capture morphological and contextual variations in the expression process. Recent methods aim to solve this problem by using massive datasets to obtain more efficient features of FER [9][10][11][12][13][14][15]. Some researchers use multimodal fusion to recognize emotions, such as voices, expressions, and actions [16].
Recently, with the flourishing of the neural network, a few attempts have tried to use deep neural networks to replace the feature extraction [17]. Inferring a dynamic facial appearance from a single 2D photo is arduous and ill-posed, since the expression formation process blends multiple facial features (mouth, eyes) as well as environment (voice, lighting) into an expression for each moment. To better handle the transformation, one must rely on multi-part independent changes, such as a smile causing the corners of the mouth to rise.
As others show [18], FER can be solved using temporal image sequences and utilizing both spatial and temporal variations.

Motivations and Contributions
In this paper, we present a new algorithm that performs facial expression recognition with video and achieves satisfactory accuracy on standard datasets. Our work is inspired by an ensemble of regression trees [19] and 3-Dimension Convolutional Neural Networks (3DCNN) [20]. Some of the above algorithms have shown a promising performance in facial expression recognition. Notwithstanding, in terms of video with tiny movement, as well as a database with fewer video or pictures, face recognition is still a challenge. Therefore, exploiting the limited dataset to more effectively improve the recognition accuracy is a problem worth exploring.
This paper proposes using an ensemble of regression trees to annotate corresponding facial features. The network, which is called Multi-features Cooperative Deep Convolutional Network (MC-DCN), captures global and local features by using two small 3DCNNs. These parallel networks provide a better balance between global and local features. The batch normalization was connected behind the convolutional layer to adapt to the characteristics of a small sample dataset.
The algorithm uses a cascaded network to address FER problem. The main contributions of this work can be summarized as follow:

•
Firstly, the ensemble of regression trees is applied to achieve facial location. Furthermore, an alternative to the attention mechanism is added. The influence of different facial organs on expressions was analyzed, and the features of facial organs selected. This net can extract the contours of face accurately. The application of facial features allows the network to be fully trained under tiny movements. Meanwhile, the weight analysis of different organs and the entire face can effectively improve the recognition ability of expressions which are not obvious; • Secondly, a new network was proposed, called Multi-features Cooperative Deep Convolutional Network (MC-DCN), that can dynamically obtain expression features from image sequences. The network combines the face part and the local feature map, and can sense the deformation process and trends of important expression features excellently. Meanwhile, a part called the CNN block is used. The CNN block is improved on the basis of Resnet, which means that the network as a whole has a stronger generalization ability, enhances the compatibility of the algorithm in different scenarios, and improves recognition accuracy.
The rest of this paper is organized as follows: In Section 2, the source of the datasets is given. Section 3 details the entire framework of the algorithm. Qualitative and quantitative experimental results, obtained from three public datasets, are shown in Section 4.

SAVEE Database
The database was captured in The Centre for Vision, Speech and Signal Processing (CVSSP) 3D vision laboratory over several months during different times of the year from four actors. It contains a total of 480 short videos which were recorded by four actors showing seven different emotions. The length of these videos varied from 3 to 5 s, and they include anger, disgust, fear, happiness, neutral, sadness, and surprise. Classification accuracy for visual and audio-visual data for seven emotion classes over four actors by evaluators is given in Table 1. KL, JK, JE, DC in the table are the abbreviation of the actor's names.

MMI Database
MMI has more than 2900 videos, including a total of 236 videos containing emoticons. Each video contains a complete change process. MMI contains six basic emotions (except neutral) and many other action descriptors which are activated by the Facial Action Coding System (FACS). It begins with Neutral, goes through a series of onset, apex, and offset phases, and returns to a neutral face. Expert assessment was used to expand the emotional video of neutral for the database. Some of the videos in MMI are recorded by dual cameras at the same time, and only one of them was chosen for the video.

GEMEP Database
As a database for FERA Challenge, GEneva Multimodal Emotion Portrayals (GEMEP) contains more than 7000 audio-video portrayals recorded by 10 different actors, representing 18 emotions with the help of professional theater directors. GEMEP was restructured to unify standards; details will be introduced below.

Data Augmentation
In order to enable the network to obtain sufficient training and parameter adjustments, the training set database will be horizontally flipped and rotated with tiny angles (specifically including the following angles: ±9 • , ±6 • , ±3 • ) and other data enhancement methods.

Methodology
A facial expression recognition approach was proposed from a video that employs 3D convolution nets (C3D) framework and ensemble of regression trees (ERT) framework. A detailed description of the algorithm is given in this section. Figure 1 is a flowchart of the algorithm proposed in this paper. In order to obtain features from the dynamic video, we extracted the images in the video to generate an image sequence. This can be done by using two 3DCNNs, which consist of five convolutional layers with an ReLU activation function. Batch normalization was followed by every convolution layer. Further, we combined the two networks by an element-wise average of the output of the fully connected layers, which was then connected to a final softmax layer for classification. The forecast result is represented by (1) where p i is a input sample and y i is ith type of emotion. These local parts were extracted from face in each frame; each of these shallow subnetworks were trained on global-local features. Subnetworks used the CNN Block. Finally, all the fully trained subnetworks were integrated for fine-tuning. The network can comprehensively learn dynamic changes in time and in the global-local features of space.

Face Alignment with an Ensemble of Regression Trees
Generally, people's complete (or partial) body and surrounding environment were shown in the original video, especially for the database that emphasizes body movements. Therefore, to eliminate body and environment as much as possible, while ensuring the integrity of the face, this paper adopts the ensemble of regression trees to preprocess the database in order to obtain information about critical parts of the face [17]. Each part of the face makes a different contribution to FER, and we hope that the part that contributes more to FER can be used to train the network [26].
By learning and combining these features of each critical part, tree-based local binary features (LBF) use linear regression to detect them. Different from LBF, an ensemble of regression trees (ERT) stored the updated value of the shape directly into the leaf node during their process of learning. The mean shape plus of all passing leaf nodes can obtain the final facial key point position after learning all the trees at the initial position, as shown in (2) S where t is the number of cascade layers, and r t (•) denotes the current regressor. The input parameters of the regressor are the shape of the image I updated by the previous regressor. The features used can be grayscale or other [27].
In order to train all the r t (•), the gradient tree boosting algorithm is used to reduce the sum of the squared errors of the initial shape and the ground truth. Each regressor consists of many trees. A pixel pair was selected randomly to ensure these parameters of these trees by the coordinate difference between the current shape and ground truth. As the regressor is updated, the initial estimated shape S (0) will eventually be updated to the true shape S (t+n) . Algorithm 1 [19] is the update algorithm of the regressor. It is assumed that the input image is I, learning rate v ∈ (0, 1). The ERT method pays more attention to the contour of the face, ignoring the distinctive information of texture and wrinkles in facial expressions. In order to improve the performance of extracting facial features, the proposed algorithm combines the detection of texture and wrinkles with ERT. Figure 2, which was generated by style Generative Adversarial Networks (styleGAN) [28], shows that the wrinkles on the eyebrows and forehead that clearly reflect the characteristics of expression vary with age. Thus, detecting brow and forehead wrinkles deteriorates the generalization ability of the model. To acquire better generalization ability, the proposed scheme detects wrinkles on the corners of mouth and nasolabial folds rather than the eyebrow and forehead. Facial features are shown in Table 2. Recently, the dynamic recognition method has become the backbone of many FERs due to the spatiotemporal characteristics of this task. At present, the methods of the task are mainly concentrated in two aspects. The first is using the optical flow method to find the dynamic trend of the target. Secondly, using the 3D net in the C3D network as an example, a three-dimensional convolution kernel, which is used to convolve the spatiotemporal blocks formed by the video to capture the dynamic features of expressions. This paper draws on the above two methods. Two parallel 3D networks were used to extract global and local features, respectively. This algorithm first aligns faces in videos and performs size normalization on each face. Secondly, each 16 frames of face images and local feature images were used to generate spatiotemporal blocks with the same size. Thirdly, two 3D networks with the same structure performed feature extraction on different spatiotemporal blocks, respectively. Finally, the average layer was used to merge feature generated by two networks. The framework is shown in Figure 3. It can be observed in Figure 3 that input data was preprocessed with a dual-channel as input for the 3D net where X is the input image sequence, and n is the total number of labeled-image sequence.
Considering the 3D convolutional layer, Batch Normalization, Relu, and a pooling layer as a convolutional block, (1) can be represented by (4): where f (•) and g(•) are the convolution block calculation processes, m is the number of convolution blocks, and h is the fusion calculation process of these two networks. Without considering Batch Normalization, f (•) and g(•) give similar results. f (•), as an example, can be expressed as where P is the output of one of these layers. W, b are the weight coefficients of the layer, R is the calculation of activation function and pooling layer, and l is the number of layers.
The network is formed by stacking five convolutional blocks. The kernel of the convolution layer is 3 × 3 × 3, the step size is 1 × 1 × 1. The initial pool core size is 1 × 2 × 2, the step size is 1 × 2 × 2, and the subsequent pool core is 2 × 2 × 2, the step size is 2 × 2 × 2. A total of 4096 output units were set in the fully connected layer. The number of output channels is 64, 128, 256 and 512, respectively. Because these two subnetworks have been set before, the number of channels was constant from the fully connected layer to the average layer, as shown in Figure 4. Since we are using a parallel network, the number of channels needs to be multiplied by two.

CNN Block
The preprocessed video will produce face images and contour images. In order to adapt to the two types of images and improve the generalization ability of the network, a module was used called the CNN block, which is similar to the Residual Block [30]. It can be observed in Figure 5; the difference with Resnet is the simplest path of information dissemination as the main path. Such a structure has stronger generalization ability and can better avoid the vanishing gradient. Keeping the clear of the shortcut path, the information can be transmitted smoothly in the forward and backward propagation; the BN and ReLU are unified before the weight as pre-activation, which could result in ease of optimization and reduce overfitting on the residual path.

The Objective Function
The loss function has a faster convergence rate because its gradient for the last layer of weight has nothing to do with the derivative of the activation function, and is only proportional to the difference between the output label and the true label. Backpropagation is continuous multiplication, so the update of the entire weight matrix will be accelerated. The derivation of multi-class cross entropy loss is simpler, and the loss is only related to the probability of the correct class. The loss is very simple to use to derive the input of the softmax activation layer, as shown in Equation (7) where y i is ith label, and n equals 7 or 6 in this paper due to different databases. This can be obtained in Equation (8), when assigning different weights λ 1 and λ 2 to the losses of the two parallel networks:

Implementation Details
In this paper, the training and inference of the proposed algorithm were implemented with Tensorflow backend. The details of the equipment are given as follows: Intel ® Core™ i7, 3.00 GHz processors, 32 GB of RAM, and 1 NVIDIA GeForce RTX 2080 SUPER Graphics Processing Unit (GPU).
The initial learning rate is set as 0.01 to 0.0001; for different databases, the network reached a satisfactory Loss after from 8 to 12 h.

Results on Different Databases
The input image sequence was taken, and included 16 frames. The neighboring frames were used as a supplement when the video was less than 16 frames in length. Using a fixed time of frames instead of a fixed number of frames has the advantage of being more suitable for practical applications [31]. Our ratio of training set to test set is 8 to 2.
The convergence curves using the proposed methods were shown in Figure 6. The blue and the yellow were the result of using original 3DCNN; the red and the black were the result with MC-DCN with SAVEE and MMI. From Table 3, the run-times with different database can be observed.

Results on SAVEE
To evaluate the performance of the proposed MC-DCN, we compared it with four FER methods, including Gaussian (PCA) [22] and FAMNN [32]. Results are shown in Table 3. Details of the experimental results of our method on the SAVEE database are given in Figure 7.

Results on MMI
The total accuracy of our model on the MMI dataset is shown in Table 3. Detailed experimental results are given as Figure 8. Obviously, the proposed algorithm achieves better performance in SAVEE than in MMI. It was inferred that there are two main reasons for these results. One reason is that the expression of the MMI dataset is a gradual process; input data are taken from the video for a fixed length time. It is possible to select a sequence where the emotional change has not reached its peak. The other reason was caused by the sample imbalance problem, especially the "neutral". Some training samples are added the manual way to solve the problem of insufficient sample size.

Results on GEMEP
The GEneva Multimodal Emotion Portrayals (GEMEP) are a collection of audio and video recordings featuring 10 actors portraying 18 affective states, with different verbal contents and different modes of expression. A total 105 videos were used with expert estimates. In order to unify the experiment with other databases, these videos were relabeled with six new labels: happiness, sadness, disgust, fear, angry and surprise. The recognition rate was 78.3%.
The database has an uneven sample size because of the reconstruction. A video with different frames was tested to deal with the small samples. The confusion matrix of GEMEP is shown in Figure 9.

Conclusions
This paper presents a parallel network, MC-DCN, for facial expression recognition. Considering the different contributions of different facial organs in facial expression recognition, firstly, the ensemble of regression trees (ERT) is used to locate facial features. Secondly, the data are further classified by a network containing CNN blocks. Due to facial features simultaneously being affected by expression, age, identity and other aspects, this paper introduces the attention mechanism to the back end of ERT. This method makes the network pay more attention to the corners of the mouth, nasolabial folds and other parts, which will show very big changes under the influence of expression. At the same time, it ignores crow's feet, which is more susceptible to age. In addition, we have added CNN blocks to the network to improve the generalization ability of the overall network to deal with different scenarios. In addition, we have added CNN blocks to the network to improve the generalization ability of the overall network to deal with different scenarios. The parallel structure allows the network to perceive mutual information while highlighting the learning ability of motions that are not obvious in facial expressions, effectively improving the accuracy of the network recognition. Experimental results show that our proposed algorithm achieves an accuracy of 95.5% on SAVEE (exaggerated expression), and accuracy of 78.6% and 78.3% on MMI (about half of the time in a state of no expression) and GEMEP (no obvious expression, not concentrated), respectively. Compared with other methods, including 3DNN, our method has improved the recognition accuracy. In particular, the results for GEMEP show that the choice of facial features increases the network's ability to learn unobvious movements. In the future, we plan to explore important parts of expressions, such as the state of muscles. In this way, the accuracy of expression recognition can be improved through more precise and accurate features.