Facial Expression Recognition Based on Random Forest and Convolutional Neural Network

: As an important part of emotion research, facial expression recognition is a necessary requirement in human–machine interface. Generally, a face expression recognition system includes face detection, feature extraction, and feature classiﬁcation. Although great success has been made by the traditional machine learning methods, most of them have complex computational problems and lack the ability to extract comprehensive and abstract features. Deep learning-based methods can realize a higher recognition rate for facial expressions, but a large number of training samples and tuning parameters are needed, and the hardware requirement is very high. For the above problems, this paper proposes a method combining features that extracted by the convolutional neural network (CNN) with the C4.5 classiﬁer to recognize facial expressions, which not only can address the incompleteness of handcrafted features but also can avoid the high hardware conﬁguration in the deep learning model. Considering some problems of overﬁtting and weak generalization ability of the single classiﬁer, random forest is applied in this paper. Meanwhile, this paper makes some improvements for C4.5 classiﬁer and the traditional random forest in the process of experiments.


Introduction
Facial expressions include rich emotional information and play a very important role in interpersonal communication. Facial expression recognition has become one of the most promising biometric recognition technologies due to its characteristics of nature, intuition, non-contact, safety, and rapidity. According to a famous expression theory [1]: in the 100% emotional expression that is composed of facial expression, voice, and language, expression makes up 50 percent of the total information, 40 percent is composed by voice, and language only includes 8 percent, which shows the importance of facial expressions in interpersonal communication. In order to realize a more intelligent and natural human-machine interaction, facial expression recognition has been widely studied in recent decades [2,3], and it has attracted more and more researchers' attention.
The study of human expression can be extended to many other disciplines, such as behavioral science, psychology, and machine intelligence. The mature expression system is beneficial to human life. For example: (1)The monitoring system with the function of recognizing facial expressions can be used to identify the abnormal expression (hate, irritable, insecurity, etc.) in many large public places, such as supermarkets, train stations, airports, and crowded shopping streets. It is very effective to prevent the crime. (2) If the facial expression recognition system can be combined with the driver's safe driving, it can analyze the driver's facial expression at any time to determine whether the driver is in the process of training and tuning, a new expression recognition system model that combines the feature extracted from the deep neural network model with the traditional learning classifier is proposed in this paper, which can address the incompleteness of handcrafted features as well as avoid the long training time of the deep learning model.
As an important part of facial expression recognition, the design of the facial expression classifier greatly affects the accuracy rate of facial expression recognition; therefore, the selection and application of classifier is important to determine the final result. The facial expression recognition classifier should have a high computational efficiency and a powerful ability to handle a large number of data sets.
Decision tree [13], as a popular method of pattern recognition and data mining, has been deeply used in various fields due to its simple operation characteristics. Meanwhile, this algorithm has no requirements for the samples. These advantages are available for facial expression classification. Among many decision tree algorithms, C4.5 classifier [14] has been widely used in the field of image recognition and has the ability to classify and recognize facial expressions. Therefore, C4.5 classifier is selected as the classifier for expression recognition in this paper. Because some problems such as overfitting and weak generalization ability of single classifier are considered, ensemble Learning is applied into the decision tree algorithm in order to improve the classification accuracy. Random forest is the most representative algorithm among ensemble learning methods, and it can solve the bottleneck problem of the decision tree, which has a good scalability and parallelism to high-dimensional data in classification. Therefore, the random forest algorithm is selected as the facial expression classifier in this paper.
The remainder of this paper is organized as follows. Section 2 describes previous works which are related to our work. Section 3 describes the proposed method in detail. Section 4 presents experiment results and analysis. Section 5 presents conclusions and future work.

Related Work
Generally, a face expression recognition system has three steps including face detection, feature extraction, and feature classification (Figure 1), where feature extraction is the most crucial step. The expression classification accuracy is largely dependent on the effectiveness of the extracted feature. If the features extracted from expressions make within-class distance as small as possible and between-class distance as large as possible, functions of the classifier will be demanded less. Luo et al. [15] proposed a hybrid method of principal component analysis (PCA) and local binary pattern (LBP). Principal component analysis was used to extract the global grayscale features of the whole image, and LBP was used to extract the local features. The support vector machine (SVM) was used for facial expression recognition. In this paper, the preprocessing methods include geometry normalization, brightness normalization, histogram equalization, image filtering, and facial effective area segmentation. Each image of the training set was normalized into small size (24 × 24). The recognition result was 93.75%.
Chen et al. [16] applied the HOG to encode these facial components as features. A linear SVM was then trained to perform the facial expression classification. They evaluated the proposed method on the JAFFE dataset and an extended Cohn-Kanade dataset. In the experiment of JAFFE, the size of the image was 256 × 256. After acquiring the face region from the face image, the size was adjusted to 156 × 156. The leave-one-sample-out strategy was used to test this method and compare with the other methods. The average classification rate on this dataset reached 94.3%. In the experiment of CK+, they divided the images into two sets. One was the training set and the other was the test set. About one-fifth of the images of each group were randomly selected for the test set. The remaining images were chosen as the training set. This method achieved an average of 88.7% with a variance of ±2.3% classification rate at last.
Although the above methods have achieved good recognition results, the data sets used in these experiments are small samples. Moreover, handcrafted feature is not comprehensive. The emergence of deep learning breaks the traditional pattern (feature extraction followed by facial expression classification), which deals with feature extraction and classification simultaneously. Convolutional neural network is the most widely used in image classification, which can map deeper information that can further improve the accuracy rate.
Mollahosseini1 et al. [17] proposed a deep neural network architecture to address the FER problem across multiple standard face datasets, viz. MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013. The network included two elements (two traditional CNN and two "Inception" style modules), whereas inception style modules were made up of 1 × 1, 3 × 3, and 5 × 5 convolution layers (Using ReLU) in parallel. All the images were resized to 48×48 pixels for analysis. In order to augment the existed data, this paper extracted five crops of 40 × 40 from the four corners and the center of the image. They evaluated the accuracy of the proposed deep neural network architecture in two different experiments; viz. subject-independent and cross-database evaluation. In the subject-independent experiment, databases are split into training, validation, and test sets. The 5-fold cross validation technique was used to evaluate the results. In FERA and SFEW, the training and test sets were defined in the database release, and the results are evaluated on the database defined test set without performing K-fold cross validation. Accuracy rates are: 94.7%, 77.9%, 55.0%, 76.7%, 47.7%, 93.2%, and 66.4% on each data set respectively.
Wen et al. [18] proposed a method that integrated many convolutional neural networks with probability-based fusion for facial expression recognition. In the all designed CNN models, the softmax classifier was used in the last layer to estimate the probabilities of the testing sample belonging to each class. When many CNNs were generated as base classifiers, their probabilities' outputs were merged using the probability-based fusion method. Because the diversity among the base classifiers is regarded as a key issue in performance for any ensemble learning method, this paper applied the implicit method to generate CNNs with rich diversity. Four databases were used in their experiments, viz. JAFFE, CK+, Emotiw2015, FER2013. In experiments, they obtained 100 CNNs, whose accuracies ranged from 65% to 68% on FER2013-VAL. This paper illustrated that ensemble learning could be applied to further improve performance. It was also seen that both approaches did not obtain very good performance across databases. The main reason for this is that the training database for the approach was not large enough to contain all kinds of samples with rich diversity. Therefore, as many samples as possible should be included in the training database to achieve richer diversity. Furthermore, when the base classifiers were weak, the ensemble method failed to further improve performance.
However, in the above literature, the feature information extracted by machine learning algorithm is not comprehensive. A lot of training time for the CNN model is needed in experiments, and the hardware requirements are very high. Considering the above problems, this paper proposes a method that combines CNN feature with machine learning classifier for facial expression recognition, which can not only increase the coverage of extracted feature information but also can reduce the training time of the model.

The Acquisition of Features Based on Convolutional Neural Network
Convolutional neural network (CNN) [19] is a supervised learning method that can perform the feature extraction and classification process simultaneously and can automatically discover the multiple levels of representations in data, which has been widely used in the field of computer vision. The general structure of the basic CNN model is shown in Figure 2. The detailed introduction of CNN model can be seen in [20,21].  Selecting the CNN model that is used in a task and how to build an appropriate model for feature learning requires complexity analysis based on the current task. The main factors that affect the convolutional neural network include network's depth, the selection of different convolution kernels, the choice of the activation function and so on. Considering the research task of this paper, a simple CNN model framework is built and shown in Table 1. This paper focuses on expression recognition by combining the random forest with features extracted from the CNN model and finds the most suitable fusion method based on the real experiment environment. The original data will have a certain information loss after going through every layer of the CNN model. When the original data reaches the full connection layer, the nature of the raw data has become distorted, therefore, random forests cannot be directly put on the CNN structure. The most reasonable way is to put the random forest into the last pool layer, which can be seen in Figure 3.

Introduction and Improvement of C4.5 Decision Tree
Among decision tree algorithms, C4.5 classifier has been widely used in the field of image recognition due to its high computational efficiency and the ability to process a large number of data sets as well as its simple and easy characteristics. Therefore, C4.5 is selected as the classifier for facial expression recognition in this paper.
In the process of processing the classification of continuous values, the most core calculation of C4.5 classifier is the acquisition of information gain rate, which can be seen in Equation (1). The information gain rate not only affects the classification efficiency of decision tree but also determines the choice of nodes in the process of decision tree generation. where, Let us introduce some notation. [a 1 , a 2 , a 3 , ..., a v , ..., a V ] stands for V attribute sets. |y| represents the total number of categories. IV(a v ) stands for the fixed value of the attribute a v . |D| = |y| ∑ i=1 m i stands for the sample number in the whole dataset. D V stands for the sample number in attribute value a v . Suppose there are n different values for attribute a v , then sort these values from smallest to largest (i.e., {a v 1 , a v 2 , ..., a v n }). D can be divided into two different subsets (D − t and D + t ) based on partition point t.
is used as the division. Then, we can treat these points as discrete attribute value. Suppose l λ t i stands for the number of seven expression labels in these two datasets D − t and D + t , respectively. We find that logarithm operation appears very frequently in the process of computing the information gain rate [22] and almost exists in the operation process of each equation. A lot of logarithm operations will affect the computational speed of the system. This paper improves the formula of information gain by introducing Taylor series expansion, which simplifies the operation time and improves the real-time response of the system. The new equation of information gain rate can be seen as follows.
The new equation of information gain rate can be determined by applying Equation (3) to Equation (2).
Compared with Equation (1), the complicated log calculation is replaced by the four simple operations in Equation (3), which greatly improves the operation efficiency and the real-time performance of the system.

Generation of the New Random Forest
Considering that a single classifier is prone to overfit and the generalization ability of one classifier is weak, the ensemble learning method is introduced to improve the classification accuracy.
Random forest (RF) [23], which was proposed in 2001, is composed of multiple decision trees. In the idea of ensemble learning [24], the base learner should be "good but different", that is, individual learners should have a relatively good recognition rate that is different from the others. However, in the process of selecting a single decision tree, the number of the decision tree is set in advance. Decision trees are established randomly, and the final result is determined by voting in an integrated way. In the process of building many single decision trees by the traditional approach, decision trees may not be very different from each other or the recognition rate of the generated individual decision tree is not high, which will affect the final result. This paper proposes a probability selection-based method to determine all the acquired individual decision trees, which not only meets the requirements of good and different but also meets the requirements of diversity. The specific algorithm is shown in Algorithm 1.

Algorithm 1. Generate new random forest
Input: training set D; attribute set A Output: multiple expression classification decision trees; 1 : Count=0; number=0; 2 : Create the root node node; 3 : If all samples in D belong to the same category C , then, 4 : Mark node as class C leaf node, return, 5 : end if 6 : If A=φ, OR the sample values on A are the same, then 7 : Mark node as a leaf node and its category as the class with the largest number of samples, return 8 : end if 9 : For each attribute, information gain rate is calculated by Equation (2). 10 : Select the optimal partition attribute from A, and assume that the test attribute A * has the highest information gain rate during the experiment. 11 : Find the segmentation point of the attribute; 12 : A new leaf node is separated from node a*; 13 : If the sample subset corresponding to this leaf node is empty, then this leaf node is divided to generate a new leaf node, which is marked as the expression with the highest number. 14 : Else, 15 continue to split this leaf node; 16 : end if; 17 : One decision tree is created. 18  Select the optimal decision tree from all the currently established decision trees as the alternative decision tree. number=number+1; 32 : else 33 : The decision tree is randomly selected from all the currently established decision trees as an alternative decision tree. number=number+1; 34 : if number<m, 35 repeat step(29)-step(33) 36 : if number=m, 37 : break 38 : end if 36 : All the selected decision trees are combined to form a random forest 39 : The test samples are put into the random forest, and the classification results of each decision tree are collected.
The results with the most votes will be used as the prediction classification of the current sample. Figure 4 shows the four databases that are used in this paper. JAFFE database [25]: The Jaffe facial expression database was published in 1998. The database was created by 10 Japanese women who were asked to make a variety of facial expressions in a given background and then photographed by a camera. This is a relatively small data sample library, a total of 213 facial expression images that were produced by 10 women. There are seven expressions, such as: disgust, anger, fear, happy, sad, surprised, and neutral. Each person has three to four images for one expression label, respectively. CK+ database [26]: CK+ database is an extension of the Cohn-Kanade database, which was released in 2010. The CK+ database has more data than the JAFFE database, which includes 123 subfolders, with a total of 593 expression images. The information in the last image of each sequence contains classification labels, and 327 images have expression classification labels. This database is one of the most widely used in the field of facial expression recognition.

Database
FER-2013 database [27]: RAF-DB database [28]: The Real-world Affective Faces Database is a large-scale facial expression database with about 30,000 facial images. Images downloaded from the Internet in this database are of great variability in terms of subjects' age, gender and ethnicity, head poses, lighting conditions, occlusions, (e.g., glasses, facial hair, or self-occlusion), post-processing operations (e.g., various filters and special effects), etc. RAF-DB database has two folders: basic and compound. The basic folder is used in this paper.

Data Augmentation
When the CNN model is chosen as the recognition model, the larger the amount of original data there is, the higher the accuracy and the generalization ability of the trained model. Therefore, data augmentation algorithm is very important, especially for some data sets with uneven distribution. A large training database is the advantage of training a good model. There are some common methods for data enhancement, such as: rotating the image, cutting the image, changing the color difference of the image, distorting the image features, changing the size of the image, and enhancing the image noise. The numbers of samples about these four databases are shown in Table 2.

Results
For JAFFE database and CK+ database, 70% of the augmented data is taken as the training set and 30% of the data is taken as the test set. For FER2013 database and RAF-DB database, the training set and the testing set are used by the existing samples. Figure 5 shows some decision tree models that were generated in our experiments. Table 3 shows the experimental results obtained by different feature extraction methods and classification methods in this paper. Figure 6 shows the obvious differences between these different methods. (d) RAF-DB database Figure 5. Some decision tree models that were generated in experiments.  As can be seen from the results in Figure 6, the expression classification ability based on "handcrafted" feature is generally lower than that based on CNN feature under the same classifier. The recognition performance of random forest based on probability selection is higher than that of traditional random forest. The experimental curves can be seen in Figure 7. Figure 8 shows the comparison of classification capabilities obtained by the new method that was proposed in this paper and other methods' performance on JAFFE database, CK+ database, FER2013 database, and RAF-DB database, which verifies that this method has certain advantages over other methods.  (d) Figure 8. The comparison between the new method and some state-of-the-art methods in papers on four experimental databases. (a) JAFFE database. From left to right, results come from [17,18,26,28,29] and the new method proposed in this paper; (b) CK+ database. From left to right, results come from [16,18,27,30,31] and the new method proposed in this paper; (c) FER2013 database. From left to right, results come from [32][33][34][35] and the new method proposed in this paper; (d) RAF-DB database. From left to right, results come from [36][37][38] and the new method proposed in this paper.

Conclusions
Human beings, as advanced animals, often communicate emotions through rich and different expressions most of the time. Expression is a very sensitive problem: we can quickly perceive other people's situation and inner activities by observing their different expressions, and it is necessary to observe the changes of other people's expressions in interpersonal communication. Only in this way can we better understand each other and quickly know other people's emotions and thoughts.
In the era of rapid development in science and Internet technology, the demand for human-computer interaction in life has increased quickly. If researchers can make a breakthrough on the question of the recognition of human emotions, this will be a leapfrog development in the intelligent era of human-computer interaction.
Considering the complexity and other issues of the system, this paper does not make experiments based on some special conditions such as make-up and occlusion. Further research is needed on how to recognize facial expressions under these extreme conditions. In addition, for the convolutional neural network, it is necessary to collect as many samples as possible and make the trained network have a good generalization performance.