Facial expressions include rich emotional information and play a very important role in interpersonal communication. Facial expression recognition has become one of the most promising biometric recognition technologies due to its characteristics of nature, intuition, non-contact, safety, and rapidity. According to a famous expression theory [1
]: in the 100% emotional expression that is composed of facial expression, voice, and language, expression makes up 50 percent of the total information, 40 percent is composed by voice, and language only includes 8 percent, which shows the importance of facial expressions in interpersonal communication. In order to realize a more intelligent and natural human–machine interaction, facial expression recognition has been widely studied in recent decades [2
], and it has attracted more and more researchers’ attention.
The study of human expression can be extended to many other disciplines, such as behavioral science, psychology, and machine intelligence. The mature expression system is beneficial to human life. For example: (1)The monitoring system with the function of recognizing facial expressions can be used to identify the abnormal expression (hate, irritable, insecurity, etc.) in many large public places, such as supermarkets, train stations, airports, and crowded shopping streets. It is very effective to prevent the crime. (2) If the facial expression recognition system can be combined with the driver’s safe driving, it can analyze the driver’s facial expression at any time to determine whether the driver is tired, which can avoid the potential danger of fatigue driving. (3) Facial expressions can also promote the development of the service industry. The pleased feedback of customers can be captured in time by facial expression recognition system. (4) If the facial expression recognition system is placed in the hospital, the patient’s expression can be monitored and analyzed in real time, which can ensure timely treatment.
According to Ekman and Friesen [4
], there is a universality of six basic emotions: happiness, surprise, sadness, fear, anger, and disgust. These emotions can be found in all cultures. Generally, facial expression recognition includes three steps: preprocessing, feature extraction, and classification [5
]. Feature extraction is a key step in the whole recognition work [6
] and the feature that we expect should minimize the distance of within-class variations of expression while maximizing the distance of between-class variations [7
]. In feature extraction, there are two common approaches: appearance-based methods [8
] and geometric feature-based methods [10
]. Geometric features are extracted to form a feature vector that represents the shape and locations of facial components. This method can reduce a lot of computation time and make the system response faster, but the extraction of geometric features requires higher accuracy in feature point selection, and it is difficult to achieve accurate positioning of feature points in the case of low image quality or complex background. Meanwhile, geometric feature-based methods can only outline the change of the whole face body shape, and it ignores other parts of the face information such as skin texture changes, which will cause a low accuracy rate in the recognition of subtle changes in facial expressions. Surface properties of the image can be described with appearance-based methods. Because texture can make full use of image information, it can be used as an important basis for image description and recognition either from theory or common sense. However, the texture cannot completely reflect the essential properties of the object, because it is only a feature of the surface of an object. Therefore, it is impossible to obtain high-level image content by using only texture features. Facial expression features extracted by appearance-based methods have strong anti-interference ability, but most of them have complex computational problems and lack the ability to extract comprehensive and abstract features.
Either the geometric feature-based method or the appearance-based method belongs to single feature-based method. For facial expression feature information, single feature can only reflect a local or single feature of the expression, and it cannot fully reflect the overall features of the expression. For the expression classifier, the comprehensive degree of feature attribute value is very important, which can greatly affect the recognition accuracy. Hence, in order to minimize the final error rate, lots of fusion features-based methods have been proposed in recent years. The fusion features-based method can fully and effectively reflect the overall features for an expression image. Although the fusion features-based method proposed in the later stage has better performance than the single feature- based method, handcrafted features are often not well considered. If features are inadequate, even the best classifier would fail to achieve good performance [12
]. Meanwhile, the time and labor cost of finding new ways to extract artificial features is very expensive. Studies have shown that neural networks have strong expressive ability, and a reasonable three-layer neural network can express any complex nonlinear mapping relations. In a deep network, each hidden layer can perform nonlinear transformation on the output of the previous layer. This means that each hidden layer is the synthesis and abstraction of the data of the previous layer, and the features learned through layer by layer can better depict the essence of the description object. Therefore, the deep network has better expression ability and learning ability than the shallow network. More hidden features can be automatically acquired to learn more useful features for improving the accuracy of prediction or classification. Therefore, deep learning is an inevitable trend in many machine recognition algorithms. However, deep learning-based methods require a large number of training samples and tuning parameters, and the hardware requirements are also high.
Expression feature extraction is the most important part of the facial expression recognition (FER) system, and effective extraction of facial features will greatly improve the recognition performance. Considering the disadvantages of handcrafted features and the complexity of deep learning model in the process of training and tuning, a new expression recognition system model that combines the feature extracted from the deep neural network model with the traditional learning classifier is proposed in this paper, which can address the incompleteness of handcrafted features as well as avoid the long training time of the deep learning model.
As an important part of facial expression recognition, the design of the facial expression classifier greatly affects the accuracy rate of facial expression recognition; therefore, the selection and application of classifier is important to determine the final result. The facial expression recognition classifier should have a high computational efficiency and a powerful ability to handle a large number of data sets.
Decision tree [13
], as a popular method of pattern recognition and data mining, has been deeply used in various fields due to its simple operation characteristics. Meanwhile, this algorithm has no requirements for the samples. These advantages are available for facial expression classification. Among many decision tree algorithms, C4.5 classifier [14
] has been widely used in the field of image recognition and has the ability to classify and recognize facial expressions. Therefore, C4.5 classifier is selected as the classifier for expression recognition in this paper. Because some problems such as overfitting and weak generalization ability of single classifier are considered, ensemble Learning is applied into the decision tree algorithm in order to improve the classification accuracy. Random forest is the most representative algorithm among ensemble learning methods, and it can solve the bottleneck problem of the decision tree, which has a good scalability and parallelism to high-dimensional data in classification. Therefore, the random forest algorithm is selected as the facial expression classifier in this paper.
The remainder of this paper is organized as follows. Section 2
describes previous works which are related to our work. Section 3
describes the proposed method in detail. Section 4
presents experiment results and analysis. Section 5
presents conclusions and future work.
2. Related Work
Generally, a face expression recognition system has three steps including face detection, feature extraction, and feature classification (Figure 1
), where feature extraction is the most crucial step. The expression classification accuracy is largely dependent on the effectiveness of the extracted feature. If the features extracted from expressions make within-class distance as small as possible and between-class distance as large as possible, functions of the classifier will be demanded less.
Luo et al. [15
] proposed a hybrid method of principal component analysis (PCA) and local binary pattern (LBP). Principal component analysis was used to extract the global grayscale features of the whole image, and LBP was used to extract the local features. The support vector machine (SVM) was used for facial expression recognition. In this paper, the preprocessing methods include geometry normalization, brightness normalization, histogram equalization, image filtering, and facial effective area segmentation. Each image of the training set was normalized into small size (24 × 24). The recognition result was 93.75%.
Chen et al. [16
] applied the HOG to encode these facial components as features. A linear SVM was then trained to perform the facial expression classification. They evaluated the proposed method on the JAFFE dataset and an extended Cohn–Kanade dataset. In the experiment of JAFFE, the size of the image was 256 × 256. After acquiring the face region from the face image, the size was adjusted to 156 × 156. The leave-one-sample-out strategy was used to test this method and compare with the other methods. The average classification rate on this dataset reached 94.3%. In the experiment of CK+, they divided the images into two sets. One was the training set and the other was the test set. About one-fifth of the images of each group were randomly selected for the test set. The remaining images were chosen as the training set. This method achieved an average of 88.7% with a variance of ±2.3% classification rate at last.
Although the above methods have achieved good recognition results, the data sets used in these experiments are small samples. Moreover, handcrafted feature is not comprehensive. The emergence of deep learning breaks the traditional pattern (feature extraction followed by facial expression classification), which deals with feature extraction and classification simultaneously. Convolutional neural network is the most widely used in image classification, which can map deeper information that can further improve the accuracy rate.
Mollahosseini1 et al. [17
] proposed a deep neural network architecture to address the FER problem across multiple standard face datasets, viz. MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013. The network included two elements (two traditional CNN and two “Inception” style modules), whereas inception style modules were made up of 1 × 1, 3 × 3, and 5 × 5 convolution layers (Using ReLU) in parallel. All the images were resized to 48×48 pixels for analysis. In order to augment the existed data, this paper extracted five crops of 40 × 40 from the four corners and the center of the image. They evaluated the accuracy of the proposed deep neural network architecture in two different experiments; viz. subject-independent and cross-database evaluation. In the subject-independent experiment, databases are split into training, validation, and test sets. The 5-fold cross validation technique was used to evaluate the results. In FERA and SFEW, the training and test sets were defined in the database release, and the results are evaluated on the database defined test set without performing K-fold cross validation. Accuracy rates are: 94.7%, 77.9%, 55.0%, 76.7%, 47.7%, 93.2%, and 66.4% on each data set respectively.
Wen et al. [18
] proposed a method that integrated many convolutional neural networks with probability-based fusion for facial expression recognition. In the all designed CNN models, the softmax classifier was used in the last layer to estimate the probabilities of the testing sample belonging to each class. When many CNNs were generated as base classifiers, their probabilities’ outputs were merged using the probability-based fusion method. Because the diversity among the base classifiers is regarded as a key issue in performance for any ensemble learning method, this paper applied the implicit method to generate CNNs with rich diversity. Four databases were used in their experiments, viz. JAFFE, CK+, Emotiw2015, FER2013. In experiments, they obtained 100 CNNs, whose accuracies ranged from 65% to 68% on FER2013-VAL. This paper illustrated that ensemble learning could be applied to further improve performance. It was also seen that both approaches did not obtain very good performance across databases. The main reason for this is that the training database for the approach was not large enough to contain all kinds of samples with rich diversity. Therefore, as many samples as possible should be included in the training database to achieve richer diversity. Furthermore, when the base classifiers were weak, the ensemble method failed to further improve performance.
However, in the above literature, the feature information extracted by machine learning algorithm is not comprehensive. A lot of training time for the CNN model is needed in experiments, and the hardware requirements are very high. Considering the above problems, this paper proposes a method that combines CNN feature with machine learning classifier for facial expression recognition, which can not only increase the coverage of extracted feature information but also can reduce the training time of the model.
Human beings, as advanced animals, often communicate emotions through rich and different expressions most of the time. Expression is a very sensitive problem: we can quickly perceive other people’s situation and inner activities by observing their different expressions, and it is necessary to observe the changes of other people’s expressions in interpersonal communication. Only in this way can we better understand each other and quickly know other people’s emotions and thoughts.
In the era of rapid development in science and Internet technology, the demand for human–computer interaction in life has increased quickly. If researchers can make a breakthrough on the question of the recognition of human emotions, this will be a leapfrog development in the intelligent era of human–computer interaction.
Considering the complexity and other issues of the system, this paper does not make experiments based on some special conditions such as make-up and occlusion. Further research is needed on how to recognize facial expressions under these extreme conditions. In addition, for the convolutional neural network, it is necessary to collect as many samples as possible and make the trained network have a good generalization performance.