A Brief Review of Facial Emotion Recognition Based on Visual Information

Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling “end-to-end” learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.


Introduction
Facial emotions are important factors in human communication that help us understand the intentions of others. In general, people infer the emotional states of other people, such as joy, sadness, and anger, using facial expressions and vocal tone. According to different surveys [1,2], verbal components convey one-third of human communication, and nonverbal components convey two-thirds. Among several nonverbal components, by carrying emotional meaning, facial expressions are one of the main information channels in interpersonal communication. Therefore, it is natural that research of facial emotion has been gaining lot of attention over the past decades with applications not only in the perceptual and cognitive sciences, but also in affective computing and computer animations [2].
Interest in automatic facial emotion recognition (FER) (Expanded form of the acronym FER is different in every paper, such as facial emotion recognition and facial expression recognition. In this paper, the term FER refers to facial emotion recognition as this study deals with the general aspects of recognition of facial emotion expression.) has also been increasing recently with the rapid development of artificial intelligent techniques, including in human-computer interaction (HCI) [3,4], virtual reality (VR) [5], augment reality (AR) [6], advanced driver assistant systems (ADASs) [7], and entertainment [8,9]. Although various sensors such as an electromyograph (EMG), electrocardiogram promising type of sensor because it provides the most informative clues for FER and does not need to be worn.
This paper first divides researches on automatic FER into two groups according to whether the features are handcrafted or generated through the output of a deep neural network.
In conventional FER approaches, the FER is composed of three major steps, as shown in Figure 1: (1) face and facial component detection, (2) feature extraction, and (3) expression classification. First, a face image is detected from an input image, and facial components (e.g., eyes and nose) or landmarks are detected from the face region. Second, various spatial and temporal features are extracted from the facial components. Third, the pre-trained FE classifiers, such as a support vector machine (SVM), AdaBoost, and random forest, produce the recognition results using the extracted features. In contrast to traditional approaches using handcrafted features, deep learning has emerged as a general approach to machine learning, yielding state-of-the-art results in many computer vision studies with the availability of big data [11].
Deep-learning-based FER approaches highly reduce the dependence on face-physics-based models and other pre-processing techniques by enabling "end-to-end" learning to occur in the pipeline directly from the input images [12]. Among the several deep-learning models available, the convolutional neural network (CNN), a particular type of deep learning, is the most popular network model. In CNN-based approaches, the input image is convolved through a filter collection in the convolution layers to produce a feature map. Each feature map is then combined to fully connected networks, and the face expression is recognized as belonging to a particular class-based the output of the softmax algorithm. Figure 2 shows the procedure used by CNN-based FER approaches.
FER can also be divided into two groups according to whether it uses frame or video images [13]. First, static (frame-based) FER relies solely on static facial features obtained by extracting handcrafted features from selected peak expression frames of image sequences. Second, dynamic (video-based) FER utilizes spatio-temporal features to capture the expression dynamics in facial expression sequences. Although dynamic FER is known to have a higher recognition rate than static FER because it provides additional temporal information, it does suffer from a few drawbacks. For example, the extracted dynamic features have different transition durations and different feature characteristics of the facial expression depending on the particular faces. Moreover, temporal normalization used to obtain expression sequences with a fixed number of frames may result in a loss of temporal scale information. In contrast to traditional approaches using handcrafted features, deep learning has emerged as a general approach to machine learning, yielding state-of-the-art results in many computer vision studies with the availability of big data [11].
Deep-learning-based FER approaches highly reduce the dependence on face-physics-based models and other pre-processing techniques by enabling "end-to-end" learning to occur in the pipeline directly from the input images [12]. Among the several deep-learning models available, the convolutional neural network (CNN), a particular type of deep learning, is the most popular network model. In CNN-based approaches, the input image is convolved through a filter collection in the convolution layers to produce a feature map. Each feature map is then combined to fully connected networks, and the face expression is recognized as belonging to a particular class-based the output of the softmax algorithm. Figure 2 shows the procedure used by CNN-based FER approaches.
FER can also be divided into two groups according to whether it uses frame or video images [13]. First, static (frame-based) FER relies solely on static facial features obtained by extracting handcrafted features from selected peak expression frames of image sequences. Second, dynamic (video-based) FER utilizes spatio-temporal features to capture the expression dynamics in facial expression sequences. Although dynamic FER is known to have a higher recognition rate than static FER because it provides additional temporal information, it does suffer from a few drawbacks. For example, the extracted dynamic features have different transition durations and different feature characteristics of the facial expression depending on the particular faces. Moreover, temporal normalization used to obtain expression sequences with a fixed number of frames may result in a loss of temporal scale information.

Terminology
Before reviewing researches related to FER, special terminology playing an important role in FER research is listed below: • The facial action coding system (FACS) is a system based on facial muscle changes and can characterize facial actions to express individual human emotions as defined by Ekman and Friesen [14] in 1978. FACS encodes the movements of specific facial muscles called action units (AUs), which reflect distinct momentary changes in facial appearance [15]. • Facial landmarks (FLs) are visually salient points in facial regions such as the end of the nose, ends of the eye brows, and the mouth, as described in Figure 1b. The pairwise positions of each of two landmark points, or the local texture of a landmark, are used as a feature vector of FER. In general, FL detection approaches can be categorized into three types according to the generation of models such as active shape-based model (ASM) and appearance-based model (AAM), a regression-based model with a combination of local and global models, and CNNbased methods. FL models are trained model from the appearance and shape variations from a coarse initialization. Then, the initial shape is moved to a better position step-by-step until convergence [16]. • Basic emotions (BEs) are seven basic human emotions: happiness, surprise, anger, sadness, fear, disgust, and neutral, as shown in Figure 3a. • Compound emotions (CEs) are a combination of two basic emotions. Du et al. [17] introduced 22 emotions, including seven basic emotions, 12 compound emotions most typically expressed by humans, and three additional emotions (appall, hate, and awe). Figure 3b shows some examples of CE. • Micro expressions (MEs) indicate more spontaneous and subtle facial movements that occur involuntarily. They tend to reveal a person's genuine and underlying emotions within a short period of time. Figure 3c shows some examples of MEs. • Facial action units (AUs) code the fundamental actions (46 AUs) of individual or groups of muscles typically seen when producing the facial expressions of a particular emotion [17], as shown in Figure 3d. To recognize facial emotions, individual AU is detected and the system classify facial category according to the combination of AUs. For example, if an image has been annotated as having 1, 2, 25, and 26 AUs using an algorithm, the system will classify it as expressing an emotion of the "surprised" category, as indicated in Table 1. Table 1 shows the prototypical AUs observed in each basic and compound emotion category.

Terminology
Before reviewing researches related to FER, special terminology playing an important role in FER research is listed below:

•
The facial action coding system (FACS) is a system based on facial muscle changes and can characterize facial actions to express individual human emotions as defined by Ekman and Friesen [14] in 1978. FACS encodes the movements of specific facial muscles called action units (AUs), which reflect distinct momentary changes in facial appearance [15]. • Facial landmarks (FLs) are visually salient points in facial regions such as the end of the nose, ends of the eye brows, and the mouth, as described in Figure 1b. The pairwise positions of each of two landmark points, or the local texture of a landmark, are used as a feature vector of FER. In general, FL detection approaches can be categorized into three types according to the generation of models such as active shape-based model (ASM) and appearance-based model (AAM), a regression-based model with a combination of local and global models, and CNN-based methods. FL models are trained model from the appearance and shape variations from a coarse initialization. Then, the initial shape is moved to a better position step-by-step until convergence [16].  Figure 3b shows some examples of CE. • Micro expressions (MEs) indicate more spontaneous and subtle facial movements that occur involuntarily. They tend to reveal a person's genuine and underlying emotions within a short period of time. Figure 3c shows some examples of MEs. • Facial action units (AUs) code the fundamental actions (46 AUs) of individual or groups of muscles typically seen when producing the facial expressions of a particular emotion [17], as shown in Figure 3d. To recognize facial emotions, individual AU is detected and the system classify facial category according to the combination of AUs. For example, if an image has been annotated as having 1, 2, 25, and 26 AUs using an algorithm, the system will classify it as expressing an emotion of the "surprised" category, as indicated in Table 1. Table 1 shows the prototypical AUs observed in each basic and compound emotion category.

Contributions of this Review
Despite the long history related to FER, there are no comprehensive literature reviews on the topic of FER. Some review papers [19,20] have focused solely on conventional researches without introducing deep-leaning-based approaches. Recently, Ghayoumi [21] introduced a quick review of deep learning in FER. However, only a review of simple differences between conventional approaches and deep-learning-based approaches was provided. Therefore, this paper is dedicated to a brief literature review, from conventional FER to recent advanced FER. The main contributions of this review are as follows: • The focus is on providing a general understanding of the state-of-the art FER approaches, and helping new researchers understand the essential components and trends in the FER field. • Various standard databases that include still images and video sequences for FER use are introduced, along with their purposes and characteristics. • Key aspects are compared between conventional FER and deep-learning-based FER in terms of accuracy and resource requirements. Although deep-learning-based FER generally produces better FER accuracy than conventional FER, it also requires a large amount of processing capacity, such as a graphic processing unit (GPU) and central processing unit (CPU). Therefore, many current FER algorithms are still being used in embedded systems, including smartphones.

Contributions of this Review
Despite the long history related to FER, there are no comprehensive literature reviews on the topic of FER. Some review papers [19,20] have focused solely on conventional researches without introducing deep-leaning-based approaches. Recently, Ghayoumi [21] introduced a quick review of deep learning in FER. However, only a review of simple differences between conventional approaches and deep-learning-based approaches was provided. Therefore, this paper is dedicated to a brief literature review, from conventional FER to recent advanced FER. The main contributions of this review are as follows:

•
The focus is on providing a general understanding of the state-of-the art FER approaches, and helping new researchers understand the essential components and trends in the FER field. • Various standard databases that include still images and video sequences for FER use are introduced, along with their purposes and characteristics. • Key aspects are compared between conventional FER and deep-learning-based FER in terms of accuracy and resource requirements. Although deep-learning-based FER generally produces better FER accuracy than conventional FER, it also requires a large amount of processing capacity, such as a graphic processing unit (GPU) and central processing unit (CPU). Therefore, many current FER algorithms are still being used in embedded systems, including smartphones. • A new direction and application for future FER studies are presented.

Organization of this Review
The remainder of this paper is organized as follows. In Section 2, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. In Section 3, advanced FER approaches using deep-learning algorithms are presented. In Sections 4 and 5, a brief review of publicly available FER database and evaluation metrics with a comparison with benchmark results are provided. Finally, Section 6 offers some concluding remarks and discussion of future work.

Conventional FER Approaches
For automatic FER systems, various types of conventional approaches have been studied. The commonality of these approaches is detecting the face region and extracting geometric features, appearance features, or a hybrid of geometric and appearance features on the target face.
For the geometric features, the relationship between facial components is used to construct a feature vector for training [22,23]. Ghimire and Lee [23] used two types of geometric features based on the position and angle of 52 facial landmark points. First, the angle and Euclidean distance between each pair of landmarks within a frame are calculated, and second, the distance and angles are subtracted from the corresponding distance and angles in the first frame of the video sequence. For the classifier, two methods are presented, either using multi-class AdaBoost with dynamic time warping, or using a SVM on the boosted feature vectors.
The appearance features are usually extracted from the global face region [24] or different face regions containing different types of information [25,26]. As an example of using global features, Happy et al. [24] utilized a local binary pattern (LBP) histogram of different block sizes from a global face region as the feature vectors, and classified various facial expressions using a principal component analysis (PCA). Although this method is implemented in real time, the recognition accuracy tends to be degraded because it cannot reflect local variations of the facial components to the feature vector. Unlike a global-feature-based approach, different face regions have different levels of importance. For example, the eyes and mouth contain more information than the forehead and cheek. Ghimire et al. [27] extracted region-specific appearance features by dividing the entire face region into domain-specific local regions. Important local regions are determined using an incremental search approach, which results in a reduction of the feature dimensions and an improvement in the recognition accuracy.
For hybrid features, some approaches [18,27] have combined geometric and appearance features to complement the weaknesses of the two approaches and provide even better results in certain cases.
In video sequences, many systems [18,22,23,28] are used to measure the geometrical displacement of facial landmarks between the current frame and previous frame as temporal features, and extracts appearance features for the spatial features. The main difference between FER for still images and for video sequences is that the landmarks in the latter are tracked frame-by-frame and the system generates new dynamic features through displacement between the previous and current frames. Similar classification algorithms are then used in the video sequences, as described in Figure 1. To recognize micro-expression, high speed camera is used to capture video sequences of the face. Polikovsky et al. [29] presented facial micro-expressions recognition in video sequences captured from 200 frames per second (fps) high speed camera. This study divides face regions into specific regions, and then 3D-Gradients orientation histogram is generated from the motion in each region for FER.
Apart from FER of 2D images, 3D and 4D (dynamic 3D) recordings are increasingly used in expression analysis research because of the problems presented in 2D images caused by inherent variations in pose and illumination. 3D facial expression recognition generally consists of feature extraction and classification. One thing to note in 3D is that dynamic and static system are very different because of the nature of data. Static systems extract feature from statistical models such as deformable model, active shape model, analysis of 2D representations, and distance-based features. In contrast, dynamic systems utilize 3D image sequences for analysis of facial expressions such as 3D motion-based features. For FER, 3D images also use the similar conventional classification algorithms [29,30]. Although 3D-based FER showed higher performance than 2D-based FER, 3D and 4D-based FER also has certain problems such as a high computational cost owing to a high resolution and frame rate, as well as the amount of 3D information involved.
Some researchers [31][32][33][34][35] have tried to recognize facial emotions using infrared images instead of visible light spectrum (VIS) image because visible light (VIS) image is variable according to the status of illumination. Zhao et al. [31] used near-infrared (NIR) video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature descriptors. This study uses component-based facial features to combine geometric and appearance information of face. For FER, a SVM and sparse representation classifiers are used. Shen et al. [32] used infrared thermal videos by extracting horizontal and vertical temperature difference from different facial sub-regions. For FER, the Adaboost algorithm with the weak classifiers of k-Nearest Neighbor is used. Szwoch and Pieniążek [33] recognized facial expression and emotion based only on depth channel from Microsoft Kinect sensor without using camera. This study uses local movements within the face area as the feature and recognized facial expressions using relations between particular emotions. Sujono and Gunawan [34] used Kinect motion sensor to detect face region based on depth information and active appearance model (AAM) to track the detected face. To role of AAM is to adjust shape and texture model in a new face, when there is variation of shape and texture comparing to the training result. To recognize facial emotion, the change of key features in AAM and fuzzy logic based on prior knowledge derived from FACS are used. Wei et al. [35] proposed FER using color and depth information by Kinect sensor together. This study extracts facial feature points vector by face tracking algorithm using captured sensor data and recognize six facial emotions by random forest algorithm.
Commonly, conventional approaches determine features and classifiers by experts. For feature extraction, many well-known handcrafted feature, such as HoG, LBP, distance and angle relation between landmarks are used and the pre-trained classifiers, such as SVM, AdaBoost, and random forest, are also used for FE recognition based on the extracted features. Conventional approaches require relatively lower computing power and memory than deep learning-based approaches. Therefore, these approaches are still being studied for use in real-time embedded systems because of their low computational complexity and high degree of accuracy [22]. However, feature extraction and the classifiers should be designed by the programmer and they cannot be jointly optimized to improve performance [36,37]. Table 2 summarizes the representative conventional FER approaches and their main advantages.  Stepwise approach [31] Six prototypical emotions

Deep-Learning Based FER Approaches
In recent decades, there has been a breakthrough in deep-learning algorithms applied to the field of computer vision, including a CNN and recurrent neural network (RNN). These deep-learning-based algorithms have been used for feature extraction, classification, and recognition tasks. The main advantage of a CNN is to completely remove or highly reduce the dependence on physics-based models and/or other pre-processing techniques by enabling "end-to-end" learning directly from input images [44]. For these reasons, CNN has achieved state-of-the-art results in various fields, including object recognition, face recognition, scene understanding, and FER.
A CNN contains three types of heterogeneous layers: convolution layer, max pooling layer, and fully connected layers, as shown in Figure 2. Convolutional layers take image or feature maps as the input, and convolve these inputs with a set of filter banks in a sliding-window manner to output feature maps that represent a spatial arrangement of the facial image. The weights of convolutional filters within a feature map are shared, and the inputs of the feature map layer are locally connected [45]. Second, subsampling layers lower the spatial resolution of the representation by averaging or max-pooling the given input feature maps to reduce their dimensions and thereby ignore variations in small shifts and geometric distortions [45,46]. The last fully connected layers of a CNN structure compute the class scores on the entire original image. Most deep-learning-based methods [46][47][48][49] have adapted a CNN directly for AU detection.
Breuer and Kimmel [47] employed CNN visualization techniques to understand a model learned using various FER datasets, and demonstrated the capability of networks trained on emotion detection, across both datasets and various FER-related tasks. Jung et al. [48] used two different types of CNN: the first extracts temporal appearance features from the image sequences, whereas the second extracts temporal geometry features from temporal facial landmark points. These two models are combined using a new integration method to boost the performance of facial expression recognition.
Zhao et al. [49] proposed deep region and multi-label learning (DRML), which is a unified deep network. DRML is a region layer that uses feed-forward functions to induce important facial regions, and forces the learned weights to capture structural information of the face. The complete network is end-to-end trainable, and automatically learns representations robust to variations inherent within a local region.
As we determined in our review, many approaches have adopted a CNN directly for FER use. However, because CNN-based methods cannot reflect temporal variations in the facial components, a recent hybrid approach combining a CNN for the spatial features of individual frames, and long short-term memory (LSTM) for the temporal features of consecutive frames, was developed. LSTM is a special type of RNN capable of learning long-term dependencies. LSTMs are explicitly designed to solve the long-term dependency problem using short-term memory. An LSTM has a chain-like structure, although the repeating modules have a different structure, as shown in Figure 4. All recurrent neural networks have a chain-like form of four repeating modules of a neural network [50]:

•
The cell state is a horizontal line running through the top of the diagram, as shown in Figure 4. An LSTM has the ability to remove or add information to the cell state. The LSTM or RNN model for modeling sequential images has two advantages compared to standalone approaches. First, LSTM models are straightforward in terms of fine-tuning end-to-end when integrated with other models such as a CNN. Second, an LSTM supports both fixed-length and variable-length inputs or outputs [51].
The representative studies using a combination of a CNN and an LSTM (RNN) include the following: Kahou et al. [11] proposed a hybrid RNN-CNN framework for propagating information over a sequence using a continuously valued hidden-layer representation. In this work, the authors presented a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge [52], and proved that a hybrid CNN-RNN architecture for a facial expression analysis can outperform a previously applied CNN approach using temporal averaging for aggregation.
Kim et al. [13] utilized representative expression-states (e.g., the onset, apex, and offset of expressions), which can be specified in facial sequences regardless of the expression intensity. The spatial image characteristics of the representative expression-state frames are learned using a CNN. In the second part, temporal characteristics of the spatial feature representation in the first part are learned using an LSTM of the facial expression.
Chu et al. [53] proposed a multi-level facial AU detection algorithm combining spatial and temporal features. First, the spatial representations are extracted using a CNN, which is able to reduce person-specific biases caused by handcrafted descriptors (e.g., HoG and Gabor). To model the temporal dependencies, LSTMs are stacked on top of these representations, regardless of the lengths of the input video sequences. The outputs of CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction of 12 AUs.
Hasani and Mahoor [54] proposed the 3D Inception-ResNet architecture followed by an LSTM unit that together extracts the spatial relations and temporal relations within the facial images between different frames in a video sequence. Facial landmark points are also used as inputs of this network, emphasizing the importance of facial components rather than facial regions, which may not contribute significantly to generating facial expressions.
Graves et al. [55] used a recurrent network to consider the temporal dependencies present in the image sequences during classification. In experimental results using two types of LSTM (bidirectional The LSTM or RNN model for modeling sequential images has two advantages compared to standalone approaches. First, LSTM models are straightforward in terms of fine-tuning end-to-end when integrated with other models such as a CNN. Second, an LSTM supports both fixed-length and variable-length inputs or outputs [51]. The representative studies using a combination of a CNN and an LSTM (RNN) include the following: Kahou et al. [11] proposed a hybrid RNN-CNN framework for propagating information over a sequence using a continuously valued hidden-layer representation. In this work, the authors presented a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge [52], and proved that a hybrid CNN-RNN architecture for a facial expression analysis can outperform a previously applied CNN approach using temporal averaging for aggregation.
Kim et al. [13] utilized representative expression-states (e.g., the onset, apex, and offset of expressions), which can be specified in facial sequences regardless of the expression intensity. The spatial image characteristics of the representative expression-state frames are learned using a CNN. In the second part, temporal characteristics of the spatial feature representation in the first part are learned using an LSTM of the facial expression.
Chu et al. [53] proposed a multi-level facial AU detection algorithm combining spatial and temporal features. First, the spatial representations are extracted using a CNN, which is able to reduce person-specific biases caused by handcrafted descriptors (e.g., HoG and Gabor). To model the temporal dependencies, LSTMs are stacked on top of these representations, regardless of the lengths of the input video sequences. The outputs of CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction of 12 AUs.
Hasani and Mahoor [54] proposed the 3D Inception-ResNet architecture followed by an LSTM unit that together extracts the spatial relations and temporal relations within the facial images between different frames in a video sequence. Facial landmark points are also used as inputs of this network, emphasizing the importance of facial components rather than facial regions, which may not contribute significantly to generating facial expressions.
Graves et al. [55] used a recurrent network to consider the temporal dependencies present in the image sequences during classification. In experimental results using two types of LSTM (bidirectional LSTM and unidirectional LSTM), this study proved that a bidirectional network provides a significantly better performance than a unidirectional LSTM.
Jain et al. [56] proposed a multi-angle optimal pattern-based deep learning (MAOP-DL) method to rectify the problem of sudden changes in illumination, and find the proper alignment of the feature set by using multi-angle-based optimal configurations. Initially, this approach subtracts the background and isolates the foreground from the images, and then extracts the texture patterns and the relevant key features of the facial points. The relevant features are then selectively extracted, and an LSTM-CNN is employed to predict the required label for the facial expressions.
Commonly, deep learning-based approaches determine features and classifiers by deep neural networks experts, unlike conventional approaches. Deep learning-based approaches extract optimal features with the desired characteristics directly from data using deep convolutional neural networks. However, it is not easy to collect a large amount of training data for the facial emotion under the different conditions enough to learn deep neural networks. Moreover, deep learning-based approaches require more a higher-level and massive computing device than convention approaches to operate training and testing [35]. Therefore, it is necessary to reduce the computational burden at inference time of deep learning algorithm.
Among the many approaches based on a standalone CNN or combination of LSTM and CNN, some representative works are shown in Table 3.  As determined through our review conducted thus far, the general frameworks of the hybrid CNN-LSTM and CNN-RNN-based FER approaches have similar structures, as shown in Figure 5. In summary, the basic framework of CNN-LSTM (RNN) is to combine an LSTM with a deep hierarchical visual feature extractor such as a CNN model. Therefore, this hybrid model can learn to recognize and synthesize temporal dynamics for tasks involving sequential images. As shown in Figure 5, each visual feature determined through a CNN is passed to the corresponding LSTM, and produces a fixed or variable-length vector representation. The outputs are then passed into a recurrent sequence-learning module. Finally, the predicted distribution is computed by applying softmax [51,53]. As determined through our review conducted thus far, the general frameworks of the hybrid CNN-LSTM and CNN-RNN-based FER approaches have similar structures, as shown in Figure 5. In summary, the basic framework of CNN-LSTM (RNN) is to combine an LSTM with a deep hierarchical visual feature extractor such as a CNN model. Therefore, this hybrid model can learn to recognize and synthesize temporal dynamics for tasks involving sequential images. As shown in Figure 5, each visual feature determined through a CNN is passed to the corresponding LSTM, and produces a fixed or variable-length vector representation. The outputs are then passed into a recurrent sequencelearning module. Finally, the predicted distribution is computed by applying softmax [51,53].

Brief Introduction to FER Database
In the field of FER, numerous databases have been used for comparative and extensive experiments. Traditionally, human facial emotions have been studied using either 2D static images or 2D video sequences. A 2D-based analysis has difficulty handling large pose variations and subtle facial behaviors. The analysis of 3D facial emotions will facilitate an examination of the fine structural changes inherent in spontaneous expressions [40]. Therefore, this sub-section briefly introduces some popular databases related to FER consisting of 2D and 3D video sequences and still images:

Brief Introduction to FER Database
In the field of FER, numerous databases have been used for comparative and extensive experiments. Traditionally, human facial emotions have been studied using either 2D static images or 2D video sequences. A 2D-based analysis has difficulty handling large pose variations and subtle facial behaviors. The analysis of 3D facial emotions will facilitate an examination of the fine structural changes inherent in spontaneous expressions [40]. Therefore, this sub-section briefly introduces some popular databases related to FER consisting of 2D and 3D video sequences and still images:  Table 4 shows a summary of these publicly available databases.  Figure 6 shows examples of the nine databases for FER with 2D and 3D images and video sequences. In recent, other sensors, such as NIR camera, thermal camera, and Kinect sensors, are having interesting of FER researches because visible light image is easily changeable when there are changes in environmental illumination conditions. As the database captured from NIR camera, Oulu-CASIA NIR&VIS facial expression database [31] consists of six expressions from 80 people between 23 and 58 years old. 73.8% of the subjects are males. Natural visible and infrared facial expression (USTC-NVIE) database [32] collected both spontaneous and posed expressions of more than 100 subjects simultaneously using a visible and an infrared thermal camera. Facial expressions and emotions database (FEEDB) is a multimodal database of facial expressions and emotion recorded using Microsoft Kinect sensor. It contains of 1650 recordings of 50 persons posing for 33 different facial expressions and emotions [33].
As described here, various sensors other than the camera sensor are used for FER, but there is a limitation in improving the recognition performance with only one sensor. Therefore, it is predicted that the attempts to increase the FER through the combination of various sensors, will continue in the future.

Performance Evaluation of FER
Given the FER approaches, evaluation metrics of the FER approaches are crucial because they provide a standard for a quantitative comparison. In this section, a brief review of publicly available evaluation metrics and a comparison with the benchmark results are provided.

Subject-Independent and Cross-Database Tasks
Many approaches are used to evaluate the accuracy using two different experiment protocols: subject-independent and cross-dataset tasks [55]. First, a subject-independent task splits each database into training and validation sets in a strict subject-independent manner. This task is also called a K-fold cross-validation. The purpose of K-fold cross-validation is to limit problems such as overfitting and provide insight regarding how the model will generalize into an independent unknown dataset [61]. With the K-fold cross-validation technique, each dataset is evenly partitioned Figure 6. Examples of nine representative databases related to FER. Databases (a) through (g) support 2D still images and 2D video sequences, and databases (h) through (i) support 3D video sequences.
Unlike the databases described above, MPI facial expression database [60] collects a large variety of natural emotional and conversational expressions under the assumption that people understand emotions by analyzing both the conversational expressions as well as the emotional expressions. This database consists of more than 18,800 samples of video sequences from 10 females and nine male models displaying various facial expressions recorded from one frontal and two lateral views.
In recent, other sensors, such as NIR camera, thermal camera, and Kinect sensors, are having interesting of FER researches because visible light image is easily changeable when there are changes in environmental illumination conditions. As the database captured from NIR camera, Oulu-CASIA NIR&VIS facial expression database [31] consists of six expressions from 80 people between 23 and 58 years old. 73.8% of the subjects are males. Natural visible and infrared facial expression (USTC-NVIE) database [32] collected both spontaneous and posed expressions of more than 100 subjects simultaneously using a visible and an infrared thermal camera. Facial expressions and emotions database (FEEDB) is a multimodal database of facial expressions and emotion recorded using Microsoft Kinect sensor. It contains of 1650 recordings of 50 persons posing for 33 different facial expressions and emotions [33].
As described here, various sensors other than the camera sensor are used for FER, but there is a limitation in improving the recognition performance with only one sensor. Therefore, it is predicted that the attempts to increase the FER through the combination of various sensors, will continue in the future.

Performance Evaluation of FER
Given the FER approaches, evaluation metrics of the FER approaches are crucial because they provide a standard for a quantitative comparison. In this section, a brief review of publicly available evaluation metrics and a comparison with the benchmark results are provided.

Subject-Independent and Cross-Database Tasks
Many approaches are used to evaluate the accuracy using two different experiment protocols: subject-independent and cross-dataset tasks [55]. First, a subject-independent task splits each database into training and validation sets in a strict subject-independent manner. This task is also called a K-fold cross-validation. The purpose of K-fold cross-validation is to limit problems such as overfitting and provide insight regarding how the model will generalize into an independent unknown dataset [61]. With the K-fold cross-validation technique, each dataset is evenly partitioned into K folds with exclusive subjects. Then, a model is iteratively trained using K-1 folds and evaluated on the remaining fold, until all subjects are tested. Validation is conducted using almost less than 20% of the training subjects. The accuracy is estimated by averaging the recognition rate over K folds. For example, in ten-fold cross-validation adopted for an evaluation, nine folds are used for training, and one fold is used for testing. After this process is performed ten different times, the accuracies of the ten results are averaged and defined as the classifier performance.
The second protocol is a cross-database task. In this task, one dataset is used entirely for testing the model, and the remaining datasets listed in Table 4 are used to train the model. The model is iteratively trained using K-1 datasets and evaluated on the remaining dataset repeatedly until all datasets have been tested. The accuracy is estimated by averaging the recognition rate over K datasets in a manner similar to K-fold cross-validation.

Evaluation Metrics
The evaluation metrics of FER are classified into four methods using different attributes: precision, recall, accuracy, and F1-score.
The precision (P) is defined as TP/(TP + FP), and the recall (R) is defined as TP/(TP + FN), where TP is the number of true positives in the dataset, FN is the number of false negatives, and FP is the number of false positives. The precision is the fraction of automatic annotations of emotion i that are correctly recognized. The recall is the number of correct recognitions of emotion i over the actual number of images with emotion i [18]. The accuracy is the ratio of true outcomes (both true positive to true negative) to the total number of cases examined. Accuracy (ACC) = TP + TN Total population (1) Another metric, the F1-score, is divided into two metrics depending on whether they use spatial or temporal data: frame-based F1-score (F1-frame) and event-based F1-score (F1-event). Each metric captures different properties of the results. This means that a frame-based F-score has predictive power in terms of spatial consistency, whereas an event-based F-score has predictive power in terms of the temporal consistency [62]. A frame-based F1-score is defined as An event-based F1-score is used to measure the emotion recognition performance at the segment level because emotions occur as a temporal signal.
where ER and EP are event-based recall and precision. ER is the ratio of correctly detected events over the true events, while the EP is the ratio of correctly detected events over the detected events. F1-event considers that there is an event agreement if the overlap is above a certain threshold [63].

Evaluation Results
To show a direct comparison between conventional handcrafted-feature-based approaches and deep-learning-based approaches, this review lists public results on the MMI dataset. Table 5 shows the comparative recognition rate of six conventional approaches and six deep-learning-based approaches. As shown in Table 5, deep-learning-based approaches outperform conventional approaches with an average of 72.65% versus 63.2%. In conventional FER approaches, the reference [68] has the highest performance than other algorithms. This study tried to compute difference information between the peak expression face and its intra class variation in order to reduce the effect of the facial identity in the feature extraction. Because the feature extraction is robust to face rotation and misalignment, this study achieves relatively accurate FER than other conventional methods. Among several deep-learning-based approaches, two have a relatively higher performance compared to several state-of-the-art methods; a complex CNN network proposed in [72] consists of two convolutional layers, each followed by max pooling and four Inception layers. This network has a single-component architecture that takes registered facial images as the input and classifies them into one of six basic or one neutral expression. The highest performance approach [13] also consists of two parts. In the first part, the spatial image characteristics of the representative expression-state frames are learned using a CNN. In the second part, the temporal characteristics of the spatial feature representation in the first part are learned using an LSTM of the facial expression. Based on the accuracy of a complex hybrid approach using spatio-temporal feature representation learning, the FER performance of largely affected not only by the spatial changes but also by the temporal changes.
Although deep-learning-based FER approaches have achieved great success in experimental evaluations, a number of issues remain that deserve further investigation: • A large-scale dataset and massive computing power are required for training as the structure becomes increasingly deep.

•
Large numbers of manually collected and labeled datasets are needed.

•
Large memory is demanded, and the training and testing are both time consuming. These memories demanding and computational complexities make deep learning ill-suited for deployment on mobile platforms with limited resources [73].
• Considerable skill and experience are required to select suitable hyper parameters, such as the learning rate, kernel sizes of the convolutional filters, and the number of layers. These hyper-parameters have internal dependencies that make them particularly expensive for tuning. • Although they work quite well for various applications, a solid theory of CNNs is still lacking, and thus users essentially do not know why or how they work.

Conclusions
This paper presented a brief review of FER approaches. As we described, such approaches can be divided into two main streams: conventional FER approaches consisting of three steps, namely, face and facial component detection, feature extraction, and expression classification. The classification algorithms used in conventional FER include SVM, Adaboost, and random forest; by contrast, deep-learning-based FER approaches highly reduce the dependence on face-physics-based models and other pre-processing techniques by enabling "end-to-end" learning in the pipeline directly from the input images. As a particular type of deep learning, a CNN visualizes the input images to help understand the model learned through various FER datasets, and demonstrates the capability of networks trained on emotion detection, across both the datasets and various FER related tasks. However, because CNN-based FER methods cannot reflect the temporal variations in the facial components, hybrid approaches have been proposed by combining a CNN for the spatial features of individual frames, and an LSTM for the temporal features of consecutive frames. A few recent studies have provided an analysis of a hybrid CNN-LSTM (RNN) architecture for facial expressions that can outperform previously applied CNN approaches using temporal averaging for aggregation. However, deep-learning-based FER approaches still have a number of limitations, including the need for large-scale datasets, massive computing power, and large amounts of memory, and are time consuming for both the training and testing phases. Moreover, although a hybrid architecture has shown a superior performance, micro-expressions remain a challenging task to solve because they are more spontaneous and subtle facial movements that occur involuntarily. This paper also briefly introduced some popular databases related to FER consisting of both video sequences and still images. In a traditional dataset, human facial expressions have been studied using either static 2D images or 2D video sequences. However, because a 2D-based analysis has difficulty handling large variations in pose and subtle facial behaviors, recent datasets have considered 3D facial expressions to better facilitate an examination of the fine structural changes inherent to spontaneous expressions.
Furthermore, evaluation metrics of FER-based approaches were introduced to provide standard metrics for comparison. Evaluation metrics have been widely evaluated in the field of recognition, and precision and recall are mainly used. However, a new evaluation method for recognizing consecutive facial expressions, or applying micro-expression recognition for moving images, should be proposed.
Although studies on FER have been conducted over the past decade, in recent years the performance of FER has been significantly improved through a combination of deep-learning algorithms. Because FER is an important way to infuse emotion into machines, it is advantageous that various studies on its future application are being conducted. If emotional oriented deep-learning algorithms can be developed and combined with additional Internet-of-Things sensors in the future, it is expected that FER can improve its current recognition rate, including even spontaneous micro-expressions, to the same level as human beings.