Facial Expression Recognition Using Computer Vision: A Systematic Review

: Emotion recognition has attracted major attention in numerous fields because of its relevant applications in the contemporary world: marketing, psychology, surveillance, and entertainment are some examples. It is possible to recognize an emotion through several ways; however, this paper focuses on facial expressions, presenting a systematic review on the matter. In addition, 112 papers published in ACM, IEEE, BASE and Springer between January 2006 and April 2019 regarding this topic were extensively reviewed. Their most used methods and algorithms will be firstly introduced and summarized for a better understanding, such as face detection, smoothing, Principal Component Analysis (PCA), Local Binary Patterns (LBP), Optical Flow (OF), Gabor filters, among others. This review identified a clear difficulty in translating the high facial expression recognition (FER) accuracy in controlled environments to uncontrolled and pose-variant environments. The future efforts in the FER field should be put into multimodal systems that are robust enough to face the adversities of real world scenarios. A thorough analysis on the research done on FER in Computer Vision based on the selected papers is presented. This review aims to not only become a reference for future research on emotion recognition, but also to provide an overview of the work done in this topic for potential readers.


Introduction
Emotion recognition is being actively explored in Computer Vision research.With the recent rise and popularization of Machine Learning [1] and Deep Learning [2] techniques, the potential to build intelligent systems that accurately recognize emotions became a closer reality.However, this problem is shown to be more and more complex with the progress of fields that are directly linked with emotion recognition, such as psychology and neurology.Micro-expressions, electroencephalography (EEG) signals, gestures, tone of voice, facial expressions, and surrounding context are some terms that have a powerful impact when identifying emotions in a human [3].When all of these variables are pieced together with the limitations and problems of the current Computer Vision algorithms, emotion recognition can get highly complex.
Facial expressions are the main focus of this systematic review.Generally, an FER system consists of the following steps: image acquisition, pre-processing, feature extraction, classification, or regression, as shown in Figure 1.To be able to get a proper facial expression classification, it is highly desirable to provide the most relevant data to the classifier, in the best possible conditions.In order to do that, a conventional FER system will firstly pre-process the input image.One pre-processing step that is common among most reviewed papers is face detection.
Face detection techniques are able to create bounding boxes that delimit detected faces, which are the desired regions of interest (ROIs) for a conventional FER system.This task is still challenging, and it is not guaranteed that all faces are going to be detected in a given input image.This is especially true when acquiring images from an uncontrolled environment, where there may be movement, harsh lighting conditions, different poses, great distances, among other factors [4].
When the faces are properly detected, a conventional FER system will process the retrieved ROIs in order to prepare the data that will be fed into the classifier.Normally, this pre-processing step is divided into several substeps, such as intensity normalization for illumination changes, noise filters for image smoothing, data augmentation (DA) [5] to increase the training data, rotation correction for the rotated faces, image resizing for different ROI sizes, image cropping for a better background filtering, among others.After the pre-processing, one can retrieve relevant features from the pre-processed ROIs.There are numerous features that can be selected, such as Actions Units (AUs) [6], motion of certain facial landmarks, distance between facial landmarks, facial texture, gradient features, and so forth.Then, these features are fed into a classifier.Generally, the classifiers used in an FER system are Support Machine Vectors (SVMs) [7] or Convolutional Neural Networks (CNNs) [8].
This systematic review is organized as follows: Section 2 presents the paper selection criteria for this systematic review; Section 3 presents the most popular FER databases; Section 4 presents the most popular pre-processing methods in FER; Section 5 presents the most popular feature extraction methods in FER; Section 6 presents the most popular classifiers in FER; Section 7 presents the most relevant results obtained by the selected works, as well as a discussion; Section 8 presents insights on the Emotion Recognition in the Wild Challenge (EmotiW); Section 9 presents the conclusion and last remarks of this systematic review.

Selection Criteria
It was decided to search the literature from January 2006 (when the publications on machine learning in emotion recognition started emerging) to April 2019.The used databases were ACM [9], IEEE [10], BASE [11], and Springer [12].These databases were chosen because of their relevance in Computer Vision.
The searching strategy consisted of Open Access articles, journals, conference objects, and manuscripts.There are several keywords and variants that cover the FER field; therefore, it was decided to limit the searched keywords to the most popular two within this topic: "Emotion Recognition" or "Facial Expression Recognition".The full text of the works was considered during the searching process.The results were the following: All the papers were assessed, resulting in a total of 783 non-duplicate papers.The selection criteria for this systematic review only include works that explore FER using Computer Vision.First, all the titles and abstracts were quickly appraised in order to exclude studies that were not relevant to the scope of this systematic review or did not meet the selection criteria, resulting in 185 papers.Second, the full text of the remaining papers was carefully evaluated in order to filter out the remaining studies that did not meet the selection criteria, resulting in 112 papers.The following categories of papers were excluded: 1. Theoretical studies, 2. Studies that are not related with Computer Vision, 3. Surveys and Thesis, 4. Dataset publications, 5. Older iterations of the same studies.
Figure 2 sums up the search and exclusion steps of this systematic review.It is important to mention that some included papers did not present results, they were not conclusive or their procedures and/or results deviate from the mainstream scope; therefore, they were not included in Section 7 .However, those papers still contributed to the remaining sections.

FER Databases
In order to build FER systems that are able to obtain results which can further be compared with related works, researchers working in the FER field have numerous databases at their disposal.Most databases are built on 2D static images or 2D video sequences; however, there are some databases that contain 3D images.An FER system built on a 2D approach has the limitation of handling different poses poorly, since most 2D databases only contain frontal faces.A 3D approach is potentially capable of handling the pose variation problem.Most FER databases are labeled with the six basic emotions (anger, disgust, fear, happiness, sadness, and surprise), plus the neutral expression.Some FER databases are built on controlled environments (generally inside a laboratory with controlled lighting conditions), while others are built on uncontrolled or wild environments.Furthermore, the subjects of some FER databases were asked to pose certain emotions towards a reference, while others tried to stimulate spontaneous and genuine facial expressions.This section introduces the most popular databases used in the reviewed works:

•
The Extended Cohn-Kanade database (CK+) [50]: contains 593 image sequences of posed and non-posed emotions.In addition, 123 participants were 18 to 50 years of age, 69% female, 81%, Euro-American, 13% Afro-American, and 6% other groups.The images were digitized into either 640 × 490 or 640 × 480 resolution and are mostly gray.Each sequence was built on frontal views and 30-degree views, starting with a neutral expression up until the peak emotion (last frame of the sequence).Most sequences are labeled with eight emotions: anger, disgust, contempt, fear, neutral, happiness, sadness, and surprise.

•
The Japanese Female Facial Expression database (JAFFE) [51]: contains 213 images of six basic emotions, plus the neutral expression posed by 10 Japanese female models.Each image has been labeled by 60 Japanese subjects.

•
Binghamton University 3D Facial Expression database (BU-3DFE) [52]: contains 606 3D facial expression sequences captured from 101 subjects.The texture video has a resolution of about 1040 × 1329 pixels per frame.The resulting database consists of 58 female and 43 male subjects, with a large variety of ethnic/racial ancestries.This database was built on the six basic emotions, plus the neutral expression.

•
Facial Expression Recognition 2013 database (FER-2013) [53]: was created using the Google image search API to search for images of faces that match a set of 184 emotion-related keywords like "blissful", "enraged", etc.These keywords were combined with words related to gender, age or ethnicity, leading to 35,887 grayscale images with a 48 × 48 resolution, mapped into the six basic emotions, plus the neutral expression.

•
Emotion Recognition in the Wild database (EmotiW) [54]: contains two sub-databases, Acted Facial Expression in the Wild (AFEW) and the Static Facial Expression in the Wild (SFEW).AFEW contains videos (image sequences including audio) and SFEW contains static images.This database was built on the six basic emotions, plus the neutral expression and the image size is 128 × 128.

•
MMI database [55]: contains over 2900 videos and high-resolution still images of 75 subjects.It is fully annotated for the presence of AUs in videos, and partially coded on frame-level, indicating for each frame whether an AU is in either the neutral, onset, apex or offset phase.This database was built on six emotions: anger, disgust, fear, happiness, sadness, and surprise.• eNTERFACE'05 Audiovisual Emotion database [56]: contains 42 subjects from 14 different nationalities.Among the 42 subjects, 81% were men and the remaining 19% were women.In addition, 31% of the subjects wore glasses, while 17% had a beard, which consists of video sequences (including audio) with a 720 × 576 resolution.This database was built on six emotions: anger, disgust, fear, happiness, sadness, and surprise.

•
Karolinska Directed Emotional Faces database (KDEF) [57]: contains a set of 4900 pictures of human facial expressions.The set contains 70 individuals (35 females and 35 males) displaying the six basic emotions, plus the neutral expression.Each expression is viewed from five different angles and was photographed in two sessions.

•
Radboud Faces Database (RaFD) [58]: contains a set of pictures of 67 models (including Caucasian males and females and Moroccan Dutch males) displaying eight emotional expressions (anger, disgust, contempt, fear, neutral, happiness, sadness, and surprise), amounting to 120 images per model.Each emotion is shown with three different gaze directions and all pictures were taken from five camera angles simultaneously.The image size is 1024 × 681.
Figure 4 shows some examples of the introduced FER databases: Table 1 sums up these introduced databases and provides their website.

Pre-Processing
Pre-processing is one of the most important phases, not only in FER systems, but in any Machine Learning based system.Analyzing raw data without any kind of screening can lead to unwanted results.This is why assuring the quality of data before extracting its relevant features is vital.The following subsections present the most popular and effective pre-processing techniques in the reviewed works.

Face Detection
Generally, face detection is the very beginning of an FER system.This technique is responsible for selecting the ROI of an input image that will be fed to the next steps of the FER system.Most reviewed papers used the classic Viola-Jones face detector [68] from 2004.
The Viola-Jones face detector is a Machine Learning based approach where a cascade function is trained from a lot of positive images (images with faces) and negative images (images without faces).Haar features are used in this algorithm and they are applied in all training images to find the best threshold which will classify the faces as positive or negative detections. Figure 5 shows how these features are selected by AdaBoost (learning algorithm that selects a small number of critical visual features from a very large set of potential features).Two examples of features are shown in the top row and then overlaid on a training face in the bottom row.Basically, these features measure the difference in intensity between the white region and the black region.Typically, one region is darker than the other, and that is how the features are selected to detect faces.
Very few works used other face detectors, such as the Dlib library [69] and Multi-task Cascade Convolution Neural Network (MTCNN) [70].The Dlib face detector uses an ensemble of regression trees to regress the location of 68 facial landmarks from a sparse subset of intensity values extracted from an input image and, consequently, detect where the faces are. Figure 6 shows a detected face from the CK+ database and its 68 facial landmarks using the Dlib library.As for MTCNN, it goes through three stages to output a proper face detection, and, in each stage, the face that is being analyzed goes through a CNN.The first stage obtains the candidate windows and their bounding box regression vectors, merging highly overlapped candidates [71].The second stage feeds those candidates to another CNN, which rejects a large number of false candidates.The third stage is similar to the second one, but it also outputs five facial landmarks' positions.Figure 7 shows a detected face from the CK+ database and its five facial landmarks using MTCNN.Despite Viola-Jones being the most used face detector choice in the reviewed papers, MTCNN outperforms it across several challenging benchmarks for face detection and face alignment, while keeping real-time performance.The main reason Viola-Jones was the most used face detector is that most reviewed works tested their FER systems in controlled environment databases, where Viola-Jones can smoothly detect the faces.Works that tested their FER systems in uncontrolled environment databases generally used MTCNN.

Geometric Transformations
Even if the faces are detected in an input image, it does not mean that they are in proper conditions to be analyzed.Some problems that can arise from these detected ROIs are rotation, scale, and noise [72].One has to guarantee that the face to be classified is as geometrically similar as possible to the faces used when training the classifier.That way, the classifier is apter to produce more trustworthy results.
Some reviewed works applied geometric transformations to faces that were not detected in the best conditions.Starting with the rotation, the most popular way to correct it is by using the facial landmarks outputted by the face detector.Typically, the reviewed works considered two facial landmarks that form an angle of zero in the horizontal axis, when a face is aligned.To perform face alignment of a rotated face, a rotation transformation is applied to align those two facial landmarks with the horizontal axis, until the angle formed by them is zero.Figure 8 shows a rotation correction on a face from the CK+ database.As for the scale problem, it is because of the different distances in which faces can be detected: closer detected faces have bigger ROIs than farther ones.Since it is crucial to feed the next stages of the FER system with the same size of ROIs, the technique used in the reviewed works is resizing every ROI to a predetermined size (spatial normalization).
The noise problem is mainly the background in the detected ROIs.It is important to remove the background from the original ROIs since it can decrease the accuracy of the classifier by adding one more variable to the problem: distinguishing between foreground and background.Most works seemed to overlook this kind of noise, but there are a few that tried to crop the ROI even more in order to filter the background.The most popular approach to do this is to use the facial landmarks and the bounding box outputted by the face detector.By calculating the distances between relevant facial landmarks, it is possible to reduce the ROI dimension obtained from the face detector and filter the background noise.Figure 9 shows an approach of background removal on a face from the CK+ database.

Image Processing
Having a proper geometric transformed ROI might not be enough to prepare image data.There are several image processing techniques to accentuate relevant features that are going to be used in the classifier, but most FER systems of the reviewed papers used the following ones: Smoothing in image processing is often necessary.By smoothing an image, one can capture relevant patterns while filtering the noise.That way, smoothing can provide robustness to the data that is going to be analyzed.There are several ways to smooth an image, but the most popular ones in the reviewed papers are through a bilateral filter [73] or a Gaussian filter [74].
A bilateral filter is effective in noise removal while keeping edges sharp, since it uses a gaussian function of space for smoothing only the nearby pixels and a gaussian function of intensity for smoothing pixels that have a similar intensity to the central pixel.That way, it can preserve the edges, since they usually have high intensity variations.
A Gaussian filter is effective in removing Gaussian noise from an image.It takes the neighborhood around the central pixel and find its Gaussian weighted average.This Gaussian filter is a function of space alone; therefore, it will also smooth the edges.
In a conventional FER system, a bilateral filter has the upper hand when smoothing a face, since it keeps the edges from being blurred.Edges in a face normally delimit facial features such as eyes, nose, eyebrows and mouth.Smoothing these critical features for FER might be something undesirable.

Histogram Equalization
Histograms plot the intensity distribution of an image.Because of potential different lighting conditions that extracted faces may present, the recognition accuracy will likely suffer.There are histogram based algorithms in Computer Vision that are able to stretch out this distribution, meaning that overexposed or underexposed regions of the face will have their intensity uniformed.This method has the advantage of improving the contrast in an image, highlighting facial features and reducing the interference caused by different lighting conditions.However, it may also increase the contrast of background noise [75].Reviewed works that used this pre-processing approach tended to explore Histogram equalization (HE) [76].Figure 10 shows a few results on the CK+ database using HE.

Data Augmentation
Emotion recognition databases are generally small, which is something undesirable for Machine Learning classifiers.Training on small data can lead to overfitting [77], which is a common problem in Machine Learning models.A model overfits when it classifies accurately data used for training, but its accuracy drops considerably when classifying data outside the training set (poor generalization).Therefore, overfitting can be spotted when training the model: the accuracy on the training data are high, but the accuracy on the validation data is significantly lower.One way to deal with this problem that is often a consequence of using small databases for training is through DA.
DA is a technique that enables the increase of data by modifying it reasonably.These modifications can be cropping, flipping, rotating, zooming, rescaling, brightness changing, shifting, among others.It is important to keep track of these modifications; otherwise, this method might change the proper meaning of the training samples and confuse the classifier.Figure 11 shows an example of DA.

Principal Component Analysis
PCA [78] is a method used to reduce the dimension of a large number of features keeping most of their information.At the potential expense of a small amount of accuracy, one can simplify data by reducing the huge amount of variables.In an FER system, this algorithm can be used to reduce redundant facial features, leading to an increase of the computational efficiency.

Feature Extraction
Once the pre-processing phase is over, one can extract the relevant highlighted features.In a conventional FER system, the relevant features are facial features.The quality of these features plays a huge role in system accuracy; therefore, several techniques to extract features were developed in Computer Vision.The most popular feature extraction techniques in the reviewed works are the following:

Local Binary Patterns
LBP [79] is known as one of the best methods for texture processing.This algorithm aims to compare a center pixel with its 3 × 3 square neighborhood.If the neighbor pixel value is greater than or equal to the center pixel value, then it takes the value "1", else it takes the value "0".Afterwards, one can take the resulting binary code from this operation and set the center pixel with its decimal value.Figure 12 shows an example of this process.In FER systems, LBP has the potential to highlight relevant facial features for emotion recognition, such as eyebrows, eyes, nose, and mouth.However, it has the disadvantage of being sensitive to noise: since it is a method based on intensity differences, it is affected by image noise when processing a region that has a nearly uniform intensity.Figure 13 shows a feature extraction example using LBP.

Optical Flow
OF [80] is a method that can only be used in a sequence of frames (video) since it aims to assess the magnitude and direction of motion.Basically, this technique calculates the motion between two frames at the pixel level.Therefore, it outputs a vector containing the movement of pixels in an image from the first frame to the second.However, this method depends on how well the initial tracked features are chosen and on how well they evolve through time, being highly sensitive to noise and occlusions [81].
This method can be effectively implemented in an FER system since, whenever one goes from a neutral expression to a peak facial expression, there is an obvious motion in the face that can be estimated.Figure 14 illustrates the usage of this method in an FER system using a sequence from the CK+ database.

Active Appearance Model
Active appearance model (AAM) [82] is a Computer Vision algorithm that matches the object shape and appearance to a new image.In an FER system, this means matching only the face to a new image, everything else besides the face is discarded.The shape information extracted with the AAM is used to compute a set of suitable parameters that highlights the appearance of the facial features.However, this method is highly sensitive to input images that have differences in pose, expression, and illumination, which were not included in the training set [83]. Figure 15 shows an example of this process.

Action Units
AUs are individual muscle movements that constitute facial expressions.AUs were inspired by physiological, psychological, and sociological theories that claim that different facial expressions trigger different facial muscles.The Facial Action Coding System (FACS) [84] is a system to describe facial expressions breaking them down in single AUs.Using AUs as features to classify emotions is a popular approach in the reviewed works, although some found difficulty in coding the dynamics of movements with precision, as well as measuring the AU intensity.Figure 16

Facial Animation Parameters
Facial Animation Parameters (FAPs) [85] represent 66 displacements and rotations of the feature points from the neutral face position.FAPs are based on facial motion and are related to muscle actions.They represent a complete set of basic facial actions, allowing the representation of facial expressions.FAPs can also be defined as relevant distances between facial features.However, FAPs extraction is sensitive to various noise sources, such as different lighting conditions, which might lead to subtle failures in facial area segmentation, compromising the FER system [86].Figure 17 illustrates an example of FAPs extraction (eyebrows and mouth) using the CK+ database.

Gabor Filter
The Gabor filter [87] is used to represent texture information.It provides characteristic selection about orientation and scale, being robust to harsh illumination conditions.It is able to capture spatial information of frequency, position, and orientation from an image, and extract subtle local transformations effectively.However, the drawback of this method is the high-dimensional Gabor feature spaces, leading to a high computational cost, which is impractical for real-time applications.In order to have a real-time performance, one needs to use simplified Gabor features; however, these are sensitive to lighting variations [88].Figure 18 illustrates a feature map of Gabor for a face from the CK+ database.

Scale-Invariant Feature Transform
Scale-Invariant Feature Transform (SIFT) is a Computer Vision algorithm for detecting and describing local features in an image [89].SIFT features are invariant to uniform scaling, orientation and illumination changes; however, they suffer from blur and affine changes [90].This robustness is because of the transformation of an image into a large collection of feature vectors, each of which is invariant to the conditions mentioned above.In FER systems, the SIFT algorithm can be used to detect facial features such as eyebrows, eyes, nose and mouth.Some works managed to combine this feature extraction algorithm with OF to calculate the motion of facial features.Figure 19 illustrates an example of extracting local features from a face of the CK+ database.

Histogram of Oriented Gradients
Histogram of Oriented Gradients (HOG) [91] describes local object appearance and shape within an image by the distribution of intensity gradients or edge directions.The image is divided into cells with a determined number of pixels and a histogram of gradient directions is built for each cell.Since it operates on local cells, it is invariant to geometric transformations, except for object orientation.Figure 20 illustrates an example of HOG extraction using the CK+ database.These HOG features are different and smoothly distinguished for each facial expression, which makes them appealing for FER systems.

Classification/Regression
A classification model is responsible for predicting a certain label given an input image or features.A regression model is responsible for determining the relationship between a dependent variable with independent variables.Both methods were used in the reviewed works, classification being the most popular one.The most used classification and regression algorithms are presented as follows.

Convolutional Neural Network
CNN is a type of neural networks that is mainly used in Computer Vision (Deep Learning) because of its ability to solve multiple image classification problems.CNNs can even beat humans in some of these problems since they are able to detect and identify underlying patterns that are too complex for the human eye.An input image is run through several hidden layers of the CNN that will decompose it into features.Those features are then used for classification, generally through a Softmax function that retrieves the highest probability from the classes' probability distribution as the predicted class.
It is important to mention that different problems require different CNN models and different tuning to maintain a high classification accuracy.This is mainly caused by the overfitting/underfitting problem.Overfitting, as mentioned in Section 4.3.3, is when the model generalizes poorly; in other words, its classification accuracy drops considerably when classifying data that was not in the training set.Underfitting is when the model makes poor predictions in both training data and data it has not seen before.There are several ways to solve this problem: 1. Increasing the complexity of the model (by adding more layers).2. Adding dropout layers [93] which randomly disable a determined number of nodes during training to avoid that the model memorizes patterns instead of learning them.3. Tuning the parameters of the model during the training, such as epochs, batch size, learning rate, class weight, among others.4. Increasing the training data by adding more samples or through DA as mentioned in Section 4.3.3. 5. Whenever the database is too small (a common problem on the publicly available databases for emotion recognition), Transfer Learning (TL) can be applied.TL uses a pre-defined model that has already been trained on a large database and one can fine-tune it using a smaller database for its own classification problem.
A great portion of the reviewed works used this approach for classification, showing promising results for Deep Learning based classifiers.

Support Vector Machine
SVM is a Machine Learning algorithm mostly used for classification/regression problems.An SVM model is the representation of features in space, mapped so that the features belonging to each class are divided by a clear gap that is as wide as possible.Input features are then mapped into that same space and predicted to belong to a class based on which side of the gap they fall.The training phase creates this map that is used after for predictions.The strengths of this classifier lie in handling complex nonlinear data and on being robust to overfitting.However, they are computationally expensive, hard to tune due to the importance of selecting the proper kernel function, and don't perform well with large databases.Figure 21 shows a potential map of an SVM model trained for an FER system.

K-Nearest Neighbor
K-nearest neighbor (KNN) [94] is an instance-based learning algorithm that utilizes a non-parametric technique when making its classification or regression.The training data consist of vectors in a multidimensional feature space, each with a class label.The training phase of the algorithm consists only of storing these feature vectors and their belonging classes.In the classification step, an input feature or set of features are predicted by assigning the class that has the nearest features to the input.Common distance metrics used for calculating which features are closer to the input are the Euclidean distance (ED) and Hamming distance.The strengths of this classifier lie on the simple implementation and on the fast training step.However, it requires large storage space, the testing is slow, is sensitive to noise, and performs poorly for high-dimensional data.One more problem of this classification/regression approach is that unbalanced classes can lead to inaccurate predictions (classes with more number of samples usually dominate the predictions, even if wrongly).One way to overcome this problem is by setting class weights.

Naive Bayes
Naive Bayes classifiers [95] are a family of probabilistic Machine Learning classifiers based on Bayes theorem, which assume strong independence between features.The Bayes theorem is defined through the following equation: Using this equation, it is possible to calculate the likelihood of event A occurring given that B is true.However, these classifiers assume that features are independent, which means that, in an FER system, the Naive Bayes classifier will not correlate features when making a prediction.This is something undesirable since there are obvious correlated features when making facial expressions-for instance, when one is surprised, there is an obvious correlation between the mouth and the eyes (both usually open wider).However, the advantages of this classifier lie in the simple implementation and on scaling well for large databases.These type of classifiers are very popular in text classification problems.

Hidden Markov Model
Hidden Markov Model (HMM) [96] is a probabilistic model that is able to predict a sequence of unknown variables from a set of observed variables.For instance, in an FER system that would mean predicting happiness (hidden variable) based on a smile (observed variable).The strengths of this classifier lie on the potential to model arbitrary features from the observations, on the potential to merge various HMMs to classify more data and on the incorporation of previous knowledge into the model.However, this classifier is computationally expensive and struggles with overfitting.

Decision Tree
Decision Tree (DT) [97] as a classifier is basically a flowchart represented as a tree model.A DT splits the database into smaller sets of data until no more splits can be made, and the resulting leaves are the classes used for classification.The strengths of this classifier lie on the potential to learn nonlinear relationships of data, on handling high-dimensional data, and on the simple implementation.However, the main disadvantage of this classifier is overfitting, since it can keep branching until it memorizes the data during the training step.

Random Forest
Random Forest (RF) [98] is essentially an ensemble classifier, consisting in a group of DTs.Each DT outputs a prediction, and the final prediction is based on majority voting, meaning that the most predicted class will be the last prediction.It has the advantage of reducing overfitting over just one DT, since it reduces the bias by averaging the predictions of the ensemble.However, it has the disadvantage of becoming slower when increasing its complexity (e.g., by adding more DTs to the ensemble).

Euclidean Distance
ED is basically the distance between two points in Euclidean space.Some reviewed works used this metric for classification: calculating the distance from facial features of a certain facial expression to the mean vector of facial features for each emotion.The emotion that presents the closest distance is then assigned to the input face.The ED between two points (x,y) is defined through the following equation: ( The advantage of this classifier lies on its simple implementation to detect latent clusters.However, this simplicity is also its main drawback, especially for high-dimensional data.

Results and Discussion
In this section, some relevant results from the reviewed papers are presented.This aims to provide insights on what has been done in the FER field based on the reviewed works.However, it is difficult to make a fair comparison between works since most of them used different databases to train and test their FER systems, as well as different ratios of training/test data and different procedures.However, works that followed similar procedures and used the same databases can be compared.There are a few things to be aware of when analyzing the results.There are mainly two testing procedures used in the reviewed works: • Hold-out (HO) is when the database is split up into two groups: one for training and one for testing.Generally, the training set has more data than the testing set (e.g., 70%/30%).This method has the advantage of being the fastest to be computed; however, it may produce high variance evaluations, since it heavily depends on which data end up being used for training and for testing.

•
Cross-validation (CV), which can be divided into: 1. K-fold cross-validation [99], which is when the database is randomly split up into k groups.
One group is for testing and the remaining for training.The process is repeated until every group is used for testing.This method has the advantage of being more robust to the division of the database, since every piece of data ends up being trained k-1 times and tested once.Therefore, the evaluation variance can be reduced as k is increased, although the computation time also increases.2. Leave-p-out cross-validation [100], which is when all possible sets of p data are left out from the training and used for validation.Although this method provides more robust evaluations than k-fold cross-validation, it can become computationally infeasible depending on p.
Within these two procedures, there is still the factor of being person-independent (PI): when the same person does not appear in the training set and in the testing set simultaneously.It is relevant to mention that testing procedures which followed a CV and PI approach are the most rigorous ones (meaning they usually have lower accuracy than the other testing procedures).The "Testing procedure" column, if it is not PI, means that the procedure potentially carried out with same people in the training set and testing set simultaneously.Another important remark is that some authors had common complaints about FER databases:

•
The posed facial expressions made by actors when building the databases are too artificial.This means that, even if the works present a high accuracy on the benchmarks (using databases that are also built on posed facial expressions), it might not translate into a high accuracy when the same system faces a real world scenario.• Some databases are poorly annotated or have an ambiguous annotation.Authors who noticed this problem tried to overcome it by making their own annotations or by excluding those samples.

•
Emotion databases are generally small (mainly because of how hard it is to set up the image acquisition system and how hard it is to get several actors to do different facial expressions).
It is also worth noting that some classifiers, pre-processing, features, and databases presented in the tables were not approached in the previous sections.Since there is a huge variety of methods and databases used in the reviewed works, it was decided to only mention the most relevant ones.However, if the reader wants to know more about a specific work that used different techniques from the ones mentioned in the previous sections, it is recommended to check their work for a deeper understanding.The reviewed works are orderly referenced to facilitate potential searches.The results are compacted into tables and separated from their different approaches: static image (Table 2), video (Table 3), audiovisual (Table 4), video/static (Table 5) and circumplex model (Table 6).In the "Pre-processing" column, face detection is not present since every work did this ROI extraction technique.In Tables 2-5, the last column shows the accuracy obtained for each work in predicting emotions.Most works considered the six basic emotions, plus the neutral expression.However, not every work tried to make predictions in these six basic emotions, plus the neutral expression: some used five out of the six basic emotions, some added one more (contempt, usually seen on the CK+ database), and others went further and connected them (e.g., happily surprised).In Table 6, the last column shows the accuracy obtained for each work in predicting emotions using the circumplex model.This model has a wide variety of metrics, such as valence, arousal, power, dominance, expectation, and activation.Depending on the value of these metrics, one can predict which emotion certain subject is feeling.Figure 22 illustrates this concept using the valence and arousal metrics.Tables 2-6 present the results of the reviewed works during this systematic review, sorted by accuracy.Some works tested their algorithms using various databases, but only the best classification result is presented in the tables.
Table 2 sums up results of works that approached this problem by analyzing static images.It is interesting to notice that the top results in this table mostly use CNNs with or without variants, CNNs combined with SVMs, and the top two did TL from pre-trained models.The works that attained the best results generally used a CNN pre-trained on large databases, being fine-tuned with smaller databases afterwards.This shows why CNNs have been dominating the FER field (as well as numerous Computer Vision fields): the joint optimization of both feature extraction and classification.Some used an end-to-end CNN approach, which is only fed with ROIs (faces) that could have been pre-processed or not.In other words, this approach does not need to choose which features to use for training their CNN: the CNN selects the features to learn by itself during the training step.Other works used classifiers like DTs, RFs, or SVMs, instead of conventionally trying to overcome the problems of using CNN classifiers in FER (such as overfitting because of the overall small databases).These works showed that, with good understanding of the problem and of the used databases, as well as a proper pre-processing step (such as DA and intensity/spatial normalization), feature selection (such as AUs, LBP, HOG or Gabor features) and fine-tuning, it is possible to achieve competitive results.
However, there is a common discrepancy of accuracy when testing in controlled environment databases and in wild environment databases.This shows a clear difficulty in translating the good results in controlled environments (such as CK+ and JAFFE) to uncontrolled environments (such as FER-2013 and SFEW).Every work that tested its algorithms with various databases obtained a significantly worse result on the uncontrolled environment ones.One example from this table is the work [102], which despite obtaining 98.90% accuracy when testing on the CK+ database, only obtained 55.27% accuracy on the SFEW database.This is mainly caused by the head pose variation and the different lighting conditions to which a real world scenario is susceptible.
Table 3 sums up results of works that approached this problem by analyzing videos.It is possible to see the introduction of temporal features like OF and Motion History Histograms (MHHs).The pre-processing step revolves around intensity/spatial normalization and PCA.As for the classification step, there is a wide variety of classifiers that revolves around EDs, Gaussian Mixture Model (GMM), CNNs, SVMs and HMMs.Most works tried to analyze facial expressions in a dynamic manner, by using the positional information of the detected facial features.In order to improve results in the video approach, the reviewed works tended to process spatial and temporal information separately.The spatial information can characterize different positions of facial features, while temporal information can capture the flow of these facial features through time (normally from a neutral expression up until a peak facial expression).Afterwards, they aggregated these two types of features and fed them into a single classifier or into a hybrid classifier.Normally, these classifiers need to have the capacity of retaining temporal information, for example by using Long Short-Term Memory Neural Networks (LSTMs) or Recurrent Neural Networks (RNNs).
The main conclusion to retrieve from video approaches is that retaining the temporal information usually leads to better results than analyzing each frame separately.However, as in static image approaches, the works with the lowest accuracy rates are from pose-variant databases, which means that retaining the temporal information is still not enough to perform well in uncontrolled environments.
Table 4 sums up results of works that approached this problem by analyzing both the video and its audio individually, combining them for the final classification.The columns "Classifiers", "Pre-processing", and "Features" are solely representing the visual analysis.However, the "Accuracy" column presents the last result of their multimodal system: fusion between video and audio.This approach has similar remarks to the video approach.The main differences are how the reviewed works processed the audio modality and how they combined the audiovisual features for the final model.The fusion methods that were able to attain the best results were the Probability-Based Product Rule and the Bayes sum rule.As in static image and video approaches, works with the lowest accuracy rates mostly resulted from uncontrolled environment databases (FER-2013 and AFEW).However, their results concluded that a multimodal approach is generally better than a unimodal approach.
Table 5 sums up results of a work that approached this problem by analyzing both the video and their frames individually, combining them for the final classification.The main remarks made by the authors is that this combined approach leads to better results than analyzing both parts alone.Table 6 sums up results of works that approached this problem by analyzing the circumplex model.What is presented in the "Accuracy" column is an average of the calculated metrics in this model.One thing that one can immediately notice is that every work used a video approach.The main features used for this approach revolve around facial landmarks and LBP features.As for the classification step, the main classifiers revolve around LSTMs, SVMs, HMMs, KNNs, and Support Vector Regressions (SVR).As in the video approach, in order to improve the results, the reviewed works tried to track facial features over time, processing their spatial and temporal information separately.By using a hybrid classifier with the capacity of retaining temporal information (LSTM), with a proper pre-processing step (e.g., intensity normalization which was neglected in the reviewed works of this table), it is possible to achieve competitive results when determining the metrics of the circumplex model.Some remarks made by the authors are that the regions close to the eyes, mouth and eyebrows have dominant influence on FER, and that people tend to avoid eye contact when they do not want to talk about some topics.

Insights on Emotion Recognition in the Wild Challenge
Since the main observed problems with FER systems from this systematic review are pose-variant faces and wild environments, this section explores a yearly FER challenge in the wild (EmotiW).Most FER systems' goal is to face real world scenarios; therefore, it is important to analyze this challenge and how the participants are trying to tackle these environment adversities to pinpoint future directions.The EmotiW Challenge is divided into two different competitions: one is based in a static approach, using the SFEW database, and the other one is based in an audiovisual approach, using the AFEW database.In Section 7, it is possible to observe the overall superiority of multimodal approaches against the unimodal approaches; therefore, this section explores the audiovisual competition of the EmotiW Challenge.
Kahou et al. [169] proposed the combination of multiple emotion classifiers each based on a different data source.They used a CNN for frame-based classification of facial expressions from aligned faces.To classify whole video clips, they have implemented a video frame aggregation strategy based on SVMs.They have employed a shallow network architecture that focuses on extracted features of the mouth of the primary human subject in the scene and use these features as input to a SVM.Finally, they presented a novel technique to aggregate models based on random hyper-parameter search using low complexity aggregation techniques consisting of simple weighted averages to combine the visual model with the audio model.As for the pre-processing step, they did intensity normalization and isotropic smoothing.This work won the EmotiW 2013 challenge with the best submission achieving 41.03% accuracy.
Liu et al. [177] proposed to represent the AFEW video clips using three kinds of image set models: linear subspace, covariance matrix, and Gaussian distribution, respectively.Then, different kernels were employed on these set models correspondingly for distance measurement.Three classifiers were used: SVM, logistic regression, and partial least squares.Finally, a score-level fusion of classifiers based on different kernel methods and different modalities was conducted to further improve the performance.As for the feature extraction step, they extracted HOG and SIFT features, and used a CNN to exploit the strong spatially local correlations presented in the faces.This work won the EmotiW 2014 challenge with the best submission achieving 50.40% accuracy.
Yao et al. [161] proposed a pair-wise learning strategy to automatically seek a set of facial image patches which are important for discriminating two particular emotion categories called AU-aware facial features.In each pair-wise task, they used an undirected graph structure, which takes learnt facial patches as individual vertices, to encode feature relations between any two learnt facial patches.Finally, they constructed the emotion representation by concatenating all facial feature relations.As for the pre-processing step, they used a face frontalization method to remove the influence of head pose variation by normalizing their faces geometrically, and implemented a Discrete Cosine Transform (DCT) based method to compensate for illumination variations in the logarithm domain.In the feature extraction step, they extracted AUs and used Supervised Descent Method (SDM) to track them.This work won the EmotiW 2015 challenge with the best submission achieving 53.80% accuracy.
Fan et al. [178] proposed a hybrid network that combines LSTM and 3D convolutional networks (C3D).LSTM takes appearance features extracted by the CNN over individual video frames as input and encodes motion later, while C3D models appearance and motion of video simultaneously.This work emphasized the importance of pre-processing the data by testing their system with the original video frames, obtaining an average accuracy of just 20%.Therefore, as for the pre-processing step, they did face alignment.This work won the EmotiW 2016 challenge with the best submission achieving 59.02% accuracy.
Hu et al. [179] proposed a new learning method named Supervised Scoring Ensemble (SSE) with deep CNNs.They added supervision not only to deep layers but also to intermediate layers and shallow layers to ease the training.They also presented a new fusion structure in which class-wise scoring activation at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture.As for the pre-processing step, they applied SDM to track facial features, face frontalization, rescaling, and applied a DCT based method for intensity normalization.As for the feature extraction step, they combined the grayscale face image with its corresponding basic LBP and mean LBP feature maps to form a three-channel input.This work won the Emotiw 2017 challenge with the best submission achieving 60.34% accuracy.
Liu et al. [158] proposed a hybrid net containing three main parts for the visual features: Landmark ED, CNN, and LSTM.These parts were then combined with weights for the final classification.For the Landmark ED part, 34 EDs were calculated as well as the mean, maximum, and variance, resulting in 102 total features for each video.This part alone achieved 39.95% accuracy on the AFEW database.For the CNN part, four CNNs were fine-tuned to predict single static images.Features extracted from these four CNNs were used to train a linear SVM.The VGG-Face model [180] was fine-tuned using the FER-2013 database for feature extraction.Those features were then used to train an LSTM, and this part alone was able to achieve 46.21% accuracy.Finally, they combined these three visual parts with the audio part as a final decision step.As for the pre-processing step, DA and face alignment were performed.This work won the EmotiW 2018 challenge with the best submission achieving 61.87% accuracy.
Figure 23 illustrates the EmotiW Challenge winners' accuracy over time.It is possible to see that the EmotiW Challenge is stimulating the research done in emotion recognition in the wild, correlating with better results over the years.Recent winners used face frontalization to overcome the pose-variant faces problem.Since the AFEW database is 2D, it is the most reasonable solution.Overall, the winners don't seem to overlook the pre-processing step: most do geometric transformations and illumination corrections in order to normalize the data.Concerning the feature extraction step, at least two winners applied SDM to track facial features.The main facial features that were explored were AUs, HOG, SIFT, and LBP features.As for the classification step, they mainly used SVMs and fine-tuned CNNs in a combined way.Some winners also used LSTMs to make predictions based on the temporal data of the AFEW database.
Based on the EmotiW Challenge results, the future direction for FER in uncontrolled environments seems to be converging into:

•
Pre-processing techniques that normalize pose-variant faces as well as the image intensity.

•
The exploration of AUs, HOG, SIFT, and LBP features.

•
The use of hybrid classifiers based on SVMs, fine-tuned CNNs, and LSTMs.

Conclusions
The interest in FER is growing gradually, and, with it, new algorithms and approaches are being developed.The recent popularization of Machine Learning made an obvious breakthrough in the research field.The research in FER is definitely in the right path, walking together with important fields like psychology, sociology, and physiology.From this, more and more accurate FER systems are emerging every year.However, despite this obvious progress, pose-variant faces in the wild are still a big challenge for FER systems.However, there are emotion recognition challenges every year that explores this problem and, with it, FER systems are becoming robuster to pose-variant scenarios.Especially after a major breakthrough done by a CNN called AlexNet [181], which achieved a top-5 error of 15.31% in the ImageNet 2012 competition, more than 10.8% points lower than that of the runner up.After this, researchers became aware of the potential in CNNs for solving Computer Vision problems, and more FER systems using CNNs emerged, correlating with overall better results.The only potential negative aspect to point out from the reviewed works is that none considered the environment context.Although most works are giving the right steps towards multimodal systems, the environment context seems to be ignored.For instance, if there is an image of a birthday party, the happy context has a huge weight in the mood of people participating in it, which can't be ignored even if a certain participant is not explicitly smiling.Nevertheless, FER systems are being stimulated by yearly challenges and by the overall interest in numerous fields, achieving better results year by year.

Figure 2 .
Figure 2. Flowchart of the paper selection for this systematic review.

Figure 3
Figure 3 shows how many papers were found for each year in the searched time frame, displaying a clear increase of interest in emotion recognition.It only appears to decrease in 2019 because, as mentioned above, the searched time frame goes from January 2006 to April 2019.

Figure 3 .
Figure 3. Number of papers found for each year during the paper selection stage.

Figure 5 .
Figure 5. Features selected by the AdaBoost learning algorithm in a face from the CK+ database.

Figure 6 .
Figure 6.Detected face from the CK+ database using the Dlib library.

Figure 7 .
Figure 7. Detected face from the CK+ database using MTCNN.

Figure 8 .
Figure 8. Rotation correction on a face from the CK+ database.

Figure 9 .
Figure 9. Background removal using the distance between eyes of a face from the CK+ database.

Figure 10 .
Figure 10.Results in some faces of the CK+ database using HE.

Figure 11 .
Figure 11.DA results on a face from the CK+ database.

Figure 13 .
Figure 13.Feature extraction using LBP on a face from the CK+ database.

Figure 14 .
Figure 14.OF of a sequence from the CK+ database.

Figure 15 .
Figure 15.AAM shape estimation in a face from the CK+ database.
illustrates a few examples of facial AUs.

Figure 16 .
Figure 16.Some relevant examples of AUs for facial expression discrimination.

Figure 18 .
Figure 18.Feature map of Gabor for a face from the CK+ database.

Figure 19 .
Figure 19.SIFT features of a face from the CK+ database.

Figure 20 .
Figure 20.HOG descriptors of the mouth of two subjects from the CK+ database [92].

Figure 21 .
Figure 21.Example of a feature space in an FER system.

Table 2 .
Results of reviewed works for static image approaches.

Table 3 .
Results of reviewed works for video approaches.

Table 4 .
Results of reviewed works for audiovisual approaches.

Table 5 .
Results of a reviewed work with a video/static multimodal approach.

Table 6 .
Results of reviewed works based on the circumplex model.