Next Article in Journal
Studies of Indium Tin Oxide-Based Sensing Electrodes in Potentiometric Zirconia Solid Electrolyte Gas Sensors
Next Article in Special Issue
Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
Previous Article in Journal
Gaussian Approach for the Synthesis of Periodic and Aperiodic Antenna Arrays: Method Review and Design Guidelines
Previous Article in Special Issue
CorrNet: Fine-Grained Emotion Recognition for Video Watching Using Wearable Physiological Sensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models

1
Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea
2
School of Technology, Environment and Design, University of Tasmania, Hobart, TAS 7001, Australia
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(7), 2344; https://doi.org/10.3390/s21072344
Submission received: 10 March 2021 / Revised: 24 March 2021 / Accepted: 25 March 2021 / Published: 27 March 2021
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)

Abstract

:
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with “Conv2D+LSTM+3DCNN+Classify” architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy.

1. Introduction

Emotional cues provide universal signals that enable human beings to communicate during the course of daily activities and are a significant component of social interactions. For example, people will use facial expressions such as a big smile to signal their happiness to others when they feel joyful. People also receive emotional cues (facial expressions, body gestures, tone of voice, etc.) from their social partners and combine them with their experiences to perceive emotions and make suitable decisions. In addition, emotion recognition, especially facial emotion recognition, has long been crucial in the human–computer interaction (HCI) field, as it helps computers efficiently interact with humans. Recently, several scientific studies have been conducted on facial emotion recognition (FER) in an attempt to develop methods based on new technologies in the computer vision and pattern recognition fields. This type of research has a wide range of applications, such as advertising, health monitoring, smart video surveillance, and development of intelligent robotic interfaces [1].
Emotion recognition on the basis of behavioral expressions presents numerous challenges due to the complex and dynamic properties of human emotional expressions. Human emotions change over time, are inherently multi-modal in nature, and differ in terms of such factors as physiology and language [2]. In addition, use of facial cues, which are considered the key aspect of emotional cues, still presents challenges owing to variations in such factors as head poses and lighting conditions [3]. Several factors, such as body expressions and tone of voice are also affected by noise in the environment and occlusion. In some cases, emotions cannot be interpreted without context [4]. In video-based emotion recognition, facial expression representation often includes three periods, onset, apex and offset [5,6], as shown in Figure 1. The lengths of the periods differ; the onset and offset periods tend to be shorter than the apex period. There are challenges regarding the unclear temporal border between periods, and spontaneous expressions lead to multiple apexes.
To address the above-mentioned challenges, both traditional and deep learning methods often focus on facial expressions that present changes in facial organs in response to emotional states, underlying intentions, and social interactions. Such methods attempt to determine facial regions of interest, represent changes in facial expressions, and divide emotions into six basic categories, namely, anger, disgust, fear, happiness, sadness, and surprise, as proposed by Ekman et al. [7].
In 2D image-based facial emotion recognition (2D FER), the main tasks focus on robust facial representation followed by classification. There are two approaches to feature representation, geometric- and appearance-based approaches. Geometric-based approaches represent facial expressions using geometric features of facial components (mouth, eyes, nose, etc.) in terms of shape, location, distance, and curvature [8,9,10]. Appearance-based approaches use local descriptors, image filters such as LBP [11], Gabor filters [12], PHOG [13], etc. to extract hand-crafted features for facial expression representation for traditional methods. In deep learning methods, feature representation is automatically extracted by convolutional neural networks (CNN) [14] that are trained on large-scale emotion recognition datasets such as RAF-DB [15] and AffectNet [16]. Geometric-based methods are often affected by noise and have difficulty showing small changes in facial details, while appearance-based methods are robust to noise and retain facial details. Deep learning models such as VGG-16 [17] and Resnet [18] demonstrate improved 2D FER performance [10,19].
In video-based emotion recognition, the main task focuses on efficiently exploiting spatiotemporal coherence to classify human emotion as well as integrating multiple modalities to improve overall performance. In the spatiotemporal approach, extensions of hand-crafted traditional features such as HOG, LBP and BoW are also proposed and applied using video-based emotion recognition methods such as 3D HOG [20], LBP-TOP [21], and Bag-Of-Word [22]. In addition, temporal models such as conditional random fields [23] and interval temporal Bayesian network [24] are used to exploit spatiotemporal relationships between different features. For deep learning-based methods, many works use CNNs for feature extraction followed by LSTM for exploiting spatiotemporal relations [25,26,27]. For the frame-level approach, every frame in a video clip is subjected to facial feature extraction, concatenated together by a statistical operator (mim, mean, and std) using pre-determined time steps and finally classified by deep learning models or traditional classification methods such as SVM [28,29].
Recently, many works have focused on video-based emotion recognition to address challenges in emotion recognition using the deep learning approach. Zhu et al. [30] used a hybrid attention cascade network to classify emotion recognition with a hybrid attention module for the fusion features of facial expressions. Shi et al. [31] proposed a self-attention module integrated with the spatial-temporal graph convolutional network for skeleton-based emotion recognition. Anvarjon et al. [32] proposed deep frequency features for speech emotion recognition.
However, video-based emotion recognition also presents some challenges under in-the-wild conditions, such as problems involving head pose, lighting conditions, and the complexity in the facial expression representation due to spontaneous expression. Context is key in emotion recognition. For instance, in a dark environment or when the face of interest is tiny, it is possible to recognize emotions based off our experiences with related elements such as parts of the scene, body gestures, things, and other people in the scene. In addition, a hierarchical structure in the emotion feature representation is necessary to deal with unclear emotion temporal borders.
In this study, we propose an overall system with face tracking and voting to select the main face for emotion recognition using two models based on spatiotemporal and temporal-pyramid architecture to efficiently improve emotion recognition. For face tracking and voting, we use a tracking-and-detection template with robust appearance features as well as motion features to suggest faces and people. Then, through a voting scheme based on probabilities, occurrences, and sizes, we choose the face and person of interest in the video clip.
In video-based emotion recognition, we first deal with in-the-wild conditions by integrating contextual features, facial emotion probability, and facial emotion features to construct a robust set of facial emotion features. For unclear temporal border and spontaneous expression problems, we propose a temporal-pyramid architecture to integrate face-context features by time steps based on statistical information. The hierarchical structure of facial-context feature integration improves the emotion evaluation results of our system. Moreover, we also propose a spatiotemporal model using “Conv2D+LSTM+3DCNN+Classify” architecture to exploit spatiotemporal coherence among face-context emotion features in 3D and 2D+T strategies. Finally, we suggest the best ensemble method to choose the best combination among models. Our experiment was conducted on the AFEW dataset [33] which is the dataset of the EmotiW Challenge 2019 [34]. We achieved good performance on the validation set and test set.
The contributions of this paper are as follows: (1) We integrate facial emotion features with scene context features to improve performance. (2) We propose spatiotemporal models to exploit spatiotemporal coherence among face-context features using 3D and 2D+T temporal strategies. In addition, we build a temporal-pyramid model to exploit the hierarchical structure of overall face-context emotion features by statistical operator. (3) Our proposed system achieved good performance on a validation set taken from the AFEW dataset [33].
This paper is organized into seven sections. In Section 2, we briefly summarize related works. We describe our proposed idea in Section 3. We discuss the network architectures in Section 4 and the best selection ensemble method in Section 5. Our experiments are shown in Section 6. Finally, the conclusions are outlined in Section 7.

2. Related Works

2.1. Image-Based Facial Expression Recognition

Emotion recognition plays a fundamental role in human–computer interactions (HCIs). It is used to automatically recognize emotions for a wide range of applications, such as customer marketing, health monitoring, and emotionally intelligent robotic interfaces. Emotion recognition remains a challenging task due to the complex and dynamic properties of emotions, their tendency to change over time, the fact that they are often mixed with other factors, and their inherently multi-modal nature in terms of behavior, physiology, and language.
To recognize emotion expression, the face is one of the most important visual cues. Facial expression recognition (FER) exploits the facial feature representation of static images [11] in the spatial domain. Traditional methods use handcrafted features such as local binary patterns (LBPs), speeded-up robust features (SURF), and scale-invariant feature transform (SIFT), to classify emotions. Recently, with the success of deep learning in computer vision tasks, FER problems raise the new challenge for classifying emotions under in-the-wild environments despite occlusions, illumination differences, etc. Many 2D FER image datasets such as AffectNet [16], RAF-DB [15], etc. have been published to promote technological development and fulfill the requirement for large-scale and real-world datasets.

2.2. Video-Based Emotion Recognition

From still images to video, emotion recognition presents many serious challenges; these involve, for example, behavioral complexities, environmental effects, and temporal changes in the video channel, as well as acoustic and language differences in the audio channel. To provide a baseline for video emotion recognition in the wild, the AFEW dataset [33] was built from many movies and TV shows. Emotions are classified into seven categories (anger, disgust, fear, happiness, neutrality, sadness, and surprise) under uncontrolled environments such as outdoor/indoor scenes, illumination changes, occlusions, and spontaneous expression. From 2013 to 2018, the emotion recognition research community made great strides through the EmotiW Challenge [34] on the basis of the AFEW dataset [33].
Because human emotions are almost always displayed on the face by movements of facial muscles, many studies have focused on facial representations in attempts to exploit the spatial and temporal information contained in a video. There are three main approaches to this problem: geometry, video-level, and frame-level approaches.
For the geometry approach, Liu et al. [26] computed 3D landmarks, normalized these landmarks and extracted features using Euclidean distances. They proposed the Landmark Euclidean Distance network. Kim et al. [27] proposed the CNN-LSTM network to classify emotions through sequential 2D landmark features.
For the spatiotemporal approach, Liu et al. [26] used the VGG Face network to extract facial features and then used these facial features to classify emotions. They showed an accuracy of 43.07% on the validation set. Lu et al. [25] proposed VGG-Face+BLSTM [35] for the spatiotemporal network using the VGG-Face network fine-tuned on facial expression images from video clips. This model showed an accuracy of 53.91%.
Finally, the main idea of the frame-level approach is to merge emotion features in every frame using an aggregation function (min, max, std, etc.). It addresses the invariance of the number of video frames. Bargal et al. [29] used facial emotion recognition networks to extract facial features and concatenated the results.For all frames, they used the statistical encoding module (STAT) to merge all frame-level features by min, max, variance, and average. They showed a high accuracy of 58.9% on the validation set. Knyazev et al. [28] later updated the STAT* module by scaling and normalization.
We realize that weak points exist in the above works that make use of the spatiotemporal, frame-level, and audio modalities. For instance, the spatiotemporal networks do not integrate 3DCNN [36] and BiLSTM [35] to find strong correlations between the spatial information in the data cube. Moreover, it would be better to use online fine-tuning in the video training process instead of offline feature extraction.
For the frame-level approach, STAT encoding does not utilize temporal information between the frame-level features. In addition, the frame-level features need to add more contextual information such as action information and scene information. The audio approach only uses one type of acoustic feature for emotion classification.

3. Proposed Idea

In this section, we define the problem that we wish to address and give a brief overview of our video emotion recognition system. Next, we explain our proposed method in detail, including the tracking and voting modules and method of face context feature extraction. The details of the model are discussed in the next section.

3.1. Problem Definition

In this study, the input is a video clip V = S , A lasting 5 min or less consisting of a scene sequence S and audio stream A . Certain cues play an important role in human emotion recognition, such as facial expression, body gestures, and tone of voice. In the scope of our work, we mainly focus on visual cues that are important to the perception of human feelings. Face tracking F , along with the corresponding person tracking P comprising body and scene information, are the most important cues to solve this problem. Our objective is to effectively locate the significant face F ¯ and corresponding person P ¯ from the scene sequence S . From there, we use the face and person image sequences S F ¯ and S P ¯ to classify the emotion c i C = 0 , 6 as one of seven basic emotions, namely, anger, disgust, fear, happiness, neutrality, sadness, and surprise.
Let c f j t = Δ x f j t , y f j t , w f j t , h f j t R 4 and c p j t = Δ x p j t , y p j t , w p j t , h p j t R 4 be the location of the jth face and person at time t in a scene sequence S R W × H × T , respectively, and the tracking indices g j t R calculated using the tracking module shown in Figure 2, where x f j t , y f j t / x p j t , y p j t is the face/person center, w f j t , h f j t / w p j t , h p j t is the face/person size, W × H is the size of a scene, and T is the length of scene sequence S . The scene sequence S then contains the face tracking F and person tracking P information, defined as follows:
F = f i i = 1 N f and P = p i i = 1 N f , f i = c f j k t k | t k < t k + 1 g j k t k = i k = 1 M i and p i = c p j k t k | t k < t k + 1 g j k t k = i k = 1 M i
where N f is the number of tracked faces and persons and f i / p i is the ith tracked face/person that contains the locations of the face/person in chronological order ( t k < t k + 1 ) and which has a length of M i and the same tracking index ( g j k t k = i ).
We also denote S f i and S p i as, respectively, the image sequences of a tracked face and person, f i and p i , extracted from the scene sequence S . The emotional expression in the video V is mostly affected by the most significant face F ¯ , which appears more often and is larger than the other faces, and the corresponding person P ¯ , defined as follows:
F ¯ = m o d e F P ¯ = s e l e c t g P ¯ = g F ¯ P
where g P ¯ and g F ¯ are the tracking indices of P ¯ and F ¯ , respectively.
The goal of our method is to classify the image sequences S F ¯ and S P ¯ of the dominant tracked face F ¯ and corresponding person P ¯ to classify what kind of emotions exist in the video V . The classification result is denoted by a classification label c 0 , 6 corresponding to the seven basic emotions anger, disgust, fear, happiness, neutrality, sadness, and surprise.

3.2. Proposed System

An overview of our proposed system is shown in Figure 2. The system attempts to classify a video clip in the wild according to seven categorical emotions, namely anger, disgust, fear, happiness, neutrality, sadness, and surprise.
The key to this study is context-aware emotion recognition in video clips. The expression of the key face in a video clip signifies the emotion that the system will apply to that clip. The contextual features from the person region are used to improve the performance of the system when the key face is small and/or occluded. Our proposed model exploits the context-aware feature map to classify emotions into seven basic categories.
First, from an input video clip, our system effectively locates the most important tracked face F ¯ and corresponding tracked person P ¯ using the Tracking and FaceVoting module. These are considered the most significant characteristics to help our system classify emotional expression.
Second, the face context feature map is extracted from the significant face F ¯ and person P ¯ using the face feature extraction and context feature extraction models. The face feature extraction model is based on conventional models and uses pre-trained weights based on the AffectNet [16] and RAF-DB [15] datasets. The context feature extraction model is VGG16 [17], with pre-trained weights from ImageNet.
The context spatiotemporal LSTM-3DCNN model uses LSTM [37] or 3DCNN [36] to exploit the spatiotemporal correlation of the face context feature map and fine-tune the face feature extraction model. Its scheme is “FaceContext+LSTM+Conv3D+Classification” and it helps our system learn the feature map more deeply.
Moreover, we propose the context temporal-pyramid model based on the temporal-pyramid scheme instead of LSTM and 3DCNN. The face context feature map can be enhanced by the temporal-pyramid scheme as well as statistical operators (mean, max, and min). It exploits the long-term dependencies in all time-steps from the face context feature map. Our system applies categorical cross-entropy loss for training on the seven basic emotion classes for every video emotion model.
Finally, we fuse the classification features from all models to achieve the best accuracy in emotion classification. We propose the best selection ensemble and compare it to average fusion and join fine-tuning fusion [10]. The best selection ensemble finds the best combination of models by the heuristic principle when giving a first specific model. It attempts to find an unused model to help the current combination achieve the best accuracy with a smaller number of models to prevent over-fitting.

3.3. Face and Person Tracking

For the tracking module, we propose a tracking algorithm based on a tracking-by-detection scheme [38] and Hungarian matching method [39] to return the tracked faces F along with the corresponding tracked persons P from the scene sequence S .

3.3.1. Tracking Database of Tracked Faces and Persons

It is assumed that there are tracked faces F = f i i = 1 N t and corresponding tracked persons P = p i i = 1 N t at the time t, where N t is the number of tracked faces and persons, and f i (or p i ) is the location sequence of a tracked face (or person) as defined in Equation (1). Let D = d i i = 1 N t be the tracking database containing appearance and motion observations.
Our algorithm uses the HSV color histogram and the face features to record appearance observations. The last face size and location of a tracked face record motion observations. Each element d i = d i h s v , d i e n c , d i p o s , d i s i z e D is calculated as follows:
d i h s v = H h s v S f i j j = M i k + 1 M i d i e n c = G v g g f a c e 2 S f i j j = M i k + 1 M i d i p o s = x , y f i M i and d i s i z e = w , h f i M i
where d i h s v is the HSV color histograms of the last k-face images S f i j for the tracked face f i ; M i is the number of faces in f i ; and the operator H . is used to generate 100 bin values of a 2D histogram using the H and S channels for color, and 20 bin values of a 1D histogram using the V channel for brightness, as mentioned in [40]. d i e n c is the face encoding features of the last k-face images, which is extracted from the model G that uses pre-trained weights from VGGFace2 [41]. d i p o s and d i s i z e are respectively the last position and size of the tracked face f i .

3.3.2. Face and Person Candidates

For every scene s t S , our algorithm uses Tiny Face Detector [42] to extract face candidates. This is is a robust detector that finds small faces with high efficiency. We also use SSD detection [43] trained on the VOC dataset to detect person candidates. For every face candidate, we find the person candidate that yields the smallest intersect over union (IoU) score. If this is not possible, the whole scene is used as the person region.
Let c f j t = x f j t , y f j t , w f j t , h f j t C F and c p j t = x p j t , y p j t , w p j t , h p , j t C P be, respectively, the face and person candidates in the scene s t , where x f j t , y f j t / x p j t , y p j t is the face/person center, and w f j t , h f j t / w p j t , h p j t is the face/person size. We need to extract appearance and motion observations of the face candidates. Let O F = o j be the appearance and motion observations of face candidates C F ; then, every element o j = o j h s v , o j e n c , o j p o s , o j s i z e is computed as follows:
o j h s v = H h s v S c f j t and o j e n c = G v g g f a c e 2 S c f j t o j p o s = x , y c f j t and o j s i z e = w , h c f j t
where S c f j t is the corresponding image of c f j t , the operator H . is used to extract the HSV color histogram, and the pre-trained VGGFace2 model G is used to compute the face-encoding features.

3.3.3. Face and Person Matching

Let M v be the cost matrix of the observation v h s v , e n c , p o s , s i z e between the face candidates C F and the tracked faces F . We use a Euclidean distance operator E . to calculate every element M i j v M v as follows:
M i j v = E d ¯ i v , o j v i f E d ¯ i v , o j v T v o t h e r w i s e
where the operator d ¯ i v is the mean of d i v , T v is the valid threshold of observation v (determined experimentally), i is the face index in F , and j is the candidate index in C F and C P .
The total cost matrix M is the weighted sum of M v with every element M i j calculated as follows:
M i j = v w v M i j v M v
where w v is the weighted term of the observation v and is the sum of elements other than .
Our algorithm uses the Hungarian matching method [39] to find the optimal solution for which each tracking candidate c f j t (or c p j t ) is assigned to at most one tracking object f i (or p i ) and each tracking object f i (or p i ) is assigned to at most one tracking candidate c f j t (or c p j t ) as follows:
i j M i j X i j m i n
where X is a Boolean matrix with X i j = 1 if the tracking candidate c f j t (or c p j t ) is assigned to the tracking object f i (or p i ).
Then, we compute tracking indices g j t to assign the jth tracking candidate c f j t (or c p j t ) to the tracked objects F (or P ) as follows:
g j t = i i f X i j = 1 v , M i j v o t h e r w i s e

3.3.4. Face and Person Update

For g j t = i , the tracking candidate c f j t (or c p j t ) is assigned to the tracked object f i (or p i ) as follows:
f i = f i c j d i v = d i v o j v , v h s v , e n c d i v = o j v , v p o s , s i z e
where the operator ⊕ is used to insert an element into the last position of an array.
Otherwise, for g j t = , the candidate c f j t (or c p j t ) is a new tracking object to be inserted into the set of tracked objects F (or P ) as follows:
F = F c j D = D o j v , v h s v , e n c , p o s , s i z e

3.4. Face Voting

For the FaceVoting module, the system votes on the most significant face that has the largest influence on human emotional perception. Therefore, the inputs are the tracked faces F and tracked persons P . The outputs are the most significant tracked face F ¯ and the corresponding person P ¯ , which are used in the emotion classification.
The most important tracked face is the face that occurs more often and more clearly than the other tracked faces. It is assessed through frequency of occurrence, face size, and face probability. Given the tracked faces F = f i i = 1 M i and persons P = p i i = 1 M i , the weighted terms of frequency of occurrence, face size, and face probability of each tracked face f i and tracked person p i are computed as follows:
w f r e q i = M i T w s i z e i = j = 1 M i w j i × h j i M i × W × H w p r o b i = j = 1 M i p j i M i
where W , H and T are, respectively, the size and length of the scene sequence S and w j i , h j i and p j i are, respectively, the size and detection probability of the jth face in the tracked face f i .
The weighted term of each tracked face f i and tracked person p i is calculated as follows:
w i = c f r e q w f r e q i + c s i z e w s i z e i + c p r o b w p r o b i
where c x f r e q , s i z e , p r o b is a constant term that is used to adjust the priority of frequency of occurrence, face size, and face probability features in the face voting process.
The significant tracked faces F ¯ and corresponding tracked persons P ¯ have a weight that reaches a maximum value:
i m a x = arg max i 1 M i   w i F ¯ = f i m a x P ¯ = p i m a x
From there, we extract the face images S F ¯ and person images S P ¯ based on tracked face F ¯ and tracked person P ¯ , respectively.

3.5. Face and Context Feature Extraction

The Face and Context Feature Extraction module produces face and context features and probabilities for each of the seven emotions from the face and person regions using the face and context feature extraction models shown in Figure 3.
Let M f a c e be the face feature model which is built on conventional base networks such as Resnet [18], SEnet [44], Xception [45], Nasnet mobile [46], Densenet [47], Inception Resnet [48], VGG Face 1 [49], VGG Face 2 [41], and ImageNet [50]. The model receives a face image X f a c e and returns prediction emotion probabilities Y ^ f a c e p R 7 and feature vector Y ^ f a c e f R K as follows:
Y ^ f a c e p , Y ^ f a c e f = M f a c e X f a c e
where K is the feature size and Y ^ f a c e p is the one-hot encoding vector used to determine the emotion label c by c = arg max Y ^ f a c e p .
In this study, we trained M f a c e on the AffectNet dataset [16] and fine-tuned it on the RAF-DB dataset [15] with category cross-entropy (CCE) loss as follows:
L C C E = c C Y f a c e p log Y ^ f a c e p
where c is the emotion label in the set of seven basic emotions C .
Similarly, let M c t x be the context feature model, which extracts the context feature vector Y c t x from the person image X p e r s o n as follows:
Y c t x = M c t x X p e r s o n
where the context feature extraction model M c t x is built on the VGG16 model [17] with weights pre-trained on ImageNet.
Formally, we want the face feature model M f a c e and the context model M c t x to follow the following distribution:
p c | X f a c e , X p e r s o n = p c | Y ^ f a c e f , Y ^ f a c e p , Y c t x
The context around a person’s region is used to improve the performance of our model when the tracked face is very small or occluded. By extracting the feature vector with a model trained on ImageNet, we exploit the image diversity in ImageNet, and integrate this information into the face feature vector to identify correlations among the face and context characteristics and the emotion probability vector.

4. Network Architectures

4.1. Context Spatiotemporal LSTM-3DCNN Model

Overview. The context spatiotemporal LSTM-3DCNN model shown in Figure 4 incorporates the face, context feature blocks M f a c e and M c t x , the LSTM block M L S T M , the 3DCNN block M 3 d c n n , and the classification block M c l a s . Our proposed model uses the face and context feature blocks M f a c e and M c t x to extract the face and context feature vectors. Use of the context feature vector helps to improve the accuracy of our model in difficult cases such as those with occluded face, small face, etc. Next, the LSTM block M L S T M exploits the temporal correlation among the feature vectors and normalizes the information to a fixed-length spatiotemporal feature map where the first axis is the temporal dimension and the second and third axes are the spatial dimension. The 3DCNN block M 3 d c n n learns spatiotemporal information from the spatiotemporal feature map to produce the high-level emotional features. From there, the classification block M c l a s classifies the emotion as one of the seven basic categories.
The context feature vectors play an important role in performance improvement. It deals with the difficulties in emotion recognition when the faces are occluded and small. Moreover, it integrates contextual features with body posture, visual scene, social situations, etc. to explain human emotion instead of using only facial cues in emotion recognition.
Implementation Details. Given the significant tracked face S F ¯ and corresponding person S P ¯ in the input image sequences, the module applies random temporal sampling to transform the input image sequences into sequences with a fixed length of K as follows:
X f a c e t t = 1 K = T e m p o r a l S a m p l i n g S F ¯ X p e r s o n t t = 1 K = T e m p o r a l S a m p l i n g S P ¯
where K is the size of the sampling operator with a value of 32.
The network uses the face and context feature blocks M f a c e and M c t x to transform every input face image X f a c e t and person image X p e r s o n t at time step t = 1 , K ¯ in the input sequences. The outputs return the face probability vector Y f a c e p t , face feature vector Y f a c e f t , and context feature vector Y c t x t :
Y f a c e p t , Y f a c e f t = M f a c e X f a c e t Y c t x t = M c t x X p e r s o n t
Finally, they are combined to form the overall face context feature vector Y f a c e _ c t x t as follows:
Y f a c e _ c t x t = c o n c a t Y f a c e p t , Y f a c e f t , Y c t x t = c o n c a t M f a c e X f a c e t , M c t x X p e r s o n t
where the c o n c a t operator is used to combine feature vectors.
We freeze the first layers of M f a c e with the exception of the end layers, which have roles in feature extraction and emotion classification. This helps the face feature model M f a c e not only transfer knowledge from the model pre-trained on large-scale image emotion recognition datasets [15,16] but also to be fine-tuned again at frame level on the video emotion dataset [33]. For M c t x , we freeze all layers and only extract the context feature that is learned from the model that is pre-trained on the large-scale ImageNet dataset [50].
To exploit the long-term dependencies, the LSTM block M L S T M consists of stacked LSTM layers where each LSTM memory cell at layer i computes the hidden and state vectors h i t , c i t from the current face context feature Y f a c e _ c t x t (for layer 0) or the hidden vector h i 1 t (for layer i > 0), and the hidden and cell states after the previous LSTM memory cell h i t 1 , c i t 1 :
h i t , c i t = L S T M Y f a c e _ c t x t , h 0 t 1 , c 0 t 1 , i = 0 L S T M h i 1 t , h i t 1 , c i t 1 , 0 < i < L
where L is the number of LSTM layers in M L S T M . In this study, we chose L = 2 by experiment.
Next, we use the Dense and Reshape layers to normalize every hidden state vector h L 1 t at the last LSTM layer to a specific length and produce the spatiotemporal feature map Y l s t m R K × S × S of the face and context feature vectors Y f a c e _ c t x t as follows:
Y l s t m = R e s h a p e S × S D e n s e L h L 1 t
where L = S × S , and S × S are the fixed-length and (width, height) used to normalize and reshape the hidden state vector, respectively, and K is the number of time-steps.
To perform a deeper analysis of the spatiotemporal feature map Y l s t m in the temporal domain and ensure spatial coherence of the feature domain, the 3DCNN block M 3 d c n n is used to produce the emotional high-level feature Y 3 d c n n from Y l s t m as folows:
Y 3 d c n n = M 3 d c n n Y l s t m
where M 3 d c n n consists of four 3D convolutional blocks and a global average pooling layer. Every 3D convolutional block has 3D convolutional layers, followed by a batch normalization layer, and a rectified linear unit (ReLU), along with a 3D max pooling layer, at the end. The number of 3D convolutional layers and the kernel size of each one are, respectively: (2, 64), (2, 128), (3, 256), and (4, 512). All 3D convolutional layers use 3 × 3 × 3 filters and a padding of 1. The 3D max pooling layers have a size of 2 × 2 × 2 .
Lastly, M c l a s receives the emotion feature Y 3 d c n n and classifies it into the seven basic emotions. M c l a s comprises two fully-connected layers followed by ReLU layers and dropout layers. At the end of the block, a softmax layer is used to output the emotion probability vector Y e m o t i o n as follows:
Y e m o t i o n = M c l a s Y 3 d c n n
Finally, we use categorical cross-entropy loss for emotion classification as follows:
C C E Y g t _ e m o t i o n , Y e m o t i o n = i = 1 C Y i , g t _ e m o t i o n log Y i , e m o t i o n
where Y g t _ e m o t i o n is the ground-truth; Y e m o t i o n is the prediction result of the model; and C is the number of emotion labels.

4.2. Context Temporal-Pyramid Model

Overview. The context temporal-pyramid model illustrated in Figure 5 comprises the face and context blocks M f a c e and M c t x , the temporal-pyramid block M s t p , and the classification block M c l a s . The model has some similarities to the context spatiotemporal model in that it uses M f a c e and M c t x for face context feature extraction and M c l a s for emotion classification. However, the model exploits the face context features during all time steps in long-term temporal dependencies. The temporal-pyramid block M s t p provides all face context features from the feature extraction block to the statistical aggregation M s t a t k = l 1 , l 2 , , l P models where P is the number of statistical aggregation models. Each M s t a t k builds the temporal pyramid features at level k. It will divide the time steps into 2 k feature sub-sequences and aggregate the face and context features using the mean operator and face probabilities by max, mean, and min operators. From there, all temporal pyramid features at all pyramid levels are combined into the context temporal pyramid feature to exploit the long-term dependencies of the face context features in all time-steps. Finally, emotion classification is done by M c l a s .
Implementation Details. The context temporal-pyramid model uses all faces and persons in S F ¯ and S P ¯ to extract face context features Y f a c e c t x t = Y f a c e p t , Y f a c e f t , Y c t x t t = 1 M F ¯ using Equation (19), where M F ¯ is the number of elements in S F ¯ and S P ¯ .
Then, the temporal-pyramid block M s t p exploits the long-term temporal dependencies during all time steps through a temporal pyramid scheme. It consists of statistical aggregation M s t a t k = l 1 , l 2 , , l P models where every statistical aggregation M s t a t k transforms the face context features Y f a c _ c t x t into temporal pyramid features at level k, as shown in Figure 6.
The M s t a t k model divides all time steps into k time step sub-sequences T S j k = j 1 n k , j . n k j = 1 k where n = S F ¯ is the number of faces and persons in S F ¯ and S P ¯ . The face context features Y f a c e _ c t x i i T S j k in every time step sub-sequence j are transformed into the temporal pyramid feature V j k using the operator mean for face and context features and min, mean, and max for face probabilities, as follows:
V k , f a c e f j = m e a n i T S j k Y f a c e f i V k , c t x j = m e a n i T S j k Y c t x i V k , f a c e p j = c o n c a t min i T S j k Y f a c e p i , m e a n i T S j k Y f a c e p i , max i T S j k Y f a c e p i
where the m e a n , m a x , and m i n operators are used to create an aggregate of the mean, max, and min from the vector values and the c o n c a t operator combines all values in a vector. The correlation between V k , f a c e f j , V k , c t x j and V k , f a c e p j is exploited at every time step sub-sequence j in pyramid level k, which helps our model learn the long-term temporal dependencies.
The temporal pyramid feature V j k is a combination of V k , f a c e f j , V k , c t x j , and V k , f a c e p j as follows:
V k j = c o n c a t V k , f a c e f j , V k , c t x j , V k , f a c e p j
Finally, the temporal-pyramid block M s t p incorporates M s t a t k models in pyramid levels k = l 1 , l 2 , , l P , l i [ 0 3 ] to produce the context temporal-pyramid feature Y s t p as follows:
Y s t p = c o n c a t k l 1 , l 2 , , l P M s t a t k Y f a c e p t , Y f a c e f t , Y c t x t t = 1 M F ¯
From there, we use the classification block M c l a s , the architecture of which is similar to that of M c l a s in the context spatiotemporal model for emotion classification to produce emotion probabilities, as shown in Equation (24). We also apply categorical cross-entropy loss to train the model, as shown in Equation (25).

5. Best Selection Ensemble

The main idea of an ensemble method is to identify the best combination of the given models to solve the same tasks. The main advantage of ensemble methods is that they effectively use the large margin classifiers to reduce variance error and bias error [51].
We propose a best selection ensemble method to combine multi-modality information to address the bias error problem. Our method applies the heuristic principle to find the best combination of the given models at every selection step. We search all model combinations with the given first model and keep the shortest combination to prevent over-fitting.
First, it is assumed that the outputs of the M k k = 1 K models are prediction emotion probability vectors Y ^ k k = 1 K defined as follows:
Y ^ k = y ^ k , i i 1 , N E , i 1 , N E y ^ k , i = 1
where K and N E = 7 are the number of models and emotion labels, respectively. The average fusion F a v g of M k k = 1 K is calculated as follows:
F a v g M k = k = 1 K y ^ k , i K i 1 , N E
The multi-modal score is calculated based on the accuracy metric between the fusion result and the ground truth, as follows:
Score a c c M k = a c c F a v g M k , Y g t
where the a c c operator is used to calculate the accuracy of the prediction compared to ground truth.
Without loss of generality, we assume that M k k = 1 K is sorted in descending accuracy where Score a c c M i > Score a c c M j if i < j .
Let S e l e c t be the model-combination set. Initially, S e l e c t is empty. We sequentially choose the first model M s 1 from left to right in M k k = 1 K and attempt to find the optimal list of model selections corresponding to the given model M s 1 .
Let O p e n = M k k = 1 K \ M s 1 be the open list of models that can be selected for processing. C l o s e = M s 1 is then the closed list of the selected models.
At step l, it is assumed that O p e n = M k k = 1 K \ M s j = 1 l and C l o s e = M s j = 1 l . We select the first model M v from left to right in O p e n such that the following is satisfied:
F a v g C l o s e d M v > F a v g C l o s e d C l o s e d M v < = T
where T = 5 is the threshold of the number of models in C l o s e d (determined experimentally).
If a model M v cannot be found, we stop at this step and update the S e l e c t list as follows:
S e l e c t = S e l e c t C l o s e d
We then repeat the process to select the first model in the next position. Finally, we choose the model combination in S e l e c t with the highest accuracy and smallest number of models.

6. Experiments and Discussion

6.1. Datasets

6.1.1. Image-Based Emotion Recognition in the Wild

In this work, we chose suitable datasets for training of the face feature extraction model. The datasets must deal with the in-the-wild environments where there are many unconstrained conditions, such as occlusion, poses, illumination, etc. AffectNet [16] and RAF-DB [15] are by far the largest datasets satisfying the above criteria. The images in the datasets are collected from the Internet based on emotion-related keywords. Emotion labels are annotated by experts to guarantee reliability.
AffectNet [16] contains two data groups, manual and automatic groups, with more than 1,000,000 images that are labeled with 10 emotion categories as well as dimensional emotion (valence and arousal). We used only images in the manual group belonging to seven basic emotion categories (anger, disgust, fear, happiness, neutrality, sadness, and surprise). Thus, we used 283,901 images for training and 3500 images for validation. The data distributions of in the training and validation sets are shown in Figure 7.
The RAF-DB dataset [15] consists of about 30,000 facial images in the basic and compound emotion groups which were taken under the in-the-wild conditions with illumination changes, uncontrolled poses, and occlusion. In this study, we chose 12,271 images for training and 3068 images for validation, all of which were from the basic emotion group. The data distributions of the training and validation sets are shown in Figure 8.

6.1.2. Video-Based Emotion Recognition in the Wild

For facial emotion recognition in video clips, we used the AFEW dataset [33] to evaluate our study. The video clips in the dataset are collected from movies and TV shows under uncontrolled environments in terms of occlusion, illumination, and head poses. Each video clip was chosen based on its label, which contains emotion-related keywords corresponding to the emotion illustrated by the main subject. Use of this dataset helped us to address the problem of temporal facial expressions in the wild.
From the AFEW dataset, we used 773 video clips for training and 383 video clips for validation with labels corresponding to the seven basic emotion categories (anger, disgust, fear, happiness, neutrality, sadness, and surprise). The distribution of this dataset is shown in Figure 9.
Table 1 shows the datasets used for in image and video emotion recognition in this study:

6.2. Environmental Setup, Evaluation Metrics, and Experimental Setup

Environment. We used Python 3.7 with Tensorflow 2.1 and Keras to develop our program. Our experiments were conducted on a Desktop PC with Intel Core I7 8700, 64 GB RAM and two Nvidia GeForce GTX 1080 Ti graphic cards with 11 GB memory.
Evaluation Metrics. We used accuracy ( A c c . ) and F 1 score as the quantitative measurements in this study. We also used the average M e a n A c c . and standard deviation S t d A c c . of the accuracy values on the main diagonal of the normalized confusion matrix M n o r m to evaluate the performance results, as in [15]. These metrics are calculated as follows:
A c c u r a c y = T P + T N T P + F P + T N + F N F 1 = 2 P r e c i s i o n . R e c a l l P r e c i s i o n + R e c a l l M e a n A c c . = i = 1 n g i , i n S t d A c c . = i = 1 n g i , i M e a n A c c . 2 n
where g i , i d i a g M n o r m is the ith diagonal value of the normalized confusion matrix M n o r m , n is the size of M n o r m , and T P , T N , F P , and F N , respectively, are true positive, true negative, false positive, and false negative. The precision is the ratio of correctly predicted positive samples to all predicted positive samples. The recall is the ratio of correctly positive prediction to all true samples. They are calculated as follows:
P r e c i s i o n = T P T P + F P R e c a l l = T P T P + F N
The accuracy metric measures the ratio of correctly predicted samples to all samples; it ranges from 0 (worst) to 1 (best). It allows us to assess the performance of our model given that the data distribution is almost symmetric.
F 1 score can be used to more precisely evaluate the model in the case of an uneven class distribution, as it takes both F P and F N into account. F 1 score is a weighted average of precision and recall and ranges from 0 (worst) to 1 (best). In this study, due to the multi-class classification problem, we report the F 1 score as the weighted average F 1 score of each emotion label with weighting based on the number of labels.
Moreover, we also used M e a n A c c . and S t d A c c . to consider emotion evaluation under in-the-wild conditions with an imbalanced class distribution. This can be done in place of the accuracy metric, which is sensitive to bias under an uneven class distribution.
Experimental Setup. In this study, we conducted four experiments corresponding to: (1) the face and context feature extraction models; (2) the context spatiotemporal models; (3) the context temporal-pyramid model; and (4) the ensemble methods. Finally, we compared our results to related works on the AFEW dataset for video emotion recognition.

6.3. Experiments on Face and Context Feature Extraction Models

Overview. We used six conventional architectures to build a face feature extraction model to integrate into the facial emotion recognition models for video clips shown in Table 2. They consisted of Resnet 50 [18], Senet 50 [44], Densenet 201 [47], Nasnet mobile [46], Xception [45], and Inception Resnet [48]. Besides training from scratch, weights pre-trained on VGG-Face 2 [41], VGG-Face 1 [49], and ImageNet [50] were also used for transfer learning to leverage the knowledge from these huge facial and visual object datasets. For the context feature extraction model, we used the VGG16 model [17] with weights pre-trained on ImageNet [50] to extract the context feature around the person region.
Training Details. We first trained the models on the AffectNet dataset. We then fine-tuned the models on the RAF-DB dataset. Because the training and testing distributions differed, we applied a sampling technique to ensure that every emotion label in every batch had the same number of elements. Every image was resized to 224 × 224 and data augmentation was applied with random rotation, flip, center crop, and transition. The batch size was 8. The optimizer was Adam [52] with a learning rate of 0.001 and plateau reduction when training on the Affect-Net dataset. For fine-tuning on RAF-DB, we used SGD [53] with a learning rate within the range of 0.0004 to 0.0001 using the cosine annealing schedule.
Results and Discussion. Table 2 shows the performance measurements of the face feature extraction models on the validation sets of the AffectNet and RAF-DB datasets.
As shown in Table 2, the performance results on AffectNet could be separated into three distinct groups, which are, in descending order: Group 1 (Inception Resnet, ResNet 50, and Senet 50), Group 2 (Densenet 201 and Nasnet mobile), and Group 3 (Xception). Group 1 had three metrics greater than 61% with the highest accuracy value of 62.51%, F 1 score of 62.41% and M e a n A c c . of 62.51% for the Inception Resnet model.
After fine-tuning on the RAF-DB dataset using the weights from pre-training on the AffectNet dataset, the ResNet 50 model achieved the best performance, with the accuracy of 87.22%, F 1 score of 87.38%, and M e a n A c c . of 82.45%. M e a n A c c . was 82.44% greater than that of the DLP-CNN baseline in the RAF-DB dataset (74.20%) [15]. Therefore, we chose to use this model as the face feature extraction model for video emotion recognition.
Figure 10 shows the confusion matrix of the ResNet 50 model on the validation sets of the AffectNet and RAF-DB datasets. For the results of the ResNet 50 model on AffectNet, the happiness emotion label achieved the highest accuracy of 85%, while the remaining emotion labels showed similar accuracies, ranging from 53.6% to 63%. After fine-tuning in the RAF-DB dataset, the accuracy of the images labeled neutrality, sadness, surprise, and anger were significantly enhanced from 83.9% to 88.3%, nearly reaching the accuracy of 91.8% for the happiness label. The disgust and fear categories showed the lowest accuracy. In addition, the values of M e a n A c c . ± S t d on AffectNet and RAF-DB were 61.57 % ± 10.78 % , and 82.44 % ± 9.20 % , respectively.

6.4. Experiments on Spatiotemporal Models

Overview. The spatiotemporal models consist of four blocks, namely feature extraction block, LSTM block, 3DCNN block, and classification block that receives input from the face and person sequences. In this experiment, we built three different models from the spatiotemporal approach, as shown in Table 3.
Model 1, “Spatiotemporal Model + Fix-Feature,” used only the face sequence with the ResNet 50 face feature extraction model. The ResNet 50 model used weights that were pre-trained on the AffectNet and RAF-DB datasets, as discussed above. Moreover, all layers of the ResNet 50 model were frozen. Thus, the face feature extraction model was not fine-tuned during video-based emotion recognition training. Model 2, “Spatiotemporal Model + NonFix-Feature,” was different from the first model in that only three blocks of the ResNet 50 model were frozen, and the feature block of the ResNet 50 model was fine-tuned. Model 3, “Spatiotemporal Model + NonFix-Feature + Context,” expanded the context feature of Model 2 using input from both face and person sequences and used the pre-trained weights from the VGG16 model on ImageNet for context feature extraction.
Training Details. We trained our models on the AFEW dataset. During video batch sampling, every emotion label appeared with the same frequency to overcome the uneven class distribution and differences in distribution between the training and validation sets. We randomly extracted 32 frames per video clip in the training phase. For the validation phase, we averaged five predictions per clip by randomly extracting 32 frames. For data augmentation, we transformed the whole face and person sequence by resizing to 224 × 224, applying random horizontal flip, spatial rotation ± 15 , and scaling ± 20 % . Training was done using SGD optimizer with early stopping at 40 epochs, an initial learning rate of 0.0004, and a reduction in the learning rate on the plateau.
Result Discussion. Table 3 illustrates the performance results of the spatiotemporal models on the validation set of the AFEW dataset.
Model 1, with fixed face features due to frozen face feature extraction, obtained an accuracy of 51.70%, F 1 score of 54.17%, and M e a n A c c . of 46.51%. Through fine-tuning on the feature block of the ResNet 50 model, Model 2showed an enhancement of accuracy by 0.52%, F 1 score by 2.09%, and M e a n A c c . by 0.82%. Due to use of the context with the person region, Model 3 showed significant increases of 1.82%, 2.52%, and 1.65% for the accuracy, F 1 score, and M e a n A c c . , respectively. Model 3 also showed the highest accuracy of 54.05%, F 1 score of 50.78%, and M e a n A c c . ± S t d of 48.98 % ± 32.28 % among all the spatiotemporal models.
Figure 11 shows the confusion matrix among the three models using the spatiotemporal approach. By fine-tuning the feature block of the face feature extraction model, Model 2 obtained an accuracy of 73% in the neutrality emotion label, compared to 58.7% for Model 1. Furthermore, Model 3, which took context into account, showed an enhancement of the accuracy of the sadness and surprise emotion labels, with accuracies of 62.3%, and 32.6%, respectively. These figures represent increases of 13.1% and 17.2% for the two emotion labels compared to the second approach. Moreover, Model 3 showed M e a n A c c . ± S t d of 48.98 % ± 32.28 % , which is greater than the 47.33 % ± 31.73 % of Model 2, and 46.51 % ± 34.38 % of Model 1.

6.5. Experiments on Temporal-Pyramid Models

Overview. For the temporal-pyramid model, we performed an ablation study on the context and scale factors, as shown in Table 4. For the context factor, Models 4–6 without context used only the ResNet 50 face feature extraction model, while Models 7–9 with context combined the face and context features from face and person sequences. When shown a face frame, a model without context produced one vector with a length of 2048 for the face feature and 21 probability outputs corresponding to the seven emotion labels and three statistical operators (min, mean, and max). The context feature vector from the VGG 16 model using pre-trained weights form ImageNet had a length of 2048. Therefore, the models without/with context had lengths of 2069/4117 per frame.
For the level factor, we conducted experiments on three groups of levels, {3}, {4}, and {0,1,2,3}. At level k, all processing frames are divided into 2 k sub-sequences and all sub-sequences are combined in the same interval by the mean operator in the face and context features and three operators (min, mean, and max) in the emotion probability outputs. For example, for level group {0,1,2,3}, we divided all face and context frames in a video clip into 1, 2, 4, and 8 sub-sequences at Levels 0, 1, 2, and 3, respectively. In total, 15 sub-sequences were used to capture the emotion based on statistical information from whole frames or small chunks of frames with various lengths. Therefore, the length of the temporal-pyramid features without and with context is 15 * 2069 = 31,035 and 15 * 4117 = 61,755, respectively.
Training Details. In the training phase, we created temporal-pyramid features at level groups {3}, {4}, and {0,1,2,3} with and without context using the face feature model and context feature model with pre-trained weights in the Resnet 50 model from AffectNet and RAF-DB, and pre-trained weights for the VGG-16 model from ImageNet. For every level group, we used data augmentation to process 10 instances in every video clip. Data augmentation was applied to all frames with the same transformations: resizing to 224 × 224, random horizontal flip, scaling, and rotation. When sampling to get a minibatch, we randomly chose eight video clips with one of ten instances in data augmentation for every video clip, where the results satisfied the balance between emotion labels in a minibatch. We used the same training configuration as used in the training phase of the spatiotemporal models with the SGD optimizer, an initial learning rate 0.0004, and learning rate reduction on the plateau.
Results and Discussion. Table 4 depicts the experimental results of the temporal-pyramid models with adjustment of context and level factors.
For the level factor, Models 4–6, respectively, were set to level groups {3}, {4}, and {0,1,2,3} without context. The performance results of the three models were the same, with an accuracy of 55.87%. However, Model 6, with many level factors, gave better results in terms of F 1 score and M e a n A c c . (54.06% and 51.85%, respectively, compared to 52.76% and 51.21% and 52.51%, and 51.23% for Models 4 and 5, respectively). Similarly, Model 9, using many level factors, also showed an F 1 score and M e a n A c c . of 56.50% and 54.25%, which were superior to the results of Models 7 and 8. Therefore, the level factor affected the F 1 score and M e a n A c c . .
For the context factor, Models 7–9. respectively, increased accuracy, F 1 score, and M e a n A c c . by 0.26%, 1.86%, and 1.15%; 0.52%, 1.49%, and 0.94%; and 0.78%, 2.44%, and 2.41% over the corresponding values of Models 4–6. In the same level group, the context factors helped Models 7–9 provide better results than Models 4–6, respectively. Moreover, Model 9, with many level factors, showed a significant increase in F 1 score and M e a n A c c . , as it had the highest values of 56.50% and 54.25%, respectively.
Figure 12 shows the confusion matrices of Models 6, 9, and 8. For the same level group {0,1,2,3}, Model 9, with context, showed an enhancement in the accuracy of the difficult emotion labels, disgust, fear, and surprise, by 30.0%, 43.5%, and 43.5%, respectively, compared to 20.0%, 32.6%, and 23.9% for Model 6. The M e a n A c c . ± S t d of Model 9 was 54.25 % ± 16.63 % , which is greater than the 51.84 % ± 25.98 % of Model 6 (without context) and 52.18 % ± 27.47 % of Model 8 (with only one level {4}).

6.6. Experiments on Best Selection Ensemble

Overview. We conducted ensemble experiments through three approaches to exploit the complementary nature and redundancy among the models, as shown in Table 5. We first used the average fusion method, which combines the seven emotion probability outputs of all models with an average operator. The second approach was the multi-modal joint late-fusion method [10]. In this approach, we divided all models into two groups, spatiotemporal (Models 1–3) and temporal-pyramid (Models 4–9) groups. This method used the average operator to merge all probability outputs of the emotion models in the same group, called the probability-merged layer, followed by a dense layer, and a softmax layer for classification into the seven emotion categories. The role of each group’s outputs guarantees the accuracy of each branch. In addition, the model had a joint branch to merge the probability-merged layers of the two groups with a concatenation operator to give the emotion outputs.
The last approach was the best selection ensemble method. It chooses one of the models as the first element and then repeats the process by adding one of the remaining models using the average operator on the probability outputs with the previous models to help current combination increase. The process ends when there are no additional unused models to help increase the accuracy of the model combination or all models are selected.
Results and Discussion. The results of our experiments on the average fusion, multi-model joint late fusion, and best selection ensembles are shown in Table 5.
The best selection method showed the highest accuracy and F 1 score of 59.79% and 58.48%, respectively, representing significant increase in accuracy and F 1 score of 2.09% and 3.48% and 1.3% and 1.08% compared to the average fusion method and multi-modal joint late-fusion method, respectively. The combination models in the best selection method that gave the best scores were Models 3, 6, 7, and 9.
The confusion matrix in the best selection method shown in Figure 13 gave the highest M e a n A c c 56.24% with the smallest S t d A c c . of 23.26% compared to the average fusion method and multi-modal join late-fusion method. Moreover, this method showed an improvement in performance for the more difficult emotion labels: disgust, 25.0%; fear, 39.1%; and sadness, 37.0%.

6.7. Discussion and Comparison with Related Works

Discussion. Figure 14 presents the results of the three experiments on the AFEW validation set. First, the context factor played an important role in enhancing the performance of spatiotemporal Model 3 compared to Models 1 and 2 using the same approach, as well as temporal-pyramid Models 7–9 compared to the corresponding Models 4–6. This finding confirms that context is key to interpretation facial expression to access the emotional state of a person [54], especially, in cases in which the facial region is small and blurry.
Second, use of multi-level factors {0,1,2,3} in temporal-pyramid models provided more robust features than were seen in the models using only a single level ({3} and {4}). For instance, Model 6 gave better results than Models 4 and 5. Similarly, the performance of Model 9 was better than that of Models 7 and 8. This shows that division of time periods in facial expression representation in a hierarchical structure creates robust features to capture human emotions under in-the-wild conditions, such as unclear temporal border and multiple apexes from spontaneous expressions.
Finally, when integrating multiple-modalities, the best selection ensemble method achieved better results than average fusion method, and multi-modal joint late-fusion method.
The main advantage of our ensemble method is that it allows the identification of the best combination of a large number of models through a multi-modal approach as well as derivation of instances from many training times. We were able to expand the average operator through use of other operators, such as skew, min, max, and median, as well as by combining many operators. In this study, the average and median operator were more useful than the others.
Comparison with related works. The accuracy measurements of our proposed methods and related methods on the AFEW validation set are shown in Table 6.
Our spatiotemporal method outperforms other recently reported methods using the same approach, by around 0.14% compared with Li et al. [63]. Recently, Kumar et al. [66] used multi-level attention with an unsupervised approach by iterative training between student and teacher models. Their method showed a highest accuracy of 55.17%, which is lower than that of our temporal-pyramid method, 56.66%. To compare the fusion and ensemble methods, we searched for related studies that used multiple-modalities using visual and geometric information of facial expressions. Our ensemble method achieved the highest accuracy of 59.79%, which is better than that shown in related studies, where the highest reported accuracy was 57.43% by Fan et al. [61].

7. Conclusions

In this study, we built an emotion recognition system to track the main face and recognize its facial expression in a video clip. We propose a face-person tracking and voting module to help our system detect the main face and person in a video clip for emotion recognition. Our tracking algorithm is based on a tracking-by-detection scheme with robust appearance observations to suggest facial and human regions, while the voting module uses relevant information about frequency of occurrences, size, and face detection probability to determine a main human and face sequence. In the next step, our emotion recognition models detects facial expressions through two main approaches, the spatiotemporal approach and the temporal-pyramid approach. Finally, the best selection ensemble method selects the best combination of models from among many training models to predict facial expression in a video clip. Compared to previous results on the AFEW dataset, our work shows improvement in every domain.
In the spatiotemporal models, we use 2D CNN facial and context blocks followed by an LSTM block and 3D CNN block to exploit the spatiotemporal coherence of facial and context features and facial emotion probabilities. The context factor is a significant key that increases the the performance of our model from 52.22% to 54.05%. Moreover, we achieved an accuracy that is better than that reported by related studies on the AFEW validation set.
For the temporal-pyramid models, we apply data augmentation on facial and context regions and extracted facial and person features and face emotion probabilities from every frame of the video clip. Using temporal-pyramid strategies, we created robust hierarchical features to feed into a simple neural network for classification of facial expression. Our method exploits the high correlation of features in the temporal domain. Due to the improvements mentioned above, we achieved an accuracy of 56.66% on the validation set, which is better than the accuracies of related studies using a single model with the same approach.
Finally, we propose a best selection ensemble to select a suitable combination of models from a large number of model instances during training with tuning of hyper-parameters, adjustment of levels, and configuration of context factor. Our ensemble method achieved an accuracy of 59.79%, which is better than that of the average fusion and multi-modal joint late-fusion method as well as related studies on the AFEW validation set.
In the further works, we will apply a multi-level attention mechanism to highlight the spatiotemporal correlations between emotion features over time. In addition, we use a graph convolution network to express movement of facial action units, which helps our system to better classify human expression.

Author Contributions

Conceptualization, N.-T.D. and S.-H.K.; Funding acquisition, S.-H.K., G.-S.L. and H.-J.Y.; Investigation, N.-T.D.; Methodology, N.-T.D.; Project administration, S.-H.K., G.-S.L., H.-J.Y. and S.Y.; Supervision, S.-H.K.; Validation, N.-T.D.; Writing—original draft, N.-T.D.; and Writing—review and editing, N.-T.D. and S.-H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (NRF-2020R1A4A1019191) and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A3A03000947).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Corneanu, C.A.; Simon, M.O.; Cohn, J.F.; Guerrero, S.E. Survey on RGB, 3D, Thermal, and Multimodal Approaches for Facial Expression Recognition: History, Trends, and Affect-Related Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1548–1568. [Google Scholar] [CrossRef] [Green Version]
  2. Bänziger, T.; Grandjean, D.; Scherer, K.R. Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT). Emotion 2009, 9, 691–704. [Google Scholar] [CrossRef] [Green Version]
  3. Martinez, B.; Valstar, M.F. Advances, Challenges, and Opportunities in Automatic Facial Expression Recognition. In Advances in Face Detection and Facial Image Analysis; Springer International Publishing: Cham, Switzerland, 2016; pp. 63–100. [Google Scholar] [CrossRef]
  4. Wieser, M.J.; Brosch, T. Faces in Context: A Review and Systematization of Contextual Influences on Affective Face Processing. Front. Psychol. 2012, 3. [Google Scholar] [CrossRef] [Green Version]
  5. Koelstra, S.; Pantic, M.; Patras, I. A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1940–1954. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Bernin, A.; Müller, L.; Ghose, S.; Grecos, C.; Wang, Q.; Jettke, R.; von Luck, K.; Vogt, F. Automatic Classification and Shift Detection of Facial Expressions in Event-Aware Smart Environments. In Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, 26–29 June 2018; pp. 194–201. [Google Scholar] [CrossRef]
  7. Ekman, P.; Friesen, W.V. Facial Action Coding System: A Technique for the Measurement of Facial Movement; Consulting Psychologists Press: Palo Alto, CA, USA, 1978. [Google Scholar]
  8. Kotsia, I.; Pitas, I. Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines. IEEE Trans. Image Process. 2007, 16, 172–187. [Google Scholar] [CrossRef] [PubMed]
  9. Pantic, M.; Patras, I. Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2006, 36, 433–449. [Google Scholar] [CrossRef] [Green Version]
  10. Jung, H.; Lee, S.; Yim, J.; Park, S.; Kim, J. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2983–2991. [Google Scholar] [CrossRef]
  11. Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on Local Binary Patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef] [Green Version]
  12. Liu, W.; Song, C.; Wang, Y.; Jia, L. Facial expression recognition based on Gabor features and sparse representation. In Proceedings of the 2012 12th International Conference on Control, Automation, Robotics and Vision, ICARCV 2012, Guangzhou, China, 5–7 December 2012; pp. 1402–1406. [Google Scholar] [CrossRef]
  13. Dhall, A.; Asthana, A.; Goecke, R.; Gedeon, T. Emotion recognition using PHOG and LPQ features. In Proceedings of the 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, FG 2011, Santa Barbara, CA, USA, 21–25 March 2011; pp. 878–883. [Google Scholar] [CrossRef]
  14. Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020. [Google Scholar] [CrossRef] [Green Version]
  15. Li, S.; Deng, W.; Du, J. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2584–2593. [Google Scholar] [CrossRef]
  16. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef] [Green Version]
  17. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  18. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
  19. Ding, H.; Zhou, S.K.; Chellappa, R. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017. [Google Scholar] [CrossRef] [Green Version]
  20. Klaeser, A.; Marszalek, M.; Schmid, C. A Spatio-Temporal Descriptor Based on 3D-Gradients. In Proceedings of the British Machine Vision Conference 2008, Leeds, UK, 1–4 September 2008; pp. 99.1–99.10. [Google Scholar] [CrossRef] [Green Version]
  21. Zhao, G.; Pietikäinen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Sikka, K.; Wu, T.; Susskind, J.; Bartlett, M. Exploring bag of words architectures in the facial expression domain. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar] [CrossRef]
  23. Jain, S.; Hu, C.; Aggarwal, J.K. Facial expression recognition with temporal modeling of shapes. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1642–1649. [Google Scholar] [CrossRef] [Green Version]
  24. Wang, Z.; Wang, S.; Ji, Q. Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3422–3429. [Google Scholar] [CrossRef] [Green Version]
  25. Lu, C.; Zheng, W.; Li, C.; Tang, C.; Liu, S.; Yan, S.; Zong, Y. Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; Volume III, pp. 646–652. [Google Scholar] [CrossRef]
  26. Liu, C.; Tang, T.; Lv, K.; Wang, M. Multi-Feature Based Emotion Recognition for Video Clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; pp. 630–634. [Google Scholar] [CrossRef] [Green Version]
  27. Kim, D.H.; Lee, M.K.; Choi, D.Y.; Song, B.C. Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction—ICMI 2017, Glasgow, UK, 13–17 November 2017; ACM Press: New York, New York, USA, 2017; pp. 529–535. [Google Scholar] [CrossRef]
  28. Knyazev, B.; Shvetsov, R.; Efremova, N.; Kuharenko, A. Leveraging Large Face Recognition Data for Emotion Classification. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 692–696. [Google Scholar] [CrossRef]
  29. Bargal, S.A.; Barsoum, E.; Ferrer, C.C.; Zhang, C. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction—ICMI 2016, Tokyo Japan, 12–16 November 2016; ACM Press: New York, NY, USA, 2016; pp. 433–436. [Google Scholar] [CrossRef]
  30. Zhu, X.; Ye, S.; Zhao, L.; Dai, Z. Hybrid attention cascade network for facial expression recognition. Sensors 2021, 21, 2003. [Google Scholar] [CrossRef]
  31. Shi, J.; Liu, C.; Ishi, C.T.; Ishiguro, H. Skeleton-based emotion recognition based on two-stream self-attention enhanced spatial-temporal graph convolutional network. Sensors 2021, 21, 205. [Google Scholar] [CrossRef] [PubMed]
  32. Anvarjon, T.; Mustaqeem; Kwon, S. Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors 2020, 20, 5212. [Google Scholar] [CrossRef] [PubMed]
  33. Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 2012, 19, 34–41. [Google Scholar] [CrossRef] [Green Version]
  34. Dhall, A.; Roland Goecke, S.G.; Gedeon, T. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. In Proceedings of the ACM International Conference on Mutimodal Interaction, Suzhou, China, 14–18 October 2019. [Google Scholar]
  35. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 2005, 18, 602–610. [Google Scholar] [CrossRef]
  36. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 4489–4497. [Google Scholar] [CrossRef] [Green Version]
  37. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  38. Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef] [Green Version]
  39. Kuhn, H.W. The Hungarian method for the assignment problem. In 50 Years of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef] [Green Version]
  40. Pérez, P.; Hue, C.; Vermaak, J.; Gangnet, M. Color-Based Probabilistic Tracking. In Proceedings of the European Conference on Computer Vision, Copenhagen, Denmark, 28–31 May 2002; pp. 661–675. [Google Scholar] [CrossRef]
  41. Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar] [CrossRef] [Green Version]
  42. Hu, P.; Ramanan, D. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 951–959. [Google Scholar] [CrossRef]
  43. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Leibe, B., Matas, J., Welling, M., Sebe, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
  44. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 2011–2023. [Google Scholar] [CrossRef] [Green Version]
  45. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef] [Green Version]
  46. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef] [Green Version]
  47. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef] [Green Version]
  48. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  49. Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 41.1–41.12. [Google Scholar] [CrossRef] [Green Version]
  50. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012. [Google Scholar] [CrossRef]
  51. Rokach, L. Ensemble Methods for Classifiers. In Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2005; pp. 957–980. [Google Scholar] [CrossRef]
  52. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  53. Schaul, T.; Zhang, S.; LeCun, Y. No more pesky learning rates. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  54. Barrett, L.F.; Mesquita, B.; Gendron, M. Context in emotion perception. Curr. Dir. Psychol. Sci. 2011. [Google Scholar] [CrossRef]
  55. Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 445–450. [Google Scholar] [CrossRef]
  56. Yan, J.; Zheng, W.; Cui, Z.; Tang, C.; Zhang, T.; Zong, Y.; Sun, N. Multi-clue fusion for emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 458–463. [Google Scholar] [CrossRef]
  57. Vielzeuf, V.; Pateux, S.; Jurie, F. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 569–576. [Google Scholar] [CrossRef] [Green Version]
  58. Hu, P.; Cai, D.; Wang, S.; Yao, A.; Chen, Y. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; ACM: New York, NY, USA, 2017; pp. 553–560. [Google Scholar] [CrossRef]
  59. Kaya, H.; Gürpınar, F.; Salah, A.A. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 2017, 65, 66–75. [Google Scholar] [CrossRef]
  60. Vielzeuf, V.; Kervadec, C.; Pateux, S.; Lechervy, A.; Jurie, F. An Occam’s Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; pp. 589–593. [Google Scholar] [CrossRef] [Green Version]
  61. Fan, Y.; Lam, J.C.K.; Li, V.O.K. Video-based Emotion Recognition Using Deeply-Supervised Neural Networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; pp. 584–588. [Google Scholar] [CrossRef]
  62. Nguyen, D.H.; Kim, S.; Lee, G.S.; Yang, H.J.; Na, I.S.; Kim, S.H. Facial Expression Recognition Using a Temporal Ensemble of Multi-level Convolutional Neural Networks. IEEE Trans. Affect. Comput. 2019, 33, 1. [Google Scholar] [CrossRef]
  63. Li, S.; Zheng, W.; Zong, Y.; Lu, C.; Tang, C.; Jiang, X.; Liu, J.; Xia, W. Bi-modality Fusion for Emotion Recognition in the Wild. In Proceedings of the 2019 International Conference on Multimodal Interaction, Jiangsu, China, 14–18 October 2019; ACM: New York, NY, USA, 2019; pp. 589–594. [Google Scholar] [CrossRef]
  64. Meng, D.; Peng, X.; Wang, K.; Qiao, Y. Frame Attention Networks for Facial Expression Recognition in Videos. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3866–3870. [Google Scholar] [CrossRef] [Green Version]
  65. Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 10142–10151. [Google Scholar] [CrossRef] [Green Version]
  66. Kumar, V.; Rao, S.; Yu, L. Noisy Student Training Using Body Language Dataset Improves Facial Expression Recognition. In Computer Vision—ECCV 2020 Workshops; Bartoli, A., Fusiello, A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 756–773. [Google Scholar] [CrossRef]
Figure 1. The three periods of facial expression representation are onset, apex, and offset. The duration of each varies, leading to unclear temporal borders. In addition, the appearance of spontaneous expressions leads to the presence of multiple apexes [6].
Figure 1. The three periods of facial expression representation are onset, apex, and offset. The duration of each varies, leading to unclear temporal borders. In addition, the appearance of spontaneous expressions leads to the presence of multiple apexes [6].
Sensors 21 02344 g001
Figure 2. Overview of the proposed system for video emotion recognition in the wild.
Figure 2. Overview of the proposed system for video emotion recognition in the wild.
Sensors 21 02344 g002
Figure 3. Face and Context Feature Extraction.
Figure 3. Face and Context Feature Extraction.
Sensors 21 02344 g003
Figure 4. Context Spatio-Temporal LSTM-3DCNN Model.
Figure 4. Context Spatio-Temporal LSTM-3DCNN Model.
Sensors 21 02344 g004
Figure 5. Context Temporal-Pyramid Model.
Figure 5. Context Temporal-Pyramid Model.
Sensors 21 02344 g005
Figure 6. Context Temporal-Pyramid Features.
Figure 6. Context Temporal-Pyramid Features.
Sensors 21 02344 g006
Figure 7. Data distribution for training and validation on AffectNet dataset [16].
Figure 7. Data distribution for training and validation on AffectNet dataset [16].
Sensors 21 02344 g007
Figure 8. Data distribution for training and validation on RAF-DB dataset [15].
Figure 8. Data distribution for training and validation on RAF-DB dataset [15].
Sensors 21 02344 g008
Figure 9. Data distribution for training and validation on AFEW dataset [33].
Figure 9. Data distribution for training and validation on AFEW dataset [33].
Sensors 21 02344 g009
Figure 10. Confusion matrix of ResNet 50 model on the AffectNet and RAF-DB validation sets.
Figure 10. Confusion matrix of ResNet 50 model on the AffectNet and RAF-DB validation sets.
Sensors 21 02344 g010
Figure 11. Confusion matrix of Models 1–3 (spatiotemporal approach) on the AFEW validation set.
Figure 11. Confusion matrix of Models 1–3 (spatiotemporal approach) on the AFEW validation set.
Sensors 21 02344 g011
Figure 12. Confusion matrices of Models 6, 9, and 8 (from left to right), which use the temporal-pyramid approach on the AFEW validation set.
Figure 12. Confusion matrices of Models 6, 9, and 8 (from left to right), which use the temporal-pyramid approach on the AFEW validation set.
Sensors 21 02344 g012
Figure 13. Confusion matrices of the average fusion method, multi-modal join late-fusion method and best selection ensemble method on the AFEW validation set.
Figure 13. Confusion matrices of the average fusion method, multi-modal join late-fusion method and best selection ensemble method on the AFEW validation set.
Sensors 21 02344 g013
Figure 14. Results of our proposed models on the AFEW validation set. The rectangle data points represent spatiotemporal approaches, with context (Model 3) and without context (Models 1 and 2). The circular data points represent models based on temporal-pyramid approaches, which consist of two groups: without context (Models 4–6) and with context (Models 7–9) with level groups of {3}, {4}, {0,1,2,3}, respectively, in each group. Finally, the “plus sign” data points represent the average fusion method (Model 10), multi-modal joint late-fusion method (Model 11), and best selection method (Model 12).
Figure 14. Results of our proposed models on the AFEW validation set. The rectangle data points represent spatiotemporal approaches, with context (Model 3) and without context (Models 1 and 2). The circular data points represent models based on temporal-pyramid approaches, which consist of two groups: without context (Models 4–6) and with context (Models 7–9) with level groups of {3}, {4}, {0,1,2,3}, respectively, in each group. Finally, the “plus sign” data points represent the average fusion method (Model 10), multi-modal joint late-fusion method (Model 11), and best selection method (Model 12).
Sensors 21 02344 g014
Table 1. Image and video emotion recognition datasets.
Table 1. Image and video emotion recognition datasets.
EmotionAffectNet [16]RAF-DB [15]AFEW [33]
TrainingValidationTrainingValidationTrainingValidation
Angry24,88250070516213364
Disgust38035007171607440
Fear6378500281748146
Happy134,4155004772118515063
Neutral74,874500252468014463
Sad25,459500198247811761
Surprise14,09050012903297446
Total283,901350012,2713068773383
Table 2. Performance of face feature extraction models on the AffectNet and RAF-DB validation sets.
Table 2. Performance of face feature extraction models on the AffectNet and RAF-DB validation sets.
NoModelPre-Train WeightAffectnet [16]RAF-DB [15]
Acc . F 1 Mean Acc ± Std Acc . F 1 Mean Acc ± Std
1ResNet 50 [18]VGGFace2 [41]61.57%61.46% 61.57 % ± 10.79 % 87.22%87.38%82.45% ± 09.20%
2Senet 50 [44]VGGFace 1 [49]61.51%61.50% 61.51 % ± 10.40 % 83.64%83.81% 76.96 % ± 11.12 %
3Nasnet mobile [46]ImageNet [50]59.20%58.88% 59.20 % ± 13.95 % 80.74%81.01% 74.05 % ± 12.44 %
4Densenet 201 [47]ImageNet [50]59.31%58.91% 59.31 % ± 14.12 % 83.08%83.23% 76.94 % ± 11.31 %
5Inception Resnet [48]Scratch62.51%62.41% 62.51 % ± 09.63 % 81.23%81.79% 77.08 % ± 08.10 %
6Xception [45]Scratch56.26%56.38% 56.26 % ± 11.18 % 80.90%81.03% 74.71 % ± 14.28 %
Table 3. Performance results of the spatiotemporal models on the AFEW validation set.
Table 3. Performance results of the spatiotemporal models on the AFEW validation set.
NoMethodContextFeature Acc . F 1 Mean Acc . ± Std
1Spatiotemporal Model + Fix-Feature Fix51.70%46.17% 46.51 % ± 34.38 %
2Spatiotemporal Model + Nonfix-Feature Nonfix52.22%48.26% 47.33 % ± 31.73 %
3Spatiotemporal Model + Nonfix-Feature + ContextNonfix54.05%50.78%48.98% ± 32.28%
Table 4. Performance results of the temporal-pyramid models on the AFEW validation set.
Table 4. Performance results of the temporal-pyramid models on the AFEW validation set.
NoMethodContextLevel Acc . F 1 Mean Acc . ± Std
4Temporal-Pyramid Model + Level 3 355.87%52.76% 51.21 % ± 29.87 %
5Temporal-Pyramid Model + Level 4 455.87%52.51% 51.23 % ± 30.12 %
6Temporal-Pyramid Model + Level 0,1,2,3 0,1,2,355.87%54.06%51.85% ± 25.98%
7Temporal-Pyramid Model + Level 3 + Context356.14%54.61% 52.35 % ± 25.53 %
8Temporal-Pyramid Model + Level 4 + Context456.40%53.99% 52.18 % ± 27.47 %
9Temporal-Pyramid Model + Level 0,1,2,3 + Context0,1,2,356.66%56.50%54.25% ± 16.63%
Table 5. Validation results of the ensemble experiments on the AFEW validation set.
Table 5. Validation results of the ensemble experiments on the AFEW validation set.
NoMethod Acc . F 1 Mean Acc . ± Std
10Average Fusion57.70%55.00% 53.18 % ± 29.62 %
11Multi-modal Joint Late-Fusion [10]58.49%57.40% 54.92 % ± 23.50 %
12Best Selection Ensemble59.79%58.48%56.24% ± 23.26%
Table 6. Performance comparison with related studies on AFEW validation set.
Table 6. Performance comparison with related studies on AFEW validation set.
AuthorsMethodApproachModalityYearAccuracy
Fan et al. [55]CNN + LSTMSpatiotemporal (2D + T)Visual201645.43%
C3DSpatiotemporal (3D)Visual 39.69%
Yan et al. [56]Trajectory Features + SVMGeometryGeometry201637.37%
CNN Features + Bi-directional RNNSpatiotemporal (2D+T)Visual 44.46%
FusionFusionVisual + Geometry 49.22%
Vielzeuf et al. [57]VGG-LSTMSpatiotemporal (2D+T)Visual201748.60%
LSTM C3DSpatiotemporal (3D)Visual 43.20%
ModDrop FusionFusionVisual 52.20%
Hu et al. [58]Face Features + Supervised Scoring EnsembleFrame-LevelVisual201744.67%
Knyazev et al. [28]Face Features + STAT (min,std,mean) + SVMFrame-LevelVisual201753.00%
Weighted Average ScoreFusionVisual 55.10%
Kaya et al. [59]CNN-FUN Features + Kernel ELMPLSSpatiotemporal (3D)Visual201751.60%
Lu et al. [25]VGG-Face + BLSTMSpatiotemporal (2D+T)Visual201853.91%
C3DSpatiotemporal (3D)Visual 39.36%
Weighted Average FusionFusionVisual 56.05%
Liu et al. [26]VGG16 FER2013 + LSTMSpatiotemporal (2D+T)Visual201846.21%
Face Features + STAT (min,std,mean) + SVMFrame-LevelVisual 51.44%
Landmark Euclidean DistanceGeometryGeometry 39.95%
Weighted Average FusionFusionVisual + Geometry 56.13%
Vielzeuf et al. [60]Max Score Selection + Temporal PoolingFrame-LevelVisual201852.20%
Fan et al. [61]Deeply-Supervised CNN (DSN)Frame-LevelVisual201848.04%
Weighted Average FusionFusionVisual 57.43%
Duong et al. [62]CNN Features + LSTMSpatiotemporal (2D+T)Visual201949.30%
Li et al [63]VGG-Face Features + Bi LSTMSpatiotemporal (2D+T)Visual201953.91%
Meng et al. [64]Frame Attention Networks (FAN)Frame-LevelVisual201951.18%
Lee et al. [65]CAER-NetSpatiotemporal (2D+T)Visual201951.68%
Kumar et al. [66]Noisy Student Training + Multi-level attentionFrame-LevelVisual202055.17%
Our methodSpatiotemporal modelSpatiotemporal (2D+T)Visual 54.05%
Temporal-pyramid modelFrame-LevelVisual 56.66%
Best Selection EnsembleFusionVisual 59.79%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Do, N.-T.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S.; Yeom, S. Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models. Sensors 2021, 21, 2344. https://doi.org/10.3390/s21072344

AMA Style

Do N-T, Kim S-H, Yang H-J, Lee G-S, Yeom S. Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models. Sensors. 2021; 21(7):2344. https://doi.org/10.3390/s21072344

Chicago/Turabian Style

Do, Nhu-Tai, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, and Soonja Yeom. 2021. "Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models" Sensors 21, no. 7: 2344. https://doi.org/10.3390/s21072344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop