6.2. Environmental Setup, Evaluation Metrics, and Experimental Setup
Environment. We used Python 3.7 with Tensorflow 2.1 and Keras to develop our program. Our experiments were conducted on a Desktop PC with Intel Core I7 8700, 64 GB RAM and two Nvidia GeForce GTX 1080 Ti graphic cards with 11 GB memory.
Evaluation Metrics. We used accuracy (
) and
score as the quantitative measurements in this study. We also used the average
and standard deviation
of the accuracy values on the main diagonal of the normalized confusion matrix
to evaluate the performance results, as in [
15]. These metrics are calculated as follows:
where
is the
ith diagonal value of the normalized confusion matrix
,
n is the size of
, and
,
,
, and
, respectively, are true positive, true negative, false positive, and false negative. The precision is the ratio of correctly predicted positive samples to all predicted positive samples. The recall is the ratio of correctly positive prediction to all true samples. They are calculated as follows:
The accuracy metric measures the ratio of correctly predicted samples to all samples; it ranges from 0 (worst) to 1 (best). It allows us to assess the performance of our model given that the data distribution is almost symmetric.
score can be used to more precisely evaluate the model in the case of an uneven class distribution, as it takes both and into account. score is a weighted average of precision and recall and ranges from 0 (worst) to 1 (best). In this study, due to the multi-class classification problem, we report the score as the weighted average score of each emotion label with weighting based on the number of labels.
Moreover, we also used and to consider emotion evaluation under in-the-wild conditions with an imbalanced class distribution. This can be done in place of the accuracy metric, which is sensitive to bias under an uneven class distribution.
Experimental Setup. In this study, we conducted four experiments corresponding to: (1) the face and context feature extraction models; (2) the context spatiotemporal models; (3) the context temporal-pyramid model; and (4) the ensemble methods. Finally, we compared our results to related works on the AFEW dataset for video emotion recognition.
6.3. Experiments on Face and Context Feature Extraction Models
Overview. We used six conventional architectures to build a face feature extraction model to integrate into the facial emotion recognition models for video clips shown in
Table 2. They consisted of Resnet 50 [
18], Senet 50 [
44], Densenet 201 [
47], Nasnet mobile [
46], Xception [
45], and Inception Resnet [
48]. Besides training from scratch, weights pre-trained on VGG-Face 2 [
41], VGG-Face 1 [
49], and ImageNet [
50] were also used for transfer learning to leverage the knowledge from these huge facial and visual object datasets. For the context feature extraction model, we used the VGG16 model [
17] with weights pre-trained on ImageNet [
50] to extract the context feature around the person region.
Training Details. We first trained the models on the AffectNet dataset. We then fine-tuned the models on the RAF-DB dataset. Because the training and testing distributions differed, we applied a sampling technique to ensure that every emotion label in every batch had the same number of elements. Every image was resized to
and data augmentation was applied with random rotation, flip, center crop, and transition. The batch size was 8. The optimizer was Adam [
52] with a learning rate of 0.001 and plateau reduction when training on the Affect-Net dataset. For fine-tuning on RAF-DB, we used SGD [
53] with a learning rate within the range of 0.0004 to 0.0001 using the cosine annealing schedule.
Results and Discussion.
Table 2 shows the performance measurements of the face feature extraction models on the validation sets of the AffectNet and RAF-DB datasets.
As shown in
Table 2, the performance results on AffectNet could be separated into three distinct groups, which are, in descending order: Group 1 (Inception Resnet, ResNet 50, and Senet 50), Group 2 (Densenet 201 and Nasnet mobile), and Group 3 (Xception). Group 1 had three metrics greater than 61% with the highest accuracy value of 62.51%,
score of 62.41% and
of 62.51% for the Inception Resnet model.
After fine-tuning on the RAF-DB dataset using the weights from pre-training on the AffectNet dataset, the ResNet 50 model achieved the best performance, with the accuracy of 87.22%,
score of 87.38%, and
of 82.45%.
was 82.44% greater than that of the DLP-CNN baseline in the RAF-DB dataset (74.20%) [
15]. Therefore, we chose to use this model as the face feature extraction model for video emotion recognition.
Figure 10 shows the confusion matrix of the ResNet 50 model on the validation sets of the AffectNet and RAF-DB datasets. For the results of the ResNet 50 model on AffectNet, the happiness emotion label achieved the highest accuracy of 85%, while the remaining emotion labels showed similar accuracies, ranging from 53.6% to 63%. After fine-tuning in the RAF-DB dataset, the accuracy of the images labeled neutrality, sadness, surprise, and anger were significantly enhanced from 83.9% to 88.3%, nearly reaching the accuracy of 91.8% for the happiness label. The disgust and fear categories showed the lowest accuracy. In addition, the values of
on AffectNet and RAF-DB were
, and
, respectively.
6.4. Experiments on Spatiotemporal Models
Overview. The spatiotemporal models consist of four blocks, namely feature extraction block, LSTM block, 3DCNN block, and classification block that receives input from the face and person sequences. In this experiment, we built three different models from the spatiotemporal approach, as shown in
Table 3.
Model 1, “Spatiotemporal Model + Fix-Feature,” used only the face sequence with the ResNet 50 face feature extraction model. The ResNet 50 model used weights that were pre-trained on the AffectNet and RAF-DB datasets, as discussed above. Moreover, all layers of the ResNet 50 model were frozen. Thus, the face feature extraction model was not fine-tuned during video-based emotion recognition training. Model 2, “Spatiotemporal Model + NonFix-Feature,” was different from the first model in that only three blocks of the ResNet 50 model were frozen, and the feature block of the ResNet 50 model was fine-tuned. Model 3, “Spatiotemporal Model + NonFix-Feature + Context,” expanded the context feature of Model 2 using input from both face and person sequences and used the pre-trained weights from the VGG16 model on ImageNet for context feature extraction.
Training Details. We trained our models on the AFEW dataset. During video batch sampling, every emotion label appeared with the same frequency to overcome the uneven class distribution and differences in distribution between the training and validation sets. We randomly extracted 32 frames per video clip in the training phase. For the validation phase, we averaged five predictions per clip by randomly extracting 32 frames. For data augmentation, we transformed the whole face and person sequence by resizing to 224 × 224, applying random horizontal flip, spatial rotation , and scaling . Training was done using SGD optimizer with early stopping at 40 epochs, an initial learning rate of 0.0004, and a reduction in the learning rate on the plateau.
Result Discussion.
Table 3 illustrates the performance results of the spatiotemporal models on the validation set of the AFEW dataset.
Model 1, with fixed face features due to frozen face feature extraction, obtained an accuracy of 51.70%, score of 54.17%, and of 46.51%. Through fine-tuning on the feature block of the ResNet 50 model, Model 2showed an enhancement of accuracy by 0.52%, score by 2.09%, and by 0.82%. Due to use of the context with the person region, Model 3 showed significant increases of 1.82%, 2.52%, and 1.65% for the accuracy, score, and , respectively. Model 3 also showed the highest accuracy of 54.05%, score of 50.78%, and of among all the spatiotemporal models.
Figure 11 shows the confusion matrix among the three models using the spatiotemporal approach. By fine-tuning the feature block of the face feature extraction model, Model 2 obtained an accuracy of 73% in the neutrality emotion label, compared to 58.7% for Model 1. Furthermore, Model 3, which took context into account, showed an enhancement of the accuracy of the sadness and surprise emotion labels, with accuracies of 62.3%, and 32.6%, respectively. These figures represent increases of 13.1% and 17.2% for the two emotion labels compared to the second approach. Moreover, Model 3 showed
of
, which is greater than the
of Model 2, and
of Model 1.
6.5. Experiments on Temporal-Pyramid Models
Overview. For the temporal-pyramid model, we performed an ablation study on the context and scale factors, as shown in
Table 4. For the context factor, Models 4–6 without context used only the ResNet 50 face feature extraction model, while Models 7–9 with context combined the face and context features from face and person sequences. When shown a face frame, a model without context produced one vector with a length of 2048 for the face feature and 21 probability outputs corresponding to the seven emotion labels and three statistical operators (min, mean, and max). The context feature vector from the VGG 16 model using pre-trained weights form ImageNet had a length of 2048. Therefore, the models without/with context had lengths of 2069/4117 per frame.
For the level factor, we conducted experiments on three groups of levels, {3}, {4}, and {0,1,2,3}. At level k, all processing frames are divided into sub-sequences and all sub-sequences are combined in the same interval by the mean operator in the face and context features and three operators (min, mean, and max) in the emotion probability outputs. For example, for level group {0,1,2,3}, we divided all face and context frames in a video clip into 1, 2, 4, and 8 sub-sequences at Levels 0, 1, 2, and 3, respectively. In total, 15 sub-sequences were used to capture the emotion based on statistical information from whole frames or small chunks of frames with various lengths. Therefore, the length of the temporal-pyramid features without and with context is 15 * 2069 = 31,035 and 15 * 4117 = 61,755, respectively.
Training Details. In the training phase, we created temporal-pyramid features at level groups {3}, {4}, and {0,1,2,3} with and without context using the face feature model and context feature model with pre-trained weights in the Resnet 50 model from AffectNet and RAF-DB, and pre-trained weights for the VGG-16 model from ImageNet. For every level group, we used data augmentation to process 10 instances in every video clip. Data augmentation was applied to all frames with the same transformations: resizing to 224 × 224, random horizontal flip, scaling, and rotation. When sampling to get a minibatch, we randomly chose eight video clips with one of ten instances in data augmentation for every video clip, where the results satisfied the balance between emotion labels in a minibatch. We used the same training configuration as used in the training phase of the spatiotemporal models with the SGD optimizer, an initial learning rate 0.0004, and learning rate reduction on the plateau.
Results and Discussion.
Table 4 depicts the experimental results of the temporal-pyramid models with adjustment of context and level factors.
For the level factor, Models 4–6, respectively, were set to level groups {3}, {4}, and {0,1,2,3} without context. The performance results of the three models were the same, with an accuracy of 55.87%. However, Model 6, with many level factors, gave better results in terms of score and (54.06% and 51.85%, respectively, compared to 52.76% and 51.21% and 52.51%, and 51.23% for Models 4 and 5, respectively). Similarly, Model 9, using many level factors, also showed an score and of 56.50% and 54.25%, which were superior to the results of Models 7 and 8. Therefore, the level factor affected the score and .
For the context factor, Models 7–9. respectively, increased accuracy, score, and by 0.26%, 1.86%, and 1.15%; 0.52%, 1.49%, and 0.94%; and 0.78%, 2.44%, and 2.41% over the corresponding values of Models 4–6. In the same level group, the context factors helped Models 7–9 provide better results than Models 4–6, respectively. Moreover, Model 9, with many level factors, showed a significant increase in score and , as it had the highest values of 56.50% and 54.25%, respectively.
Figure 12 shows the confusion matrices of Models 6, 9, and 8. For the same level group {0,1,2,3}, Model 9, with context, showed an enhancement in the accuracy of the difficult emotion labels, disgust, fear, and surprise, by 30.0%, 43.5%, and 43.5%, respectively, compared to 20.0%, 32.6%, and 23.9% for Model 6. The
of Model 9 was
, which is greater than the
of Model 6 (without context) and
of Model 8 (with only one level {4}).
6.6. Experiments on Best Selection Ensemble
Overview. We conducted ensemble experiments through three approaches to exploit the complementary nature and redundancy among the models, as shown in
Table 5. We first used the average fusion method, which combines the seven emotion probability outputs of all models with an average operator. The second approach was the multi-modal joint late-fusion method [
10]. In this approach, we divided all models into two groups, spatiotemporal (Models 1–3) and temporal-pyramid (Models 4–9) groups. This method used the average operator to merge all probability outputs of the emotion models in the same group, called the probability-merged layer, followed by a dense layer, and a softmax layer for classification into the seven emotion categories. The role of each group’s outputs guarantees the accuracy of each branch. In addition, the model had a joint branch to merge the probability-merged layers of the two groups with a concatenation operator to give the emotion outputs.
The last approach was the best selection ensemble method. It chooses one of the models as the first element and then repeats the process by adding one of the remaining models using the average operator on the probability outputs with the previous models to help current combination increase. The process ends when there are no additional unused models to help increase the accuracy of the model combination or all models are selected.
Results and Discussion. The results of our experiments on the average fusion, multi-model joint late fusion, and best selection ensembles are shown in
Table 5.
The best selection method showed the highest accuracy and score of 59.79% and 58.48%, respectively, representing significant increase in accuracy and score of 2.09% and 3.48% and 1.3% and 1.08% compared to the average fusion method and multi-modal joint late-fusion method, respectively. The combination models in the best selection method that gave the best scores were Models 3, 6, 7, and 9.
The confusion matrix in the best selection method shown in
Figure 13 gave the highest
56.24% with the smallest
of 23.26% compared to the average fusion method and multi-modal join late-fusion method. Moreover, this method showed an improvement in performance for the more difficult emotion labels: disgust, 25.0%; fear, 39.1%; and sadness, 37.0%.
6.7. Discussion and Comparison with Related Works
Discussion.
Figure 14 presents the results of the three experiments on the AFEW validation set. First, the context factor played an important role in enhancing the performance of spatiotemporal Model 3 compared to Models 1 and 2 using the same approach, as well as temporal-pyramid Models 7–9 compared to the corresponding Models 4–6. This finding confirms that context is key to interpretation facial expression to access the emotional state of a person [
54], especially, in cases in which the facial region is small and blurry.
Second, use of multi-level factors {0,1,2,3} in temporal-pyramid models provided more robust features than were seen in the models using only a single level ({3} and {4}). For instance, Model 6 gave better results than Models 4 and 5. Similarly, the performance of Model 9 was better than that of Models 7 and 8. This shows that division of time periods in facial expression representation in a hierarchical structure creates robust features to capture human emotions under in-the-wild conditions, such as unclear temporal border and multiple apexes from spontaneous expressions.
Finally, when integrating multiple-modalities, the best selection ensemble method achieved better results than average fusion method, and multi-modal joint late-fusion method.
The main advantage of our ensemble method is that it allows the identification of the best combination of a large number of models through a multi-modal approach as well as derivation of instances from many training times. We were able to expand the average operator through use of other operators, such as skew, min, max, and median, as well as by combining many operators. In this study, the average and median operator were more useful than the others.
Comparison with related works. The accuracy measurements of our proposed methods and related methods on the AFEW validation set are shown in
Table 6.
Our spatiotemporal method outperforms other recently reported methods using the same approach, by around 0.14% compared with Li et al. [
63]. Recently, Kumar et al. [
66] used multi-level attention with an unsupervised approach by iterative training between student and teacher models. Their method showed a highest accuracy of 55.17%, which is lower than that of our temporal-pyramid method, 56.66%. To compare the fusion and ensemble methods, we searched for related studies that used multiple-modalities using visual and geometric information of facial expressions. Our ensemble method achieved the highest accuracy of 59.79%, which is better than that shown in related studies, where the highest reported accuracy was 57.43% by Fan et al. [
61].