4.1. CNN Testing
Confusion matrices are presented in this section.
The best classification for Case 1 (
Figure 5) is for the emotion Happy, at 95%, with 1738 correctly predicted images. This is also due to the large number of images in the training set for this emotion. The worst performance is seen for the emotion Disgust, with a predictability percentage of 86%, which represents 96 correctly predicted images. It is also the class with the fewest images in the training set.
The model’s performance for Case 2 (
Figure 6) improves, where the Happy emotion increases to 96%, and the Sad, Fear, Angry, and Disgust emotions have a percentage of 88%. The prediction percentages are more homogeneous.
In Case 3 (
Figure 7), the prediction for the Fear class improves to 90%. The Disgust class remains with the lowest percentage of predictability, at 86%, and the Happy class remains with the highest percentage, at 94%.
In Case 4 (
Figure 8), 850 instances of Angry were correctly classified as “Angry”, representing 0.89, or 89%, correct classifications.
We notice that the best classification is for the emotion Happy, at 94%, with 1720 correctly predicted images. This is also due to the large number of images in the training set for this emotion. We see the weakest performance for the feelings of Fear and Sadness, at 88%, with 897 and 1000 correctly predicted images, respectively. One explanation is that the facial expressions for the feelings of Angry, Fear, and Sad can be similar in some circumstances, as seen in
Figure 9.
For example, we observe the following incorrect classifications for the feelings of Angry, Fear, and Sad:
Although in the training set, the fewest images were for the emotion Disgust, and we expected it to be in the last position in the hierarchy of prediction values, due to the similarity between the emotions Fear, Angry, and Sad, it is it is ranked in the second-to-last position with a percentage of 89% predictability, meaning that 99 images were correctly predicted.
In Case 5 (
Figure 10), the emotion Happy has the best predictability percentage, at 94%, but the emotion Sad has the lowest predictability percentage, at 86%. The explanation, in this case, is also the similarity between the Fear, Angry, and Sad emotions.
Case 6 (
Figure 11) shows improvements for most classes and a general tendency to classify the emotions Angry and Disgust better.
Table 2 presents, for all six cases proposed, the values for precision and accuracy at training and at testing, as well as the number of epochs for training.
In the first five cases, the model achieved an accuracy between 91% and 92%, indicating good performance in terms of correctly identifying positive classes. These results suggest that the model is able to effectively recognize facial expressions. It was observed that the training accuracy was also quite high in these cases, demonstrating the model’s ability to learn from the available data.
In Case 6, the number of images in the validation set was higher but also more unbalanced compared to the rest of the cases. This is reflected in the accuracy value, indicating a decrease in performance in terms of generalizing the model on the test set. This difference shows the importance of balancing and preparing the dataset for the expected results.
In a specific way, Case 2 stands out for its use of data augmentation functions, i.e., shuffle and vertical flip, achieving the best overall performance. The augmentation by vertical flip addresses the asymmetry in the human face.
The performances on the test set for Cases 1–5 ranged between 90.51% and 91.16%, while Case 6 achieved the lowest accuracy on the test set of 89.25%.
The number of epochs varied between 25 and 50 in the analyses performed. Case 1, trained for 25 epochs, managed to achieve good performance, thus demonstrating that the developed model achieved an efficient adaptation to its data. This suggests that a smaller number of epochs is not necessarily a disadvantage but may indicate efficient learning in the particular context of the data used. On the other hand, Case 2, with 46 epochs, achieved the best precision of 92% and a training accuracy of 93.94%. This suggests that an average number of epochs can contribute to optimizing the performance, but there is no universal rule, as there are also situations in which models trained on a large number of epochs can suffer from the phenomenon of overtraining.
The metrics obtained on the emotion classes are found in
Table 3.
Accuracy analysis: Case 2 demonstrates a remarkable accuracy of 0.99 for the Disgust emotion class, suggesting that this model is exceptionally efficient in correctly identifying this emotion. Such performance is significant, as accuracy scores close to 1.0 indicate exceptional accuracy, reflecting the model’s ability to make clear distinctions between adjacent emotions and minimize classification errors. On the other hand, Case 4, with an accuracy of 0.84, shows a significantly poorer performance in identifying the Disgust emotion. This discrepancy may suggest that the model used in this case fails to capture the specific details of the expressions or the context in which Disgust is expressed.
When focusing on the emotion Angry, we observe a variable accuracy, ranging from 0.88 and 0.90. Case 6 stands out as having the best performance, with a score of 0.90. This suggests a robust ability in identifying an emotion that, although it may seem, in many cases, easy to recognize, can still vary considerably depending on the specific context and the associated facial expressions. The stability of the accuracy between cases suggests that the models used were trained on diverse and well-labeled datasets, but there remains potential for enhancement to achieve optimal performance.
Regarding the emotion Surprised, Cases 2 and 4 demonstrate their efficiency with a precision of 0.95, which highlights an accurate identification of this emotion. However, Case 6, with an accuracy of 0.87, indicates an inferior performance. This situation highlights the complexity of emotions and the challenges they present in automatic classification.
Recall analysis: Analyzing the case of the emotion Angry, it is observed that the model achieves a recall of 0.92, which suggests remarkable performance in identifying cases of Anger. This high value indicates that the model is able to detect the majority of instances in which Anger is present, thus demonstrating significant accuracy.
In contrast, the emotion Fear presents a lower recall of 0.82. This value suggests difficulties in correctly identifying the Fear emotion, indicating an area where the model needs improvement. The emotion Fear can often be subtle and can vary significantly depending on the context. For example, Fear can be expressed through a wide range of behaviors and facial expressions, which are not always obvious. This diversity makes the model face challenges in distinguishing Fear from other similar emotions, such as Sadness or Anger. In this regard, a deeper analysis of the training data and the social context is required to improve the identification of this emotion.
Regarding the emotion Happy, the results show constant and good performance, with a recall ranging from 0.94 and 0.96 for all analyzed cases. These high values suggest that the model is effective in recognizing the emotion Happy, which is logical considering that this is an emotion that is often easier to identify.
F1 score analysis: In the case of the emotion Disgust, the observations suggest an F1 score of 0.93 for Case 2, which is distinguished by a remarkable balance between precision and recall. This result indicates that the model was able to identify this emotion efficiently, both in terms of avoiding false positive classifications and correctly capturing positive examples. Case 4, on the other hand, records an F1 score of 0.92, which, although lower, still suggests high performance. This result could suggest refined distinctions among the various datasets used and the complexity of the context in which the emotions are manifested.
In conclusion, the models in Cases 1–5 perform well on most emotions, with high precision and recall. Case 6, although having some good values, shows signs of weakness in identifying the emotions Fear, Sad, and Surprised.
Case 2 is the best optimized, having the best precision and good accuracy on the training set, which suggests that it could be a good reference model.
To assess whether the observed performance improvements are statistically meaningful, a Wilcoxon signed-rank test was conducted to compare accuracy and precision values across multiple runs. This non-parametric test was chosen due to its robustness in handling small sample sizes and non-normally distributed data. The results indicated no statistically significant difference between the accuracy and precision values, confirming the stability and consistency of the model’s performance. This shows that the improvements observed across different experimental conditions are not due to random variations but rather reflect the effectiveness of the proposed approach in facial emotion recognition.
Comparing the results obtained in the literature on the FER2013 dataset (
Table 4 and [
52]) with the results obtained in this study, we can conclude that data augmentation with the Shuffle and vertical flip functions offers the best optimization, achieving the best accuracy, namely, 92%.
Comparing the validation accuracy results obtained for the FER2013 dataset in the literature using complex models (
Table 5) with the results obtained in this study, we can conclude that the proposed CNN model with data augmentation offers the best accuracy of 91.16%.
When analyzing the experimental results, beyond the overall success rates of the model, it is important to examine the cases where misclassifications occurred and their potential causes. As observed in the confusion matrices, the highest classification accuracy is consistently achieved for the Happy emotion, with predictability percentages reaching up to 96%, largely due to the high number of training samples for this category. Conversely, the Disgust emotion exhibits the lowest predictability, ranging between 86% and 89%, which can be attributed to its underrepresentation in the training set.
Another recurring challenge is the misclassification of emotions with similar facial expressions, particularly among Angry, Fear, and Sad, as evidenced by instances where Fear was confused with Sad (37 cases) and Angry was misclassified as Sad (36 cases). This suggests that the model struggles with subtle variations in facial features that differentiate these emotions, a known limitation in FER systems, especially when trained on datasets like FER2013, where intra-class variations can be significant. Additionally, while applying data augmentation and class weighting strategies improved overall predictability, certain cases—such as Sad being the least accurately classified in Case 5 (86%)—demonstrate that these techniques alone may not fully resolve the inherent similarity-induced confusion among some emotions.
A further observation is that increasing the number of images in the test set (Case 6) led to improvements in classifying Angry and Disgust, suggesting that a more balanced dataset distribution during training and testing can enhance generalization. These findings indicate that while the model effectively captures distinct emotional patterns, future improvements could explore more sophisticated augmentation techniques, feature refinement strategies, or multimodal approaches incorporating additional cues like temporal dynamics or physiological signals to further enhance recognition accuracy in challenging cases.
4.2. Testing on Completely New Data
To validate the results obtained, testing was carried out on completely new data by developing an application.
After training and testing the network, one file was created to save the model, and another file was created to save the structure with all the parameters. These will be loaded when running the application.
For face detection, the Haarcascade classifier was used. The images captured from the camera were resized to 1280 × 720 pixels so that they fit well on the laptop screen.
The detection of each face was accessed through the width and height coordinates, framing the face detected from the images taken by the webcam in real time in a rectangle. Regardless of the input image, it was converted to grayscale because the model was trained on grayscale images, thus improving the accuracy of emotion detection. From the image converted to grayscale, the region of interest was accessed, and all the face images were cropped. After cropping the frame, each face image was resized to 48 × 48 in grayscale to be sent to the trained model for facial expression recognition. The list of all emotions was directly mapped with labels: at index 0, we have the emotion Anger; at 1, Disgust; at 2, Fear; at 3, Happiness; at 4, Neutral; at 5, Sadness; and at 6, Surprise.
Five subjects were trained to reproduce the seven emotions, where each emotion was trained four times with a 5 s hold, captured using a webcam at a speed of 20 fps. A total of 2000 raw images were thus collected for each emotion. The obtained accuracies are presented in
Table 6.
An example of detection is shown in
Figure 12, where the emotion with the highest confidence percentage can be seen, which is displayed as the detected emotion.
For the 30 fps camera used in our study, the interval between consecutive frames is 0.033 s. The maximum processing time per frame is 0.0238 s, which remains within this interval, ensuring that no frames are skipped during real-time processing. However, the first frame experiences a processing delay of approximately 0.11 s due to the initial camera startup overhead. This delay is only present at the beginning and is eliminated in subsequent frames, allowing smooth and uninterrupted real-time execution.