1. Introduction
As the study of emotions has advanced, their significance has become increasingly evident, leading to widespread applications of emotional research in real-life scenarios. Facial expressions serve as pivotal cues for capturing human emotions. Facial expression recognition (FER) tasks are extensively employed in educational, medical, and human–computer interaction settings to enhance human assistance and learning experiences. Consequently, research in FER has transitioned from controlled laboratory environments to practical field applications, expanding from static to dynamic facial expression recognition (DFER) and facing a broader array of challenges.
Firstly, as Izad highlighted, facial expressions occur within microseconds [
1], necessitating FER tasks to operate with a higher speed to accurately capture emotional cues and realize the practical value of FER applications. In the current FER tasks, the recognition is achieved by loading pretrained models, and the inference speed of the model is equivalent to the recognition speed. However, with the continuous development of artificial intelligence, end-to-end training is accelerating. In the future, application scenarios will no longer use pretrained models to adapt to user inputs. Instead, user input data will also become part of the training data. The iterative training of the model will be deployed at the application end to help the model continuously upgrade. At this time, the training speed will also become essential to ensure that downstream tasks can be trained in real time, even under limited computing power. In the future of large-scale deployment and application of artificial intelligence, new challenges are posed to the performance of FER tasks. While pursuing recognition accuracy, FER tasks should also focus on improving the training speed. Moreover, current FER techniques often neglect the central region of the face. Effective FER should prioritize the central facial area in photographs, minimizing attention towards the periphery. While standard cropping methods emphasize facial features, the variability inherent in facial photographs often results in these methods failing to focus on the central position consistently. Furthermore, noisy labels and low image quality continue to pose essential challenges for FER tasks [
2].
Many studies have proposed lightweight network architectures or reduced Floating Point Operations per Second (FLOPS) to enhance model processing speed. However, reducing FLOPS does not always correlate with reduced latency and can sometimes exacerbate delays. This is because the computational volume and memory access frequency substantially impact model processing speed more than FLOPS alone [
3,
4]. For instance, as shown in
Figure 1, the feature information extracted by CNN networks is redundant, with low contribution rates from channel-specific features. Addressing redundant channel features offers a viable solution to decrease computational load, reduce memory access frequency, and enhance model processing speed.
To address these challenges, we adopted FasterNet [
5] as our backbone model, characterized by channel filtering and central attention weighting. This choice effectively minimizes redundant channel features while emphasizing the central face position. Our model also employs a dual-branch structure: one branch handles normal samples while the other handles horizontally flipped samples. The outputs from both branches leverage flipped attention consistency loss to mitigate face anisotropy and address issues related to noisy label annotations.
In summary, the contributions of the FCCA method we proposed are as follows:
Innovative Backbone Selection: We introduce FasterNet as the backbone for FER tasks, incorporating a channel filtering mechanism to reduce memory access and enhance overall model performance. Additionally, our method utilizes a channel weighting mechanism to prioritize the central facial position, mitigating the adverse effects of edge positions on key feature extraction.
Enhanced Model Training: By integrating flipped attention consistency loss with FasterNet, our approach trains the model with both flipped and cross-entropy loss. This method leverages high-quality feature extraction to bolster model robustness and effectively mitigate the effects of label noise.
Experimental Validation: We conducted comprehensive experiments to evaluate our method’s generalization capability, runtime efficiency, and recognition accuracy using publicly available datasets like RAF-DB, AffectNet, and DFEW. Our results demonstrate that our approach achieves strong generalization across different datasets, delivers rapid training speed, and attains a recognition accuracy that is slightly less than state-of-the-art methods.
By addressing these key aspects, our FCCA method notably advances the field of FER, offering a robust solution that excels in accuracy and efficiency across diverse application scenarios.
The remainder of this paper is organized as follows.
Section 2 (“Related Work”) critically reviews the advancements and limitations of three mainstream backbone networks for facial expression recognition (FER) tasks, namely Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and lightweight network architectures.
Section 3 (“Proposed Method”) comprehensively describes the proposed FCCA model, detailing its architectural design, operational principles, and specific implementation.
Section 4 (“Experiments”) outlines the experimental framework, including dataset specifications, evaluation metrics, and configuration details.
Section 5 (“Results”) presents a quantitative analysis of the FCCA model’s recognition performance across experimental datasets.
Section 6 (“Discussion”) offers a comparative evaluation and visualization of the model’s performance, focusing on recognition accuracy and computational efficiency, while uncovering deeper insights into the FCCA methodology. Finally,
Section 7 (“Conclusions”) summarizes this study’s key contributions and limitations, and suggests potential directions for future research. This structured approach ensures a systematic research presentation, from theoretical foundations to empirical validation and critical analysis.
2. Related Works
Scholars have continuously experimented with the performance and robustness of backbone models to better extract facial expression discriminative features and address various challenges in the FER task. Common backbone models used in the FER task include CNN, ViT, and lightweight models.
Given CNN’s strong ability to extract local features, ResNet-50 and ResNet-18 were used as backbones to extract facial expression features for a long time. Zhang et al. used ResNet-18 to extract facial expression features and combined a branch to determine the uncertainty of samples [
6]. Wang et al. used ResNet-18 as a facial expression feature extractor and designed a region attention mechanism to suppress the negative impact of occlusion on the model’s learning of useful features [
7]. Farzaneh et al. also used ResNet-18 to extract features and deep center attention weighted features to highlight important features [
8]. Zhang et al. used ResNet-50 to extract key facial features and used erasing consistency loss to weaken the negative impact of noisy features on correct classification [
9]. These methods effectively alleviated the difficulties faced in the field of FER. However, they were limited by CNN’s abandonment of global vision, and there was still considerable room for improvement in FER task performance.
Given the powerful ability of ViT to capture long-range feature relationships, Xue et al. introduced ViT to increase the global receptive field, using two pooling modules to solve the defect of ViT lacking inductive bias and the infiltration of noisy features [
10]. Liu et al. proposed a novel facial muscle movement-aware representation learning method that captures the semantic relationships of facial muscle movements in expression images [
11]. Mao et al. improved ViT in three directions: cross-fusion, dual-stream, and multiscale feature extraction, reducing the computational complexity of POSTER and extracting multiple features [
12]. Although ViT increased the global receptive field, its computational complexity increased exponentially compared to CNN, and the running time increased, which was harmful to the deployment and application of the FER task.
Since CNN and ViT methods face challenges balancing computational complexity and recognition accuracy, many researchers are exploring new network architectures to seek breakthroughs. Mao et al. argued that existing FER methods neglected the dynamic size changes of tensors and proposed a hierarchical attention network with progressive feature fusion to avoid the computational complexity of ViT and achieve recognition accuracy similar to that of ViT [
13]. The Dual-Direction Attention Mixed Feature Network (DDAMFN) combines robustness and lightweight characteristics [
14]. DDAMFN projects the knowledge representation of questions into the latent knowledge space to extract features relevant to knowledge, solely relying on the question text itself. Wang et al. uses graph convolutional networks (GCN) and k-nearest neighbor graphs to identify facial expressions. GCN processes low uncertainty images to extract geometric cues for emotional label prediction. High uncertainty images leverage k-nearest neighbor graphs to find similar low uncertainty images, fusing their emotional label distributions. A convolutional neural network is trained on these distributions to identify facial expressions [
15].
In summary, the design of the backbone for the FER task has mostly considered spatial feature extraction capabilities and computational complexity as factors. From CNNs to ViTs to lightweight networks, scholars have been pursuing faster and more accurate model designs. However, due to neglecting the impact of memory access efficiency on computational speed, the actual running speed has not improved despite reducing computational complexity. However, reducing FLOPS does not always correlate with reduced latency and can sometimes exacerbate delays. This is because computational volume and memory access frequency substantially impact model processing speed more than FLOPS alone [
3,
4]. For instance, as shown in
Figure 1, the feature information extracted by CNN networks is redundant, with low contribution rates from channel-specific features. Addressing redundant channel features offers a viable solution to decrease computational load, reduce memory access frequency, and enhance model processing speed.
5. Results
This section presents a thorough comparison of results obtained through the FCCA method against benchmarks or state-of-the-art methods. Detailed analysis of recognition accuracy metrics and runtime efficiency highlights the method’s strengths. The FCCA method notably competes with established techniques across multiple datasets, such as RAF-DB, AffectNet, and DFEW, showcasing its generalization capability and adaptability to diverse data conditions.
The RAF-DB dataset is an imbalanced dataset, where the number of samples in each category within the validation set exhibits an uneven distribution. Therefore, we report the average recognition accuracy of the FCCA method on the RAF-DB dataset to evaluate the model’s generalization ability to sample distributions. "Avg.Acc (%)" denotes the average recognition accuracy, which is calculated as follows: the recognition accuracies of the seven categories (surprise, disgust, fear, happy, sad, angry, and neutral) are summed and then divided by the total number of categories (i.e., 7), resulting in the average recognition accuracy. As depicted in
Table 1, FCCA achieves excellent recognition performance on the challenging imbalanced RAF-DB dataset, boasting an overall recognition accuracy of 91.30% and an average recognition accuracy of 85.29%. In a direct comparison, FCCA demonstrates a notable advantage over the EAC method with ResNet-50 as the backbone, outperforming it by a margin of 0.95%. Furthermore, FCCA exhibits superior performance compared to the TransFER method, which incorporates the ViT method, outperforming it by 0.39%. Notably, the POSTER method, which leverages a pyramid structure, achieves the highest recognition accuracy to date, reaching a prime 92.05%.
As depicted in
Table 2, FCCA demonstrates a compelling recognition accuracy of 65.51% on the AffectNet dataset, representing a notable improvement of 0.19% over the EAC method. Moreover, the POSTER method has a state-of-the-art performance, boasting a stellar recognition accuracy of 67.31%.
As depicted in
Table 3, FCCA demonstrates strong performance in the DFER task on the DFEW dataset, achieving a WAR of 69.66% and a UAR of 56.61%. These results match the state-of-the-art recognition performance, solidifying FCCA’s efficacy in this domain.
Notwithstanding the excessive computational requirements of the POSTER method, which hinges on a sophisticated pyramidal architecture and resource-intensive ViT, it necessitates substantial processing power and prolongs processing times. Considering the concurrent demands for speedy recognition and high accuracy in the FER task, a dual SOTA standard that simultaneously benchmark systems on speed and accuracy offers a more comprehensive and accurate evaluation framework for FER performance.
6. Discussion
Our method surpasses the benchmark regarding recognition accuracy, but still fails to match the excellence of the POSTER method. To investigate further, we conducted training time experiments on the RAF-DB and AffectNet datasets, employing a single training cycle as a standard. We recorded the training time for the POSTER and the FCCA methods using the training set and test set, allowing us to compare the runtime of various methods.
As depicted in
Table 4 and
Table 5, with the training set as a prime example, the FCCA method demonstrated the shortest runtime, clocking in at 68 s and 137 s, respectively, while the POSTER method took the longest, requiring 188 s and 415 s, respectively. Notably, the POSTER method required over twice the runtime of the FCCA method. Between the FCCA and POSTER methods, FCCA held its own, with runtimes of 111 s and 230 s. A thorough data analysis reveals that the runtime sequence is FCA, FCCA, and then POSTER.
By combining recognition accuracy and runtime, we found that the FCCA method achieved recognition accuracy is slightly less than POSTER’s, yet consumed only half the runtime. This indirect indication of the FCCA method’s excellent performance in balancing runtime and recognition accuracy makes it an attractive option for FER tasks in real-world applications and deployment.
To better understand the FCCA method’s performance in various emotional classification tasks, we presented confusion matrices for the recognition accuracy of various emotions in the RAF-DB and AffectNet datasets. Additionally, we included recognition performance confusion matrices for the CNN (ResNet-50) and FCA methods to enable a more direct comparison. As shown in
Figure 5, on the RAF-DB dataset, the FCA method demonstrated superior performance in identifying fear, happy, sad, and angry emotions, with improvements of 2.7%, 1.35%, 5.02%, and 1.86%, respectively. Notably, CNN (ResNet-50) continued to excel in recognizing surprise and disgust emotions, while FCA showed better performance in identifying fear and other emotions. In contrast, the FCCA method showed striking improvements in the recognition performance of most emotional categories, including surprise, disgust, happy, sad, angry, and neutral emotions, with improvements of 2.74%, 3.75%, 1.61%, 4.39%, 6.79%, and 2.20%, respectively. On the AffectNet dataset, the FCA method exhibited remarkable stability, demonstrating its stronger adaptability in real-world datasets. On the other hand, the FCCA method showed notable improvements in most emotional categories on both datasets, indicating its enhanced ability to improve model performance. However, the FCCA method surprisingly exhibited a decline in performance on the sad category in both datasets, which may be attributed to many despairing facial expressions that are difficult to recognize and often misclassified as neutral. Therefore, enhancing the model’s fine-grained classification ability is imperative to address this issue.
An examination of the RAF-DB and AffectNet datasets reveals that CNN (ResNet-50) is capable of distinguishing various features. However, the inter-class distance is insufficient, and the intra-class distance is not tight enough, resulting in characteristic vectors exhibiting relatively minor differences between classes and relatively low similarity within classes. In contrast, as shown in
Figure 6, the FCA method demonstrated larger distances between classes, considerably enhancing the inter-class differences and intra-class similarity of characteristic vectors, indicating a certain degree of improvement in FCA’s ability to distinguish between classes.
In the FCCA method, the distances between classes are even larger, and the intra-class characteristic vectors are incredibly tight, providing a clear indication that FCCA not only effectively enhances the inter-class differences between characteristic vectors but also improves the similarity within classes. This illustrates the FCCA method’s ability to address the above-mentioned issues effectively and is a fundamental advantage over the baseline FCA method.
The Grad-CAM [
40] method provides a straightforward visual representation of the attention regions of the model during correct classification, whereas SHAP offers a more comprehensive view by showcasing the attention regions of all classification results, including incorrect ones, and ranking them according to their predicted probabilities from highest to lowest. As shown in
Figure 7, the first column shows seven emotion samples, while the second column highlights relevant feature regions under correct classification. As evident in the accompanying figure, the attention regions for correct classification are remarkably broad and focused compared to those for other incorrect classifications. For instance, in a sample correctly classified as “disgust”, the attention region encompasses the eye area, nasal wings, and mouth area, with a particularly high concentration of attention around the mouth area. In stark contrast, the attention region for an incorrectly classified sample as “angry” is limited to the eye area and appears scattered and unfocused. This dual attention pattern is also observed in other emotional classification samples, highlighting FCCA’s dual-focus network architecture, which simultaneously attends to global and local key features. By employing this mechanism, FCCA meticulously and rigorously performs the classification task, rendering it a crucial factor in its performance.
Conducts ablation experiments to dissect and showcase the specific contributions of each component within the FCCA method. This helps to elucidate why and how certain design choices impact overall performance.
To delve deeper into the mechanism of the FCCA method, we conducted a series of ablation experiments. As depicted in
Table 6, compared to ResNet-50 on the RAF-DB dataset, the FCA method improved the recognition accuracy, boosting it from 88.69% to 89.70%, representing a notable 1.01% gain. By integrating FACL with the FCA approach, the FCCA method achieved a further 1.6% enhancement in performance. On the AffectNet dataset, the FCA method demonstrated a substantial advantage over traditional CNN (ResNet-50) methods, elevating the recognition accuracy from 62.57% to 64.8%, a 2.23% increase. The addition of FACL yielded an additional 0.8% performance gain. Our analysis suggests that while the effectiveness of FCA and FACL in enhancing model performance exhibits varying degrees of effectiveness across different datasets, both FCA and FACL make meaningful contributions to the improvement of model performance.
To gain a deeper understanding of the feature extraction capabilities of FCA and CNN (ResNet-50), we employed Grad-cam to visualize the feature outputs of the two networks’ final layers, as illustrated in the figure. As shown in
Figure 8, in the realm of facial texture and detailed feature extraction, the FCA method stands out for its remarkable clarity. Unlike CNN (ResNet-50), which relies solely on facial contour features, FCA not only defines the facial features but also captures the intricate skin texture surrounding them and the subtle movements of the muscles. Through meticulous and comprehensive facial feature extraction, FCA enables models to grasp detailed features and sidesteps the decline in classification performance resulting from missing crucial features. This observation also implies that certain key clues in facial expressions may be concealed within facial texture and muscle movements, reinforcing the significance of facial texture details and muscle movements in FER. Upon closer examination, it becomes apparent that CNN (ResNet-50) is prone to extracting superfluous information from sample edges. In contrast, FCA focuses on extracting vital features from the central facial region. By doing so, FCA effectively mitigates the adverse effects of irrelevant features on model classification. FCA’s remarkable feature extraction ability enables capturing more crucial information for the FER task. At the same time, its center attention mechanism skillfully suppresses the influence of unnecessary information on FER. Both FCCA and FCA methods use FaterNet for extraction, with the difference being that FCA is a simple feature extraction and classification head combination model, using only cross-entropy loss to constrain model training. In contrast, FCCA employs a dual-branch feature extraction output structure with original and flipped samples, using cross-entropy loss and flip attention consistency loss to jointly supervise model training. There is no difference in feature extraction performance between the two, so we did not include the feature extraction diagram of the FCCA method for comparison.
To facilitate a deeper understanding of the working mechanisms of the FCCA method in facial emotion recognition (FER), we employed Grad-cam to visually illustrate the attention regions of various approaches. As shown in
Figure 9, the results, as depicted in the figure, show that the CNN (ResNet-50) method primarily focuses on key regions in most instances, but occasionally allocates attention to areas like the cheeks, which are not crucial for the FER task. In contrast, FCA directs all its attention towards facial features or other essential facial regions, suggesting that FCA extracts more key facial information and channels it into the model, thereby enhancing focus on pivotal regions. Furthermore, the attention regions of FCCA unveil the influence of facial symmetry on the FER task. When the left and right halves of the face are relatively symmetrical, the disparity between the attention regions of FCA and FCCA is minimal. Conversely, when the facial halves are asymmetrical, the difference in attention regions between FCA and FCCA becomes more pronounced, implying that FACL makes more prominent contributions to the analysis of non-symmetrical samples by paying more attention to global key information, which is beneficial for accurate FER classification. We hypothesize that the performance of FCCA in classifying non-symmetrical faces may be attributed to the incorporation of FACL, which enables the model to learn the characteristic of facial asymmetry and acquire new knowledge, namely, that global information can be more beneficial for accurate classification of faces with asymmetrical features.
We can expand on the main experiments by exploring additional facets or details relevant to the FCCA method. This may include robustness testing, sensitivity analysis to parameters, or exploration of performance under varying conditions. Noise labels pose a pervasive challenge to facial expression recognition (FER) tasks, as their widespread occurrence leads to an influx of noise features that can disrupt model performance. Moreover, the model often learns and retains these noise features, compromising its ability to accurately classify emotions. To mitigate this issue, we introduced the flipped attention consistency loss (FACL), a novel approach that leverages flipped attention adaptation to neutralize the detrimental impact of noise features on the model. As shown in
Table 7, empowered by FACL, we evaluated the anti-noise label capabilities of FCCA methodon the renowned RAF-DB dataset. As the table below shows, adding noise labels resulted in a considerable decline in model recognition performance. However, our analysis revealed that different methods exhibit varying anti-noise label capabilities, with FCCA emerging as the most resilient and effective approach, surpassing the CNN (ResNet-50) method. FCCA achieved commendable recognition accuracy rates of 88.62%, 87.48%, and 85.92% for 10%, 20%, and 30% noise label distributions, respectively, outperforming the EAC method in each scenario. Furthermore, our investigation showed that FCCA’s anti-noise label capability increases in direct proportion to the proportion of noise labels. Notably, when adding 10% noise labels, FCCA and EAC demonstrated equivalent anti-noise label capabilities, with recognition accuracy rates of 88.62%. As the proportion of noise labels increased to 20%, FCCA outperformed EAC, with recognition accuracy rates of 87.48% compared to 87.35%. When adding 30% noise labels, FCCA’s anti-noise label capability stood out as substantially superior to EAC, with recognition accuracy rates of 87.92% compared to 85.27%. Our findings unequivocally demonstrate the anti-noise label capability of FCCA, underscoring its potential for tackling the complexities of FER tasks in the face of noise labels.
Notably, FLOPS alone cannot precisely indicate a model’s running speed. We have compiled the FLOPS of various classic models, including the FCCA model, and analyzed their accuracy and model parameter quantity performance, as depicted in
Table 8. Interestingly, despite having higher FLOPS and parameters, ResNet-50 runs faster than the POSTER method. Among the numerous approaches, the FCA method features fewer parameters, lower FLOPS, and higher recognition accuracy, demonstrating its comprehensive performance. When computational resources are limited, FCA or FCCA methods may be a better choice; however, when resources are plentiful, the POSTER method can unlock its full potential and deliver superior performance.
To explore the effects of FACL on the model, we conducted extensive experiments on the RAF-DB and AffectNet datasets, systematically adjusting the
value in FACL and analyzing the resulting variations in recognition performance. As shown in
Figure 10, FACL consistently yielded the highest recognition accuracy when
was set to 1, regardless of the dataset. Notably, recognition accuracy improved as
converged to 1, whereas it declined as
diverged from 1. Furthermore, when
was excessively large, FACL’s recognition accuracy fell short of the FCA methods. Our findings indicate that FACL can only enhance model performance when
is set within a reasonable range; otherwise, it may even degrade the model’s behavior.
To thoroughly assess the performance of the FCCA method, we generated comprehensive training iteration curves for FCCA on the RAF-DB and AffectNet datasets, incorporating metrics for recognition accuracy and loss values. As shown in
Figure 11, on the RAF-DB dataset, FCCA achieved a peak recognition accuracy of 91.30% at the 51st training epoch. Compared to FCA without pretrained models, which trained slowly and yielded lower recognition accuracy, FCCA with pretrained models exhibited a consequential acceleration in training convergence speed, albeit accompanied by unstable oscillations. When FACL was integrated into FCCA, the network converged rapidly and consistently with minimal oscillations, achieving the best recognition accuracy. Similarly, on the AffectNet dataset, FCCA achieved its best recognition accuracy of 65.51% at the 10th training epoch, again showcasing its advantages in rapid convergence and stable training process.
FER is a subfield of emotion recognition and sentiment analysis. Emotion recognition aims to identify human emotions and attitudes through multimodal cues such as text, speech, and visual clues. At the same time, sentiment analysis focuses more on detecting and classifying specific emotional states like happiness and sadness. Emotion recognition analyzes emotions using NLP, computer vision, and speech processing technologies, applied in human–computer interaction, social media analysis, and healthcare. It involves multimodal learning as human emotions are expressed through text, speech, and visual cues. Facial expression recognition is key because it is the most direct and universal indicator of emotions. Multimodal learning combines various information to improve the accuracy of sentiment analysis. Key technologies include feature extraction (BERT for text, Wav2Vec for speech, CNN for visuals, etc.), modality fusion (early and late fusion using attention mechanisms), cross-modal learning (aligning facial expressions with speech or text), and self-supervised learning (pretraining with large-scale unlabeled data). Facial expression recognition (FER) is closely linked to sentiment analysis because facial expressions are one of the main methods for emotion detection. Both FER and sentiment analysis aim to identify emotions, but the former focuses more on visual cues, while the latter usually combining multiple modalities, such as speech and text. Facial expressions and speech are complementary modalities that better indicate emotions. These two fields face technical challenges, such as occlusion and ambiguity, requiring technologies like random masking and self-supervised pretraining. Combining multiple modalities (such as FER with speech and text) can improve the accuracy and robustness of recognition systems, with techniques like attention mechanisms and self-supervised learning used to enhance multimodal representations [
41,
42]. The importance of facial expression recognition in multimodal emotion recognition lies in: improving system robustness, handling noise and occlusion; enhancing generalization capability, adapting to diverse datasets; and applications in healthcare, education, entertainment, and other fields. Facial expression recognition is a crucial part of emotion recognition. Through multimodal learning and self-supervised learning technologies, FER can be integrated into more accurate emotion recognition systems, improving performance and advancing emotion understanding in complex scenarios.
Facial expression recognition (FER) is vital for sentiment analysis, as facial expressions are central to emotion detection. While FER relies on visual cues, sentiment analysis often integrates multiple modalities like speech and text, which complement each other to improve emotion recognition. Both fields face challenges such as occlusion and ambiguity, tackled through technologies like random masking and self-supervised pretraining. Multimodal approaches, combining FER with speech and text, enhance accuracy and robustness using attention mechanisms and self-supervised learning techniques. FER’s role in multimodal emotion recognition includes improving system robustness against noise and occlusion, enhancing generalization across datasets, and enabling applications in healthcare, education, and entertainment. Supported by multimodal and self-supervised learning, FER can advance emotion recognition systems for complex scenarios. Dynamic FER (DFER) better aligns with natural scenarios for real-time video analysis than static FER (SFER). However, DFER faces challenges like varying expression intensities in video sequences. An expression intensity-aware loss function can address this by evaluating and weighting intensity variations [
39]. Additionally, optimizing the FasterNet architecture for speed and reliability—through lightweight design, attention mechanisms, and distillation techniques—can enhance real-time video analysis capabilities.