SlowR50-SA: A Self-Attention Enhanced Dynamic Facial Expression Recognition Model for Tactile Internet Applications

: Emotion recognition from facial expressions is a challenging task due to the subtle and nuanced nature of facial expressions. Within the framework of Tactile Internet (TI), the integration of this technology has the capacity to completely transform real-time user interactions, by delivering customized emotional input. The influence of this technology is far-reaching, as it may be used in immersive virtual reality interactions and remote tele-care applications to identify emotional states in patients. In this paper, a novel emotion recognition algorithm is presented that integrates a Self-Attention (SA) module into the SlowR50 backbone (SlowR50-SA). The experiments on the DFEW and FERV39K datasets demonstrate that the proposed model achieves good performance in terms of both Unweighted Average Recall (UAR) and Weighted Average Recall (WAR) metrics, achieving a UAR (WAR) of 57.09% (69.87%) on the DFEW dataset, and UAR (WAR) of 39.48% (49.34%) on the FERV39K dataset. Notably, SlowR50-SA operates with only eight frames of input at low temporal resolution, highlighting its efficiency. Furthermore, the algorithm has the potential to be integrated into Tactile Internet applications, where it can be used to enhance the user experience by providing real-time emotion feedback. SlowR50-SA can also be used to enhance virtual reality experiences by providing personalized haptic feedback based on the user’s emotional state. It can also be used in remote tele-care applications to detect signs of stress, anxiety, or depression in patients.


Introduction
The Tactile Internet (TI) is a fundamental aspect of 6G technology that predicts realtime haptic connection and offers revolutionary possibilities for distant work and immersive interactions.TI is a very low-latency, ultra-high-reliability communication system [1].It is designed to allow remote access, monitoring, manipulation, or control of physical or virtual objects or processes that are perceived as happening in real time, either by humans or automated systems [2].The development of TI presents obstacles, particularly in the field of tactile cognition, which refers to the understanding and processing of tactile interactions.An essential aspect of delivering superior haptic feedback and facilitating immediate performance assessment is a comprehensive grasp of context.In order to address the unavoidable delays in remote work, the development of artificial intelligence approaches is necessary.However, to enhance motion prediction and feedback, it is crucial to gather more information about the context and terminal interactions [3].Key features include ultra-low latency, reliability, and a human-centric approach through projects such as Tactile Internet with Human-in-the-Loop (TaHiL) [4].
TI incorporates tactile and kinesthetic content to complement the visual and aural aspects of augmented reality (AR) and virtual reality (VR) experiences [5].It embodies a concept of the Internet that integrates the sensation of touch with conventional communication methods, with the goal of facilitating remote operation of systems without requiring physical proximity.TI applications enable tactile communications, especially in the context of haptic-enabled VR.These applications require extremely low latency, less than 50 ms, making them suitable for remote phobia treatment via VR [6].A critical challenge in implementing TI is the "1 ms challenge", which requires the system reaction time to be less than 1ms in order to prevent users from being able to differentiate between local and remote control [7].
Another paradigm emerges as TI enables multisensory experiences in the Metaverse, integrating tactile and kinesthetic components with visual and audio content.As users interact within the Metaverse, the TI enhances the sensation of immersion by adding the additional sense of touch and pressure.However, it presents challenges related to reliability, low latency, and sensitivity to network jitter, making its integration into the Metaverse a complex but promising endeavor.In addition, it is essential to address issues related to the emotional and mental well-being of users, ethical norms, and the preservation of a safe and healthy environment within this virtual realm [8].As the Metaverse enhances digital experiences with tactile elements, the integration of semantic compression refines the efficient communication of emotions, thus augmenting the immersive quality of virtual interactions.Semantic compression also addresses high-latency challenges in affective computing [9], playing a pivotal role in the TI's pursuit of diverse experiences, including emotional, behavioral, and cognitive dimensions, in the 6G era [1].Unlike remote vision and hearing, haptic sensing remains an uncharted area that the TI seeks to explore [10].In the realm of 6G TI, the convergence of semantic compression and emotions, exemplified by the work of Akinyoade and Eluwole [11], initiates further exploration of applications and research directions, recognizing its transformative role in shaping the era of 6G technology.
The convergence of the field of emotional intelligence with artificial intelligence (AI) and machine learning (ML) enables the development of robotic systems that can accurately perceive and react to human emotions [12].Emotion, a complex and subjective feeling encompassing physiological, psychological, and behavioral aspects, arises from internal or external stimuli in the human being.It can be expressed in various forms, such as joy, sadness, surprise, or fear, observable in facial expressions or heard in a person's voice, or even through touch.Not only does touch affect emotion, but emotional expressions also affect touch perception.So emphasis is placed on enhancing the Quality of Experience (QoE) through machine learning and QoE models [13,14].The potential of the Tactile Internet extends to applications like Exergames, where users interact with emotions through speech recognition.Exergames allows users to express feelings verbally during exercise, recording and analyzing emotions to generate tactile vibrations corresponding to users' feelings.This real-time feedback enhances users' overall experience and satisfaction during physical activities [15].
Facial expression recognition is a crucial component in the larger context of emotional recognition.It entails the analysis of the facial expressions of persons, which is achieved by both humans and computer systems using image processing and AI technologies.
The motivation behind this work stems from the burgeoning field of TI and its potential to revolutionize real-time user experiences through the integration of emotion recognition technology.With the advent of 6G technology, there is a growing emphasis on ultra-low-latency communication systems like TI, which enable remote interactions with physical or virtual objects in real time.However, the full realization of TI's potential hinges on the incorporation of emotional intelligence, particularly in applications such as immersive virtual reality interactions and remote tele-care.Recognizing the crucial role of facial expressions in conveying emotions, the objective of this study is to develop a novel deep learning architecture specifically tailored for Dynamic Facial Expression Recognition (DFER) within the TI framework.By integrating an SA module into the SlowR50 backbone (SlowR50-SA), this research aims to enhance the accuracy and efficiency of emotion recognition, thus paving the way for personalized emotion feedback in TI applications.Through rigorous experimentation and benchmarking against state-of-the-art methods, this work seeks to demonstrate the effectiveness and computational efficiency of the proposed model, thereby addressing a critical need in the evolving landscape of TI-enabled interactions.
The main contributions of the presented work are as follows: • Presenting a novel deep learning architecture for DFER, the model effectively extracts spatiotemporal features using the SlowR50 (8 × 8) model.This architecture integrates a slow pathway with low temporal resolution for capturing long-range temporal information and identifying subtle changes in facial expressions.The inclusion of an SA module further refines the feature vector, dynamically attending to relevant spatial and temporal details, enhancing the representation of nuanced facial expressions.

•
The proposed algorithm achieves superior performance on benchmark datasets (DFEW and FERV39K) compared to state-of-the-art methods, demonstrating its effectiveness in Dynamic Facial Expression Recognition.The model outperforms competitors in terms of both UAR and WAR, showcasing its capability to accurately classify emotions.

•
The model demonstrates computational efficiency by achieving state-of-the-art results with only eight frames of input.This efficiency, combined with its high performance, positions the algorithm as a promising candidate for real-world applications, especially in TI scenarios, where it can effectively recognize and respond to facial expressions with reduced computational cost.
The rest of this paper is structured as follows.Section 2 provides an overview of related works in the field of DFER.Section 3 details the proposed model and its implementation.In Section 4, we present the datasets used, experimental setups, comparisons with stateof-the-art methods, and an ablation study.This section also includes visualizations of 2D t-SNE features and confusion matrices.Finally, Section 5 concludes the paper, summarizing key findings and contributions.

Related Work
Dynamic facial expression recognition is a complex task in computer vision and affective computing.Its goal is to classify a facial video clip, rather than a still image, into one of the basic emotions.The field of DFER has attracted considerable attention from researchers [16][17][18][19][20][21][22].These studies share a common goal of addressing challenges within environmental scenarios, such as occlusion, pose variation, and noisy frames.Despite the progress made by these methods, it is evident that they still fall short in extracting comprehensive temporal features that encompass both short-term and long-term aspects.So a prevailing trend in the recently published works is the adoption of the transformer architecture for modeling spatiotemporal relationships in facial expressions.This architectural choice, as highlighted by [16][17][18][19][20][21], underscores the importance of capturing complex dependencies within dynamic facial expressions.Evaluation of these DFER methodologies extends to commonly used datasets such as DFEW, FERV39K, AFEW, and BU-3DFE, reflecting the comparative analysis of DFER models by [16][17][18][19][20][21].
Examining the differences between these DFER methods reveals different approaches to spatiotemporal modeling.One notable study by Liu et al. (2023) [16] proposes an Expression Snippet Transformer (EST) that decomposes videos into expression fragments and predicts the order of scrambled fragments.This approach emphasizes the importance of unifying video lengths through interpolation and clipping, achieving high accuracy across multiple datasets.However, EST focuses on fragment-based analysis, leaving room for improvement in capturing long-range temporal dependencies.Zhao et al. (2021) [17] introduced the Former-DFER by using a combination of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former) to train spatial and temporal features.While Former-DFER effectively captures spatiotemporal relationships, its performance may be limited by the complexity of the transformer architecture and the computational resources required.Lee et al. (2023) [18] present Frame-Level Emotion-Driven Dynamic Facial Expression Recognition featuring an Affectivity Extraction Network (AEN) with frame-level emotion-driven loss features.This method incorporates emotion-driven loss functions to enhance recognition accuracy, but it may lack robustness in handling diverse environmental scenarios.Li et al. (2023) [22] contributed to intensity-adaptive loss for dynamic facial expression recognition by integrating a global attentional bias (GCA) block and intensity-adaptive loss (IAL) to handle different expression intensities.While effective in addressing intensity variations, this approach may require additional computational overhead.Li et al. (2022) [19] propose NR-DFERNet, addressing noisy frames using a dynamic-static fusion module (DSF) and a fragment-based filter (SF) to mitigate the impact of neutral frames.These different methodological approaches also involve variations in the training paradigm.Wang et al. (2023) [20] reimagined the learning paradigm for DFER by treating it as a weakly supervised problem and introduced the multi-3D dynamic facial expression learning (M3DFEL) framework with multi-instance learning (MIL).Additionally, variations in loss functions are investigated with Li et al. ( 2023) [22], introducing Intensity-Aware Loss to distinguish samples with low expression intensity.While the Intensity-Aware Loss effectively handles expression intensity variations, it may introduce additional computational overhead during training, potentially limiting scalability to larger datasets.Attention mechanisms are also a focal point, as seen in Ma et al.'s Logo-Former (2022) [21], which proposes a local-global spatiotemporal transformer (LOGO-Former) with attention mechanisms to capture local and global dependencies.However, this work may face challenges in capturing long-range dependencies and subtle temporal changes in dynamic facial expressions, potentially impacting its effectiveness in recognizing nuanced expressions.Addressing the challenge of modeling noisy frames, Li et al. (2022) [19] propose NR-DFERNet, introducing a dynamic class tag (DCT) and an SF to process noisy frames in the decision stage.However, this work may have limited effectiveness in handling complex noise patterns in dynamic video sequences, potentially leading to misclassification of facial expressions in challenging environmental conditions.In the ever-evolving field of Facial Expression Recognition, researchers have developed creative approaches to address the complexities of analyzing facial expressions in dynamic video sequences.The EST method [16] takes a distinctive approach by unifying video lengths through interpolation and clipping, using face detection, and randomly selecting frames to create expression snippets.With an implementation in PyTorch, EST achieves an average FER accuracy of 88.17% across datasets such as BU-3DFE, MMI, AFEW, and DFEW.Noteworthy is its real-time speed and computational efficiency.Since the EST method relies on random frame selection and interpolation to create expression snippets, it may introduce biases or artifacts in the extracted snippets, potentially leading to inaccuracies in recognition.The Frame-Level Emotion-Guided Dynamic Facial Expression Recognition with Emotion Grouping method [18] introduces the AEN architecture, incorporating temporal transformers and pre-processing involving face region detection.Trained on a PyTorch platform with an NVIDIA RTX 3090 GPU, it utilizes pre-trained networks and introduces fusion parameters for dynamic emotion grouping.This method emphasizes the efficacy of proposed loss functions and fusion parameters.Addressing the nuances of expression intensity, the Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild method [22] employs a GCA Block, Dynamic-Static Fusion Module, and Temporal Transformer for feature extraction.Trained on PyTorch-GPU and Tesla V100 GPUs, it achieves performance on dynamic facial expression recognition tasks.As mentioned, this architecture relies heavily on pre-trained networks and fusion parameters, which may limit its adaptability to novel datasets or dynamic environmental conditions.
In conclusion of this section, the DFER field has witnessed substantial progress with various methodological approaches, all aiming to address challenges posed by environmental scenarios in videos, such as occlusion, pose variation, and noisy frames.The adoption of the transformer architecture, as evident in several studies, emphasizes the importance of capturing complex spatiotemporal relationships within dynamic facial expressions.Methodological variations include innovative techniques like EST, Former-DFER, Frame-Level Emotion-Driven Dynamic Facial Expression Recognition, NR-DFERNet, and M3DFEL, each proposing unique solutions to the challenges at hand.These methods em-ploy diverse strategies, including attention mechanisms, intensity-adaptive loss, and novel training paradigms like weakly supervised learning and multi-instance learning.Evaluations on benchmark datasets reveal competitive performance, showcasing advancements in addressing expression intensity, noisy frames, and long-term temporal relationships.While these methodologies exhibit promising results, ongoing research and exploration of new techniques remain crucial for further advancements in the dynamic facial expression recognition domain.

Proposed Model
The proposed video classification model for DFER is illustrated in Figure 1.For the extraction of spatiotemporal features, the SlowFast architecture's slow path model is utilized.In this case, we are specifically using the SlowR50 (8 × 8) model, as introduced in [23].There are several reasons why it is encouraged to use a slow pathway with low temporal resolution.Initially, it can be used to extract long-range temporal information.When recognizing emotions in videos, it is necessary to identify the overall emotional state of the person, taking into account the temporal dynamics of their facial expressions.This can be effectively achieved by processing the video at a low temporal frame resolution, as it allows the model to focus on the subtle changes in facial expressions that occur over time.The slow pathway's ability to capture long-range temporal information allows it to identify long-range relationships, which is crucial for accurate expression recognition.Furthermore, it can reduce the computational cost of DFER algorithms.Because the slow pathway operates at a lower temporal resolution than the fast pathway, it requires fewer computations and less memory, making it more efficient to train and use.For the proposed model, the videos are processed by segmenting each video into C frames of resolution M × N, resulting in an input tensor of a shape M × N × C.This approach allows us to efficiently capture the temporal dynamics of facial expressions while still maintaining a manageable input size.The feature vector extracted by the SlowR50 backbone is further refined by the SA module, which consists of multiple blocks: a Multi-Head Attention, a Summation block, a Linear (FC layer) and ReLU activation function (see Figure 1).The working flow of SA is further described in the details below.
Multi-Head Attention: Initially, the Multi-Head Attention mechanism with four heads is applied to the single feature vector, allowing the model to dynamically attend to the most relevant spatial and temporal information encoded within the vector.This attention mechanism enables the model to capture the subtle nuances of facial expressions, even in low temporal resolutions.In the context of the Multi-Head Attention operation, where the objective is to capture relationships within the same vector, all the query (Q), key (K), and value (V) vectors are set equal to the feature vector extracted by the backbone.This simplification allows focusing on the self-attention mechanism applied to the feature vector.In general, the Multi-Head Attention operation is expressed as follows: where each attention head (head i ) is computed as: where and W O are weight matrices associated with the query, key, value, and output transformations, respectively, for each attention head i.The attention function (Attention) is defined as: The variable K T denotes the transposed matrix of the key vectors, ensuring compatibility with the query vectors for the attention calculation, and d k represents the dimensionality of the key vectors.Summation block: After the Multi-Head Attention mechanism has identified and weighted the most relevant spatial and temporal information within the feature vector, the attended features are added back to the original feature vector.This summation operation effectively integrates the attention mechanism's insights into the feature representation, weighting the features according to their importance and enriching the representation with additional information.This enhanced feature representation allows the model to better capture the subtle details in facial expressions.

Linear (FC layer) and ReLU activation function:
The model further refines the feature representation by passing it through a Fully Connected (FC) layer and ReLU activation function.This combination of layers serves to normalize the feature representation, enhancing its complexity, and improving the model's ability to generalize to unseen data.This refined feature representation provides the model with a more accurate and informative basis for making predictions about the underlying emotion in the video frames.After the SA module, the final Linear (FC) layer is the classification layer, which produces the Emotion Label.During the training phase, the backbone is fine-tuned, while the SA module and classification layer are fully trained.

Implementation Details
The algorithm is implemented in PyTorch-GPU (v1.12.1) [24] and trained on an NVIDIA GeForce RTX 2080 Ti GPU (Graphics Processing Unit).It uses a SlowR50 (8 × 8) model [23] for feature extraction, which is fine-tuned during training.The models are trained for 100 epochs with an AdamW optimizer, learning rate of 5 × 10 −4 , and weight decay of 0.05.After finding the best model, it is fine-tuned for another 30 epochs with a learning rate of 5 × 10 −5 .Each video is input to the algorithm as eight frames of 196 × 196 pixels each.Horizontal flipping, random cropping, and color jitter are applied to augment the data.

Datasets
Dynamic Facial Expression in-the-Wild (DFEW) [25] is a comprehensive dataset captured in real-world settings, introduced in 2020.Comprising over 16,000 video clips featuring dynamic facial expressions, these clips are collected from a broad range of over 1500 global movies, presenting diverse and real-world scenarios with challenges such as extreme illuminations, self-occlusions, and unpredictable pose changes.Each video clip is carefully annotated by ten well-trained experts under professional guidance.The anno-tations classify expressions into seven categories: Happy, Sad, Neutral, Angry, Surprise, Disgust, and Fear.
FERV39K [26] encompasses 38,935 video clips sourced from four scenarios, further categorized into 22 fine-grained scenes.Distinguished by its unprecedented scale of 39K clips, scenario-scene division, and cross-domain supportability, FERV39K marks a milestone in DFER datasets.Each video clip within FERV39K undergoes meticulous annotation by 30 professional annotators, ensuring the provision of high-quality labels.These annotations classify expressions into the same seven primary categories as in DFEW.

Experimental Protocol
In this study, UAR and WAR are employed as primary evaluation metrics, aligning with established practices in the field of dynamic facial expression recognition.These metrics are widely used in previous studies for their effectiveness in evaluating model performance across various domains, including facial expression recognition [17][18][19][20]22,27].UAR, computed as the average recall across all classes, provides an unbiased assessment of the model's ability to accurately classify facial expressions without favoring any specific class.It can be defined as: where N is the number of classes and R i is the recall for class i.Similarly, WAR extends the evaluation beyond UAR by considering the distribution of samples across different classes.By weighting the recall of each class based on its sample size, WAR offers a more nuanced evaluation that accounts for class imbalances commonly encountered in real-world datasets.
It can be expressed as: where S i is the number of samples for class i.Given the widespread use of UAR and WAR in existing literature, their adoption in this study enables direct comparisons with prior research outcomes.This ensures the consistency and reliability of the findings while facilitating a deeper understanding of the proposed model's performance relative to stateof-the-art approaches.
To ensure fair and consistent comparisons, we adopted a 5-fold cross-validation setup as suggested by DFEW [25] for evaluating various methods.For the FERV39K dataset, we followed the recommended approach from [26] by partitioning the data into 80% training and 20% testing sets.

Comparison of the Proposed Method with the State-of-the-Art Methods
The comparative analysis of the proposed SlowR50-SA algorithm with other stateof-the-art methods on both DFEW and FERV39K datasets is presented in Table 1.The research works included in the comparison analysis were chosen based on their use of the identical experimental protocol employed in this study.The table's results demonstrate that SlowR50-SA outperforms all other approaches in terms of both UAR and WAR metrics.
It surpasses the AEN model [18] with a difference of 0.43% (0.5%) for UAR (WAR) on DFEW.Additionally, SlowR50-SA outperforms the M3DFEL model [20] by UAR (WAR) of 0.99% (0.62%) on DFEW and 3.54% (1.67%) on FERV39K, despite using only eight frames as input compared to M3DFEL's sixteen frames.In addition, SlowR50-SA outperformed the second-best model in terms of UAR for the FERW39K dataset, surpassing ResNet18-ViT by 0.13%.Similarly, the proposed model outperformed IAL, the second-best model for FERW39K in terms of WAR, by 0.8%.This demonstrates the effectiveness of SlowR50-SA, which achieves superior performance using fewer frames and outperforms other methods.

Ablation Study on Self-Attention Module
Adding a Self-Attention Module after the SlowR50 backbone on the DFEW dataset resulted in an improvement of over 0.3% in the UAR metric and almost 0.5% in the WAR metric (see Table 2).This improvement came with an increase of 17.82M parameters and a slight increase of 40 M FLOPs (Floating Point Operations Per second).Recall the results shown in Table 1, in which the SlowR50 backbone exhibits impressive performance on the DFEW dataset, even outperforming the state-of-the-art AEN method [18].However, it is important to note that the integration of the Self-Attention (SA) module further enhances the model's ability to capture subtle spatiotemporal dependencies within facial expression sequences.Despite the notable performance of the SlowR50 backbone alone, the additional complexity introduced by the SA module contributes to further enhancing the model's performance in DFER tasks.

Detailed Results
In this section, a visual representation of the data using t-SNE [36] is provided.Specifically, two-dimensional t-SNE plots are employed to visualize samples from both the DFEW and FERV39K datasets, aiding in the comprehension of their distribution.Additionally, the confusion matrices of both datasets are presented for further analysis.
Two-dimensional t-SNE feature visualization: Figure 2a,b show the distribution of features in different colors and example image samples for each emotion in the DFEW and FERV39K datasets, respectively.For the DFEW dataset, it is evident that the features for the neutral, happy, sad, and angry emotions are more clearly separated into clusters, whereas the features for fear and surprise are more dispersed.The samples belonging to the disgust class do not form a cluster, likely due to the low proportion of disgust videos (1.22%) in the dataset.The model's inability to form a distinct cluster for the expressions of disgust indicates that it has difficulty accurately classifying these emotions.Regarding the FERV39K dataset (Figure 2b), it is apparent that the clusters exhibit a more diffuse distribution than DFEW.Similar to the DFEW dataset, clusters representing neutral, happy, sad, and angry emotions appear more tightly grouped, whereas fear, surprise, and disgust exhibit a more dispersed arrangement.The two figures depicting t-SNE visually reinforce the findings presented in Table 1.

Confusion matrices:
The proposed SlowR50-SA algorithm is tested for its effectiveness on the DFEW dataset by examining confusion matrices generated across all five folds (Figure 3).These matrices reveal that the model struggles to accurately predict both the expressions of disgust and fear.The model performs particularly poorly with expressions of disgust, as observed in the earlier t-SNE visualization.While the model performs better with the expressions of fear, it still struggles to achieve an accurate classification rate due to the fact that videos with these emotions are also rarely presented in the dataset (only 8.14%).This suggests that the task of distinguishing between disgust and fear among the other expressions is particularly challenging.Additionally, the model tends to classify samples as neutral expressions in an attempt to minimize the risk of misclassification.Figure 4 depicts the confusion matrix for the FERV39K dataset.It reveals that happy, sad, and neutral emotions are identified more frequently, with rates exceeding 50%, while the remaining four emotions exhibit lower recognition rates.The comparison of the confusion matrices between the DFEW and FERV39K datasets indicates notable differences in recognition performance.Analysis reveals that the DFEW dataset demonstrates superior classification accuracy compared to FERV39K.This discrepancy is particularly evident in the recognition of various emotions, where DFEW exhibits more robust performance across multiple emotion categories.These findings underscore the importance of dataset selection in training emotion recognition models and suggest the need for further investigation into the factors contributing to the variance in performance between datasets.GradCAM [37] visualizations: Figure 6 illustrates GradCAM visualizations from the final layers of the SlowR50 backbone across the seven emotions.It is evident that the activations primarily occur regions that are characteristic of distinct emotions.This observation holds especially true for emotions depicted in the first row of Figure 6, including happy, sad, neutral, and angry.Nevertheless, when considering the emotions of disgust and fear, it is noticeable that the model does not focus on the relevant facial regions associated with these emotions.Consequently, the performance is not satisfactory for these emotions, as evidenced by the confusion matrices and t-SNE visualization, depicted above.

Limitations of the Presented Work
The following are considered limitations of the present work:

•
While the proposed SlowR50-SA algorithm demonstrates superior performance on the DFEW and FERV39K datasets, its property to generalize to other datasets or realworld scenarios remains untested.The datasets used may not fully represent the diversity of facial expressions encountered in real-world settings, potentially limiting the algorithm's applicability in practical situations.

•
Both DFEW and FERV39K datasets may suffer from class imbalance issues, which can affect the model's performance, especially for minority classes such as disgust and fear.Imbalanced datasets may lead to biased models that prioritize majority classes, potentially resulting in lower accuracy for minority classes.

•
The ablation study focuses solely on the addition of the Self-Attention module to the SlowR50 backbone.Further analyses, such as investigating the impact of different hyperparameters or architectural variations, could provide deeper insights into the algorithm's performance and help optimize its design.

•
Although the proposed algorithm achieves good performance with only eight frames of input, its computational efficiency in real-world applications, especially on resourceconstrained devices or in real-time systems, remains unclear.Assessing the algorithm's efficiency in practical deployment scenarios is essential for its feasibility in TI applications.
To address the limitations highlighted above, future research efforts could focus on the following areas: • Generalization to diverse datasets: We acknowledge the importance of evaluating the algorithm's performance on a wider range of datasets, including those with more diverse facial expressions and real-world scenarios.Future work could involve testing the SlowR50-SA algorithm on additional datasets and assessing its robustness across various settings.

•
Mitigating class imbalance issues: To mitigate the impact of class imbalance on model performance, future studies could explore techniques such as data augmentation, oversampling of minority classes, or using advanced loss functions tailored to handle imbalanced datasets.Additionally, efforts could be made to collect or curate datasets that better represent the distribution of facial expressions in real-world scenarios.

•
Extended scope of ablation study: Further analysis could extend beyond the addition of the Self-Attention module to explore the effects of different hyperparameters, architectural variations, or alternative model components.Conducting comprehensive experiments would provide deeper insights into the algorithm's behavior and aid in optimizing its performance.• Evaluation of computational efficiency: Future research should prioritize assessing the algorithm's computational efficiency in practical deployment scenarios.This could involve benchmarking the algorithm on resource-constrained devices, evaluating its runtime performance, and optimizing its implementation for real-time applications.

Conclusions
This paper presents SlowR50-SA, a novel emotion recognition algorithm that appends a Self-Attention module to the SlowR50 backbone.The experimental results on two benchmark datasets, DFEW and FERV39K, indicate that SlowR50-SA performs favorably compared to other algorithms, demonstrating good or better performance in terms of both UAR and WAR.Additionally, the model uses only eight frames of input, indicating its efficiency.The ablation study in Table 2 further highlights the positive impact of the Self-Attention module, which significantly improves the model's performance.These findings demonstrate the potential of SlowR50-SA as a powerful tool for emotion recognition.Its state-of-the-art performance, computational efficiency, and ability to operate with fewer input frames make it a promising candidate for real-world TI applications.Based on the promising outcomes of this study, future research could explore further enhancements to SlowR50-SA, such as experimenting with different variations of the Self-Attention module, integrating multimodal data sources for more robust emotion recognition, and conducting experiments with different backbone architectures and hyperparameters.Additionally, evaluating SlowR50-SA in real-world TI scenarios and exploring transfer learning techniques could accelerate its deployment and improve its effectiveness across diverse

Figure 1 .
Figure 1.A pipeline of the proposed model.

Figure 3 .
Figure 3.The confusion matrices obtained by the proposed SlowR50-SA algorithm on DFEW dataset. hap

Figure 4 .
Figure 4.The confusion matrix obtained by the proposed SlowR50-SA algorithm on FERV39K dataset.

Table 1 .
Comparison of proposed SlowR50-SA model with the state-of-the-art methods on DFEW and FERV39K datasets (bold indicates the best result, while underline indicates the second-best result).The evaluation metrics UAR and WAR for the methods compared with the SlowR50-SA algorithm are derived from corresponding literature data.

Table 2 .
Ablation study on the effect of SA module added after the SlowR50 backbone on DFEW database.