FCCA: Fast Center Consistency Attention for Facial Expression Recognition

Sun, Rui; Zhang, Zhaoli; Liu, Hai

doi:10.3390/electronics14061057

Open AccessArticle

FCCA: Fast Center Consistency Attention for Facial Expression Recognition

by

Rui Sun

^1,2

,

Zhaoli Zhang

^1,2,*

and

Hai Liu

^1,2

¹

National Engineering Research Center for E-Learning, Central China Normal University, Wuhan 430070, China

²

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1057; https://doi.org/10.3390/electronics14061057

Submission received: 16 January 2025 / Revised: 26 February 2025 / Accepted: 4 March 2025 / Published: 7 March 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Given the critical requirements for both speed and accuracy in facial expression recognition, this paper presents a novel deep-learning architecture named Fast Central Consistency Attention (FCCA). With FasterNet-s as its backbone network, FCCA is designed to recognize facial expressions. Firstly, we leverage partial convolution to extract features from specific channels, thereby reducing frequent memory access and substantially boosting training speed. Secondly, we enhance recognition accuracy by introducing an additional pointwise convolution on the partial features, focusing on the central facial position using weighted mechanisms. Lastly, we integrate flip consistency loss to tackle uncertainty challenges inherent in facial expression recognition (FER) tasks, further improving the overall model performance. Our approach yielded superior results: we achieved recognition accuracies of 91.30% on RAF-DB and 65.51% on AffectNet datasets, along with 56.61% UAR and 69.66% WAR on the DFEW dataset. The FCCA method has demonstrated state-of-the-art performance across multiple datasets, underscoring its robustness and capability for generalization.

Keywords:

facial expression recognition; fast center consistency attention; flip consistency attention

1. Introduction

As the study of emotions has advanced, their significance has become increasingly evident, leading to widespread applications of emotional research in real-life scenarios. Facial expressions serve as pivotal cues for capturing human emotions. Facial expression recognition (FER) tasks are extensively employed in educational, medical, and human–computer interaction settings to enhance human assistance and learning experiences. Consequently, research in FER has transitioned from controlled laboratory environments to practical field applications, expanding from static to dynamic facial expression recognition (DFER) and facing a broader array of challenges.

Firstly, as Izad highlighted, facial expressions occur within microseconds [1], necessitating FER tasks to operate with a higher speed to accurately capture emotional cues and realize the practical value of FER applications. In the current FER tasks, the recognition is achieved by loading pretrained models, and the inference speed of the model is equivalent to the recognition speed. However, with the continuous development of artificial intelligence, end-to-end training is accelerating. In the future, application scenarios will no longer use pretrained models to adapt to user inputs. Instead, user input data will also become part of the training data. The iterative training of the model will be deployed at the application end to help the model continuously upgrade. At this time, the training speed will also become essential to ensure that downstream tasks can be trained in real time, even under limited computing power. In the future of large-scale deployment and application of artificial intelligence, new challenges are posed to the performance of FER tasks. While pursuing recognition accuracy, FER tasks should also focus on improving the training speed. Moreover, current FER techniques often neglect the central region of the face. Effective FER should prioritize the central facial area in photographs, minimizing attention towards the periphery. While standard cropping methods emphasize facial features, the variability inherent in facial photographs often results in these methods failing to focus on the central position consistently. Furthermore, noisy labels and low image quality continue to pose essential challenges for FER tasks [2].

Many studies have proposed lightweight network architectures or reduced Floating Point Operations per Second (FLOPS) to enhance model processing speed. However, reducing FLOPS does not always correlate with reduced latency and can sometimes exacerbate delays. This is because the computational volume and memory access frequency substantially impact model processing speed more than FLOPS alone [3,4]. For instance, as shown in Figure 1, the feature information extracted by CNN networks is redundant, with low contribution rates from channel-specific features. Addressing redundant channel features offers a viable solution to decrease computational load, reduce memory access frequency, and enhance model processing speed.

To address these challenges, we adopted FasterNet [5] as our backbone model, characterized by channel filtering and central attention weighting. This choice effectively minimizes redundant channel features while emphasizing the central face position. Our model also employs a dual-branch structure: one branch handles normal samples while the other handles horizontally flipped samples. The outputs from both branches leverage flipped attention consistency loss to mitigate face anisotropy and address issues related to noisy label annotations.

In summary, the contributions of the FCCA method we proposed are as follows:

Innovative Backbone Selection: We introduce FasterNet as the backbone for FER tasks, incorporating a channel filtering mechanism to reduce memory access and enhance overall model performance. Additionally, our method utilizes a channel weighting mechanism to prioritize the central facial position, mitigating the adverse effects of edge positions on key feature extraction.
Enhanced Model Training: By integrating flipped attention consistency loss with FasterNet, our approach trains the model with both flipped and cross-entropy loss. This method leverages high-quality feature extraction to bolster model robustness and effectively mitigate the effects of label noise.
Experimental Validation: We conducted comprehensive experiments to evaluate our method’s generalization capability, runtime efficiency, and recognition accuracy using publicly available datasets like RAF-DB, AffectNet, and DFEW. Our results demonstrate that our approach achieves strong generalization across different datasets, delivers rapid training speed, and attains a recognition accuracy that is slightly less than state-of-the-art methods.

By addressing these key aspects, our FCCA method notably advances the field of FER, offering a robust solution that excels in accuracy and efficiency across diverse application scenarios.

The remainder of this paper is organized as follows. Section 2 (“Related Work”) critically reviews the advancements and limitations of three mainstream backbone networks for facial expression recognition (FER) tasks, namely Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and lightweight network architectures. Section 3 (“Proposed Method”) comprehensively describes the proposed FCCA model, detailing its architectural design, operational principles, and specific implementation. Section 4 (“Experiments”) outlines the experimental framework, including dataset specifications, evaluation metrics, and configuration details. Section 5 (“Results”) presents a quantitative analysis of the FCCA model’s recognition performance across experimental datasets. Section 6 (“Discussion”) offers a comparative evaluation and visualization of the model’s performance, focusing on recognition accuracy and computational efficiency, while uncovering deeper insights into the FCCA methodology. Finally, Section 7 (“Conclusions”) summarizes this study’s key contributions and limitations, and suggests potential directions for future research. This structured approach ensures a systematic research presentation, from theoretical foundations to empirical validation and critical analysis.

2. Related Works

Scholars have continuously experimented with the performance and robustness of backbone models to better extract facial expression discriminative features and address various challenges in the FER task. Common backbone models used in the FER task include CNN, ViT, and lightweight models.

Given CNN’s strong ability to extract local features, ResNet-50 and ResNet-18 were used as backbones to extract facial expression features for a long time. Zhang et al. used ResNet-18 to extract facial expression features and combined a branch to determine the uncertainty of samples [6]. Wang et al. used ResNet-18 as a facial expression feature extractor and designed a region attention mechanism to suppress the negative impact of occlusion on the model’s learning of useful features [7]. Farzaneh et al. also used ResNet-18 to extract features and deep center attention weighted features to highlight important features [8]. Zhang et al. used ResNet-50 to extract key facial features and used erasing consistency loss to weaken the negative impact of noisy features on correct classification [9]. These methods effectively alleviated the difficulties faced in the field of FER. However, they were limited by CNN’s abandonment of global vision, and there was still considerable room for improvement in FER task performance.

Given the powerful ability of ViT to capture long-range feature relationships, Xue et al. introduced ViT to increase the global receptive field, using two pooling modules to solve the defect of ViT lacking inductive bias and the infiltration of noisy features [10]. Liu et al. proposed a novel facial muscle movement-aware representation learning method that captures the semantic relationships of facial muscle movements in expression images [11]. Mao et al. improved ViT in three directions: cross-fusion, dual-stream, and multiscale feature extraction, reducing the computational complexity of POSTER and extracting multiple features [12]. Although ViT increased the global receptive field, its computational complexity increased exponentially compared to CNN, and the running time increased, which was harmful to the deployment and application of the FER task.

Since CNN and ViT methods face challenges balancing computational complexity and recognition accuracy, many researchers are exploring new network architectures to seek breakthroughs. Mao et al. argued that existing FER methods neglected the dynamic size changes of tensors and proposed a hierarchical attention network with progressive feature fusion to avoid the computational complexity of ViT and achieve recognition accuracy similar to that of ViT [13]. The Dual-Direction Attention Mixed Feature Network (DDAMFN) combines robustness and lightweight characteristics [14]. DDAMFN projects the knowledge representation of questions into the latent knowledge space to extract features relevant to knowledge, solely relying on the question text itself. Wang et al. uses graph convolutional networks (GCN) and k-nearest neighbor graphs to identify facial expressions. GCN processes low uncertainty images to extract geometric cues for emotional label prediction. High uncertainty images leverage k-nearest neighbor graphs to find similar low uncertainty images, fusing their emotional label distributions. A convolutional neural network is trained on these distributions to identify facial expressions [15].

In summary, the design of the backbone for the FER task has mostly considered spatial feature extraction capabilities and computational complexity as factors. From CNNs to ViTs to lightweight networks, scholars have been pursuing faster and more accurate model designs. However, due to neglecting the impact of memory access efficiency on computational speed, the actual running speed has not improved despite reducing computational complexity. However, reducing FLOPS does not always correlate with reduced latency and can sometimes exacerbate delays. This is because computational volume and memory access frequency substantially impact model processing speed more than FLOPS alone [3,4]. For instance, as shown in Figure 1, the feature information extracted by CNN networks is redundant, with low contribution rates from channel-specific features. Addressing redundant channel features offers a viable solution to decrease computational load, reduce memory access frequency, and enhance model processing speed.

3. Proposed FCCA Methodology

As shown in Figure 2, the network architecture we have designed is composed of two principal components: the fast center attention backbone (FCA) and the flipped attention consistency loss (FACL). The fast center attention backbone is pivotal for enhancing the model’s processing speed, as it prioritizes and concentrates on the central facial region, optimizing focus where it is most needed. Meanwhile, the flipped attention consistency loss functions to mitigate and lessen the adverse effects that label noise can induce, ensuring that the accuracy and reliability of the model remain intact despite potential discrepancies in labeling. Together, these components form a robust and efficient architecture that balances speed with consistency.

3.1. Backbone

The backbone consists of four stages, each preceded by an embedding or fusion layer. The embedding layer is implemented by a regular Conv

4 \times 4

, and the fusion layer is implemented by a regular Conv

2 \times 2

. The embedding layer and the fusion layer are used to achieve spatial downsampling and increase the number of channels. The last two stages have low memory access rates, so adding more blocks does not affect the running speed. Each FCA block has a PConv layer and two PWConv layers, which form a reversed residual block structure. We selected the batch normalization layer for the normalization layer, which takes advantage of the BN’s compatibility with Conv and helps the model infer faster while maintaining good performance. We use GELU to tap the model’s potential for the activation layer.

As shown in Figure 3, PConv only extracts spatial features from a subset of input channels, while the remaining channels remain unchanged. We calculate

c_{p}

to representing the entire feature map. We set the output feature map channels for generality to the same value as the input. The FLOPs of PConv are expressed as follows:

h \times w \times k^{2} \times c_{p}^{2} .

(1)

At a general selection rate

r = \frac{c_{p}}{c} = \frac{1}{4}

, the Flops of PConv is only

\frac{1}{4}

times that of a conventional Conv, and PConv reduces the access volume, as follows:

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p},

(2)

where

r = \frac{1}{4}

represents the number of channels as

\frac{1}{4}

for a conventional Conv layer. To ensure that feature information can flow through all channels, we do not delete any channels except

c_{p}

and keep the number of channels unchanged, thus avoiding PConv degrading into a regular Conv with

c_{p}

channels.

We added an additional pointwise convolution on top of PConv, simply called PWConv, to ensure that channel information is effectively utilized. The combined effect looks like a conventional T-shaped Conv, which is more focused on the central location compared to PConv’s uniform treatment of all channel information. The Frobenius norm explains the theoretical reason why PWConv pays more attention to the central location. The assumption is that if a location has a larger Frobenius norm than other locations, it is more important. For a regular Conv filter

G \in R^{k^{2} \times c}

, the Frobenius norm of the location

∥G_{i}∥ = \sqrt{\sum_{j = 1}^{c} {|g_{i j}|}^{2}}

is calculated as follows: for

i = 1, 2, 3,

…, the crucial locations have larger Frobenius norms. The Frobenius norm of the central location is larger than that of other locations, so more attention is paid to the central location. Although T-shaped Conv can be directly calculated, separating PConv and PWConv can further utilize filters to reduce redundancy and reduce FLOPs. For the same

G \in R^{c \times h \times w}

and output

O \in R^{c \times h \times w}

, the FLOPs of the T-shaped Conv can be calculated as follows:

h \times w \times (k^{2} \times c_{p} \times c + c \times (c - c_{p})),

(3)

which has a higher number of FLOPs than PConv and PWConv,

h \times w \times (k^{2} \times c_{p}^{2} + c^{2}) .

(4)

Additionally, PConv and PWConv can be easily implemented using conventional convolutions.

3.2. Flip Attention Consistency Loss

We designed a dual-branch structure with FasterNet as the backbone, supporting flipped attention consistency loss, which collaborates with the softmax loss to achieve noise label resistance. For any given input sample

x_{i}

, we put it into branch 1 for feature extraction, horizontally flip the sample

x_{i}

to generate

x_{i}^{'}

, and put it into branch 2 for feature extraction,

\begin{matrix} L_{s} & = - \frac{1}{M} \sum_{i = 1}^{M} \sum_{j = 1}^{K} y_{i} log p (y = j ∣ x_{i}) \\ = - \frac{1}{M} \sum_{i = 1}^{M} log \frac{e^{w_{y_{i}}^{t} x_{i} + b_{y_{i}}}}{\sum_{j = 1}^{K} e^{w_{j}^{t} x_{i} + b_{j}}} \end{matrix} .

(5)

The task of countering noisy labels is mainly completed by the flipped consistency attention loss. In this task,

M_{i j}

represents the feature map from branch 1,

M_{i j}^{'}

represents the feature map from branch 2,

F^{'}

is obtained by flipping sample F, and

M_{i j}

,

M_{i j}^{'}

are the feature maps generated by F,

F^{'}

as inputs,

L_{f a c l} = \frac{1}{M K H W} \sum_{i = 1}^{M} \sum_{j = 1}^{K} {∥M_{i j} - Flip (M_{i j}^{'})∥}_{2} .

(6)

The model training is supervised by both the Softmax loss and FACL loss,

L_{Total} = L_{s} + λ L_{f a c l},

(7)

where

λ

is a constant coefficient used to determine the contribution of FACL to the model training; we assume that

λ

has a greater impact on the model’s performance, and this will be verified in subsequent experiments.

4. Experiment

This section primarily offers a visual explanation and analysis of the FCCA method through experiments, verifying its effectiveness and elucidating the mechanisms behind its superior performance. It is structured into five main parts: dataset introduction, detailed experimental setup description, comparison of experimental results encompassing recognition accuracy and processing speed, ablation experiments to unveil the operational mechanisms of FCCA, and extension experiments to supplement additional experimental details.

4.1. Datasets

To validate the effectiveness, robustness, and generalization capabilities of the FCCA method, we conducted performance assessments and progressive ablation experiments using static and dynamic facial expression datasets. Specifically, for static facial expression datasets, we utilized RAF-DB [16] and AffectNet [17], while for dynamic datasets, we employed DFEW [18]. These experiments aimed to thoroughly investigate and validate the efficacy and mechanisms of the FCCA method.

As shown in Figure 4, RAF-DB is annotated with basic or compound expressions by 40 trained human coders. It comprises 15,339 images categorized into 7 basic expressions (surprise, fear, disgust, happy, sad, angry, neutral). The dataset is split into 12,271 training images and 3068 testing images. Notably, RAF-DB exhibits class imbalance, particularly with fewer images in the fear and disgust categories than other emotions.

AffectNet is the largest publicly available wild AffectNet stands as the largest publicly available wild FER dataset. It includes 450,000 images annotated across 8 expressions, collected through expression queries in three major internet search engines. AffectNet is balanced with 3500 images allocated for testing, evenly distributed with 500 images per expression category. This dataset provides a diverse range of FER scenarios compared to RAF-DB dataset. It includes 450,000 images annotated across 8 expressions, collected through expression queries in three major internet search engines. AffectNet is balanced with 3500 images allocated for testing, evenly distributed with 500 images per expression category. This dataset provides a diverse range of FER scenarios compared to RAF-DB.

DFEW consists of 16,372 video clips extracted from over 1500 movies, presenting numerous practical challenges such as extreme illumination variations, occlusions, and unpredictable pose changes. Consistent with previous works, we only conducted experiments on 12,059 video clips attributed to specific single-label emotional categories. All samples were divided into five equally sized parts (fd1∼fd5) with no overlap. We chose five-fold cross-validation as the evaluation scheme, where one part of the samples is used for testing and the remaining for training. For all experiments, we used Unweighted Average Recall (UAR) and Weighted Average Recall (WAR) for evaluation. This dynamic dataset further tests the robustness and adaptability of FER models in real-world scenarios.

Through rigorous experimentation on these datasets, we aimed to verify the efficacy of the FCCA method across varied conditions and challenges posed by static and dynamic facial expressions. These efforts are crucial in demonstrating the method’s suitability for practical applications and potential advancements in the field of FER.

4.2. Implementation Details

In our experiments, we adopted Fasternet-s as the primary architecture, pretrained on the VGGFace2 dataset. For dealing with the unbalanced datasets, random erasing and mixup for data augmentation to boost performance. Random erasing introduces occlusion and noise, improving model robustness and balancing feature distribution. Optimal settings erased probability p = 0.5 and size scale = (0.02, 0.25). Mixup diversifies image data and enhances feature learning, with parameters mixup-alpha = 0.8, cutmix-alpha = 1.0, and others tailored for optimal performance. For optimization, we utilized the Adaptive Moment Estimation with Weight Decay (AdamW) optimizer with a weight decay set at 1E-2. To augment the input images, we applied real-time enhancement by randomly erasing a portion of the image. During testing, we extracted a 224 × 224 crop from a 256 × 256 input image. We configured the learning rates on the RAF-DB, AffectNet, and DFEW datasets as follows: 0.0003 for RAF-DB, 0.0004 for AffectNet, and 0.0001 for DFEW. Throughout all model training experiments, we consistently used a batch size 64. Training sessions were conducted for 60 epochs in each instance.

Under our specific trunk architecture setup, the depth features are 1280-dimensional, and the final convolutional feature map has dimensions of 1280 × 7 × 7. We trained our model using the PyTorch 2.2.0 deep learning framework on an NVIDIA (Santa Clara, CA, USA) RTX A4000 GPU with 16 GB of VRAM. The code is available at https://github.com/RuiSunooo/FCCA (accessed on 3 December 2024).

5. Results

This section presents a thorough comparison of results obtained through the FCCA method against benchmarks or state-of-the-art methods. Detailed analysis of recognition accuracy metrics and runtime efficiency highlights the method’s strengths. The FCCA method notably competes with established techniques across multiple datasets, such as RAF-DB, AffectNet, and DFEW, showcasing its generalization capability and adaptability to diverse data conditions.

The RAF-DB dataset is an imbalanced dataset, where the number of samples in each category within the validation set exhibits an uneven distribution. Therefore, we report the average recognition accuracy of the FCCA method on the RAF-DB dataset to evaluate the model’s generalization ability to sample distributions. "Avg.Acc (%)" denotes the average recognition accuracy, which is calculated as follows: the recognition accuracies of the seven categories (surprise, disgust, fear, happy, sad, angry, and neutral) are summed and then divided by the total number of categories (i.e., 7), resulting in the average recognition accuracy. As depicted in Table 1, FCCA achieves excellent recognition performance on the challenging imbalanced RAF-DB dataset, boasting an overall recognition accuracy of 91.30% and an average recognition accuracy of 85.29%. In a direct comparison, FCCA demonstrates a notable advantage over the EAC method with ResNet-50 as the backbone, outperforming it by a margin of 0.95%. Furthermore, FCCA exhibits superior performance compared to the TransFER method, which incorporates the ViT method, outperforming it by 0.39%. Notably, the POSTER method, which leverages a pyramid structure, achieves the highest recognition accuracy to date, reaching a prime 92.05%.

As depicted in Table 2, FCCA demonstrates a compelling recognition accuracy of 65.51% on the AffectNet dataset, representing a notable improvement of 0.19% over the EAC method. Moreover, the POSTER method has a state-of-the-art performance, boasting a stellar recognition accuracy of 67.31%.

As depicted in Table 3, FCCA demonstrates strong performance in the DFER task on the DFEW dataset, achieving a WAR of 69.66% and a UAR of 56.61%. These results match the state-of-the-art recognition performance, solidifying FCCA’s efficacy in this domain.

Notwithstanding the excessive computational requirements of the POSTER method, which hinges on a sophisticated pyramidal architecture and resource-intensive ViT, it necessitates substantial processing power and prolongs processing times. Considering the concurrent demands for speedy recognition and high accuracy in the FER task, a dual SOTA standard that simultaneously benchmark systems on speed and accuracy offers a more comprehensive and accurate evaluation framework for FER performance.

6. Discussion

Our method surpasses the benchmark regarding recognition accuracy, but still fails to match the excellence of the POSTER method. To investigate further, we conducted training time experiments on the RAF-DB and AffectNet datasets, employing a single training cycle as a standard. We recorded the training time for the POSTER and the FCCA methods using the training set and test set, allowing us to compare the runtime of various methods.

As depicted in Table 4 and Table 5, with the training set as a prime example, the FCCA method demonstrated the shortest runtime, clocking in at 68 s and 137 s, respectively, while the POSTER method took the longest, requiring 188 s and 415 s, respectively. Notably, the POSTER method required over twice the runtime of the FCCA method. Between the FCCA and POSTER methods, FCCA held its own, with runtimes of 111 s and 230 s. A thorough data analysis reveals that the runtime sequence is FCA, FCCA, and then POSTER.

By combining recognition accuracy and runtime, we found that the FCCA method achieved recognition accuracy is slightly less than POSTER’s, yet consumed only half the runtime. This indirect indication of the FCCA method’s excellent performance in balancing runtime and recognition accuracy makes it an attractive option for FER tasks in real-world applications and deployment.

To better understand the FCCA method’s performance in various emotional classification tasks, we presented confusion matrices for the recognition accuracy of various emotions in the RAF-DB and AffectNet datasets. Additionally, we included recognition performance confusion matrices for the CNN (ResNet-50) and FCA methods to enable a more direct comparison. As shown in Figure 5, on the RAF-DB dataset, the FCA method demonstrated superior performance in identifying fear, happy, sad, and angry emotions, with improvements of 2.7%, 1.35%, 5.02%, and 1.86%, respectively. Notably, CNN (ResNet-50) continued to excel in recognizing surprise and disgust emotions, while FCA showed better performance in identifying fear and other emotions. In contrast, the FCCA method showed striking improvements in the recognition performance of most emotional categories, including surprise, disgust, happy, sad, angry, and neutral emotions, with improvements of 2.74%, 3.75%, 1.61%, 4.39%, 6.79%, and 2.20%, respectively. On the AffectNet dataset, the FCA method exhibited remarkable stability, demonstrating its stronger adaptability in real-world datasets. On the other hand, the FCCA method showed notable improvements in most emotional categories on both datasets, indicating its enhanced ability to improve model performance. However, the FCCA method surprisingly exhibited a decline in performance on the sad category in both datasets, which may be attributed to many despairing facial expressions that are difficult to recognize and often misclassified as neutral. Therefore, enhancing the model’s fine-grained classification ability is imperative to address this issue.

An examination of the RAF-DB and AffectNet datasets reveals that CNN (ResNet-50) is capable of distinguishing various features. However, the inter-class distance is insufficient, and the intra-class distance is not tight enough, resulting in characteristic vectors exhibiting relatively minor differences between classes and relatively low similarity within classes. In contrast, as shown in Figure 6, the FCA method demonstrated larger distances between classes, considerably enhancing the inter-class differences and intra-class similarity of characteristic vectors, indicating a certain degree of improvement in FCA’s ability to distinguish between classes.

In the FCCA method, the distances between classes are even larger, and the intra-class characteristic vectors are incredibly tight, providing a clear indication that FCCA not only effectively enhances the inter-class differences between characteristic vectors but also improves the similarity within classes. This illustrates the FCCA method’s ability to address the above-mentioned issues effectively and is a fundamental advantage over the baseline FCA method.

The Grad-CAM [40] method provides a straightforward visual representation of the attention regions of the model during correct classification, whereas SHAP offers a more comprehensive view by showcasing the attention regions of all classification results, including incorrect ones, and ranking them according to their predicted probabilities from highest to lowest. As shown in Figure 7, the first column shows seven emotion samples, while the second column highlights relevant feature regions under correct classification. As evident in the accompanying figure, the attention regions for correct classification are remarkably broad and focused compared to those for other incorrect classifications. For instance, in a sample correctly classified as “disgust”, the attention region encompasses the eye area, nasal wings, and mouth area, with a particularly high concentration of attention around the mouth area. In stark contrast, the attention region for an incorrectly classified sample as “angry” is limited to the eye area and appears scattered and unfocused. This dual attention pattern is also observed in other emotional classification samples, highlighting FCCA’s dual-focus network architecture, which simultaneously attends to global and local key features. By employing this mechanism, FCCA meticulously and rigorously performs the classification task, rendering it a crucial factor in its performance.

Conducts ablation experiments to dissect and showcase the specific contributions of each component within the FCCA method. This helps to elucidate why and how certain design choices impact overall performance.

To delve deeper into the mechanism of the FCCA method, we conducted a series of ablation experiments. As depicted in Table 6, compared to ResNet-50 on the RAF-DB dataset, the FCA method improved the recognition accuracy, boosting it from 88.69% to 89.70%, representing a notable 1.01% gain. By integrating FACL with the FCA approach, the FCCA method achieved a further 1.6% enhancement in performance. On the AffectNet dataset, the FCA method demonstrated a substantial advantage over traditional CNN (ResNet-50) methods, elevating the recognition accuracy from 62.57% to 64.8%, a 2.23% increase. The addition of FACL yielded an additional 0.8% performance gain. Our analysis suggests that while the effectiveness of FCA and FACL in enhancing model performance exhibits varying degrees of effectiveness across different datasets, both FCA and FACL make meaningful contributions to the improvement of model performance.

To gain a deeper understanding of the feature extraction capabilities of FCA and CNN (ResNet-50), we employed Grad-cam to visualize the feature outputs of the two networks’ final layers, as illustrated in the figure. As shown in Figure 8, in the realm of facial texture and detailed feature extraction, the FCA method stands out for its remarkable clarity. Unlike CNN (ResNet-50), which relies solely on facial contour features, FCA not only defines the facial features but also captures the intricate skin texture surrounding them and the subtle movements of the muscles. Through meticulous and comprehensive facial feature extraction, FCA enables models to grasp detailed features and sidesteps the decline in classification performance resulting from missing crucial features. This observation also implies that certain key clues in facial expressions may be concealed within facial texture and muscle movements, reinforcing the significance of facial texture details and muscle movements in FER. Upon closer examination, it becomes apparent that CNN (ResNet-50) is prone to extracting superfluous information from sample edges. In contrast, FCA focuses on extracting vital features from the central facial region. By doing so, FCA effectively mitigates the adverse effects of irrelevant features on model classification. FCA’s remarkable feature extraction ability enables capturing more crucial information for the FER task. At the same time, its center attention mechanism skillfully suppresses the influence of unnecessary information on FER. Both FCCA and FCA methods use FaterNet for extraction, with the difference being that FCA is a simple feature extraction and classification head combination model, using only cross-entropy loss to constrain model training. In contrast, FCCA employs a dual-branch feature extraction output structure with original and flipped samples, using cross-entropy loss and flip attention consistency loss to jointly supervise model training. There is no difference in feature extraction performance between the two, so we did not include the feature extraction diagram of the FCCA method for comparison.

To facilitate a deeper understanding of the working mechanisms of the FCCA method in facial emotion recognition (FER), we employed Grad-cam to visually illustrate the attention regions of various approaches. As shown in Figure 9, the results, as depicted in the figure, show that the CNN (ResNet-50) method primarily focuses on key regions in most instances, but occasionally allocates attention to areas like the cheeks, which are not crucial for the FER task. In contrast, FCA directs all its attention towards facial features or other essential facial regions, suggesting that FCA extracts more key facial information and channels it into the model, thereby enhancing focus on pivotal regions. Furthermore, the attention regions of FCCA unveil the influence of facial symmetry on the FER task. When the left and right halves of the face are relatively symmetrical, the disparity between the attention regions of FCA and FCCA is minimal. Conversely, when the facial halves are asymmetrical, the difference in attention regions between FCA and FCCA becomes more pronounced, implying that FACL makes more prominent contributions to the analysis of non-symmetrical samples by paying more attention to global key information, which is beneficial for accurate FER classification. We hypothesize that the performance of FCCA in classifying non-symmetrical faces may be attributed to the incorporation of FACL, which enables the model to learn the characteristic of facial asymmetry and acquire new knowledge, namely, that global information can be more beneficial for accurate classification of faces with asymmetrical features.

We can expand on the main experiments by exploring additional facets or details relevant to the FCCA method. This may include robustness testing, sensitivity analysis to parameters, or exploration of performance under varying conditions. Noise labels pose a pervasive challenge to facial expression recognition (FER) tasks, as their widespread occurrence leads to an influx of noise features that can disrupt model performance. Moreover, the model often learns and retains these noise features, compromising its ability to accurately classify emotions. To mitigate this issue, we introduced the flipped attention consistency loss (FACL), a novel approach that leverages flipped attention adaptation to neutralize the detrimental impact of noise features on the model. As shown in Table 7, empowered by FACL, we evaluated the anti-noise label capabilities of FCCA methodon the renowned RAF-DB dataset. As the table below shows, adding noise labels resulted in a considerable decline in model recognition performance. However, our analysis revealed that different methods exhibit varying anti-noise label capabilities, with FCCA emerging as the most resilient and effective approach, surpassing the CNN (ResNet-50) method. FCCA achieved commendable recognition accuracy rates of 88.62%, 87.48%, and 85.92% for 10%, 20%, and 30% noise label distributions, respectively, outperforming the EAC method in each scenario. Furthermore, our investigation showed that FCCA’s anti-noise label capability increases in direct proportion to the proportion of noise labels. Notably, when adding 10% noise labels, FCCA and EAC demonstrated equivalent anti-noise label capabilities, with recognition accuracy rates of 88.62%. As the proportion of noise labels increased to 20%, FCCA outperformed EAC, with recognition accuracy rates of 87.48% compared to 87.35%. When adding 30% noise labels, FCCA’s anti-noise label capability stood out as substantially superior to EAC, with recognition accuracy rates of 87.92% compared to 85.27%. Our findings unequivocally demonstrate the anti-noise label capability of FCCA, underscoring its potential for tackling the complexities of FER tasks in the face of noise labels.

Notably, FLOPS alone cannot precisely indicate a model’s running speed. We have compiled the FLOPS of various classic models, including the FCCA model, and analyzed their accuracy and model parameter quantity performance, as depicted in Table 8. Interestingly, despite having higher FLOPS and parameters, ResNet-50 runs faster than the POSTER method. Among the numerous approaches, the FCA method features fewer parameters, lower FLOPS, and higher recognition accuracy, demonstrating its comprehensive performance. When computational resources are limited, FCA or FCCA methods may be a better choice; however, when resources are plentiful, the POSTER method can unlock its full potential and deliver superior performance.

To explore the effects of FACL on the model, we conducted extensive experiments on the RAF-DB and AffectNet datasets, systematically adjusting the

λ

value in FACL and analyzing the resulting variations in recognition performance. As shown in Figure 10, FACL consistently yielded the highest recognition accuracy when

λ

was set to 1, regardless of the dataset. Notably, recognition accuracy improved as

λ

converged to 1, whereas it declined as

λ

diverged from 1. Furthermore, when

λ

was excessively large, FACL’s recognition accuracy fell short of the FCA methods. Our findings indicate that FACL can only enhance model performance when

λ

is set within a reasonable range; otherwise, it may even degrade the model’s behavior.

To thoroughly assess the performance of the FCCA method, we generated comprehensive training iteration curves for FCCA on the RAF-DB and AffectNet datasets, incorporating metrics for recognition accuracy and loss values. As shown in Figure 11, on the RAF-DB dataset, FCCA achieved a peak recognition accuracy of 91.30% at the 51st training epoch. Compared to FCA without pretrained models, which trained slowly and yielded lower recognition accuracy, FCCA with pretrained models exhibited a consequential acceleration in training convergence speed, albeit accompanied by unstable oscillations. When FACL was integrated into FCCA, the network converged rapidly and consistently with minimal oscillations, achieving the best recognition accuracy. Similarly, on the AffectNet dataset, FCCA achieved its best recognition accuracy of 65.51% at the 10th training epoch, again showcasing its advantages in rapid convergence and stable training process.

FER is a subfield of emotion recognition and sentiment analysis. Emotion recognition aims to identify human emotions and attitudes through multimodal cues such as text, speech, and visual clues. At the same time, sentiment analysis focuses more on detecting and classifying specific emotional states like happiness and sadness. Emotion recognition analyzes emotions using NLP, computer vision, and speech processing technologies, applied in human–computer interaction, social media analysis, and healthcare. It involves multimodal learning as human emotions are expressed through text, speech, and visual cues. Facial expression recognition is key because it is the most direct and universal indicator of emotions. Multimodal learning combines various information to improve the accuracy of sentiment analysis. Key technologies include feature extraction (BERT for text, Wav2Vec for speech, CNN for visuals, etc.), modality fusion (early and late fusion using attention mechanisms), cross-modal learning (aligning facial expressions with speech or text), and self-supervised learning (pretraining with large-scale unlabeled data). Facial expression recognition (FER) is closely linked to sentiment analysis because facial expressions are one of the main methods for emotion detection. Both FER and sentiment analysis aim to identify emotions, but the former focuses more on visual cues, while the latter usually combining multiple modalities, such as speech and text. Facial expressions and speech are complementary modalities that better indicate emotions. These two fields face technical challenges, such as occlusion and ambiguity, requiring technologies like random masking and self-supervised pretraining. Combining multiple modalities (such as FER with speech and text) can improve the accuracy and robustness of recognition systems, with techniques like attention mechanisms and self-supervised learning used to enhance multimodal representations [41,42]. The importance of facial expression recognition in multimodal emotion recognition lies in: improving system robustness, handling noise and occlusion; enhancing generalization capability, adapting to diverse datasets; and applications in healthcare, education, entertainment, and other fields. Facial expression recognition is a crucial part of emotion recognition. Through multimodal learning and self-supervised learning technologies, FER can be integrated into more accurate emotion recognition systems, improving performance and advancing emotion understanding in complex scenarios.

Facial expression recognition (FER) is vital for sentiment analysis, as facial expressions are central to emotion detection. While FER relies on visual cues, sentiment analysis often integrates multiple modalities like speech and text, which complement each other to improve emotion recognition. Both fields face challenges such as occlusion and ambiguity, tackled through technologies like random masking and self-supervised pretraining. Multimodal approaches, combining FER with speech and text, enhance accuracy and robustness using attention mechanisms and self-supervised learning techniques. FER’s role in multimodal emotion recognition includes improving system robustness against noise and occlusion, enhancing generalization across datasets, and enabling applications in healthcare, education, and entertainment. Supported by multimodal and self-supervised learning, FER can advance emotion recognition systems for complex scenarios. Dynamic FER (DFER) better aligns with natural scenarios for real-time video analysis than static FER (SFER). However, DFER faces challenges like varying expression intensities in video sequences. An expression intensity-aware loss function can address this by evaluating and weighting intensity variations [39]. Additionally, optimizing the FasterNet architecture for speed and reliability—through lightweight design, attention mechanisms, and distillation techniques—can enhance real-time video analysis capabilities.

7. Conclusions

Considering the stringent real-time demands inherent in the FER task, we devised and introduced the FCCA method with the explicit aim of notably enhancing the training speed. Through meticulous and rigorous experimentation on the static datasets, namely RAF-DB and AffectNet, the effectiveness of the FCCA method has been convincingly verified. This innovative approach not only successfully prioritizes the training speed essential for FER tasks but also attains a recognition accuracy on par with that of the current state-of-the-art (SOTA) methods.

Furthermore, when applied to the dynamic dataset DFEW, the FCCA method exhibited remarkable recognition performance, demonstrating adaptability across diverse data scenarios. In essence, within the domain of FER, the FCCA method exhibits strong comprehensive capabilities. Thus, it serves as a highly convenient and efficient solution, facilitating seamless deployment and practical implementation in a wide range of FER-related applications.

Although our research has achieved notable outcomes in augmenting the recognition accuracy and speed within the FER task, and also in mitigating the adverse effects of noisy labels, such fundamental undertakings are inadequate to overcome the complex gamut of challenges endemic to the FER domain. For example, in the FER task, human facial muscles’ global emotional semantic modeling confronts substantial obstacles. The complex interplay between the discrete nature of emotional label semantics and the continuous nature of real-world emotional semantics represents a particularly intractable conundrum.

Subsequently, based on our existing work, we plan to integrate advanced information such as facial expression action units (FAUs) and conduct a comprehensive and in-depth exploration of innovative methodologies for global emotional semantic modeling of human facial muscles. This approach is designed to proactively and effectively address the more intricate challenges encountered in the FER task, aiming to advance the state-of-the-art in FER research and contribute to developing more accurate and robust facial expression analysis systems.

Author Contributions

R.S.: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review and editing. Z.Z.: Conceptualization, Methodology, Writing—review and editing. H.L.: Data curation, Visualization, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFC3340802; in part by the National Natural Science Foundation of China under Grant 6247077114, Grant 62377037, Grant 62277041, Grant 62173286, Grant 62177019 and Grant 62177018; and in part by the Research Grants Council of Hong Kong under Grant 9043323, and Grant 11213420; in part by the Jiangxi Provincial Natural Science Foundation under Grant 20242BAB2S107, Grant 20232BAB212026; in part by the National Natural Science Foundation of Hubei Province under Grant 2022CFB529 and Grant 2022CFB971; in part by the University Teaching Reform Research Project of Jiangxi Province under Grant JXJG-23-27-6; and in part by the Shenzhen Science and Technology Program under Grant JCYJ20230807152900001.

Data Availability Statement

The data used in this study are publicly available datasets for facial expression recognition, mainly including RAF-DB, AffectNet, and DFEW. The RAF-DB datasets can be obtained at: http://www.whdeng.cn/RAF/model1.html (accessed on 3 December 2024). The AffectNet datasets, which are balanced datasets, can be obtained at: http://mohammadmahoor.com/pages/databases/affectnet/ (accessed on 3 December 2024). The DFEW datasets can be obtained at: https://dfew-dataset.github.io/download.html (accessed on 3 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Izard, C.E. The Psychology of Emotions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
Fan, X.; Deng, Z.; Wang, K.; Peng, X.; Qiao, Y. Learning discriminative representation for facial expression recognition from uncertainties. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 903–907. [Google Scholar]
Chen, S.; Xie, E.; Ge, C.; Chen, R.; Liang, D.; Luo, P. Cyclemlp: A mlp-like architecture for dense prediction. arXiv 2021, arXiv:2107.10224. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Do not Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Zhang, Y.; Wang, C.; Deng, W. Relative uncertainty learning for facial expression recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 17616–17627. [Google Scholar]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef] [PubMed]
Farzaneh, A.H.; Qi, X. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2402–2411. [Google Scholar]
Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 418–434. [Google Scholar]
Xue, F.; Wang, Q.; Tan, Z.; Ma, Z.; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. 2022, 14, 3244–3256. [Google Scholar] [CrossRef]
Liu, H.; Zhou, Q.; Zhang, C.; Zhu, J.; Liu, T.; Zhang, Z.; Li, Y.F. MMATrans: Muscle movement aware representation learning for facial expression recognition via transformers. IEEE Trans. Ind. Inform. 2024, 20, 13753–13764. [Google Scholar] [CrossRef]
Zheng, C.; Mendieta, M.; Chen, C. Poster: A pyramid cross-fusion transformer network for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3146–3155. [Google Scholar]
Mao, J.; Xu, R.; Yin, X.; Chang, Y.; Nie, B.; Huang, A.; Wang, Y. Poster++: A simpler and stronger facial expression recognition network. Pattern Recognit. 2025, 157, 2025. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
Wang, S.; Zhao, A.; Lai, C.; Zhang, Q.; Li, D.; Gao, Y.; Dong, L.; Wang, X. Gcanet: Geometry cues-aware facial expression recognition based on graph convolutional networks. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 101605. [Google Scholar] [CrossRef]
Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2852–2861. [Google Scholar]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
Jiang, X.; Zong, Y.; Zheng, W.; Tang, C.; Xia, W.; Lu, C.; Liu, J. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2881–2889. [Google Scholar]
Zhao, S.; Cai, H.; Liu, H.; Zhang, J.; Chen, S. Feature Selection Mechanism in CNNs for Facial Expression Recognition. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; Volume 12, p. 317. [Google Scholar]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Patch-gated CNN for occlusion-aware facial expression recognition. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2209–2214. [Google Scholar]
Florea, C.; Florea, L.; Badea, M.S.; Vertan, C.; Racoviteanu, A. Annealed Label Transfer for Face Expression Recognition. In Proceedings of the BMVC, Cardiff, UK, 12 September 2019; p. 104. [Google Scholar]
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450. [Google Scholar] [CrossRef] [PubMed]
Zeng, J.; Shan, S.; Chen, X. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar]
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6897–6906. [Google Scholar]
Li, H.; Wang, N.; Ding, X.; Yang, X.; Gao, X. Adaptively learning facial expression representation via cf labels and distillation. IEEE Trans. Image Process. 2021, 30, 2016–2028. [Google Scholar] [CrossRef] [PubMed]
She, J.; Hu, Y.; Shi, H.; Wang, J.; Shen, Q.; Mei, T. Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6248–6257. [Google Scholar]
Xue, F.; Wang, Q.; Guo, G. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3601–3610. [Google Scholar]
Wang, C.; Wang, S.; Liang, G. Identity-and pose-robust facial expression recognition through adversarial feature learning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 238–246. [Google Scholar]
Li, Y.; Lu, Y.; Li, J.; Lu, G. Separate loss for basic and compound facial expression recognition in the wild. In Proceedings of the Asian Conference on Machine Learning, PMLR, Nagoya, Japan, 17–19 November 2019; pp. 897–911. [Google Scholar]
Farzaneh, A.H.; Qi, X. Discriminant Distribution-Agnostic Loss for Facial Expression Recognition in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on COMPUTER Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Computer Vision–ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
Zhao, Z.; Liu, Q. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1553–1561. [Google Scholar]
Ma, F.; Sun, B.; Li, S. Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv 2022, arXiv:2205.04749. [Google Scholar]
Li, H.; Sui, M.; Zhu, Z. Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv 2022, arXiv:2206.04975. [Google Scholar]
Li, H.; Niu, H.; Zhu, Z.; Zhao, F. Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2023; Volume 37, pp. 67–75. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
Ma, Z.; Zheng, Z.; Ye, J.; Li, J.; Gao, Z.; Zhang, S.; Chen, X. emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv 2023, arXiv:2312.15185. [Google Scholar]
Qiu, K.; Zhang, Y.; Zhao, J.; Zhang, S.; Wang, Q.; Chen, F. A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism. Electronics 2024, 13, 1922. [Google Scholar] [CrossRef]

Figure 1. Visualization results of the feature maps in an intermediate layer of the pretrained ResNet-50, where the image in the (top-left) corner is the input image (selected from the AffectNet dataset). Qualitatively, we can observe a high degree of redundancy across different channels.

Figure 2. The proposed FCCA network has three main parts: backbone, dual-branch, and FACL. Fasternet-s as the backbone extracts features from input data. The dual-branch works as follows: in Branch 1, the original image yields output

M_{i}

, and in Branch 2, its flipped version gives

M_{i}^{'}

for data augmentation and capturing diverse features. Finally,

M_{i}

and

M_{i}^{'}

are inputs to FACL. After softmax, they aid the jointly supervised network to learn better features and optimize performance.

Figure 2. The proposed FCCA network has three main parts: backbone, dual-branch, and FACL. Fasternet-s as the backbone extracts features from input data. The dual-branch works as follows: in Branch 1, the original image yields output

M_{i}

, and in Branch 2, its flipped version gives

M_{i}^{'}

for data augmentation and capturing diverse features. Finally,

M_{i}

and

M_{i}^{'}

are inputs to FACL. After softmax, they aid the jointly supervised network to learn better features and optimize performance.

Figure 3. Comparison of Conv variants. PConv (a) uses filters to extract essential & preserve residual channel features, with the former carrying key input info and the latter capturing details. PWConv (b), an enhanced PConv, forms a T-shaped structure via pointwise convolution. It focuses more on central info than PConv (a), potentially performing better in central-data-reliant tasks.

Figure 4. Details of the training and test set partitions of the RAF-DB and AffectNet datasets. (a) and (c) represent the sample quantities and distributions in the training set and test set of the RAF-DB dataset, respectively, while (b) and (d) represent the sample quantities and distributions in the training set and test set of the AffectNet dataset, respectively.

Figure 5. Confusion matrices for comparing the recognition performance of CNN (ResNet-50), FCA, and FCCA on the RAF-DB and AffectNet datasets, where (a–c) denote confusion matrices of CNN (ResNet-50), FCA, and FCCA on the RAF-DB dataset, respectively, and (d–f) denote confusion matrices of CNN (ResNet-50), FCA, and FCCA on the AffectNet dataset, respectively.

Figure 6. Visualization of the spatial distribution of feature vectors using T-SNE is shown. The first row represents the results on the RAF-DB dataset and the second row represents the results on the AffectNet dataset. The first column (a,d) indicates the spatial distribution of feature vectors obtained by the CNN (ResNet-50) method; the second column (b,e) indicates the spatial distribution of feature vectors generated by the FCA method; and the third column (c,f) indicates the spatial distribution of feature vectors from the FCCA method.

Figure 7. The feature contribution regions for emotion classification on the AffectNet dataset are shown. Each row presents the sample’s top seven predicted labels, highlighting the positive and negative feature regions. The predicted probabilities decrease sequentially from (left) to (right).

Figure 8. The feature extraction capabilities of the CNN (ResNet-50) and FCA methods for samples in the AffectNet dataset are compared. A single column shows the feature extraction capabilities of different methods for the same sample, and a single row shows the feature extraction capabilities of the same method for samples of seven different emotion categories.

Figure 9. Visualization results of GRAD-CAM for different methods on the AffectNet dataset: CNN (ResNet-50), FCA, and FCCA methods. A single column shows the attention regions focused on by the corresponding method. A single row shows the attention regions focused on for seven different expressions under the same method.

Figure 10. Ablation studies on the influence of

λ

values on the performance of the FCCA method. The experimental results on the RAF-DB and AffectNet datasets are represented by blue and orange lines, respectively.

Figure 10. Ablation studies on the influence of

λ

values on the performance of the FCCA method. The experimental results on the RAF-DB and AffectNet datasets are represented by blue and orange lines, respectively.

Figure 11. The training details of the FCCA method on the RAF-DB and AffectNet datasets, including the iteration curves that highlight when the optimal performance was achieved. (a) This shows the accuracy iteration curves for 60 epochs of training on the RAF-DB dataset (including the four methods in the ablation study). (b) This represents the accuracy iteration curves for 60 epochs of training on the AffectNet dataset (including the four methods in the ablation study). (c) This depicts the loss iteration curves for 60 epochs of training on the RAF-DB dataset. (d) This illustrates the loss iteration curves for 60 epochs of training on the AffectNet dataset.

Table 1. Comparison with other methods on the RAF-DB dataset: recognition accuracy, average recognition accuracy.

Method	Acc (%)	Avg.Acc (%)
FSN [19]	72.46	-
pACNN [20]	83.27	-
DLP-CNN [16]	84.13	74.20
ALT [21]	84.50	76.50
gACNN [22]	85.07	-
IPA2LT [23]	86.77	-
RAN [7]	86.90	-
SCN [24]	87.03	-
DACL [8]	87.78	80.44
KTN [25]	88.07	-
DMUE [26]	88.76	-
RUL [6]	88.98	-
GCANet [15]	89.25	-
MMATrans+ [11]	89.67	-
EAC [9]	90.35	-
TransFER [27]	90.91	-
DDAMFN [14]	91.35	-
POSTER [12]	92.05	-
FCCA (Ours)	91.30	85.29

Table 2. Comparison with other methods on the AffectNet dataset (7 classes): recognition accuracy.

Methods	Acc (%)
pACNN [20]	55.33
IPA2LT [23]	57.31
IPFR [28]	57.40
separate loss [29]	58.89
DDA loss [30]	62.34
RAN [7]	59.50
SCN [24]	60.23
DACL [8]	65.20
KTN [25]	63.97
DMUE [26]	62.84
RUL [6]	61.43
EAC [9]	65.32
MMATrans+ [11]	64.89
DDAMFN [14]	67.03
POSTER [12]	67.31
FCCA (Ours)	65.51

Table 3. Comparison with other methods on the DFEW dataset: UAR, WAR.

Method	UAR (%)	WAR (%)
C3D [31]	42.74	53.54
P3D [32]	43.94	54.47
R(2+1)D18 [33]	42.79	53.22
3D Resnet18	44.73	54.98
I3D-RGB [34]	43.40	54.27
VGG11+LSTM	42.39	53.70
Resnet18+LSTM	42.86	53.08
3D R.18+Center Loss [35]	44.91	55.48
EC-STFL [18]	45.35	56.51
3D Resnet18	46.52	58.27
Resnet18+LSTM	51.32	63.85
Resnet18+GRU	51.68	66.45
Former-DFER [36]	53.69	65.70
STT [37]	54.58	66.45
NR-DFERNet [38]	54.21	68.19
GCA+IAL [39]	55.71	69.24
FCCA(Ours)	56.61	69.66

Table 4. Comparison with other methods on the RAF-DB dataset: running time.

Method	Training Time (s)	Testing Time (s)
ResNet-50	98	42
FCA	68	16
FCCA	111	37
POSTER	188	15

Table 5. Comparison with other methods on the AffectNet dataset: running time.

Method	Training Time (s)	Testing Time (s)
ResNet-50	198	50
FCA	137	18
FCCA	230	35
POSTER	415	20

Table 6. Ablation experiments conducted on the RAF-DB and AffectNet datasets.

	FasterNet	Pretrained	Mixup	FACL	RAF-DB	AffectNet
Baseline	✓	-	-	-	87.81	62.08
Baseline2	✓	✓	-	-	89.73	64.35
FCA	✓	✓	✓	-	90.25	64.98
FCCA	✓	✓	✓	✓	91.30	65.51

Table 7. The performance of ResNet-50, EAC, FCA, POSTER, and FCCA after adding 10%, 20% and 30% noisy labels to the RAF-DB dataset, respectively.

Method	Noise (%)
Method	10	20	30
ResNet-50	83.44	79.11	71.67
EAC	88.62	87.35	85.27
FCA	88.20	85.98	84.06
POSTER	89.40	88.07	85.06
FCCA	88.62	87.48	85.92

Table 8. Comparison with other methods on the RAF-DB dataset: accuracy, parameters, FLOPS.

Method	Accuracy (%)	Params (M)	Flops (G)
MobileFaceNet	87.25	1.150	0.230
ResNet-18	87.47	16.78	2.600
ResNet-50	89.63	41.56	6.310
Poster	92.21	43.70	8.430
FCA	89.70	4.880	5.460
FCCA	91.30	7.120	2.178

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, R.; Zhang, Z.; Liu, H. FCCA: Fast Center Consistency Attention for Facial Expression Recognition. Electronics 2025, 14, 1057. https://doi.org/10.3390/electronics14061057

AMA Style

Sun R, Zhang Z, Liu H. FCCA: Fast Center Consistency Attention for Facial Expression Recognition. Electronics. 2025; 14(6):1057. https://doi.org/10.3390/electronics14061057

Chicago/Turabian Style

Sun, Rui, Zhaoli Zhang, and Hai Liu. 2025. "FCCA: Fast Center Consistency Attention for Facial Expression Recognition" Electronics 14, no. 6: 1057. https://doi.org/10.3390/electronics14061057

APA Style

Sun, R., Zhang, Z., & Liu, H. (2025). FCCA: Fast Center Consistency Attention for Facial Expression Recognition. Electronics, 14(6), 1057. https://doi.org/10.3390/electronics14061057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FCCA: Fast Center Consistency Attention for Facial Expression Recognition

Abstract

1. Introduction

2. Related Works

3. Proposed FCCA Methodology

3.1. Backbone

3.2. Flip Attention Consistency Loss

4. Experiment

4.1. Datasets

4.2. Implementation Details

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI