Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism

Fu, Lin; Wan, Yaping; Zou, Gang

doi:10.3390/info16110982

Open AccessArticle

Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism

by

Lin Fu

¹

,

Yaping Wan

^1,* and

Gang Zou

²

¹

School of Computer Science, University of South China, Hengyang 421001, China

²

HuNan ZK Help Innovation Intelligent Technology Research Institute, Changsha 410000, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 982; https://doi.org/10.3390/info16110982

Submission received: 20 October 2025 / Revised: 5 November 2025 / Accepted: 8 November 2025 / Published: 13 November 2025

Download

Browse Figures

Versions Notes

Abstract

Facial expression recognition (FER) serves as a pivotal approach for understanding human affective states and behavioral intentions, forming the fundamental basis for achieving natural interaction in affective computing systems. To address the limitations of convolutional neural networks in capturing global facial expression features, while simultaneously overcoming the challenges of Vision Transformers regarding their substantial parameter requirements, high computational complexity, and difficulties in meeting lightweight deployment demands for practical applications, this paper proposes Agent-Poster, a lightweight multi-scale facial expression recognition model based on Agent Attention. Building upon the POSTER++ framework, the model innovatively integrates Agent Attention, adopts a streamlined dual-stream architecture to minimize redundant interactions, and implements efficient multi-scale feature fusion. Experimental results demonstrate that the proposed method achieves superior recognition performance compared to existing approaches, attaining accuracy rates of 92.61% on the RAF-DB dataset and 68.21% on the AffectNet dataset, thereby validating its robustness and accuracy in facial expression recognition tasks.

Keywords:

facial expression recognition; image classification; feature fusion; attention module

1. Introduction

Facial expressions serve as a critical component for inferring an individual’s emotional state, intentions, and personality. They represent one of the most natural, universal, and intuitive signals for conveying affective states and intentions in human communication [1] and form a fundamental basis for embodied artificial intelligence [2].

Early research in facial expression recognition (FER) relied on handcrafted features for expression analysis. However, such methods often exhibited limited generalization capability in complex real-world scenarios, demonstrating sensitivity to variations in illumination, pose, and occlusion, while struggling to capture subtle expression variations [3]. In recent years, deep learning techniques have significantly enhanced the robustness of FER. Methods based on convolutional neural networks (CNNs) have achieved notable improvements in this domain. For instance, Savchenko et al. [4] were among the first to validate the effectiveness of CNNs such as MobileNet [5], EfficientNet [6], and RexNet [7] for FER tasks. Nonetheless, due to the limited local receptive fields of convolutional operations, these models often fall short in global contextual modeling [8].

Transformers [9], with their self-attention mechanisms, offer flexible modeling of local facial characteristics and have shown considerable potential in FER. The Vision Transformer (ViT) [10], introduced by Google Research in 2020, adapts the Transformer architecture for image classification by dividing images into patches and processing them as sequential inputs. Its application to FER has since yielded notable improvements [11,12,13,14,15,16,17]. However, both standard Transformers and ViT typically require stacking multiple layers to capture global information effectively and depend on large-scale datasets for training, leading to high computational demands.

Embodied intelligence emphasizes dynamic interaction between the body and the environment. To align with this paradigm, FER models must be lightweight, enabling adaptation to dynamic interactive settings while reducing parameter counts and computational requirements—thus bridging the gap between global modeling capacity and practical deployment [18]. Therefore, model lightweighting has emerged as a key direction for breakthroughs in facial expression recognition.

To address the aforementioned challenges, this paper proposes Agent-Poster, a multi-scale feature fusion expression recognition model based on Agent Attention. Built upon the POSTER++ [8] backbone, the model integrates multi-scale features from both images and facial landmark, with the following key enhancements:

•: Introducing Agent Attention: An Agent Attention mechanism [19] is introduced to enhance cross-modal interaction between facial images and landmarks in facial expression recognition. It proposes additional agent tokens that act as intermediaries between queries and key–value pairs, effectively reducing computational complexity while retaining global modeling capability.
•: Simplifying the Dual-Stream Architecture: We streamline the original bidirectional cross-attention by removing redundant image-to-landmark branches, retaining only the unidirectional attention from landmark to images.
•: Optimizing Multi-Scale Feature Fusion: Multi-scale features are extracted directly from the backbone network and fused via a lightweight module comprising a two-layer Swin Transformer V2, replacing the original ViT. This approach achieves efficient cross-scale integration and improves robustness to scale variations.

2. Related Work

Facial expression recognition methods: Early FER approaches predominantly relied on handcrafted features such as LBP [20] or shallow models like MTSL [21], which suffered from limited generalization across diverse real-world scenarios. The introduction of CNNs significantly enhanced robustness. For instance, Mollahosseini et al. [11] proposed a single-component network integrating dual convolution, dual pooling, and four guided layers. Shao et al. [12] developed a robust in-the-wild system incorporating Light-CNN, a dual-path CNN, and pre-trained CNNs. Gursesli [13] further demonstrated the efficacy of a lightweight solution based on MobileNetV2. These studies collectively affirm the strong capability of CNNs in modeling local facial relations. However, although convolutional operations progressively abstract local features through hierarchical layers, they struggle to capture global contextual information. To mitigate this, Zhong et al. [14] introduced a two-stage multi-task sparse learning (MTSL) framework leveraging both shared and expression-specific information. Savchenko [15] further corroborated the feasibility of lightweight CNNs, while Sang et al. [16] employed DenseNet to suppress intra-class variations and enhance discriminative power. Although CNNs have largely supplanted traditional methods due to their high accuracy and robustness, they remain constrained by the limited receptive field of convolution and a tendency to overfit on small-scale datasets.

With the rise of Transformers in computer vision, numerous FER methods incorporating Transformer architectures have emerged. Liu et al. [22] proposed the Swin Transformer, which achieves linear complexity through shifted window-based multi-head self-attention, effectively balancing global modeling with local receptive fields in FER. Liu et al. [23] designed PACVT, integrating patch-level convolutional attention with Transformer-based global feature extraction to focus on discriminative facial patches. Chen et al. [24] introduced SSF-ViT, a model based on self-supervised pre-training and fine-tuning with limited labels, which achieves over 95% fully supervised performance using only 10% of labeled data—offering a promising direction for low-resource FER. POSTER++ [8] introduced a dual-stream architecture with cross-modality fusion and a pyramidal design, integrating image and facial landmark features to simultaneously address intra-class variation, inter-class similarity, and scale sensitivity in FER. However, the high computational cost of Transformers and their variants has hindered their broader adoption in FER research [8]. In response, recent studies have begun exploring more efficient feature extraction and fusion strategies.

To address these issues, we optimize POSTER++ and propose Agent-Poster—a model that is more efficient and lightweight and thus better aligned with the practical requirements of FER. The model first extracts multi-scale features via an IR50 [25] backbone and a MobileFaceNet [26] landmark detector. These features are then concatenated at the token level using an attention mechanism and finally fed into a compact two-layer visual Swin Transformer V2 [27] for multi-scale feature integration and classification.

Lightweight methods: The development of lightweight FER models primarily follows two directions. The first involves adopting inherently efficient CNN architectures as the backbone network, such as MobileNet [5], which uses depthwise separable convolution to decompose the standard convolution, significantly reducing parameters and computational costs. EfficientNet [6] uses a composite scaling method to balance the width, depth, and input resolution of the network to achieve optimal efficiency. The second direction focuses on designing lightweight attention modules to reduce the computational cost of Transformers. Although these backbone networks can provide powerful feature extraction capabilities, their operations are essentially local. Since the Vision Transformer [10] was introduced, the self-attention mechanism has received widespread attention and has made significant progress in the field of computer vision. However, the quadratic complexity of the popular Softmax attention mechanism poses a significant challenge for its application in visual tasks. Therefore, the second approach is to design lightweight attention modules to reduce the computational cost of Transformers. Wang et al. [28] proposed a hybrid attention framework that integrates channel and spatial attention, which can more effectively capture local features and global context information. This method has shown significant effects in facial recognition (FER), especially in recognizing subtle facial expressions. Chen et al. [29] introduced a semantic-aware multi-scale attention mechanism, which uses the semantic features of the face and multi-scale features for fusion to enhance the model’s ability to capture key facial structures and fine details. Li et al. [30] developed a lightweight attention scheme that combines sparse labels and batch channel normalization, significantly reducing model parameters while maintaining efficient feature extraction capabilities. POSTER++ [8] uses W-SCAM to reduce computational costs.

In the present work, we replace W-SCAM with Agent Attention, introducing a set of learnable agent tokens, A, that act as intermediaries for the query tokens, Q. These agent tokens aggregate information from keys and values and broadcast it back to the queries. As validated experimentally, the number of agent tokens is substantially smaller than that of query tokens, thereby significantly reducing intra-class variation and enabling effective dynamic weighted feature fusion.

Multi-scale Feature Fusion: In FER tasks, facial expression characteristics manifest differently across scales. Conventional single-scale feature extraction methods often struggle to handle these multi-scale representations simultaneously, resulting in models that are sensitive to scale variations and impaired recognition performance. Multi-scale feature fusion has proven effective in enhancing both the accuracy and robustness of FER models, demonstrating superior performance in complex scenarios.

Gao and Patras introduced a novel self-supervised facial representation learning framework [3], which significantly improves FER performance by learning consistent global and local facial representations. Liu et al. proposed a pose-decoupled contrastive learning approach [4] that enhances FER robustness by disentangling pose-relevant and pose-invariant features. Ce Zheng et al. [31] developed the POSTER network, which integrates multi-scale features using a pyramidal structure combined with a Transformer. Hazourli et al. [32] partitioned facial images into multiple patches and learned features from each patch independently, while Li et al. [33] extracted 24 regions of interest and incorporated facial landmark information for localization. Although these studies provide valuable insights into lightweight and efficient FER, they still face challenges in achieving high recognition accuracy.

However, these existing methods still have significant limitations: The existing dual-stream models usually rely on computationally intensive bidirectional interaction; the window-based attention adopted in frameworks such as Swin Transformer V2 and POSTER++ essentially limits the global context modeling between windows. Although the pyramid-like fusion structure is effective, it usually requires a large number of parameters and complex design.

To directly address the aforementioned gap, we propose Agent-Poster. Our model proposes three key innovations: Firstly, by simplifying the two-stream architecture and retaining only the unidirectional attention path from the facial landmark to the image, it significantly reduces redundant computations. Secondly, it replaces the window-based attention with Agent Attention, a mechanism that utilizes a small number of agent tokens to achieve efficient global feature interaction with linear complexity. Finally, it discards the pyramid fusion and instead adopts a strategy of directly combining multi-scale features with a lightweight two-layer Swin Transformer V2, thereby achieving robust multi-scale integration with the least parameters.

3. Methods

3.1. Overall Architecture

The overall framework of Agent-Poster is a dual-stream collaborative network architecture that integrates global semantics and local geometric features. It consists of an IR50 [25] backbone network, a MobileFaceNet [26] landmark detector, and a cross-modal feature fusion mechanism. The aim is to achieve high-precision and high-efficiency facial expression recognition, as shown in Figure 1.

Based on the classical ResNet-50 architecture [25], IR50 is a backbone network optimized for face recognition. A key enhancement lies in its incorporation of a multi-scale parallel convolution module—typically comprising 1 × 1 convolution, 3 × 3 convolution, and a max-pooling branch. This module strengthens the model’s ability to capture fine-grained facial features under dynamic variations through a cross-receptive field feature fusion mechanism.

We adopt the MobileFaceNet landmark regressor trained based on the MobileFaceNet backbone [26]. This network uses MobileFaceNet as the feature extractor, replaces the original classification head with a coordinate regression head, and outputs the coordinates of 68 facial key points.

One of the key improvements of Agent-Poster lies in its simplification of the bi-stream interaction structure. As the baseline model, POSTER++ employs a bidirectional cross-attention mechanism between the image and facial landmark. Although this design facilitates the fusion of information between modalities, it has high computational costs and redundant interactions. Through analysis, we found that the attention path from the facial landmark to the image is crucial for guiding the model to focus on the discriminative regions of expressions; in contrast, the contribution of the reverse path (from image to landmark) is relatively limited. Therefore, we propose a simplified unidirectional attention architecture, strategically removing the reverse attention branch, as shown in Figure 2. This simplification brings two significant advantages: directly reducing the number of attention calculation units in the model and enhancing its generalization ability, reducing the risk of overfitting, and forcing the model to rely on the most effective feature fusion path, inducing it to learn more robust representations.

To achieve efficient multi-scale feature fusion, Agent-Poster explicitly extracts features from three stages of the IR50 backbone, corresponding to the outputs of layer2, layer3, and layer4, denoted as C3, C4, and C5, with spatial dimensions of 56 × 56, 28 × 28, and 14 × 14 and channel sizes of 512, 1024, and 2048, respectively. In parallel, the MobileFaceNet landmark detector generates three levels of geometric features, L3, L4, L5, with spatial sizes matching their image counterparts and channel dimensions of 128, 256, and 512. All feature maps are flattened into tokens and projected into a unified dimension of 256 via dedicated linear layers. The fused token sequence is constructed by interleaving image and landmark tokens at the same scale, following the order [C3, L3, C4, L4, C5, L5], resulting in a final sequence of 8232 tokens. This sequence is then processed by a two-layer Swin Transformer V2 module configured with a window size of 7, 8 attention heads, and Log-CPB for positional encoding, enabling efficient cross-scale and cross-modal integration. The simplified dual-stream attention mechanism and the efficient multi-scale fusion mechanism work together, enabling Agent-Poster to achieve high-precision and high-efficiency facial expression recognition.

3.2. Agent Attention

Agent Attention [19] is a novel attention mechanism designed to balance computational efficiency and the expressive power of the model. It achieves this by introducing an additional set of agent tokens, A, and aggregating information with the original query tokens, thereby enabling more efficient processing. For an input sequence of N tokens represented as

x \in R^{N \times C}

, N represents the number of tokens and

C

represents the dimension of each token. Compared to the traditional self-attention mechanism (Softmax Attention) which requires calculating the similarity between all query–key pairs, Agent Attention reduces the computational burden by introducing a smaller number of agent tokens, A, as shown in Figure 3.

In the Agent Attention mechanism, four fundamental components are formally defined: query tokens,

Q

, agent tokens,

A

, the key,

K

, and the value,

V

. These elements are derived from the input x through linear transformations:

Q = x W_{Q}, K = x W_{K}, V = x W_{V}

(1)

where

W_{Q}

,

W_{K}

, and

W_{V}

{\in R}^{C \times d}

are learnable parameter matrices that project the input from its original dimension into the attention head dimension

d

, with

Q, K, V

{\in R}^{N \times d}

.

Subsequently, a pooling operation is applied to extract a set of agent tokens from the original queries:

A = P o o l (Q)

(2)

We employ an adaptive average pooling method [19] to obtain the agent tokens

A \in R^{M \times d}

, where

M ≪ N

. This operation performs spatial dimensionality reduction on the query tokens, significantly reducing their number while preserving essential information. As a result, the quantity of agent tokens is kept substantially smaller than that of the original query tokens. This design not only maintains the ability to model global context but also effectively lowers computational complexity.

The agent tokens

A

then act as proxies for the query tokens

Q

to aggregate information from the key

K

and value

V

. Specifically, the attention weights between agents and keys are computed as

A g e n t A t t n = S o f t m a x (\frac{A \cdot K T}{\sqrt{d}} + B_{A N})

(3)

where

B_{N A} \in R^{M \times N}

is a positional bias term that models the spatial relationships between agents and keys. Here,

B_{N A}

is a positional bias term that models the spatial relationships between agents and keys. These attention weights are then used to aggregate the value matrix:

V_{A} = A g e n t A t t n \cdot V

(4)

where

A \in R^{M \times d}

. Next, the attention weights from queries to agents are calculated:

Q u e r y A t t n = S o f t m a x (\frac{Q \cdot A T}{\sqrt{d}} + B_{N A})

(5)

where

B_{N A} \in R^{N \times M}

is another positional bias term, capturing the spatial relationships between queries and agents. The final output is obtained as

O u t p u t = Q u e r y A t t n \cdot V_{A}

(6)

The computational advantage of Agent Attention stems from replacing the costly

Q \cdot K^{T}

operation

O (N^{2} d)

in the standard attention mechanism with two efficient steps. The operations

A \cdot K^{T}

and

Q \cdot A^{T}

in Formulas (3) and (5) both have a complexity of

O (N M d)

. By setting

M = ⌊ N / r ⌋

, where r is the reduction ratio (

r \geq 16

), the overall complexity becomes

O (N^{2} d / r)

. This achieves a complexity reduction from quadratic to linear with respect to

N

, enabling efficient global context modeling, which is crucial for high-resolution visual tasks.

3.3. Swin Transformer V2

Swin Transformer V2 [27] introduces three core improvements to address training challenges in large-scale visual models. First, it implements residual post-normalization by moving Layer Normalization to the end of residual blocks, which stabilizes training in deep networks by preventing abnormal activation accumulation. Second, it replaces dot-product attention with a scaled cosine attention mechanism that computes attention weights using normalized cosine similarity, eliminating the influence of feature magnitude on attention distribution and avoiding representation distortion from extreme attention values. Third, it proposes a log-spaced continuous position bias (Log-CPB) method that uses a compact network to model relative position relationships across windows, enabling zero-shot transfer of positional biases between different resolutions. These innovations enhance both structural robustness and geometric generalization when processing facial expression data with significant scale variations.

In Agent-Poster, we replace the original ViT [10] structure from POSTER++ with Swin Transformer V2 as the core module for multi-scale feature fusion. Swin Transformer V2 employs a hierarchical window-based attention mechanism that efficiently processes multi-scale feature tokens while maintaining linear computational complexity. Specifically, we use a minimal two-layer Swin Transformer V2 configuration instead of the more complex ViT architecture in POSTER++. This design proposes the multi-scale concept from the U-shaped Swin Transformer [24] but achieves cross-scale fusion through hierarchical feature concatenation rather than upsampling, significantly reducing computational requirements. The residual post-normalization ensures training stability in deep networks, while scaled cosine attention eliminates feature magnitude effects on weight distribution, enabling better focus on expression-relevant discriminative regions.

By integrating Agent Attention with Swin Transformer V2, our model orchestrates fine-grained local feature processing within windows, complemented by the agent-based mechanism for efficient global context modeling. This fusion enables robust multi-scale expression feature fusion at linear complexity.

4. Experiments

To comprehensively evaluate Agent-Poster, we designed systematic experiments addressing three key aspects: overall performance and robustness: The model’s recognition accuracy was evaluated on public benchmark datasets, with cross-dataset tests conducted to examine its generalization capability in complex scenarios, alongside comparisons with state-of-the-art methods. Ablation Analysis: Through ablation studies, we isolated and quantified the individual contributions of key components, including multi-scale feature extraction, Agent Attention, the lightweight Swin Transformer V2 fusion module, and one-way landmark-to-image cross-attention. Model Efficiency: We quantitatively assessed the model’s lightweight characteristics and deployment feasibility from multiple dimensions, including parameter count, computational complexity, and training time. This experimental section is organized into six subsections to thoroughly present and discuss these aspects.

4.1. Datasets

In order to verify the validity of the model presented in this paper, two public datasets, RAF-DB [34] and AffectNet [35], which are both applied for facial expression recognition, were selected as the experimental datasets. The selected datasets include facial expression data in natural environments and standard facial expression data in laboratory environments. Figure 4 shows example diagrams from the datasets, and Table 1 presents these two public datasets.

The Real-world Affective Faces Database (RAF-DB) constitutes a large-scale facial expression dataset annotated by 315 human annotators. It comprises seven expression categories, encompassing six basic emotions along with the neutral expression. The dataset is partitioned into 12,271 training images and 3068 test images designated for expression recognition tasks.

AffectNet stands as the largest publicly available dataset in the field of facial expression recognition. It contains over 280,000 facial images retrieved through emotion-related keyword queries. The dataset encompasses eight fundamental emotion categories (neutral, happy, angry, sad, fear, surprise, disgust, and contempt). In accordance with common practice, seven categories (excluding contempt) are adopted in our experiments. The dataset is organized into approximately 280,000 training samples and 3500 validation images.

4.2. Experiment Setup

To ensure the reproducibility of our experimental results, we provide implementation details. We used the IR50 network, similar to POSTER, pre-trained on the MS-Celeb-1 M dataset as the image backbone, and employed MobileFaceNet with frozen weights as our facial landmark detector, with faces detected and aligned using MTCNN and resized to 224 × 224. For data augmentation, we applied random rotation within ±10 degrees, random contrast adjustment with a scale factor between 0.8 and 1.2, random horizontal flipping with a probability of 0.5, and random erasing with a probability of 0.25 and an area ratio between 0.02 and 0.2. The Adam optimizer [36] was employed for model training, chosen for its adaptive moment estimation that computes individual learning rates for each parameter. This capability is particularly critical for our dual-stream, multi-scale Transformer architecture, which operates in a non-convex, high-dimensional parameter space prone to varying gradient scales across layers. The optimization was conducted with a base learning rate of 3.5 × 10⁻⁴, weight decay of 1 × 10⁻⁴, and a batch size of 144. A cosine annealing scheduler with a five-epoch warm-up phase was adopted to guide the learning rate decay. Furthermore, label smoothing (ε = 0.1) was applied to mitigate overfitting, and gradient clipping with a maximum norm of 1.0 was utilized to stabilize the training dynamics.

Crucially, to address uncertainty and repeatability, all experiments were conducted with three different random seeds (42, 123, 456). All reported values for the accuracy and macro-F1 performance metrics represent the mean ± standard deviation across these three independent runs. Furthermore, 95% confidence intervals were computed using stratified bootstrap resampling with 1000 iterations, where the stratification was performed by the expression class to account for potential label imbalance. The models were implemented using PyTorch 1.8.1 on a single NVIDIA RTX 3080Ti GPU. We employed early stopping with a patience of 15 epochs based on validation accuracy, and the best-performing checkpoint on the validation set was selected for final evaluation.

4.3. Ablation Study

To evaluate the effectiveness of each component in Agent-Poster, we conducted a comprehensive ablation study on both RAF-DB and AffectNet datasets. The experiments followed a progressive architecture design, starting from the POSTER++ baseline and incrementally adding key components to demonstrate their individual contributions. All experiments were conducted with three different random seeds (42, 123, 456), and the results represent mean performance metrics with standard deviations consistently below 0.3%. The specific experimental results are presented in Table 2.

The progressive ablation study systematically validated the contribution of each architectural component in Agent-Poster. The integration of multi-scale feature extraction (A1 → A2) demonstrated effectiveness, improving accuracy by 0.77% on RAF-DB and 0.03% on AffectNet. This improvement stems from the module’s capability to capture facial expression features across different spatial hierarchies, effectively addressing scale sensitivity challenges in facial expression recognition.

Subsequently, the incorporation of Agent Attention (A2 → A3) yielded additional performance gains of 0.35% on RAF-DB and 0.26% on AffectNet. These improvements demonstrate the mechanism’s efficient feature selection through learnable agent tokens, which reduces computational complexity while maintaining global contextual modeling capacity.

The integration of lightweight Swin Transformer V2 (A3 → A4) further enhanced model performance, with improvements of 0.62% on RAF-DB and 0.27% on AffectNet. This advancement is primarily driven by the module’s residual post-normalization and scaled cosine attention mechanisms, which significantly improve the stability and discriminative power of multi-scale feature fusion.

Finally, the complete Agent-Poster architecture with unidirectional landmark-to-image (L → I) fusion achieved optimal performance, delivering the maximum single-step improvement of 0.75% on RAF-DB and 0.16% on AffectNet. The progressive integration culminated in total accuracy improvements of 2.49% on RAF-DB and 0.72% on AffectNet over the baseline, with all incremental enhancements demonstrating statistical significance (p < 0.001, Stuart–Maxwell test), thereby validating the synergistic combination of all architectural components.

4.4. Cross-Dataset Experiment

To comprehensively evaluate the robustness and accuracy of Agent-Poster under domain shift conditions, we conducted two cross-dataset generalization experiments: first, training on RAF-DB and testing on the validation set of AffectNet; then, training on AffectNet and testing on the test set of RAF-DB. All experimental settings were consistent with those in Section 4.2. This assessment simulated the situation where the model deployment environment in reality does not match the distribution of the training set data. The experimental results are shown in Table 3.

The results indicate that in both settings, the cross-dataset recognition accuracy of Agent-Poster was higher than that of POSTER++, demonstrating its stronger generalization ability. Specifically, when training on RAF-DB and testing on AffectNet, Agent-Poster achieved 63.45%. Similarly, when training on AffectNet and testing on RAF-DB, Agent-Poster reached 89.17%, both higher than POSTER++.

This improvement in cross-dataset performance is mainly attributed to the two key designs of Agent-Poster: The Agent Attention mechanism, which dynamically weights information-rich tokens and reduces redundancy, enhances the model’s ability to focus on domain-invariant expression-related features. The efficient multi-scale feature fusion implemented through the lightweight Swin Transformer V2 can capture local and global facial patterns, making the model less sensitive to domain-specific changes such as lighting and background in the dataset.

Although there was a slight decrease in performance due to dataset bias compared to internal dataset evaluation, the performance decline of Agent-Poster was smaller, confirming its advantage in learning more generalized representations. These results are in line with the latest trends in FER research, emphasizing the importance of domain alignment and robust feature learning for practical applications.

4.5. Comparison

To comprehensively evaluate the performance of the Agent-Poster model, we conducted systematic comparative experiments on RAF-DB and AffectNet datasets, comparing against various current advanced methods. In accordance with reproducibility requirements, all results are based on three independent experimental runs with random seeds {42, 123, 456}, as detailed in Section 4.2.

The Agent-Poster model achieved a mean accuracy of 92.61% with a standard deviation of 0.15% on RAF-DB, and 68.21% ± 0.18% on AffectNet. The corresponding macro-F1 scores were 87.12% ± 0.16% and 65.23% ± 0.21%, respectively. The low variance observed across runs (standard deviations < 0.3%) ensures the reliability of the reported performance metrics. Furthermore, stratified bootstrap resampling with 1000 iterations confirmed the statistical significance of our improvements, yielding 95% confidence intervals of [92.38%, 92.84%] for RAF-DB accuracy and [67.95%, 68.47%] for AffectNet accuracy. All comparative results discussed below are derived from this rigorous statistical protocol, ensuring the reliability of our findings. Table 4 presents the class-wise accuracy of Agent-Poster and Table 5 presents the comparison of the experimental results of Agent-Poster with those of multiple models on RAF-DB and AffectNet. Figure 5 shows the confusion matrices for RAF-DB and AffectNet.

Results on RAF-DB. On the RAF-DB dataset, Agent-Poster achieved an overall recognition accuracy of 92.61%, significantly outperforming all comparative methods (Table 4). This performance demonstrates the effective integration of the proposed Agent Attention mechanism and the multi-scale feature fusion strategy. Compared to Transformer-based approaches such as TransFER and EAC-Net, Agent-Poster demonstrated improvements of approximately 2%. This indicates that replacing traditional dense attention computation with learnable agent tokens not only substantially reduces computational complexity but also maintains strong global contextual modeling capability.

Relative to POSTER and its lightweight variant POSTER++, Agent-Poster exhibited considerable performance gains of 6.58% and 0.9%, respectively. This improvement directly validates the efficacy of the Agent Attention mechanism introduced in Section 3, which delegates token interactions to a smaller set of agent tokens, thereby reducing redundant computations while enhancing focus on discriminative features. Agent-Poster outperformed POSTER++ [8] by 2.49% on RAF-DB and 0.72% on AffectNet.

Analysis of per-class accuracy (Table 4) reveals that Agent-Poster surpassed POSTER++ in most emotion categories, with particularly outstanding performance in “Neutral,” “Happy,” “Sad,” and “Anger.” Notably, in the challenging “Disgust” category, Agent-Poster attained an accuracy of 72.84%, exceeding POSTER++’s 71.88% by 0.96%. As further evidenced by the confusion matrix in Figure 5, our model reduced the confusion between ‘Disgust’ and ‘Anger’ compared to POSTER++. This result highlights the advantage of the Agent Attention in fine-grained feature discrimination—through its dynamic weighted fusion mechanism, the model more effectively captures subtle expression variations, thereby improving differentiation among easily confusable categories. We posit that this is because the agent tokens act as learnable proxies that summarize global context, allowing the model to dynamically weight features from different facial regions, which is particularly beneficial for distinguishing subtle inter-class differences like those between ‘Disgust’ and ‘Anger’. Ultimately, Agent-Poster achieved an average class accuracy of 86.54% on RAF-DB, outperforming POSTER++’s 85.97% and demonstrating its comprehensive superiority in multi-category emotion recognition.

Results on AffectNet. On the more challenging AffectNet dataset, Agent-Poster similarly demonstrated leading performance, achieving an overall accuracy of 68.21% (Table 4. Its performance significantly outperformed other models. This outcome validates the strong generalization capability of Agent-Poster in complex real-world scenarios.

Per-class accuracy analysis (Table 4) indicates that Agent-Poster outperformed POSTER++ in five emotion categories: “Neutral,” “Happy,” “Sad,” “Disgust,” and “Anger.” Performance in the “Happy” category was particularly outstanding, reaching 90.14% compared to POSTER++’s 89.40%. By extracting multi-scale features directly from the backbone network and integrating them via a lightweight Swin Transformer V2 module, the model more effectively captures coarse-to-fine semantic information, thereby improving recognition accuracy for pronounced expressions such as “Happy.” Although minor fluctuations were observed in the “Surprise” and “Fear” categories, Agent-Poster still achieved an average class accuracy of 67.84%, overall surpassing POSTER++’s 67.45%. This result demonstrates that Agent-Poster maintains well-balanced performance across different emotion categories, avoiding significant performance drops in certain classes at the expense of others, and reflects the robustness and generalizability of its design.

The consistency of results across three independent runs with different random seeds demonstrates the robustness of our findings. The maximum standard deviation observed was 0.28% for accuracy metrics and 0.32% for macro-F1 scores, well within acceptable limits for experimental reproducibility. Stratified bootstrap analysis with 1000 iterations confirmed that the performance advantages of Agent-Poster were statistically significant at the p < 0.05 level for all comparative evaluations.

4.6. FLOPs and Param Comparison

To thoroughly assess the overall performance of the proposed Agent-Poster model, a comparative analysis with current mainstream methods was conducted, considering three dimensions: parameter count (Param), computational complexity (FLOPs), and recognition accuracy. The results are summarized in Table 6. With only 27.1 M parameters and 7.9 G FLOPs, Agent-Poster achieved recognition accuracies of 92.61% on RAF-DB and 68.21% on AffectNet, respectively. These results demonstrate a comprehensive improvement in recognition accuracy while significantly reducing both parameter count and computational complexity, reflecting its excellent overall performance.

Compared to the Transformer-based method TransFER [37], Agent-Poster reduced parameters by approximately 58.4% and computational cost by about 48.4%, while simultaneously improving recognition accuracy by 1.7% on RAF-DB and 1.98% on AffectNet. This convincingly validates the effectiveness of the proposed lightweight design. When compared to its baseline model POSTER++ [8], Agent-Poster achieved a 38.0% reduction in parameters with comparable computational cost, while improving accuracy by 0.94% on RAF-DB and 0.72% on AffectNet. This accomplishment aligns with the optimization objective of achieving higher accuracy with fewer parameters.

Relative to the similarly lightweight design FRA [4], Agent-Poster achieved significant accuracy improvements of 2.66% on RAF-DB and 2.05% on AffectNet, while utilizing fewer parameters and comparable computational overhead. This demonstrates the superiority of the Agent Attention mechanism in feature extraction and fusion. The experimental results collectively indicate that Agent-Poster achieves a more favorable trade-off between model efficiency (parameter count and computational complexity) and recognition accuracy. It not only significantly outperforms original large-scale models but also surpasses numerous contemporary lightweight models with lower computational costs, thereby better satisfying the dual requirements of high accuracy and low overhead in practical applications.

To quantitatively evaluate the improvement in training efficiency, the average time required by the Agent-Poster and POSTER++ baseline models in each training cycle was compared. The data shows that Agent-Poster required only approximately 812 s per cycle, which was about 18% faster than POSTER++. The significant reduction in training time is attributed to the linear complexity of Agent Attention, which replaces the window self-attention with a quadratic computational cost in POSTER++. The simplified two-stream architecture and the efficient multi-scale fusion using two layers of Swin Transformer V2 further enhance the efficiency. The improvement in training speed, combined with the reduction in FLOPs and parameters, highlights the practical advantages of our proposed model in rapid prototype verification and deployment.

5. Conclusions

We introduce Agent-Poster, an improved model based on POSTER++ that features a novel attention mechanism and an optimized multi-scale feature fusion strategy. Firstly, we embed an Agent Attention hybrid attention module in the middle layer of the model and optimize it using the transfer learning strategy. This module reduces the computational complexity from quadratic to linear by using learnable proxy tokens, while maintaining the global context modeling ability, enabling the model to more effectively focus on the key regions of expressions and suppress the interference of irrelevant features. Secondly, we introduce a lightweight Swin Transformer V2 module, which enhances the stability and discriminative power of multi-scale feature fusion through residual post-normalization and scaled cosine attention mechanism. The experimental results show that Agent-Poster achieved an advanced level of performance in the FER task, achieving 92.61% accuracy on the RAF-DB dataset and 68.21% accuracy on the AffectNet dataset. Despite its performance, this work has limitations. For instance, the model’s performance may be influenced by the inherent label noise and class imbalance in datasets like AffectNet. Furthermore, our model relies on accurate face detection and landmark localization. Future work will focus on exploring model compression for mobile deployment, investigating temperature scaling for better calibration on AffectNet, and developing more robust cross-modal fusion strategies to handle occlusions and extreme illuminations.

Author Contributions

Conceptualization, L.F. and Y.W.; methodology, L.F.; software, L.F.; validation, L.F., Y.W. and G.Z.; formal analysis, L.F.; investigation, L.F.; resources, Y.W. and G.Z.; data curation L.F.; writing—original draft preparation, L.F.; writing—review and editing, L.F., Y.W. and G.Z.; visualization, L.F.; supervision, Y.W. and G.Z.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Project 2024JJ7428 of the Hunan Provincial Natural Science Foundation of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the RAFDB dataset at http://whdeng.cn/RAF/model1.html (accessed on 19 October 2025). The AffectNet dataset is available at http://mohammadmahoor.com/pages/databases/affectnet/ (accessed on 19 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
Yang, Y.; Jia, B.; Zhi, P.; Huang, S. PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1–18. [Google Scholar]
Gao, Z.; Patras, I. Self-Supervised Facial Representation Learning with Facial Region Awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2081–2088. [Google Scholar]
Liu, Y.; Wang, W.; Zhan, Y.; Feng, S.; Liu, K.; Chen, Z. Pose-Disentangled Contrastive Learning for Self-supervised Facial Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9717–9728. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Han, D.; Yun, S.; Heo, B.; Yoo, Y.J. REXNet: Diminishing Representational Bottleneck on Convolutional Neural Network. arXiv 2020, arXiv:2007.00992. [Google Scholar]
Mao, J.; Xu, R.; Yin, X.; Chang, Y.; Nie, B.; Huang, A. POSTER++: A Simpler and Stronger Facial Expression Recognition Network. Pattern Recognit. 2025, 157, 110951. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: New York, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going Deeper in Facial Expression Recognition Using Deep Neural Networks. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Shao, J.; Qian, Y. Three Convolutional Neural Network Models for Facial Expression Recognition in the Wild. Neurocomputing 2019, 355, 82–92. [Google Scholar] [CrossRef]
Gürsesli, M.C.; Lombardi, S.; Duradoni, M.; Guazzini, A. Facial Emotion Recognition (FER) Through Custom Lightweight CNN Model: Performance Evaluation in Public Datasets. IEEE Access 2024, 12, 45543–45559. [Google Scholar] [CrossRef]
Zhong, L.; Liu, Q.; Yang, P.; Liu, B.; Huang, J.; Metaxas, D.N. Learning Active Facial Patches for Expression Analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2562–2569. [Google Scholar]
Savchenko, A.V. Facial Expression and Attributes Recognition Based on Multi-Task Learning of Lightweight Neural Networks. In Proceedings of the 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics, Subotica, Serbia, 16–18 September 2021; pp. 119–124. [Google Scholar]
Sang, D.V.; Ha, P.T. Discriminative Deep Feature Learning for Facial Emotion Recognition. In Proceedings of the 2018 1st International Conference on Multimedia Analysis and Pattern Recognition, Hanoi, Vietnam, 16–17 April 2018; pp. 1–6. [Google Scholar]
Kim, S.; Nam, J.; Ko, B.C. Facial expression recognition based on squeeze vision transformer. Sensors 2022, 22, 3729. [Google Scholar] [CrossRef]
Li, H.; Sui, M.; Zhu, Z.; Zhao, F. MFEViT: A Robust Lightweight Transformer-Based Network for Multimodal 2D+3D Facial Expression Recognition. arXiv 2021, arXiv:2109.13086. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Song, S.; Huang, G. Agent Attention: On the Integration of Softmax and Linear Attention. arXiv 2023, arXiv:2312.08874. [Google Scholar] [CrossRef]
Zhao, G.; Pietikäinen, M. Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
Ojala, T.; Pietikäinen, M.; Mäenpää, T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Liu, C.; Hirota, K.; Dai, Y. Patch Attention Convolutional Vision Transformer for Facial Expression Recognition with Occlusion. Inf. Sci. 2023, 619, 781–794. [Google Scholar] [CrossRef]
Chen, X.; Zheng, X.; Sun, K.; Liu, W.; Zhang, Y. Self-Supervised Vision Transformer-Based Few-Shot Learning for Facial Expression Recognition. Inf. Sci. 2023, 634, 206–226. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Chen, C.J. PyTorch Face Landmark: A Fast and Accurate Facial Landmark Detector. 2021. Available online: https://github.com/cunjian/pytorch_face_landmark (accessed on 19 October 2025).
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar]
Wang, Z.; Yan, C.; Hu, Z. Lightweight Multi-Scale Network with Attention for Facial Expression Recognition. In Proceedings of the 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering, Zhuhai, China, 26–28 March 2021; pp. 695–698. [Google Scholar]
Chen, Y.; Wu, L.; Wang, C. A Micro-Expression Recognition Method Based on Multi-Level Information Fusion Network. Acta Autom. Sin. 2024, 50, 1445–1457. [Google Scholar]
Li, Y.; Li, S.; Sun, G.; Han, X.; Liu, Y. Lightweight Swin Transformer Combined with Multi-Scale Feature Fusion for Face Expression Recognition. Opt.-Electron. Eng. 2025, 52, 240234. [Google Scholar]
Zheng, C.; Mendieta, M.; Chen, C. POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Paris, France, 2–3 October 2023; pp. 3138–3147. [Google Scholar]
Hazirbulan, I.; Zafeiriou, S.; Pantic, M. Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1234–1243. [Google Scholar]
Li, H.; Wang, N.; Yu, Y.; Wang, X. Facial Expression Recognition with Grid-Wise Attention and Visual Transformer. Inf. Sci. 2021, 580, 35–54. [Google Scholar] [CrossRef]
Li, S.; Deng, W.; Du, J. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2584–2593. [Google Scholar]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Xue, F.; Wang, Q.; Guo, G. TransFER: Learning Relation-Aware Facial Expression Representations with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3601–3610. [Google Scholar]
Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn from All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 418–434. [Google Scholar]
Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
Lee, I.; Lee, E.; Yoo, S.B. Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1536–1546. [Google Scholar]
Chen, Y.; Li, J.; Shan, S.; Wang, M.; Hong, R. From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos. IEEE Trans. Affect. Comput. 2024, 16, 624–638. [Google Scholar] [CrossRef]
Colares, W.G.; Costa, M.G.F.; Costa Filho, C.F.F. Enhancing Emotion Recognition: A Dual-Input Model for Facial Expression Recognition Using Images and Facial Landmarks. In Proceedings of the 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Orlando, FL, USA, 15–18 July 2024; pp. 1–5. [Google Scholar]
Huang, Y. FERMixNet: An Occlusion Robust Facial Expression Recognition Model With Facial Mixing Augmentation and Mid-Level Representation Learning. IEEE Trans. Affect. Comput. 2025, 16, 639–654. [Google Scholar] [CrossRef]

Figure 1. Agent-Poster structure diagram.

Figure 2. Double-flow simplified diagram. The blue dotted line removes the branch from the image to the landmark in POSTER++.

Figure 3. Agent Attention.

Figure 4. Dataset example.

Figure 5. Confusion matrix of RAF-DB and AffectNet.

Table 1. Detailed sizes of the experimental datasets.

Dataset	Train Size	Test Size	Classes
RAF-DB	12,271	3068	7
AffectNet	280,401	3500	7

Table 2. Ablation study. Ablation components are built upon the POSTER++ baseline. L → I fusion refers to one-way landmark-to-image cross-attention All results represent mean values from three independent runs with random seeds {42, 123, 456}. Standard deviations were consistently below 0.3%, confirming experimental stability.

Method	Components	RAF-DB (%)	AffectNet (%)	Statistical Significance
A1 Baseline	-	$90.12 \pm$ 0.12	$67.49 \pm$ 0.15	-
A2 = A1+	Multi-scale feature extraction	$90.89 \pm$ 0.14	$67.52 \pm$ $0.13$	p < 0.05
A3 = A2+	Agent Attention	$91.24 \pm$ $0.10$	$67.78 \pm$ 0.16	p < 0.01
A4 = A3+	Lightweight Swin Transformer V2	$91.86 \pm$ 0.13	$68.05 \pm$ $0.14$	p < 0.001
Agent-Poster (Ours)	All components with L → I fusion	$92.61 \pm$ 0.09	$68.21 \pm$ 0.11	p < 0.001

Table 3. Cross-dataset facial expression recognition results. Experiment RA: training on RAF-DB and testing on the validation set of AffectNet. Experiment AR: training on AffectNet and testing on the test set of RAF-DB.

Method	RA (%)	AR (%)
POSTER++ [8]	60.13	87.62
Agent-Poster (Ours)	63.45	89.17

Table 4. Class-wise accuracy of Agent-Poster.

Dataset	Method	Neutral (%)	Happy (%)	Sad (%)	Surprise (%)	Fear (%)	Disgust (%)	Anger (%)	Mean Acc. (%)
RAF-DB	POSTER++	92.06	97.22	92.89	90.58	68.92	71.88	88.27	85.97
RAF-DB	Agent-Poster	93.01	97.37	93.81	90.32	69.26	72.84	89.20	86.54
AffectNet	POSTER++	65.40	89.40	68.00	66.00	64.20	54.40	65.00	67.45
AffectNet	Agent-Poster	66.19	90.14	68.70	65.46	63.42	55.17	65.78	67.84

Table 5. Comparison of experimental results on RAF-DB and AffectNet. This table compares the accuracy rates of multiple models.

Method	Reference	RAF-DB (%)	AffectNet (%)
TransFER [37]	ICCV 2021	90.91	66.23
EAC-Net [38]	ECCV 2022	90.35	65.32
POSTER [31]	ICCVW 2023	86.03	67.31
PCL [3]	CVPR 2023	85.92	66.16
DDAMFN [39]	Electronics 2023	91.35	67.03
Latent-OFER [40]	ICCV 2023	89.60	63.90
S2D [41]	TAC 2024	92.57	67.62
FRA [4]	CVPR 2024	89.95	66.16
1D-CNN + DenseNet [42]	EMBC 2024	-	60.17
FERMixNet [43]	TAFFC 2024	91.62	66.40
POSTER++ [8]	PR 2025	90.12	67.49
Agent-Poster (Ours)	-	92.61	68.21

Table 6. Comparison of Param and FLOPs. Time refers to the average time required for each training cycle.

Method	Param (M)	FLOPs (G)	Time (s)	RAF-DB (%)	AffectNet (%)
TransFER [37]	65.2	15.3	-	90.91	66.23
POSTER++ [8]	43.7	8.4	989	90.12	67.49
FRA [4]	29.6	6.8	-	89.95	66.16
Agent-Poster (Ours)	27.1	7.9	812	92.61	68.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, L.; Wan, Y.; Zou, G. Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism. Information 2025, 16, 982. https://doi.org/10.3390/info16110982

AMA Style

Fu L, Wan Y, Zou G. Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism. Information. 2025; 16(11):982. https://doi.org/10.3390/info16110982

Chicago/Turabian Style

Fu, Lin, Yaping Wan, and Gang Zou. 2025. "Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism" Information 16, no. 11: 982. https://doi.org/10.3390/info16110982

APA Style

Fu, L., Wan, Y., & Zou, G. (2025). Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism. Information, 16(11), 982. https://doi.org/10.3390/info16110982

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Agent-Poster: A Multi-Scale Feature Fusion Emotion Recognition Model Based on an Agent Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Overall Architecture

3.2. Agent Attention

3.3. Swin Transformer V2

4. Experiments

4.1. Datasets

4.2. Experiment Setup

4.3. Ablation Study

4.4. Cross-Dataset Experiment

4.5. Comparison

4.6. FLOPs and Param Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI