1. Introduction
Micro-expressions refer to involuntary facial movements that reveal concealed emotions, characterized by their brief duration and low intensity. Due to their spontaneous nature, micro-expressions serve as a critical cue for inferring true emotions [
1,
2,
3]. Since their introduction, they have been extensively studied across disciplines including psychology and sociology. However, accurately detecting micro-expressions with the naked eye is challenging for non-experts, which has motivated the development of computer-based automated micro-expression analysis [
4,
5,
6]. As a key technology in human affective computing, micro-expression recognition holds significant promise in critical fields such as suicide prevention [
7], criminal interrogation [
8], and national security [
9].
The advancement of micro-expression recognition research relies fundamentally on the support of well-annotated datasets. Despite the considerable challenges given the spontaneous and subtle nature of micro-expressions, persistent research efforts have yielded several valuable open-source datasets, including SMIC [
10], CASME [
11], CASME II [
12], SAMM [
13], and MMEW [
14]. These datasets provide an essential foundation for model training and evaluation. However, existing datasets still face critical limitations: the total number of publicly available samples is relatively small (less than 2000 in total), which restricts model complexity and generalizability, and significant class imbalance across emotion categories further complicates the development of robust recognition. These issues pose challenges to building reliable micro-expression recognition models.
Although many effective micro-expression recognition models have been developed using existing datasets, the prevailing methodology remains largely dependent on an intra-dataset evaluation protocol, wherein models are trained and tested on data from the same source [
5,
8]. This approach ensures highly consistent feature distributions between training and testing splits, which facilitates learning but also causes models to overfit to dataset-specific statistical biases. In real-world applications, however, models are required to generalize to data collected under different conditions with divergent quality and feature distributions. Under such circumstances, conventional models developed under the intra-dataset paradigm typically suffer a significant performance degradation, revealing a critical lack of generalization capability [
15].
To address the generalization challenge in micro-expression recognition, the Micro-Expression Grand Challenge 2018 (MEGC2018) and 2019 (MEGC2019) introduced two protocols: Composite Dataset Evaluation (CDE) [
16] that merges multiple datasets for training and testing to learn general features, and Holdout Dataset Evaluation (HDE) [
17] that trains on one dataset and tests on another to simulate cross-dataset scenarios. For CDE, Liu et al. integrate optical flow motion analysis, structural feature pooling, and cross-domain knowledge transfer to align feature distributions across source datasets [
18]. For HDE, Zong et al. propose a Target Sample Re-Generator on the CASME II and SMIC datasets. This method generates target samples with a feature distribution similar to the source domain to minimize distribution discrepancies [
19]. Although these methods have achieved promising results, as shown in the left part of
Figure 1, existing evaluation protocols suffer from notable limitations in assessing model generalizability. The CDE protocol, despite its ability to learn distribution discrepancies across multiple source datasets, does not evaluate models on truly unseen data. Meanwhile, conventional HDE setups typically utilize only two datasets, which prevents models from being exposed to sufficiently diverse sample distributions. Consequently, the generalizability of models trained under these paradigms is often limited [
20].
To overcome these limitations and better simulate real-world conditions, we propose a comprehensive cross-dataset micro-expression recognition framework based on a leave-one-dataset-out strategy across five datasets. For instance, Datasets 2–5 are combined for training, while Dataset 1 is held out for testing. As shown in the right part of
Figure 1, this approach maximizes data utilization while ensuring rigorous evaluation on completely unseen data distributions, thereby closely mimicking real-world scenarios where models are required to generalize to data from entirely new sources.
While this framework significantly enhances evaluation rigor, it simultaneously amplifies inherent challenges in cross-dataset recognition. The first challenge stems from feature distribution inconsistency. Disparities in acquisition hardware, illumination environments, and subject populations create substantial distribution shifts across datasets. When multiple datasets are merged for training, these discrepancies not only intensify internal feature variations within the training set but also magnify the distribution gap with the held-out test set. The second challenge is dataset imbalance. Each dataset exhibits unevenness in both quantity and class distribution, and combining multiple datasets further intensifies this imbalance. This composite imbalance may lead the model to over-rely on datasets or classes with larger sample sizes during training, thereby reducing its generalization capability across different data sources. We provide a detailed visual analysis of these cross-dataset challenges in
Section 2.
To address feature distribution inconsistency in cross-dataset recognition, we propose a distribution-balanced batch regularization learning (BRL) approach. Implemented as a specialized loss component through a self-attention mechanism, this method establishes dataset grouping constraints in the feature space and minimizes inter-group attention weight differences to force balanced feature learning from all source domains. The BRL module acts as an information flow regularizer, minimizing the entropy disparity in cross-domain attention distribution and encouraging the extraction of domain-invariant features with higher mutual information to emotion labels. Experiments demonstrate that BRL module effectively prevents overfitting to individual datasets and significantly improves cross-dataset generalization. To address data imbalance, we propose a data augmentation method based on Action Unit (AU) intensity clustering. By analyzing the AU intensity distribution of the same emotion category across different datasets, we extract representative cluster centroids as average AU weights to guide the ULME-GAN network [
21] in generating semantically consistent samples that align with the real data distribution. By effectively expanding samples, this strategy enhances the information diversity of the training set while preserving the essential statistical properties and semantic plausibility of micro-expressions. Comprehensive experiments using three mainstream networks (CNN, ResNet, and PoolFormer) on five spontaneous datasets (SMIC, CASME, CASME II, SAMM, and MMEW) demonstrate substantial performance improvements of our approach over state-of-the-art methods under the rigorous cross-dataset evaluation protocol.
The main contributions of this study are summarized as follows:
(1) A rigorous cross-dataset evaluation protocol is established for micro-expression recognition, systematically addressing key challenges through targeted solutions while achieving state-of-the-art performance under this realistic setting.
(2) A batch regularization learning strategy is proposed to address feature distribution inconsistency across datasets, implemented as a loss term that explicitly balances the model’s attention to different datasets at the representation learning level.
(3) An AU-guided data generation strategy is developed, which not only effectively mitigates class imbalance but also maintains the authenticity of data distributions, contributing to improved model performance.
The rest of this paper is organized as follows:
Section 2 reviews related work.
Section 3 provides a systematic analysis of cross-dataset challenges through visualizations and elaborates the proposed targeted solutions.
Section 4 presents the experimental results and discussions.
Section 5 concludes this study and suggests future research directions.
2. Related Work
With the advancement of micro-expression recognition research, particularly since the launch of the Micro-Expression Grand Challenge, the research focus has progressively shifted from single-dataset evaluation to CDE and HDE validation paradigms that emphasize generalization capability [
22,
23,
24,
25].
Under the CDE paradigm, the main goal is to learn highly generalizable feature representations from the mixed distribution of multiple data sources. Research primarily follows three technical paths: transfer learning, domain adaptation through feature distribution alignment, and data augmentation to address scarcity. For example, representative work includes a three-stage transfer learning framework proposed by Peng et al., which fine-tunes a network first on ImageNet, then on macro-expression datasets, and finally on micro-expression datasets, significantly improving model adaptation to composite data [
26]. Furthermore, Yu et al. develop ICE-GAN, which uses a generative adversarial network to synthesize micro-expression samples with controllable attributes, offering a new approach to mitigate data scarcity [
27]. Zhang et al. propose a Global–Local Feature Fusion Network (GLFNet) that integrates global attention and local block modules with an adaptive feature fusion mechanism, and employs a class-balanced loss to effectively address the challenges of subtle motion and class imbalance in micro-expression recognition [
28]. Gan et al. propose a network called MAG, which aligns macro-expressions with micro-expressions based on action similarity [
29]. By integrating nonlinear amplification and guidance mechanisms to enhance feature saliency, MAG improves the recognition performance of CDE. Zhang et al. propose a Hierarchical Feature Aggregation Network (HFA-Net), which further enhances micro-expression recognition performance through multi-level feature aggregation and adaptive attention feature fusion [
30].
Under the HDE paradigm, the research focus is on effective knowledge transfer from known source domains to completely unseen target domains. In addition to domain-adaptive distribution alignment methods, research in this area also encompasses targeted feature design and selection, as well as enhancement techniques for local discriminative features. For example, Peng et al. propose the Apex-Time Network (ATNet), a novel framework that leverages spatial information from the apex frame and temporal cues from adjacent frames, systematically validating the effectiveness of spatiotemporal fusion and demonstrating the critical role of temporal features in improving model generalization [
31]. Mao et al. propose a Region-Relational Reasoning Network (RRRN) for occluded micro-expression recognition. This network enhances model robustness by modeling inter-region relationships and employing an attention mechanism to mitigate occlusion effects [
32].
Recently, fully cross-dataset micro-expression recognition has emerged to better simulate real-world application scenarios. This paradigm requires models trained on multiple source domains to perform well directly on completely unseen target datasets. Researchers explore various advanced technical routes to address this challenge. These include stability feature design based on facial regions of interest, data augmentation, meta-learning for rapid domain adaptation, and many innovative network architectures. For example, Zhang et al. develop the Region-Selective Transfer Regression (RSTR) method to significantly improve cross-dataset recognition performance by concentrating on facial local regions that exhibit high cross-dataset consistency [
33]. To mitigate the feature distribution discrepancy across databases, Zong et al. develop a domain regeneration approach capable of synthesizing new micro-expression samples, thereby narrowing the domain shift between source and target datasets [
34]. Addressing the issue of intra-class variation in micro-expressions, Wang et al. introduce MCNet, a meta-clustering learning network designed to enhance recognition performance [
35].
However, existing studies show that current cross-dataset methods predominantly adopt general strategies from CDE and HDE paradigms, lacking specialized optimization for the distinctive requirements of cross-dataset scenarios. In particular, the inherent feature distribution discrepancies in multi-source domain training and the intrinsic data imbalance in composite training sets still lack systematic visualization analysis and targeted solutions. The prevailing reliance on either transferring generic approaches or constructing overly sophisticated models fails to address these fundamental issues, compromising practical stability and scalability in real-world applications. Therefore, by clearly identifying the key challenges in cross-dataset micro-expression recognition, this paper proposes a batch group regularization constraint and an action unit-guided data balancing method to provide more targeted technical solutions.
4. Experimental Results and Analysis
In this section, we present a comprehensive evaluation of the proposed method to validate its effectiveness. We begin by describing the experimental setup, including descriptions of the datasets employed, the data preprocessing pipeline, and the implementation details. Subsequently, the evaluation metrics used for quantitative analysis are defined. A thorough analysis of the results is then provided, encompassing an ablation study to dissect the contribution of each proposed component, followed by a comparative analysis with the state-of-the-art methods.
4.1. Datasets
This study employs five publicly available spontaneous micro-expression datasets (SMIC [
10], CASME [
11], CASME II [
12], SAMM [
13], and MMEW [
14]) to establish a comprehensive benchmark for cross-dataset evaluation. Except for SMIC, all other datasets provide apex frame annotations and corresponding AU annotations. For SMIC, we followed established research conventions by using the frame with the largest detected motion magnitude as the apex frame. Each micro-expression dataset contains 3 to 8 emotion categories. To maintain categorical consistency across datasets with different original emotion labels, all samples are mapped into three emotion categories: Positive, Negative, and Surprise.
As illustrated in
Figure 1 and detailed in
Table 3, the cross-dataset evaluation follows a leave-one-dataset-out protocol. For each test round, one dataset is held out as the test set, while the remaining four are combined to form a composite training set. This process is repeated five times, ensuring each dataset serves as the test set. It is important to note that all augmented samples are used exclusively for training and do not participate in the testing phase. The original sample sizes for each test dataset are provided in
Table 1.
Table 3 details the sample composition of the composite training sets corresponding to each testing scenario, comparing the original and augmented sample sizes. The data augmentation strategy effectively balances the training distribution, with each composite set reaching 936 samples after augmentation, significantly enhancing the model’s exposure to diverse sample distributions.
4.2. Data Preprocessing and Implementation Details
All samples from these five micro-expression datasets undergo the same data preprocessing procedure, with facial regions cropped to
pixels. In this study, we employ two different input modalities to comprehensively evaluate model performance: RGB apex frames and optical flow images. The RGB apex frames capture spatial texture features at the peak expression intensity, while the optical flow images, computed between onset and apex frames using the Recurrent All-Pairs Field Transforms (RAFT) algorithm [
41], characterize temporal motion patterns of micro-expressions. Since the AU-guided augmentation strategy specifically operates on single-frame AU features, data augmentation is exclusively applied to RGB apex frame inputs.
All experiments are implemented using the PyTorch framework (Version: 2.3.1) and evaluated on three backbone networks: CNN, ResNet18, and PoolFormer-S12. The models are trained for 100 epochs with an initial learning rate of 0.0002 and weight decay of . To ensure stable optimization, the BRL loss is incorporated after the 50th epoch, and the best performance after this point is recorded. To systematically evaluate the effect of the proposed BRL module, we compare five different weighting coefficients in the total loss function , where serves as the baseline without BRL regularization. All hyperparameters are tuned to achieve optimal performance across different experimental settings.
4.3. Evaluation Metrics
To comprehensively evaluate model performance in cross-dataset micro-expression recognition, we employ accuracy (Acc) alongside two additional metrics that account for class imbalance: unweighted average recall (UAR) and unweighted F1-score (UF1). These metrics are defined as follows:
where
,
, and
represent true positives, false positives, and false negatives for class
c, respectively;
denotes the total number of samples in class
c;
N is the total number of samples; and
C is the number of classes. While Acc provides an overall performance measure, UAR and UF1 offer more balanced assessments by giving equal weight to each emotion category, thus mitigating the bias to majority classes that commonly exists in imbalanced micro-expression datasets.
4.4. Ablation Experiments
4.4.1. Ablation of
An ablation study is conducted to analyze the influence of the weighting coefficient in the combined loss function . The experiment uses apex frame inputs on the CNN backbone and is evaluated on the CASME II dataset. The parameter is tested with values of , where larger values indicate greater emphasis on the batch regularization loss component.
As shown in
Table 4, the optimal performance is achieved at
, where the model attains 56.59% Acc, 43.07% UAR, and 41.89% UF1. Compared to the baseline without BRL module (
), this configuration yields improvements of 8.53% in Acc, 2.73% in UAR, and 2.70% in UF1. This balanced weighting allows the model to maintain strong classification capability while effectively utilizing the dataset-balancing regularization provided by the BRL module. When
decreases to 0.3, the regularization effect is insufficient to adequately address feature distribution discrepancies across source domains, resulting in suboptimal performance. Conversely, when
increases to 0.7 or 0.9, the excessive emphasis on dataset alignment compromises the model’s discriminative power for micro-expression classification, leading to significant performance degradation, with
performing even worse than the baseline case.
The performance trend across different values—rising to an optimum at and then declining sharply at and —reveals a critical trade-off in domain generalization. A low provides insufficient regularization to align features across domains, limiting generalization, while a high overly suppresses discriminative power for emotion classification, causing performance to fall below the baseline. These results confirm that an appropriate balance between classification learning and domain-invariant feature learning is essential for effective cross-dataset micro-expression recognition. The configuration achieves this optimal trade-off, and its equal weighting (1:1) between the cross-entropy loss and the BRL loss is adopted in all subsequent experiments.
4.4.2. Analysis with Apex Frame Inputs
This section presents the experimental results under the apex frame input setting, evaluating cross-dataset recognition performance across different datasets, backbone architectures, and the effect of the BRL module.
Table 5 summarizes the comprehensive comparison, where bold values indicate the best performance for each dataset under each evaluation metric.
At the dataset level, CASME achieves the highest recognition Acc among all test configurations, reaching 69.88% with both CNN+BRL and ResNet+BRL. This superior performance can be attributed to the exclusion of CASME during training, which results in the largest composite training set among all test scenarios due to CASME’s relatively small sample size. In contrast, SMIC consistently exhibits the lowest recognition accuracy across all configurations, a trend that persists throughout the subsequent experimental results. This performance gap stems from two primary factors. First, SMIC has a relatively small training sample size, which limits the amount of representative data available for model optimization. Second, and more critically, SMIC does not provide explicit apex frame annotations. The use of detected apex frames—defined as the frame with the largest motion magnitude within the sequence—introduces potential temporal misalignment, particularly affecting optical flow features that rely on precise onset–apex timing. These limitations collectively hinder the model’s ability to learn robust and well-aligned features from this dataset.
Regarding backbone architectures, CNN demonstrates the most substantial improvements when integrated with the BRL module, with average performance increases of 6.95% in Acc, 6.62% in UAR, and 8.70% in UF1. Both CNN and ResNet significantly outperform PoolFormer across most evaluation metrics. Overall, the CNN architecture combined with the BRL module delivers the best performance across most datasets (except CASME), demonstrating its superior capability in handling cross-dataset micro-expression recognition tasks when using apex frames as input.
The incorporation of the BRL module effectively enhances model performance across all backbone networks. The most notable improvement is observed in the CNN architecture, where the addition of BRL boosts the average accuracy from 52.28% to 59.23%. This demonstrates the effectiveness of the proposed batch regularization learning in addressing feature distribution discrepancies across datasets, particularly for architectures with moderate complexity that are well-suited for the scale of available micro-expression data.
4.4.3. Analysis with Optical Flow Inputs
This section evaluates the effectiveness of temporal motion features for cross-dataset micro-expression recognition, using optical flow sequences computed between onset and apex frames via the RAFT algorithm [
41].
Table 6 presents the comprehensive results, with bold values indicating the best performance for each dataset under each evaluation metric.
Compared to the apex frame inputs analyzed in
Section 4.4.2, optical flow inputs significantly outperform apex frame inputs (
Table 5) across all evaluation metrics and backbone architectures. This performance advantage demonstrates that temporal motion patterns capture more robust and transferable characteristics for cross-dataset recognition compared to spatial appearance features from single frames. Among the architectures, the PoolFormer architecture exhibits the most substantial improvement with optical flow inputs, achieving performance increases of over 10% across respective metrics. The incorporation of the BRL module further enhances model performance with optical flow inputs across all backbone networks. The CNN architecture achieves the most consistent improvements, with performance gains of 3.76% in Acc, 4.26% in UAR, and 3.59% in UF1 after BRL integration. While the PoolFormer+BRL configuration delivers particularly competitive results on SMIC and SAMM datasets, CNN+BRL achieves superior performance in the remaining cases.
These findings collectively demonstrate that optical flow features provide more discriminative temporal representations than single apex frames for cross-dataset micro-expression recognition. The superior generalizability of optical flow stems from its inherent robustness to cross-domain variations. Unlike appearance features from single apex frames, which are sensitive to illumination and camera differences, optical flow encodes relative motion, thereby normalizing dataset-specific biases. Micro-expressions are transient motion events, and optical flow directly captures the dynamics of facial muscle activations over time, which are more consistent across data collection setups. Consequently, models learn transferable motion patterns rather than appearance artifacts, leading to stronger generalization to unseen domains. Moreover, the BRL module effectively enhances model performance with different input types and architectural designs, confirming its robustness and general applicability across varied experimental conditions.
4.4.4. Ablation of Data Augmentation
This section evaluates the proposed AU-guided data augmentation strategy, which generates synthetic samples based on clustered AU intensity patterns for each micro-expression category. As the augmentation operates on single-frame AU characteristics, we validate its effectiveness using apex frame inputs across the three backbone architectures with BRL module integration.
Table 7 presents the results with data augmentation, where bold values indicate the best performance for each dataset under each evaluation metric.
When comparing with the non-augmented results in
Table 5, the integration of AU-guided data augmentation yields significant performance gains, with particularly notable improvements in UAR and UF1 metrics. This trend appears across all three backbone architectures. The CNN+BRL configuration with augmentation demonstrates remarkable gains on the MMEW dataset, achieving 6.40% higher UAR and 11.02% better UF1 compared to the non-augmented case. At the architecture level, ResNet+BRL+Augmentation shows the most substantial average improvements with 3.95% higher UAR and 5.28% better UF1, followed by PoolFormer and CNN. The pronounced enhancement in UAR and UF1, which are metrics specifically designed to evaluate performance on imbalanced datasets, confirms that the augmentation strategy effectively mitigates inter-dataset sample quantity disparity by generating representative samples that better capture characteristic AU patterns for each emotion category.
These results demonstrate that the AU-guided data augmentation effectively addresses dataset imbalance in cross-dataset micro-expression recognition. The method generates semantically consistent samples by leveraging statistically derived AU intensity centroids from real data distributions. This approach maintains computational efficiency through single-pass generation via the ULME-GAN network while specifically targeting class imbalance at the AU level. By preserving essential facial action characteristics of each emotion category, the augmentation diversifies training data while ensuring feature authenticity.
4.4.5. Ablation Comparison and Visualization
To comprehensively evaluate the overall effectiveness of the proposed modules, we analyze the average cross-dataset recognition performance across all five datasets using the CNN backbone, which has demonstrated superior performance in previous experiments.
Table 8 presents the ablation results of different module combinations, revealing that both the BRL module and data augmentation contribute significantly to performance improvement.
The results in
Table 8 indicate a clear performance hierarchy among different configurations. The baseline model using apex frames without any proposed modules achieves 52.28% Acc, 40.13% UAR, and 36.46% UF1. Incorporating the BRL module brings substantial improvements, increasing these metrics to 59.23%, 46.75%, and 45.16% respectively. The addition of data augmentation further enhances performance, particularly on UAR and UF1, which rise to 49.10% and 47.83%. Notably, optical flow features outperform apex frames across all configurations, with the combination of optical flow inputs and BRL module achieving the best overall performance at 63.50% Acc, 53.63% UAR, and 53.07% UF1.
We further investigate the impact of BRL through feature visualization. For each trained model, test samples are processed to extract features, which are then reduced to 2D space using Principal Component Analysis (PCA). A logistic regression classifier is trained on these reduced-dimension features to identify and plot decision boundaries.
Figure 7 presents the visualization results under four different configurations.
The visualization reveals distinct patterns in feature learning. For apex frame inputs (
Figure 7a), the feature distribution appears chaotic without BRL, making it difficult to establish clear decision boundaries. Although BRL integration (
Figure 7b) brings moderate improvement in feature clustering, significant inter-class mixing persists, indicating the limited discriminative capacity of spatial features alone in cross-dataset scenarios. In contrast, optical flow inputs (
Figure 7c,d) demonstrate better inherent separability, with the BRL module further enhancing class discrimination. Particularly in
Figure 7d, the feature clusters become more compact and well-separated, as highlighted by the red circles, indicating that the combination of optical flow features and BRL regularization effectively learns discriminative and dataset-invariant representations. This advantage stems from the fundamental characteristic that optical flow features directly encode temporal dynamics of facial muscle movements, which captures the essential nature of micro-expressions more effectively than static texture features from apex frames. Furthermore, motion patterns exhibit greater robustness to inter-dataset appearance variations such as illumination and subject demographics.
These experimental results demonstrate that the proposed methods effectively address feature distribution inconsistency and data imbalance in cross-dataset micro-expression recognition, with optical flow features providing superior temporal representations and the BRL module enhancing feature discriminability across diverse dataset distributions.
4.5. Comparison with Other Methods
Due to the relatively limited research on cross-dataset micro-expression recognition and the inconsistent use of evaluation datasets across existing studies, we compare our method with available state-of-the-art approaches under the same tested dataset. Since most existing methods primarily report Acc and UAR with one decimal place precision,
Table 9 presents our best results rounded to one decimal place for comparison. Bold values indicate the best performance for each metric across different datasets.
As shown in
Table 9, our method achieves superior performance across most datasets and evaluation metrics. On the CASME II dataset, our approach attains 73.6% Acc and 68.9% UAR, surpassing the best existing methods by 7.4% in Acc compared to DR [
34] and by 4.6% in UAR relative to ATNet [
31]. Similarly, on the SAMM dataset, our method reaches 67.8% Acc and 53.2% UAR, representing an 11.9% improvement in Acc over RNMA [
42] and a 7.4% gain in UAR compared to ATNet [
31]. However, on the SMIC dataset, our method does not achieve the best performance, obtaining 51.2% accuracy and 46.6% UAR. This performance limitation can be attributed to the absence of precise apex frame annotations in SMIC, where optical flow features are extracted using the middle frame of the sequence rather than accurately identified apex frames, potentially introducing temporal misalignment that compromises feature quality.
Table 9.
Comparison with state-of-the-art methods (Acc and UAR). Bold values indicate the best results.
Table 9.
Comparison with state-of-the-art methods (Acc and UAR). Bold values indicate the best results.
| Method | SMIC | CASME II | SAMM |
|---|
| Acc | UAR | Acc | UAR | Acc | UAR |
|---|
| LBP-TOP [17] | - | - | 23.2 | 31.6 | 33.8 | 32.7 |
| 3DHOG [17] | - | - | 37.3 | 18.7 | 35.3 | 26.9 |
| HOOF [17] | - | - | 26.5 | 34.6 | 44.4 | 34.9 |
| DR [34] | 54.9 | 54.7 | 66.2 | 49.6 | - | - |
| D3DCNN [43] | - | - | 44.7 | - | 36.9 | - |
| TFMVN [44] | - | - | 45.5 | 36.7 | - | - |
| ELRCN [45] | - | - | 38.4 | 32.2 | 48.5 | 38.2 |
| RN [26] | - | - | 57.8 | 33.7 | 54.4 | 44.0 |
| RNMA [42] | - | - | 58.4 | 34.1 | 55.9 | 42.7 |
| RSTR [33] | 45.1 | - | 56.2 | - | - | - |
| ATNet [31] | - | 52.3 | - | 64.3 | - | 45.8 |
| Ours | 51.2 | 46.6 | 73.6 | 68.9 | 67.8 | 53.2 |
These results demonstrate that the proposed BRL module and AU-guided data augmentation effectively address the core challenges in cross-dataset micro-expression recognition: feature distribution inconsistency and data imbalance. Notably, these improvements are achieved through targeted methodological innovations rather than increased architectural complexity. The performance superiority across datasets confirms that our approach learns discriminative and dataset-invariant feature representations, demonstrating both effectiveness and strong generalizability for micro-expression recognition.
5. Conclusions and Discussion
Cross-dataset micro-expression recognition presents a critical challenge for real-world applications, where models must maintain performance when encountering data from previously unseen sources. To address this challenge systematically, this study establishes a rigorous evaluation paradigm using five publicly available spontaneous micro-expression datasets under a leave-one-dataset-out protocol. This framework reveals two fundamental difficulties in cross-dataset scenarios: feature distribution inconsistency across source domains and inherent dataset imbalance. We propose targeted solutions for each challenge and demonstrate state-of-the-art performance across multiple evaluation metrics.
Through comprehensive experimental analysis and visualization, we identify that feature distribution shifts across datasets cause models to learn dataset-specific biases rather than generalizable micro-expression characteristics. Additionally, the natural imbalance in sample quantities among different datasets introduces training biases that further degrade cross-dataset generalization capability. These interrelated issues significantly impact model performance in practical cross-dataset applications. The core of the problem lies in the model’s inefficient information acquisition strategy. It tends to overfit to high-entropy but domain-specific signals, while underutilizing the essential, low-entropy stable emotional cues that are consistent across domains.
To address feature distribution inconsistency, we propose BRL as a plug-and-play learning strategy that enhances model generalization by explicitly balancing its attention across multiple source domains. This approach adaptively adjusts feature importance during training to encourage domain-invariant representation learning. The BRL module functions as an information flow regularizer, minimizing the entropy in attention distribution across domains and promoting a more equitable extraction of information from all available sources. The modular design enables integration into various backbone architectures and facilitates straightforward transfer to similar cross-domain recognition tasks. For the dataset imbalance problem, we introduce an AU-guided data augmentation strategy that generates semantically consistent samples based on clustered AU intensity patterns. Importantly, these solutions achieve significant performance improvements on three conventional backbone networks (CNN, ResNet, and PoolFormer), demonstrating that our methods effectively address the core challenges without relying on architectural complexity.
Despite the promising results, several limitations remain. Performance variations across different test datasets indicate that more universal micro-expression representations need to be developed. Future work will focus on advanced temporal feature learning techniques and domain generalization methods to create more robust cross-dataset recognition systems. The exploration of self-supervised learning paradigms and the integration of physiological prior knowledge also present promising directions for enhancing cross-dataset micro-expression recognition.