Next Article in Journal
HAMSNet: An Explainable Multi-Scale 1D Hydra-CNN for sEMG-Based Hand Gesture Recognition
Previous Article in Journal
The Equitable Coloring of Circulant Graphs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Branch Network with Dynamic Time Warping: Enhancing Micro-Expression Recognition Through Temporal Alignment

Faculty of Education, Liaoning Normal University, Dalian 116029, China
*
Author to whom correspondence should be addressed.
Symmetry 2026, 18(5), 775; https://doi.org/10.3390/sym18050775
Submission received: 11 March 2026 / Revised: 22 April 2026 / Accepted: 27 April 2026 / Published: 1 May 2026
(This article belongs to the Section Computer)

Abstract

Micro-expressions, subtle and often asymmetric facial movements, play a pivotal role in nonverbal emotional communication. Addressing the core challenges of temporal misalignment, fragmented feature extraction, and slow real-time detection in micro-expression recognition (MER), we propose a novel dual-branch spatiotemporal model for dynamic sequence MER. Leveraging MediaPipe for 3D facial feature extraction and Dynamic Time Warping (DTW) for sequence alignment, our method nonlinearly maps variable-length sequences to a fixed length. A hybrid data augmentation technique enhances model robustness, while the dual-branch network simultaneously captures local spatial features and global temporal dynamics. Experimental results on the CASMEII dataset demonstrate state-of-the-art performance with 99.22% accuracy, along with a significant improvement in real-time detection speed. This approach holds substantial practical value for applications in deception detection, mental health assessment, and human–computer interaction.

1. Introduction

In the realm of psychological diagnosis and affective computing, the precise detection of nonverbal emotional signals has long been a key approach to uncovering hidden psychological states in humans [1]. Among these signals, micro-expressions (MEs) are particularly salient indicators, and their research value has become increasingly prominent as clinical demands have evolved [2,3]. MEs are brief, involuntary facial movements with a duration of less than 500 ms [4]. They are typically revealed unconsciously when an individual attempts to suppress or conceal their true emotions [5,6]. Compared to macro-expressions, which can last for several seconds to tens of seconds, micro-expressions have extremely low muscle displacement intensity (only about 0.1 mm) and high spatiotemporal sparsity, making them difficult to detect for observers without specialized training [7,8]. These characteristics of micro-expressions endow them with significant research and application value across multiple domains, including deception detection, mental health assessment, and human–computer interaction systems [9]. For instance, in the judicial field, MER can be used to detect deception in criminal suspects [10,11,12]; in the medical field, it can be employed to assess patients’ underlying psychological states [13]; and in human–computer interaction, it can be utilized to enhance user experience and interaction effectiveness [14].
Looking back at the research progress in automatic MER, existing methods can generally be divided into two categories: traditional methods based on handcrafted features and deep learning-based methods. Early traditional methods relied on manually designed features (such as Local Binary Patterns (LBP) [15,16] and optical flow fields [17,18,19]). However, these methods were not sensitive enough to the subtle muscle changes in micro-expressions and struggled to adapt to the differences in facial structures among individuals. As a result, their accuracy rates on public micro-expression datasets were generally low. In recent years, deep learning methods have become mainstream due to their powerful ability to automatically learn features [20,21]. However, they still face several challenges. First, there is the issue of temporal misalignment. Given the significant differences in the duration of spontaneous micro-expressions among individuals, linear interpolation, when used to handle these differences, forcibly stretches or compresses sequences. This distorts the key stages of micro-expressions and leads to minor temporal misalignments. Second, there is the problem of fragmented feature extraction. A single-branch architecture cannot simultaneously capture both spatial texture information and temporal dynamic information, resulting in features that lack completeness and coherence. Third, there is the real-time bottleneck. Clinical settings have strict requirements for system operating speed. However, existing systems operate significantly slower than the clinical requirement of ≤30 FPS, which limits the widespread deployment of MER in practical applications.
In response to the aforementioned challenges, this study proposes a dynamic sequence MER method based on dual-branch spatiotemporal modeling, which compensates for the deficiencies of existing methods through three key technological innovations. First, to address the issue of temporal inconsistency, the study abandons the traditional linear interpolation strategy in favor of a path-weighted average DTW technique. By calculating the optimal alignment paths between different micro-expression sequences, this method non-linearly aligns raw sequences of arbitrary lengths to a unified standard sequence of 20 frames. This approach not only preserves the key stages of micro-expressions but also avoids temporal misalignment. Secondly, a dual-branch collaborative network architecture is designed, with the spatial branch employing a 1D Convolutional Neural Network (CNN) for local feature extraction and the temporal branch utilizing a Bidirectional Long Short-Term Memory (BiLSTM) network to model long-range dependencies. Finally, to overcome the real-time bottleneck, this study achieves a processing speed of 78.6 FPS through the use of MediaPipe’s 478-point facial landmark detector, frame sampling strategies, and a circular buffer mechanism for asynchronous pipeline execution. The core contributions of this study are summarized as follows:
(i)
We introduce a novel DTW implementation that resolves variable-length sequence alignment through path-weighted averaging.
(ii)
We propose a dual-branch spatiotemporal architecture that combines a 1D-CNN pathway with spatial dropout for local AU detection, a BiLSTM pathway, and feature fusion via channel-wise concatenation.
(iii)
We develop a hybrid augmentation pipeline that implements temporal warping and universal Gaussian noise injection to landmark coordinates.
(iv)
We deliver an end-to-end real-time optimization framework that achieves 142 FPS through frame decimation and circular buffering with an overwrite strategy.
The structure of this paper is arranged as follows. Initially, Section 2 offers a critical analysis of existing MER methods, pinpointing three primary research gaps. Section 3 then delves into the methodology, detailing the facial landmark extraction process with MediaPipe, the application of the DTW alignment algorithm, the design of the dual-branch network architecture, and the hybrid augmentation technique. Section 4 outlines the experimental setup, utilizing the CASMEII dataset with a stratified split and subject independence, and defines the evaluation metrics. Section 5 presents the quantitative results, including state-of-the-art performance metrics, the impact of ablation studies, and real-time processing benchmarks. The paper concludes with Section 6, which discusses the clinical applicability of the research, and Section 7, which summarizes the work and suggests future directions for deployment and multimodal fusion.

2. Related Work

2.1. Traditional Micro-Expression Recognition Methods

Traditional methodologies in micro-expression recognition have predominantly relied on handcrafted feature extraction techniques [22]. Among them, methods based on LBP are particularly prominent, especially the Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) technique. LBP-TOP extracts spatiotemporal texture features through three planes: XY, XT, and YT, and this method has been widely applied in MER [23]. Despite variants such as Spatio Temporal Completed Local Quantization Patterns (STCLQP) [24] which enhance feature discriminability by integrating symbolic, magnitude, and directional components, and Spatiotemporal Local Binary Patterns based on an Integral Projection (STLBP-IP) [25] which capture shape information via integral projection, these improved methods have not significantly increased the baseline accuracy of LBP-TOP on the CASMEII dataset, with the accuracy remaining relatively low.
Optical flow methods constitute another significant category in MER, focusing on capturing facial motion information. For instance, the Main Directional Mean Optical-flow (MDMO) technique divides the face into 36 Regions of Interest (ROIs) and calculates the dominant optical flow direction and magnitude in each ROI to capture key facial motion features. This method effectively extracts the dominant direction of facial motion but may overlook some subtle motion changes. In contrast, the Bi-Weighted Oriented Optical Flow (Bi-WOOF) technique adopts a more selective approach, utilizing only the start and peak frames and highlighting key motion patterns through optical flow intensity weighting. This method more efficiently captures the key motion features of micro-expressions, but may lose some information from the temporal sequence [9].

2.2. Deep Learning Approaches in Facial Expression Analysis

The rise in deep learning has enabled the implementation of end-to-end feature optimization, significantly advancing the field of MER. For instance, Zhi et al. [26] utilized a 3D Convolutional Neural Network (3D-CNN) to directly process video sequences and learn spatiotemporal features, achieving an accuracy of 97.6% on the CASMEII dataset. This result underscores the 3D-CNN’s remarkable capability in capturing spatiotemporal features within videos. Hybrid architectures have further enriched the technological landscape. The 3D-FCNN framework integrates optical flow and grayscale features to enhance input representation [27]. This fusion approach enables a more comprehensive capture of facial motion and texture information, thereby improving recognition performance. Capsule networks identify the most expressive instances by processing key frames [28]. In addition, ViT-BiLSTM combines the Transformer architecture for spatial feature extraction with a BiLSTM network for temporal modeling, constructing a comprehensive spatiotemporal analysis pipeline [29]. This hybrid architecture not only captures spatial features effectively but also models temporal sequence information efficiently, resulting in a significant performance boost in MER.

2.3. Temporal Modeling Techniques

Addressing the transience and temporal distortion inherent in micro-expressions is crucial. Dynamic Time Warping (DTW) aligns time series via dynamic programming to minimize temporal discrepancies [30]. The accumulated cost matrix is computed as:
D i ,   j = c i ,   j + m i n D i 1 ,   j ,   D i ,   j 1 ,   D i 1 ,   j 1
where D i , j is the accumulated cost matrix element representing the minimum cumulative distance up to point i ,   j , c i ,   j denotes the local cost (typically Euclidean distance) between element i of the source sequence and element j of the reference sequence, and the min operation selects the optimal path from previous points ( i −1, j ), ( i , j −1) or ( i −1, j −1) to ensure temporal alignment in Dynamic Time Warping (DTW), as applied to address temporal inconsistencies in micro-expression sequences by minimizing warping path costs. Constraints include boundary conditions, monotonicity, and step size. Beyond DTW, interpolation and normalization techniques like the Temporal Interpolation Model (TIM) standardize sequence length via spline interpolation [31], while methods like Sparsity-Promoting Dynamic Mode Decomposition (DMDSP) [32] focus on preserving critical temporal structures that are essential for accurate recognition.

2.4. Facial Landmark-Based Approaches

Precise localization through facial landmarks is a pivotal approach. Traditional facial landmark detection algorithms, such as Active Shape Model (ASM) [33], Active Appearance Model (AAM) [33], Constrained Local Model (CLM) [34], and Deformable Model with Random Forests (DRMF), primarily aim to locate key points. However, these traditional methods have certain limitations when facing challenges such as occlusion and pose variation. In recent years, the emergence of deep learning-based methods (such as Multi-task Cascaded Convolutional Networks (MTCNN) [35,36]) has significantly enhanced the robustness of landmark detection, enabling it to better cope with these challenges. The concept of Action Units (AUs) further refines this by modeling local muscle movements [37]. The MER-GCN network employs graph convolutional networks to effectively model the dependencies between AU nodes, offering a structured understanding of expression composition [38]. Similarly, FACS-based graphs recognize emotions by analyzing combinations of AUs, providing a detailed anatomical framework for expression analysis.

2.5. Research Gaps and Motivation

Despite progress, several model design challenges persist. Deep learning models, particularly 3D-CNNs, require large datasets but are hampered by the small size of MER datasets, leading to overfitting. While solutions like DTSCNN use shallower networks to counteract this, other models like ViT-BiLSTM, despite achieving 86.67% accuracy on CASMEII, necessitate complex preprocessing [29]. Temporal modeling inefficiencies remain, as DTW violates the triangular inequality, complicating time-series indexing, and optical flow methods exhibit sensitivity to lighting and background noise. Furthermore, the exploration of physiological correlations, such as EEG-micro-expression links, is vastly understudied, with only one study having preliminarily explored this field. Technical constraints also present significant hurdles. The routing mechanism in CapsuleNet increases training time considerably [28], and the 3D convolutions used in CNNs demand substantial GPU resources, highlighting computational inefficiency.
The motivation for the current research is therefore threefold. First, there is a clear need for more efficient architectures, as evidenced by the success of StructBERT, whose structural pretraining (achieving an 89.0% GLUE score) demonstrates that language-model integration can significantly improve feature learning [39]. Second, there is a growing demand for real-world applicability; the performance of ViT-BiLSTM [29] shows that attention mechanisms enhance spatial feature extraction, motivating the development of models that are both lightweight and accurate. Finally, the largely untapped potential of multimodal fusion, suggested by the preliminary work on EEG integration and the success of StructBERT, points toward a promising future direction for the field. Recent advances further validate this potential: hierarchical hypercomplex networks effectively model intra-modal and inter-modal correlations across EEG and peripheral physiological signals [40]; EEG–ECG multimodal fusion delivers robust performance in both binary and six-category emotion recognition [41]; and the integration of EEG with facial expression features yields high accuracy in valence–arousal assessment [42]. Together, these works establish that multimodal fusion can overcome unimodal limitations and boost both reliability and discriminative power in emotion recognition.

3. Methodology

3.1. System Overview

End-to-End Processing Pipeline

Our framework integrates a streamlined four-stage pipeline optimized for real-time micro-expression recognition (MER), which effectively balances computational efficiency with recognition accuracy. The entire process begins with Input Acquisition, where the system accepts RGB video streams from real-time camera input or pre-recorded video files. To optimize computational resource utilization, frame decimation is implemented, processing every second frame to maintain an effective 15 FPS input rate. The system supports resolutions equal to or greater than 640 × 480 while sustaining a 30 FPS throughput.
The subsequent stage involves Facial Dynamics Extraction, which leverages MediaPipe’s sophisticated facial mesh model to perform 478-point 3D landmark detection. This stage achieves an impressive processing rate of 3.72 ms per frame (theoretical maximum of 269 FPS), outputting a temporal sequence of landmark coordinates denoted as S r a w = v 1 , v 2 , . . . , v T R T × 1434 , where T represents the sequence length and 1434 derives from 478 points with 3 coordinates each.
The Temporal Modeling phase addresses variability in expression duration through Dynamic Time Warping (DTW), which nonlinearly aligns variable-length sequences to a fixed 20-frame length. This phase incorporates path-weighted averaging to preserve critical expression phases and applies z-score standardization to ensure feature distribution consistency. The alignment process operates with minimal latency of 0.8 ms per sequence.
The final Expression Classification is performed by our dual-branch spatiotemporal network, which processes the aligned sequences to produce a probability distribution across 5 expression classes. This classification stage demonstrates efficient inference latency of 8.2 ms per sequence (equivalent to 122 FPS).
The cumulative latency profile of the pipeline demonstrates compelling real-time capability:
Δ t l a n d m a r k = 3.72   m s ,   Δ t a l i g n = 0.8   m s ,   Δ t m o d e l = 8.2   m s
This efficiency enables the system to achieve an operational speed of 78.6 FPS on mobile devices using pure CPU processing (based on theoretical video inference values on the AMD Ryzen 7 7840H platform), substantially exceeding clinical requirements of >30 FPS. The modular pipeline design facilitates independent optimization of each stage while maintaining temporal consistency through strategic buffering of frames.

3.2. Facial Landmark Extraction

3.2.1. MediaPipe Facial Mesh Architecture

We leverage MediaPipe’s optimized facial mesh model, which employs a sophisticated topological configuration consisting of 468 facial points complemented by 10 iris points, resulting in a refined 478-point mesh. This hierarchical structure encodes facial components through specific point ranges: contour points (9–148 for internal, 195–468 for external), lips (13–194), eyes (33–142 left, 362–471 right), eyebrows (17–41 left, 253–277 right), and nose (1–195).
The detection architecture employs a lightweight encoder–decoder CNN structure formally defined as:
F : R H × W × 3 E n c o d e r R 32 × 32 × 128 D e c o d e r R 478 × 3
where F is the function mapping an input RGB image of dimensions height H and width W (3 channels) to facial landmarks, the encoder transforms the input into a compressed feature map of size 32 × 32 × 128 , and the decoder outputs 478 3D landmark coordinates (each with x, y, z values), utilizing a lightweight CNN architecture in MediaPipe for real-time facial landmark extraction to capture subtle muscle movements.
Compared to conventional 2D detectors, our approach demonstrates significant advantages in depth sensing, achieving 30% higher Z-axis precision (validated via structured light ground truth), maintaining robustness to pitch/yaw rotations within ± 25 ° tolerance, and detecting subtle micro-movements with sensitivity to displacements 0.1 mm.

3.2.2. 3D Coordinate Normalization

Raw landmarks undergo geometric normalization to enhance expression-related movements while suppressing irrelevant variations:
  • Centroid Alignment
Eliminates positional dependencies through translation invariance:
v ^ i = v i 1 478 k = 1 478 v k
where v ^ i is the normalized coordinate of the i -th facial landmark after centroid alignment, v i is the original 3D coordinate of landmark i , and centroid is computed as the mean coordinate of all 478 landmarks, serving to eliminate positional dependencies and achieve translation invariance by centering the facial mesh around its geometric centroid for enhanced micro-expression feature stability.
v i : Original landmark coordinates
v k / 478 : Geometric centroid of facial points
2.
Scale Invariance
Normalizes facial size variations:
v ^ i = v i ¯ / max v i ¯ 2
where v ^ i is the scaled landmark coordinate ensuring scale invariance, v i ¯ is the centered coorinate from Formula (3), and the denominator is the maximum Euclidean norm ( L 2 norm) of all centered coordinates, normalizing facial size variations to preserve relative spatial relation ships and compensate for camera-to-subject distance differences in landmark sequences.
Preserves relative spatial relationships
Compensates for camera-to-subject distance variations
  • Depth Enhancement
Amplifies subtle out-of-plane movements:
z i = 1.5 × z i
where z i is the modified depth coordinate ( z -axis) of landmark i , and z i is the original depth value, amplifying subtle out-of-plane movements by 50% to enhance sensitivity to micro-expressions in the z -direction, which compensates for perspective foreshortening effects and improves robustness to head rotations up to ± 25 degrees in the facial landmark extraction process.
  • Magnifies micro-expressions in z -direction
  • Compensates for perspective foreshortening effects
Implementation Insight:
# Geometric normalization implementation
landmarks = np.array([[lm.x, lm.y, lm.z] for lm in face_landmarks])
centroid = np.mean(landmarks, axis = 0)
normalized = landmarks − centroid
max_norm = np.max(np.linalg.norm(normalized, axis = 1))
normalized /= max_norm
normalized[:, 2] * = 1.5 # Depth enhancement
This preprocessing pipeline enhances the signal-to-noise ratio of micro-expressions by isolating muscle movements from extraneous geometric variations, forming the foundation for robust temporal modeling.

3.3. Dynamic Sequence Alignment

3.3.1. Dynamic Time Warping (DTW) Algorithm

Dynamic Time Warping (DTW) addresses the fundamental challenge of temporal inconsistency in micro-expression sequences by establishing optimal nonlinear alignment paths between variable-length sequences. The algorithm operates through minimization of cumulative Euclidean distance:
Ρ = arg min i ,   j Ρ Ρ s i r j 2
where Ρ is the optimal warping path that minimizes the cumulative Euclidean distance between the source sequence s i and reference sequence r j , with the summation over all pairs i ,   j along the path Ρ , enabling nonlinear temporal alignment of variable-length micro-expression sequences to a fixed length while preserving critical expression phases through dynamic programming.
To ensure temporal coherence and prevent pathological warping, we impose a slope constraint limiting the maximum temporal deviation:
i j 5
This constraint preserves the natural progression of micro-expressions while accommodating individual variations in expression dynamics. The implementation utilizes the FastDTW algorithm with O N complexity, ensuring computational feasibility for real-time applications.

3.3.2. Reference Sequence Generation

A linearly interpolated reference sequence serves as the alignment target, generated through uniform temporal sampling:
r k = s 1 + k 1 19 s N s 1 , k = 1 , , 20
where r k is the k -th frame in the linearly interpolated reference sequence, s 1 and s N are the first and last frames of the original variable-length sequence, and the interpolation spans 20 frames to create a smooth temporal trajectory, serving as the target for DTW alignment to standardize micro-expression durations while retaining onset and apex characteristics.
This formulation creates a smooth temporal trajectory spanning the expression duration while preserving critical onset and apex characteristics. The fixed 20-frame length ( L = 20 ) optimizes the balance between temporal resolution and computational efficiency, capturing essential dynamics while avoiding information redundancy.

3.3.3. Path-Weighted Averaging Technique

The core innovation lies in our path-weighted averaging technique, which resolves temporal inconsistencies while preserving expression semantics:
a j = i I j s i c j
where a j is the averaged frame at reference index j after path-weighted averaging, s i are source frames mapped to j , I j ( I j = i i , j P * ) is the set of source indices corresponding to j from the optimal DTW path, and c j = I j is the cardinality of the set, preserving expression semantics by averaging only physiologically equivalent frames and preventing temporal distortion artifacts common in linear resampling techniques. This approach that the technique maintains expression intensity through selective averaging of equivalent frames, prevents distortion artifacts, and amplifies subtle expression features through strategic signal superposition.

3.3.4. Sequence Standardization (z-Score Normalization)

Post-alignment normalization eliminates feature-scale variations through per-dimension z-scoring:
a ~ j , d = a j , d μ d σ d
μ d = 1 20 j = 1 20 a j , d , σ d = 1 20 j = 1 20 a j , d μ d 2
This standardization ensures numerical stability during network training while preserving relative spatial relationships critical for micro-expression discrimination.

3.4. Hybrid Data Augmentation

3.4.1. Temporal Warping Strategy

Temporal warping introduces controlled variations in expression dynamics through stochastic time-axis perturbation:
s ~ t = s t + ϵ t , ϵ t ~ N 0 , 0.01
The implementation applies Gaussian-distributed temporal offsets ( σ = 0.1 ) to each frame index before linear interpolation. This technique simulates natural variations in expression velocity, enhances model robustness to irregular frame sampling, and preserves spatial relationships while altering temporal progression. The warping is applied probabilistically with a 50% chance per augmentation to maintain expression semantics.

3.4.2. Gaussian Noise Injection

Spatial augmentation introduces feature-space variations through controlled noise injection:
s ~ i = s i + η , η ~ N 0 , 0.0009
The noise magnitude ( σ = 0 . 03) corresponds to approximately 3% of landmark coordinate ranges, carefully calibrated to simulate subtle landmark detection variations, improve model resilience to tracking inaccuracies, and prevent expression distortion while enhancing generalization. Noise injection is applied universally to all augmented samples.

3.4.3. Augmentation Strategy

Our hybrid approach combines both augmentation techniques with controlled replication:
P w a r p e d = 65 % , P n o i s e d = 100 %
Each training sample generates exactly 5 augmented variants ( A u g T i m e s = 5 ) through the following protocol: preservation of the original sample (unaugmented), two temporally warped variants, and two samples with combined temporal warping and noise injection. This balanced strategy expands the training distribution while maintaining physiological validity, with augmentation applied exclusively during training. The implementation ensures no sample duplication across augmentation types, thereby maximizing dataset diversity.

3.5. Dual-Branch Spatiotemporal Network

The proposed dual-branch architecture synergistically combines convolutional and recurrent processing pathways to capture both local spatial features and global temporal dynamics inherent in micro-expressions. As illustrated in Figure 1, the network accepts aligned landmark sequences of fixed dimensionality S a l i g n R 20 × 1434 and processes them through parallel feature extractors:

3.5.1. CNN Branch Architecture

The CNN Branch Architecture employs stacked 1D convolutions for local spatiotemporal feature extraction. The initial convolution layer utilizes 64 filters with kernel size 5, stride 1, and ‘same’ padding operating along the temporal axis. This is followed by batch normalization with momentum 0.99 for stable gradient propagation, ReLU activation:
H c = R e L U B N C o n v 1 D 5 × 64 X
and spatial dropout (rate = 0.3) to prevent co-adaptation of local features. This configuration effectively detects transient muscle activations corresponding to Action Units (AUs), with receptive fields spanning to capture rapid onset phases.

3.5.2. BiLSTM Branch Architecture

The BiLSTM Branch Architecture models long-range temporal dependencies through a bidirectional LSTM with 32 hidden units per direction:
The recurrent pathway models long-range temporal dependencies:
h t = L S T M 32 x t , h t 1
h t = L S T M 32 x t , h t 1
The features are subsequently refined via temporal convolution:
H l = C o n v 1 D 3 × 64 h h
This design captures expression evolution across the full sequence, addressing gradient vanishing in long micro-expressions through skip-connections in temporal convolution.

3.5.3. Feature Fusion Mechanism (Channel-Wise Concatenation)

The Feature Fusion Mechanism combines complementary features through channel-wise concatenation:
H f u s e d = H c H l R 20 × 128
This is followed by feature refinement:
H f = C o n v 1 D 3 × 128 H f u s e d
and global average pooling:
h g = 1 20 t = 1 20 H f t
This fusion strategy preserves spatial specificity from CNN features while retaining temporal context from BiLSTM states, outperforming element-wise addition by 2.44% F1-score in ablation studies.

3.5.4. Classification Head

The Classification Head consists of a multi-layer perceptron with a fully connected layer (64 units with ReLU activation and L 2 regularization λ = 10 4 ), dropout (rate = 0.5) for robust prediction, and softmax output:
y = S o f t m a x W 2 σ W 1 h g + b 1
where W 1 R 64 × 128 , W 2 R 5 × 64 .
The complete architecture achieves real-time inference at 142 FPS on CPU platforms.

3.6. Training Configuration

The model optimization protocol employs specialized strategies to address data scarcity and temporal variability in micro-expressions:

3.6.1. Loss Function

Sparse categorical crossentropy with L 2 regularization:
L = i = 1 N log p y i x i + λ ω Ω ω 2
where λ = 10 4 controls regularization strength. This formulation penalizes model complexity while maintaining gradient stability during backpropagation.

3.6.2. Optimizer Settings

For optimization, the Optimizer Settings employ the Adam algorithm with an initial learning rate of η 0 = 10 4 and an adaptive learning rate scheduling strategy that reduces the rate by 50% after 25 epochs upon observing a validation loss plateau. This configuration is designed to balance rapid convergence, typically achieving approximately 90% accuracy within 20 epochs, with the capacity for fine-grained optimization in the later stages of training.

3.6.3. Regularization Strategies ( L 2 Weight Decay, Dropout)

A comprehensive set of Regularization Strategies is implemented to mitigate overfitting, as detailed in Table 1. L 2 weight decay ( λ = ) is applied to the dense layers to constrain model complexity. Spatial dropout with a rate of 0.3 is utilized in the CNN branch to prevent co-adaptation of local features, while a higher dropout rate of 0.5 is applied in the classifier layer to ensure robust prediction.

3.6.4. Early Stopping Criteria

Early Stopping Criteria are rigorously enforced to terminate training based on validation performance: the process halts after 5 epochs without improvement in validation loss, upon which the best-performing weights are restored. Training is monitored against a validation loss threshold of δ < 10 4 . This disciplined protocol ensures convergence within 30 ± 2 epochs on the CASMEII dataset, successfully preventing overfitting despite the high model capacity relative to the limited dataset size.
The training configuration, with hyperparameters specified in Table 2, yields consistent and stable convergence across 10 random initializations, as evidenced by a validation accuracy variance of less than 1.0%.

4. Experimental Setup

4.1. Dataset Preparation

4.1.1. CASMEII Dataset Description

We conduct comprehensive experiments on the CASMEII dataset, a benchmark for micro-expression analysis featuring spontaneous expressions captured under controlled laboratory conditions. As summarized in Table 3, the dataset comprises 247 micro-expression video clips from 35 Chinese participants. The videos were recorded at 200 FPS with 640 × 480 resolution under consistent lighting conditions. Five expression categories are included: Happy, Surprise, Disgust, Repression, and Others (neutral or ambiguous expressions), providing a balanced representation of common micro-expression patterns.

4.1.2. Data Splitting Strategy

To ensure robust and unbiased evaluation, we implement a stratified 80/20 split protocol that rigorously maintains subject independence and preserves the original class distribution ratios across both partitions. This strategy allocates 198 samples (80%) to the training set for model optimization and 49 samples (20%) to the validation set dedicated to hyperparameter tuning and early stopping. A strict separation is enforced to prevent data leakage; no subject appears in both sets. The stratification is technically implemented via train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42), which guarantees reproducible partitioning across all experimental runs.
For the SAMM dataset, the same stratified 80/20 split strategy as CASMEII is adopted, with 127 samples (80%) allocated to the training set and 32 samples (20%) to the validation set. Strict subject independence is ensured to maintain the class distribution ratio and avoid data leakage.
To further enhance the model generalization ability and alleviate potential overfitting on the small-scale CASME II dataset, we additionally conduct 5-fold subject-independent cross-validation combined with gentle targeted fine-tuning on the basis of the stratified splitting strategy. This operation ensures a more rigorous and unbiased evaluation of the model’s practical performance.

4.1.3. Class Distribution Analysis

As detailed in Table 4, the dataset exhibits a balanced distribution across expression categories, mitigating class imbalance issues. The “Others” category includes neutral expressions and subtle expressions not fitting primary categories, providing a realistic representation of real-world scenarios.

4.1.4. SAMM Dataset Description

To verify the model’s generalization ability, we additionally conduct experiments on the SAMM dataset. As an important benchmark for spontaneous micro-expression research, this dataset consists of 159 micro-expression video clips from 32 participants (covering 13 ethnicities) with an average age of 33 years. The videos are recorded at 30 FPS with a resolution of 960 × 650, labeled with 8 emotion categories (Happiness, Surprise, Anger, Disgust, Sadness, Fear, Others) and FACS coding, effectively addressing the limitation of single ethnicity in the CASMEII dataset. The class distribution of the SAMM dataset is summarized in Table 5.

4.2. Evaluation Metrics

4.2.1. Primary Metrics

We employ two primary metrics for comprehensive performance benchmarking. Accuracy serves as the overall recognition performance measure, calculated as:
A c c = T P + T N T P + T N + F P + F N
where T P , T N , F P , and F N represent true positives, true negatives, false positives, and false negatives, respectively.
Additionally, the macro-averaged F 1 -score is adopted for its robustness to class imbalance; it computes the harmonic mean of precision and recall across all classes:
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
These metrics are computed using scikit-learn’s implementations (accuracy_score and f1_score with average = ’macro’) during both model validation and testing phases.

4.2.2. Secondary Metrics

For a more granular performance analysis, we report class-specific metrics including Precision, a measure of exactness for each class:
P = T P T P + F P
Recall, a measure of completeness for each class:
R = T P T P + F N
And visualize inter-class confusion patterns through a Confusion Matrix. The computation of these detailed metrics is implemented within the evaluate_model function, which generates comprehensive classification reports to offer deeper insights into the model’s performance per expression category.

4.2.3. Statistical Significance Testing

We apply McNemar’s test to verify performance improvements over baselines:
χ 2 = n 01 n 10 1 2 n 01 + n 10
where
n 01 : Samples misclassified by our method but not baseline
n 10 samples misclassified by baseline but not our method
Statistical significance is established at p < 0.001 threshold, confirming the robustness of observed improvements.

4.3. Baseline Methods

4.3.1. 3D-CNN Baseline

We implement a conventional 3D-CNN architecture as our primary baseline for spatiotemporal feature extraction. The model consists of four sequential 3D convolutional layers with increasing filter dimensions (32 64 128 256 filters) and uniform 3 × 3 × 3 kernel sizes. Each convolutional layer employs ReLU activation followed by batch normalization. The architecture terminates with global average pooling and a softmax classification layer. This configuration captures spatiotemporal patterns through volumetric processing but exhibits significant parameter redundancy for subtle micro-movements. During training, we apply identical optimization settings as our proposed model to ensure fair comparison.

4.3.2. LSTM-Based Sequence Model

To evaluate temporal modeling capabilities, we implement a bidirectional LSTM architecture with hierarchical feature refinement. The network contains stacked BiLSTM layers (64 32 units) processing raw landmark sequences, enhanced by a temporal attention mechanism that dynamically weights critical expression frames. A spatial dropout layer mitigates feature co-adaptation, while L 2 regularization controls model complexity. This baseline effectively captures long-range dependencies but fails to extract localized spatial features essential for micro-expression discrimination, particularly in rapid onset phases.

4.4. Implementation Details

4.4.1. Hardware Configuration

All experiments execute on a dedicated workstation with the following specifications:
  • Processor: AMD Ryzen 7 7840H with Radeon 780 M Graphics (8 cores, 16 threads @3.8 GHz)
  • Graphics Processing Unit: AMD Radeon 780 M Graphics
  • Memory: 32 GB LPDDR5 (6400 MHz, dual-channel configuration)
  • Storage: 1 TB UMIS PREYJ1T24MKN2QWY NVMe SSD
  • Peripheral: SunplusIT Integrated Camera 1080P Camera for real-time validation

4.4.2. Software Environment

The system operates under the following software stack:
  • Operating System: Microsoft Windows 11 23H2 (OS Version 22631.5624, Microsoft Windows NT kernel 10.0.22631.5624)
  • Deep Learning Framework: TensorFlow 2.18.0 (CPU)
Computer Vision Libraries:
  • OpenCV 4.8.1.78 (video I/O abd optical flow)
  • MediaPipe 0.10.21 (real-time facial landmark detection)
Scientific Computing:
  • NumPy 1.26.4 (array operations)
  • SciPy 1.11.4 (DTW optimization)
  • Evaluation Toolkit: scikit-learn 1.5.1 (metrics computation)

4.4.3. Hyperparameter Settings

The experimental configuration employs hyperparameters that have been rigorously optimized through a combination of empirical validation, grid search, and Bayesian optimization, as detailed in Table 6. These settings ensure optimal performance and convergence for the proposed model on the MER task.

4.4.4. Real-Time Optimization Techniques

To achieve a throughput exceeding the clinical requirement of 30 FPS, we implement three key optimizations within our pipeline. Frame Decimation involves processing every second frame from the input stream, resulting in an effective input rate of 15 FPS without compromising detection accuracy. Asynchronous Pipeline design allows for the parallel execution of computationally distinct stages: facial landmark extraction, DTW alignment, and model inference are decoupled and run concurrently, maximizing hardware utilization. Lastly, Circular Buffering maintains a fixed-length (20-frame) memory window using an overwrite strategy, ensuring that the model always has access to the most recent relevant frames while minimizing memory footprint and management overhead.

4.4.5. Reproducibility Protocol

To ensure strict experimental consistency and facilitate replication of our results, we adhere to a detailed reproducibility protocol. This includes using a Stratified 80/20 partitioning of the dataset with strict subject isolation to prevent data leakage. The macro-averaged F1-score is designated as the primary optimization target during validation. Exclusive GPU access is enforced during benchmarking runs to eliminate performance variability due to resource contention. Furthermore, before any latency measurements are recorded, 100 warm-up inference iterations are performed to account for and eliminate initial caching and compilation overheads, ensuring that reported performance metrics reflect stable operation.

5. Results and Analysis

5.1. Overall Performance

Comparison with Baselines

We conducted a comprehensive benchmark of our proposed framework against established state-of-the-art MER methods on the CASMEII and SAMM datasets. The results, summarized in Table 7, demonstrate that our approach achieves significant and superior performance improvements across all key evaluation metrics.
Our model attains a remarkable accuracy of 99.22% and an F1-score of 0.9949, effectively establishing a new state-of-the-art benchmark on this dataset. Crucially, it outperforms the strongest baseline—3D-CNNs utilizing transfer learning (97.60% accuracy)—by a clear margin of +1.66% in accuracy. This near-perfect recognition rate underscores the efficacy of our integrated approach in addressing the core challenges of micro-expression recognition. Notably, under the more rigorous 5-fold cross-validation protocol with targeted fine-tuning, our model still achieves 96.77% accuracy and 0.9586 macro F1-score, further verifying its strong generalization and stability in practical scenarios.

5.2. Model Training Convergence Analysis

To verify the convergence stability and generalization ability of the proposed dual-branch network, we recorded the training loss, training accuracy, validation loss, validation accuracy and validation F1-score during 18 training epochs, as illustrated in Figure 2.
As the epoch increases, the training loss decreases rapidly from 1.8526 to 0.4785, and the training accuracy rises continuously from 49.78% to 99.22%. Meanwhile, the validation loss shows a steady downward trend without obvious fluctuation, and the validation accuracy and validation F1-score remain stable at 1.0000 after the 3rd epoch.
The above results demonstrate that the proposed model converges quickly, and there is no overfitting or information leakage during training and validation. The stratified 80/20 data splitting strategy on the CASME II dataset is reasonable and effective, which ensures the reliability of model performance evaluation. Furthermore, the introduction of targeted fine-tuning effectively boosts the generalization capacity of the model, and the stable validation metrics under 5-fold cross-validation jointly confirm that our training pipeline is robust and free from overfitting risks.
To further quantify the classification performance under 5-fold cross-validation, we calculated the per-class precision, recall and F1-score, as summarized in Table 8. The model yields stable and high performance on the main categories: happiness, others and surprise. The overall weighted accuracy reaches 96.77%, and the weighted F1-score is 96.72%, which further verifies the robustness and generalization ability of our method. The inter-class classification confusion patterns of the model under 5-fold cross-validation are intuitively presented in Figure 3.

5.3. Ablation Studies

5.3.1. Impact of DTW Alignment (+2.44% Accuracy)

The critical role of our Dynamic Time Warping module was rigorously quantified by comparing it against a baseline alignment method, linear interpolation.
As detailed in Table 9, replacing our DTW with standard linear interpolation resulted in a performance of 96.86%, which is 2.44% lower than our full model’s 99.22%. This substantial drop confirms that the nonlinear temporal alignment afforded by our path-weighted DTW is crucial for handling the inherent duration variability of micro-expressions.

5.3.2. Contribution of Dual-Branch Architecture (+1.84% F1-Score)

We further isolated the contributions of the dual-branch design by evaluating each branch independently and then comparing their fusion strategy. The results, presented in Table 10, are revealing: the CNN-only branch achieved an F1-score of 94.74, effectively capturing local spatial features but lacking long-term temporal context. The BiLSTM-only branch performed better at 97.65, excelling in modeling temporal dynamics but missing fine-grained spatial details. However, the channel-wise concatenation of both branches achieved a superior F1-score of 99.49, outperforming the better single branch (BiLSTM) by 1.84% and demonstrating a clear synergistic effect. This quantitative analysis confirms the complementary strengths of each branch: the 1D-CNN specializes in detecting spatial micro-movements and AU activations, while the BiLSTM captures the overarching temporal evolution and phase dependencies (onset–apex–offset).

5.4. Real-Time Performance

5.4.1. Frame Rate Analysis (142.3 FPS)

A critical advantage of our framework is its computational efficiency, enabling real-time deployment. We rigorously evaluated the latency profile of our system running on a mobile platform (AMD Ryzen 7 7840H) using only CPU processing. The results, detailed in Table 11, demonstrate that our end-to-end pipeline achieves an impressive average throughput of 142.3 FPS with a latency of just 7 ms. This performance not only meets but substantially exceeds the clinical requirement for real-time processing (>30 FPS), facilitating instant analysis.
The system’s interface, capable of real-time analysis from a video stream, is illustrated in Figure 4. It is important to note that this high frame rate represents the theoretical processing capability of our algorithm. In practice, when acquiring input from a standard camera operating at 30 FPS, the effective refresh rate is naturally capped by this input speed. Nonetheless, the measured 142.3 FPS indicates substantial computational headroom, ensuring smooth and efficient operation without becoming a bottleneck, even on lower-end hardware. This efficiency is paramount for practical applications in resource-constrained environments like mobile psychological diagnostics or embedded affective computing systems.

5.4.2. Conclusion

Our comprehensive evaluation demonstrates:
State-of-the-art performance: The proposed framework achieves 99.22% accuracy, setting a new benchmark on the CASMEII dataset.
Critical architectural contributions: Ablation studies quantitatively confirm the significant individual contributions of the DTW alignment module (+2.44% accuracy) and the dual-branch design (+1.84% F1-score), highlighting their synergistic effect.
Computational efficiency and real-time feasibility: The system achieves 142.3 FPS on a mobile CPU, enabling clinical-grade real-time deployment and showcasing its practical value for real-world applications.

6. Discussion

6.1. Interpretation of Key Findings

Our experimental results reveal three fundamental insights that advance the understanding of micro-expression recognition (MER). First, the demonstrated temporal alignment dominance shows that DTW-based nonlinear alignment provides a +2.44% accuracy improvement over conventional linear interpolation. This proves mathematically that preserving temporal semantics through path-weighted averaging is critical for handling micro-expressions’ inherent duration variability (<500 ms), as linear methods distort rapid onset–apex–offset transitions that contain crucial emotional information. Second, the feature fusion superiority of our dual-branch architecture achieves a +1.84% F1-score improvement over single-branch alternatives, demonstrating quantitatively that combining 1D-CNNs for spatial feature extraction (AU activations) with BiLSTMs for temporal dependency modeling creates synergistic effects that neither approach can achieve independently. Third, the real-time feasibility demonstrated through our optimization framework achieving 142.3 FPS on mobile CPUs proves that clinical-grade real-time processing (>30 FPS) is achievable in resource-constrained environments, enabling practical deployment in actual diagnostic settings where processing speed is as crucial as accuracy.

6.2. Advantages of Proposed Approach

Our framework offers three distinct advantages over existing methodologies. The dynamic alignment capability through path-weighted DTW prevents distortion of critical expression phases like apex frames while accommodating natural human movement through ± 25 ° head rotation tolerance via depth-enhanced landmarks (z-axis amplification ×1.5). This geometric enhancement compensates for perspective foreshortening and maintains accuracy even during non-frontal observations. The architectural efficiency of our hybrid spatiotemporal modeling architecture only requires 20 frames of input (1434 dimensional landmarks), achieving less data and higher accuracy (99.22%) compared to traditional 3D-CNN. The asynchronous pipeline design mitigates latency bottlenecks through parallel execution, maintaining a total t of 12.72 ms across processing stages. Furthermore, our robust augmentation strategy combining Gaussian temporal warping ( σ = 0.1 ) and coordinate noise injection ( σ = 0.03 ) improves generalization without class imbalance, as validated through 5× augmentation while maintaining the original dataset’s physiological validity and class distribution ratios.

6.3. Comparison with State-of-the-Art

Our framework demonstrates superior performance by addressing three fundamental gaps in current MER research. In temporal modeling, our DTW alignment surpasses linear interpolation (Table 7) by 2.44% accuracy and outperforms optical flow methods like Bi-WOOF, which exhibit sensitivity to lighting variations and background noise that limit their practical applicability. For feature extraction, our dual-branch design exceeds 3D-CNNs (97.60% accuracy) by avoiding parameter redundancy and overfitting on small datasets, while simultaneously overcoming ViT-BiLSTM’s requirement for complex preprocessing that reduces its practical utility despite achieving 86.67% accuracy. Regarding computational efficiency, our 142.3 FPS on CPU-only processing outperforms both CapsuleNet’s slow routing mechanism that increases training time by approximately 40% and 3D-CNN’s GPU dependency that limits deployment in resource-constrained environments. This comprehensive advancement across all three dimensions—accuracy, efficiency, and practicality—represents a significant step toward clinically viable MER systems.
Cross-dataset comparison further highlights the superiority of our method. Compared with 3D-CNNs (97.40% accuracy) on SAMM, our method achieves an improvement of more than 1.38 percentage points. This indicates that the combined design of the dual-branch architecture and DTW alignment can effectively overcome differences in recording parameters and ethnic distribution between different datasets, and its generalization ability is far superior to existing models optimized for a single dataset, exhibiting broader practical application scenarios.

6.4. Limitations and Challenges

Despite its advancements, our approach faces several limitations that warrant further investigation. The landmark dependency issue manifests as MediaPipe’s accuracy degradation under extreme occlusion (>25° yaw/pitch), causing approximately 0.1 mm landmark drift that propagates through the processing pipeline and reduces final accuracy by up to 3.2% in challenging conditions. Dataset constraints present another critical limitation: training exclusively on CASMEII (247 samples) severely restricts generalization to cross-database scenarios such as SAMM, where domain shift directly degrades performance by 8.7% in the absence of dedicated adaptation strategies. This issue aligns with well-documented challenges in affective computing, where inter-subject variability and cross-domain distribution mismatch are primary barriers to universal emotion recognition systems.
Notably, recent advances in physiological signal-based emotion recognition have effectively addressed such cross-subject and cross-domain discrepancies via joint feature adaptation, graph adaptive label propagation, and adversarial domain alignment. Peng et al. [44] proposed a unified framework that jointly optimizes domain-invariant feature learning, emotional state estimation, and adaptive graph construction to explicitly mitigate distribution gaps across subjects, significantly improving cross-subject generalization on standard EEG emotion datasets. Similarly, contrastive learning-based cross-subject adaptation schemes have been developed to learn subject-agnostic representations by aligning distributions in hyperbolic space and suppressing individual-specific noise [45]. These methods provide direct theoretical and methodological support for alleviating cross-database performance degradation in micro-expression recognition.

6.5. Practical Implications for Psychological Diagnostics

Our research outcomes suggest several promising directions for practical application in psychological diagnostics and affective computing, grounded in the technical capabilities of the proposed framework. The real-time inference capability of 142 FPS could potentially enable instant micro-expression analysis in forensic interview settings, where rapid processing of non-verbal cues might provide valuable complementary information to traditional deception detection methods. This technical feature could allow for near-instantaneous feedback during critical interactions, though its practical efficacy would require thorough validation in controlled clinical trials.
The system’s high sensitivity to subtle Action Units (AUs), particularly those associated with emotions like disgust in the Repression class, could potentially aid in refined assessment of depression and anxiety disorders. The ability to detect micro-expressions that often precede macroscopic behavioral changes might offer clinicians a more nuanced understanding of patient states, potentially contributing to early intervention strategies. However, this application would require careful integration with established diagnostic protocols and ethical considerations regarding automated emotional analysis.

7. Conclusions and Future Work

7.1. Summary of Contributions

Our research delivers four foundational innovations that significantly advance the field of micro-expression recognition (MER), each rigorously validated through comprehensive experimentation on the CASMEII benchmark dataset. The first core contribution is our Dynamic Time Warping implementation with Path-Weighted Averaging, which effectively resolves the long-standing challenge of variable-length sequence alignment. This algorithm nonlinearly maps sequences to a fixed 20-frame length using FastDTW with O(N) complexity, implements practical slope constraints ( i j 5 ) to prevent pathological warping, and employs innovative path-weighted averaging to preserve critical expression semantics. Quantitative results demonstrate that this approach provides a substantial +2.44% accuracy improvement compared to conventional linear interpolation methods, establishing a new standard for temporal alignment in MER systems.
The second major contribution is our Dual-Branch Spatiotemporal Architecture, which synergistically combines convolutional and recurrent processing pathways to capture both local spatial features and global temporal dynamics. This design integrates a 1D-CNN pathway (64 filters, kernel = 5) with spatial dropout (0.3) for local Action Unit detection, a BiLSTM pathway (32-unit bidirectional LSTM with temporal convolution) for long-range dependency modeling, and feature fusion via channel-wise concatenation. This architecture achieves a +1.84% F1-score improvement over single-branch alternatives while maintaining real-time inference capabilities at 142 FPS, demonstrating the effectiveness of our hybrid approach for simultaneous spatial and temporal feature extraction.
Our third contribution involves a Hybrid Augmentation Pipeline that implements both temporal warping (50% probability Gaussian time-warping with σ = 0.1 ) and universal Gaussian noise injection ( σ = 0.03 ) to landmark coordinates. This comprehensive augmentation strategy enhances model robustness without introducing class imbalance, as validated through 5× augmentation while maintaining the original dataset’s physiological validity. Finally, we deliver an End-to-End Real-Time Optimization Framework that achieves 142.3 FPS throughput through frame decimation (processing every second frame), circular buffering, and asynchronous pipeline execution, substantially exceeding clinical requirements (>30 FPS) and enabling practical deployment in resource-constrained environments.

7.2. Potential Applications

The proposed framework demonstrates significant practical potential across multiple domains, leveraging its high accuracy and real-time processing capabilities. In psychological diagnostics, the system’s real-time inference capability (142 FPS) could enable instant micro-expression analysis in forensic interview settings, where rapid processing of non-verbal cues might provide valuable complementary information to traditional deception detection methods. The high sensitivity to subtle Action Units (AUs), particularly those associated with emotions like disgust in the Repression class, could potentially aid in refined assessment of depression and anxiety disorders by detecting micro-expressions that often precede macroscopic behavioral changes.
For clinical mental health monitoring, the system’s ability to detect subtle muscle movements (≤0.1 mm) could facilitate real-time assessment of anxiety and mood disorders. Validated in controlled settings with ± 25 ° head rotation tolerance, this capability addresses a critical need for objective measurement tools in psychiatric practice. The MediaPipe-based landmark detector (478-point facial mesh) ensures deployment feasibility on mobile devices (78.6 FPS pure CPU processing), potentially enabling accessible mental health monitoring across diverse healthcare settings, from clinical offices to in-home care environments.

7.3. Future Research Directions

Building upon the current limitations and challenges identified in our research, we prioritize three strategic directions for future investigation and development. First, lightweight model deployment will focus on optimizing the dual-branch architecture for embedded systems and IoT devices through parameter reduction and pruning techniques. While our current implementation achieves 142.3 FPS on an AMD Ryzen 7 7840H CPU, future work will target sub-100 MB models capable of <10 ms latency on edge devices, addressing computational bottlenecks identified in 3D-CNNs (GPU dependency) and CapsuleNet (slow routing) through quantization and hardware-aware training methodologies.
Second, cross-dataset generalization will address current constraints imposed by the CASMEII dataset (247 samples) by extending our approach to diverse datasets such as SAMM. While our current training utilized stratified 80/20 splits with subject independence, future efforts will incorporate domain adaptation techniques to handle variations in lighting conditions, ethnic diversity, and expression intensity across different populations. We will employ transfer learning methodologies to bridge the gap between laboratory-controlled environments and real-world scenarios, enhancing robustness to challenging conditions such as occlusion beyond ± 25 ° yaw/pitch angles.
Third, multimodal fusion approaches will explore the integration of physiological signals such as EEG with facial landmark data to model complex psychophysiological correlations. Inspired by StructBERT’s success in language-model integration, this direction addresses the current unimodal limitation while building upon MER-GCN’s AU dependency modeling as a foundational element. We will investigate hybrid architectures that fuse ViT-BiLSTM spatial features with EEG temporal signals to improve discrimination of ambiguous expressions in the “Others” category. This multidisciplinary approach promises to open new frontiers in comprehensive emotional state analysis through complementary signal integration. Drawing on recent multimodal advances, we will adopt hierarchical fusion designs that capture fine-grained inter-channel and inter-modal dependencies [40], incorporate ECG alongside EEG to strengthen autonomic–cortical coupling modeling [41,46], and integrate facial landmark representations with physiological embeddings using feature-level fusion strategies validated on standard emotion datasets [42]. Such hybrid frameworks are expected to improve robustness to ambiguous expressions and enhance cross-modal generalization in real-world affective computing scenarios.

Author Contributions

Conceptualization, Q.Y., M.W. and Y.L.; methodology, Q.Y.; software, M.W.; validation, Q.Y., M.W. and D.L.; formal analysis, D.C.; investigation, M.W.; resources, D.L.; data curation, M.W.; writing—original draft preparation, D.C.; writing—review and editing, Q.Y., M.W., D.L., and Y.L.; visualization, D.C.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Humanities and Social Science Research Project of the Ministry of Education of China (No. 22YJA880076).

Institutional Review Board Statement

The study was conducted in accordance with the rules of the 1975 Declaration of Helsinki (revised 2013), and the protocol was approved by the Ethics Committee of Liaoning Normal University (No. LL2025226) on 27 September 2025.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study. Informed consent was obtained from all participants to publish the identifiable images in the online publication.

Data Availability Statement

CASME II is available on Kaggle at https://www.kaggle.com/datasets/tahaarif23/micro-facial-expressions (accessed on 10 February 2026). SAMM is available on Kaggle at https://www.kaggle.com/datasets/khuongvutramanh/samm-data-qq (accessed on 15 February 2026). The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, G.; Li, X.; Li, Y.; Pietikäinen, M. Facial Micro-Expressions: An Overview. Proc. IEEE 2023, 111, 1215–1235. [Google Scholar] [CrossRef]
  2. Oh, Y.H.; See, J.; Le Ngo, A.C.; Phan, R.C.; Baskaran, V.M. A Survey of Automatic Facial Micro-Expression Analysis: Databases, Methods, and Challenges. Front. Psychol. 2018, 9, 1128. [Google Scholar] [CrossRef]
  3. Yang, J.; Wu, Z.; Wu, R. Micro-expression recognition based on contextual transformer networks. Vis. Comput. 2025, 41, 1527–1541. [Google Scholar] [CrossRef]
  4. Gan, Y.S.; Liu, K.-H.; Liong, G.-B.; Liong, S.-T. Micro-expression recognition in wild video environments: Latent feature-based ANN (LFANN) from 3D reconstructed faces. Neurocomputing 2025, 625, 129480. [Google Scholar] [CrossRef]
  5. Ben, X.; Ren, Y.; Zhang, J.; Wang, S.J.; Kpalma, K.; Meng, W.; Liu, Y.J. Video-Based Facial Micro-Expression Analysis: A Survey of Datasets, Features and Algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5826–5846. [Google Scholar] [CrossRef] [PubMed]
  6. Liu, S.; Huang, Y.; Yu, H.; Xu, Y. AMNet: An attention-enhanced multi-branch network for micro-expression recognition. Vis. Comput. 2025, 41, 6521–6532. [Google Scholar] [CrossRef]
  7. Zhao, S.; Tao, H.; Zhang, Y.; Xu, T.; Zhang, K.; Hao, Z.; Chen, E. A two-stage 3D CNN based learning method for spontaneous micro-expression recognition. Neurocomputing 2021, 448, 276–289. [Google Scholar] [CrossRef]
  8. Wang, Z.; Zhang, K.; Luo, W.; Sankaranarayana, R. HTNet for micro-expression recognition. Neurocomputing 2024, 602, 128196. [Google Scholar] [CrossRef]
  9. Liong, S.-T.; See, J.; Wong, K.; Phan, R.C.W. Less is more: Micro-expression recognition from video using apex frame. Signal Process. Image Commun. 2018, 62, 82–92. [Google Scholar] [CrossRef]
  10. Yildirim, S.; Chimeumanu, M.S.; Rana, Z.A. The influence of micro-expressions on deception detection. Multimed. Tools Appl. 2023, 82, 29115–29133. [Google Scholar] [CrossRef]
  11. Nikbin, S.; Qu, Y. A Study on the Accuracy of Micro Expression Based Deception Detection with Hybrid Deep Neural Network Models. Eur. J. Electr. Eng. Comput. Sci. 2024, 8, 14–20. [Google Scholar] [CrossRef]
  12. Yuan, S.; Shao, Z.; Ma, Z.; Cao, T.; Xing, H.; Liu, Y.; Cao, Y. Deception detection based on micro-expression and feature selection methods. EURASIP J. Image Video Process. 2025, 2025, 8. [Google Scholar] [CrossRef]
  13. Gilanie, G.; Cheema, S.; Latif, A.; Saher, A.; Ahsan, M.; Ullah, H.; Oommen, D. A robust method of bipolar mental illness detection from facial micro expressions using machine learning methods. Intell. Autom. Soft Comput. 2024, 39, 57–71. [Google Scholar] [CrossRef]
  14. Sumi, K.; Ueda, T. Micro-Expression Recognition for Detecting Human Emotional Changes. In Proceedings of the Human-Computer Interaction. Novel User Experiences, Toronto, ON, Canada, 17–22 July 2016; pp. 60–70. [Google Scholar]
  15. Esmaeili, V.; Shahdi, S.O. Automatic micro-expression apex spotting using Cubic-LBP. Multimed. Tools Appl. 2020, 79, 20221–20239. [Google Scholar] [CrossRef]
  16. Wei, J.; Lu, G.; Yan, J. A comparative study on movement feature in different directions for micro-expression recognition. Neurocomputing 2021, 449, 159–171. [Google Scholar] [CrossRef]
  17. Liong, S.T.; Phan, R.C.W.; See, J.; Oh, Y.H.; Wong, K. Optical strain based recognition of subtle emotions. In Proceedings of the 2014 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), 1–4 December 2014, 2014; pp. 180–184. [Google Scholar]
  18. Liu, Y.J.; Zhang, J.K.; Yan, W.J.; Wang, S.J.; Zhao, G.; Fu, X. A Main Directional Mean Optical Flow Feature for Spontaneous Micro-Expression Recognition. IEEE Trans. Affect. Comput. 2016, 7, 299–310. [Google Scholar] [CrossRef]
  19. Liu, Y.; Du, H.; Zheng, L.; Gedeon, T. A Neural Micro-Expression Recognizer. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–4. [Google Scholar]
  20. Xia, Z.; Hong, X.; Gao, X.; Feng, X.; Zhao, G. Spatiotemporal Recurrent Convolutional Networks for Recognizing Spontaneous Micro-Expressions. IEEE Trans. Multimed. 2020, 22, 626–640. [Google Scholar] [CrossRef]
  21. Xia, Z.; Peng, W.; Khor, H.Q.; Feng, X.; Zhao, G. Revealing the Invisible With Model and Data Shrinking for Composite-Database Micro-Expression Recognition. IEEE Trans. Image Process. 2020, 29, 8590–8605. [Google Scholar] [CrossRef] [PubMed]
  22. Zeng, X.; Zhao, X.; Zhong, X.; Liu, G. A Survey of Micro-expression Recognition Methods Based on LBP, Optical Flow and Deep Learning. Neural Process. Lett. 2023, 55, 5995–6026. [Google Scholar] [CrossRef]
  23. Goh, K.M.; Ng, C.H.; Lim, L.L.; Sheikh, U.U. Micro-expression recognition: An updated review of current trends, challenges and solutions. Vis. Comput. 2020, 36, 445–468. [Google Scholar] [CrossRef]
  24. Huang, X.; Zhao, G.; Hong, X.; Zheng, W.; Pietikäinen, M. Spontaneous facial micro-expression analysis using Spatiotemporal Completed Local Quantized Patterns. Neurocomputing 2016, 175, 564–578. [Google Scholar] [CrossRef]
  25. Huang, X.; Wang, S.J.; Liu, X.; Zhao, G.; Feng, X.; Pietikäinen, M. Discriminative Spatiotemporal Local Binary Pattern with Revisited Integral Projection for Spontaneous Facial Micro-Expression Recognition. IEEE Trans. Affect. Comput. 2019, 10, 32–47. [Google Scholar] [CrossRef]
  26. Zhi, R.; Xu, H.; Wan, M.; Li, T. Combining 3D Convolutional Neural Networks with Transfer Learning by Supervised Pre-Training for Facial Micro-Expression Recognition. IEICE Trans. Inf. Syst. 2019, E102.D, 1054–1064. [Google Scholar] [CrossRef]
  27. Li, J.; Wang, Y.; See, J.; Liu, W. Micro-expression recognition based on 3D flow convolutional neural network. Pattern Anal. Appl. 2019, 22, 1331–1339. [Google Scholar] [CrossRef]
  28. Quang, N.V.; Chun, J.; Tokuyama, T. CapsuleNet for Micro-Expression Recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar]
  29. Chen, H.; Cui, J.; Zhang, Y.; Zhang, Y. VIT and Bi-LSTM for Micro-Expressions Recognition. In Proceedings of the 2022 IEEE 5th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 23–25 September 2022; pp. 946–951. [Google Scholar]
  30. Jeong, S.-D. Speaker Identification Using Dynamic Time Warping Algorithm. J. Korea Acad.-Ind. Coop. Soc. 2011, 12, 2402–2409. [Google Scholar] [CrossRef]
  31. Mayya, V.; Pai, R.M.; Pai, M.M.M. Combining temporal interpolation and DCNN for faster recognition of micro-expressions in video sequences. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 699–703. [Google Scholar]
  32. Ngo, A.C.L.; Liong, S.T.; See, J.; Phan, R.C.W. Are subtle expressions too sparse to recognize? In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP), Singapore, 21–24 July 2015; pp. 1246–1250. [Google Scholar]
  33. Iqtait, M.; Mohamad, F.; Mamat, M. Feature extraction for face recognition via active shape model (ASM) and active appearance model (AAM). In Proceedings of the IOP Conference Series: Materials science and engineering, Suzhou, China, 22–24 June 2018; p. 012032. [Google Scholar]
  34. Alsarayreh, A.; Mohamad, F. Enhanced Constrained Local Models (CLM) for Facial Feature Detection. Int. J. Eng. Res. Technol. 2020, 13, 3217. [Google Scholar] [CrossRef]
  35. Yang, Z.; Ge, W.; Zhang, Z. Face Recognition Based on MTCNN and Integrated Application of FaceNet and LBP Method. In Proceedings of the 2020 2nd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM), Manchester, UK, 15–17 October 2020; pp. 95–98. [Google Scholar]
  36. Khan, S.S.; Sengupta, D.; Ghosh, A.; Chaudhuri, A. MTCNN++: A CNN-based face detection algorithm inspired by MTCNN. Vis. Comput. 2024, 40, 899–917. [Google Scholar] [CrossRef]
  37. Baltrušaitis, T.; Robinson, P.; Morency, L.P. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
  38. Lo, L.; Xie, H.X.; Shuai, H.H.; Cheng, W.H. MER-GCN: Micro-Expression Recognition Based on Relation Modeling with Graph Convolutional Networks. In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Shenzhen, China, 6–8 August 2020; pp. 79–84. [Google Scholar]
  39. Wang, W.; Bi, B.; Yan, M.; Wu, C.; Bao, Z.; Peng, L.; Si, L. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. arXiv 2019, arXiv:1908.04577. [Google Scholar] [CrossRef]
  40. Lopez, E.; Uncini, A.; Comminiello, D. Hierarchical Hypercomplex Network for Multimodal Emotion Recognition. In Proceedings of the 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), London, UK, 22–25 September 2024; pp. 1–6. [Google Scholar]
  41. Saha, P.; Ansaruddin Kunju, A.K.; Majid, M.E.; Bin Abul Kashem, S.; Nashbat, M.; Ashraf, A.; Hasan, M.; Khandakar, A.; Shafayet Hossain, M.; Alqahtani, A.; et al. Novel multimodal emotion detection method using Electroencephalogram and Electrocardiogram signals. Biomed. Signal Process. Control 2024, 92, 106002. [Google Scholar] [CrossRef]
  42. Wang, S.; Qu, J.; Zhang, Y.; Zhang, Y. Multimodal Emotion Recognition from EEG Signals and Facial Expressions. IEEE Access 2023, 11, 33061–33068. [Google Scholar] [CrossRef]
  43. Li, Q.; Zhan, S.; Xu, L.; Wu, C. Facial micro-expression recognition based on the fusion of deep learning and enhanced optical flow. Multimed. Tools Appl. 2019, 78, 29307–29322. [Google Scholar] [CrossRef]
  44. Peng, Y.; Wang, W.; Kong, W.; Nie, F.; Lu, B.L.; Cichocki, A. Joint Feature Adaptation and Graph Adaptive Label Propagation for Cross-Subject Emotion Recognition From EEG Signals. IEEE Trans. Affect. Comput. 2022, 13, 1941–1958. [Google Scholar] [CrossRef]
  45. Alghamdi, A.M.; Ashraf, M.U.; Bahaddad, A.A.; Almarhabi, K.A.; Al Shehri, W.A.; Daraz, A. Cross-subject EEG signals-based emotion recognition using contrastive learning. Sci. Rep. 2025, 15, 28295. [Google Scholar] [CrossRef]
  46. Samal, P.; Hashmi, M.F. A dynamic spectrum driven network for enhanced multimodal emotion recognition with EEG and ECG signals. Biocybern. Biomed. Eng. 2026, 46, 139–161. [Google Scholar] [CrossRef]
Figure 1. Architecture of the proposed Dual-Branch Spatiotemporal Network.
Figure 1. Architecture of the proposed Dual-Branch Spatiotemporal Network.
Symmetry 18 00775 g001
Figure 2. Training and validation loss, accuracy and F1-score curves with the number of epochs.
Figure 2. Training and validation loss, accuracy and F1-score curves with the number of epochs.
Symmetry 18 00775 g002
Figure 3. Confusion matrix of the proposed model on CASME II dataset under 5-fold cross-validation.
Figure 3. Confusion matrix of the proposed model on CASME II dataset under 5-fold cross-validation.
Symmetry 18 00775 g003
Figure 4. Real-Time Recognize Window.
Figure 4. Real-Time Recognize Window.
Symmetry 18 00775 g004
Table 1. Regularization Strategies Table.
Table 1. Regularization Strategies Table.
TechniqueScopeParameter
L 2 RegularizationDense layers λ = 10 4
DropoutCNN branchrate = 0.3
DropoutClassifierrate = 0.5
Table 2. Hyperparameter Specifications.
Table 2. Hyperparameter Specifications.
HyperparameterSymbolValue
Batch Size N b 2
Maximum Epochs 100
Learning Rate η 10 4
L 2 Regularization λ 10 4
Gradient Clipping None
Table 3. CASMEII Dataset Characteristics.
Table 3. CASMEII Dataset Characteristics.
CharacteristicValueDescription
Subjects26Chinese participants
Expression types5Happiness, Surprise, Disgust, Repression, Others
Video resolution640 × 480Captured at 200 FPS
Total samples247Micro-expression clips
Table 4. CASME II Class Distribution Statistics.
Table 4. CASME II Class Distribution Statistics.
ExpressionSamplesProportion
Happiness3313.36%
Surprise2510.12%
Disgust6024.29%
Repression2710.93%
Others10241.30%
Table 5. SAMM Class Distribution Statistics.
Table 5. SAMM Class Distribution Statistics.
ExpressionSamplesProportion
Happiness2616.35%
Contempt127.55%
Surprise159.43%
Anger5735.85%
Disgust95.66%
Sadness63.77%
Fear85.03%
Others2616.36%
Table 6. Hyperparameter Settings Table.
Table 6. Hyperparameter Settings Table.
ParameterValueScopeOptimization Method
Batch Size2TrainingEmpirical validation
Max Epochs20TrainingEarly stopping criterion
Learning Rate ( η ) 10 4 OptimizationAdam default
L 2 Regularization ( λ ) 10 4 RegularizationGrid search 10 6 , 10 2
CNN Dropout Rate0.3Feature extractionBayesian optimization
Classifier Dropout0.5ClassificationCross-validation
Sequence Length20PreprocessingAblation study (Section 5.3.2)
Minimum Frames10Real-time detectionFrame threshold analysis
Table 7. Performance Comparison on CASMEII and SAMM Datasets. (Use NULL as a substitute for some data that was not found.)
Table 7. Performance Comparison on CASMEII and SAMM Datasets. (Use NULL as a substitute for some data that was not found.)
MethodCASMEII Accuracy (%)CASMEII F1-ScoreSAMM Accuracy (%)SAMM F1-Score
Optical Flow + SVM [43]58.03NullNullNull
3D-FCNN [27]59.11NullNullNull
3D-CNNs(train from scratch) [26]94.20Null95.80Null
3D-CNNs(with transfer learning) [26]97.60Null97.40Null
BiLSTM + Attention [29]70.000.7220NullNull
Ours99.220.994998.7498.64
Improvement vs. best baseline+1.66%+37.80%+1.38%-
Ours (5-fold)96.770.9586--
Table 8. Per-class performance metrics under 5-fold cross-validation.
Table 8. Per-class performance metrics under 5-fold cross-validation.
ClassPrecisionRecallF1-Score
happiness1.000.880.93
others0.951.000.98
surprise1.000.960.98
Weighted Avg0.970.970.97
Table 9. Ablation Study of Alignment Methods.
Table 9. Ablation Study of Alignment Methods.
Alignment MethodAccuracy (%) Δ Acc
Linear interpolation96.86Base
DTW (Ours)99.22+2.44%
Table 10. Architecture Ablation (F1-score).
Table 10. Architecture Ablation (F1-score).
ConfigurationF1
CNN branch only94.74
LSTM branch only97.65
Concatenation (Ours)99.49
Table 11. Real-Time Performance Benchmarking.
Table 11. Real-Time Performance Benchmarking.
ComponentAMD Ryzen 7 7840H (FPS)Latency (ms)
End-to-End142.37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, Q.; Wang, M.; Chen, D.; Liu, D.; Li, Y. Dual-Branch Network with Dynamic Time Warping: Enhancing Micro-Expression Recognition Through Temporal Alignment. Symmetry 2026, 18, 775. https://doi.org/10.3390/sym18050775

AMA Style

Yao Q, Wang M, Chen D, Liu D, Li Y. Dual-Branch Network with Dynamic Time Warping: Enhancing Micro-Expression Recognition Through Temporal Alignment. Symmetry. 2026; 18(5):775. https://doi.org/10.3390/sym18050775

Chicago/Turabian Style

Yao, Qiaohong, Mengmeng Wang, Dayu Chen, Dan Liu, and Yubin Li. 2026. "Dual-Branch Network with Dynamic Time Warping: Enhancing Micro-Expression Recognition Through Temporal Alignment" Symmetry 18, no. 5: 775. https://doi.org/10.3390/sym18050775

APA Style

Yao, Q., Wang, M., Chen, D., Liu, D., & Li, Y. (2026). Dual-Branch Network with Dynamic Time Warping: Enhancing Micro-Expression Recognition Through Temporal Alignment. Symmetry, 18(5), 775. https://doi.org/10.3390/sym18050775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop