Next Article in Journal
Delayed Feedback Chaos Control on a Cournot Game with Relative Profit Maximization
Previous Article in Journal
Exploring Harmonic Evolute Geometries Derived from Tubular Surfaces in Minkowski 3-Space Using the RM Darboux Frame
Previous Article in Special Issue
Auxcoformer: Auxiliary and Contrastive Transformer for Robust Crack Detection in Adverse Weather Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optical Flow Magnification and Cosine Similarity Feature Fusion Network for Micro-Expression Recognition

1
School of Information Engineering, Nanjing Xiaozhuang University, Nanjing 211171, China
2
School of Computer Engineering, Jiangsu Ocean University, Lianyungang 222005, China
3
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(15), 2330; https://doi.org/10.3390/math13152330
Submission received: 18 June 2025 / Revised: 9 July 2025 / Accepted: 21 July 2025 / Published: 22 July 2025
(This article belongs to the Special Issue Representation Learning for Computer Vision and Pattern Recognition)

Abstract

Recent advances in deep learning have significantly advanced micro-expression recognition, yet most existing methods process the entire facial region holistically, struggling to capture subtle variations in facial action units, which limits recognition performance. To address this challenge, we propose the Optical Flow Magnification and Cosine Similarity Feature Fusion Network (MCNet). MCNet introduces a multi-facial action optical flow estimation module that integrates global motion-amplified optical flow with localized optical flow from the eye and mouth–nose regions, enabling precise capture of facial expression nuances. Additionally, an enhanced MobileNetV3-based feature extraction module, incorporating Kolmogorov–Arnold networks and convolutional attention mechanisms, effectively captures both global and local features from optical flow images. A novel multi-channel feature fusion module leverages cosine similarity between Query and Key token sequences to optimize feature integration. Extensive evaluations on four public datasets—CASME II, SAMM, SMIC-HS, and MMEW—demonstrate MCNet’s superior performance, achieving state-of-the-art results with 92.88% UF1 and 86.30% UAR on the composite dataset, surpassing the best prior method by 1.77% in UF1 and 6.0% in UAR.

1. Introduction

Facial expressions represent a fundamental form of nonverbal communication, conveying human emotions. These expressions can be divided into two categories: macro-expressions, which are readily observable due to their pronounced facial movements, and micro-expressions, which are subtle and transient, lasting between 1/25 and 1/5 of a second. Their brevity renders micro-expressions challenging to detect visually without specialized tools [1,2,3]. Nevertheless, micro-expressions can reveal authentic emotions, making their study significant for applications in psychology, medicine, and criminal investigation [4,5]. For instance, micro-expression recognition offers valuable support in psychological counseling. In the critical domain of children’s mental health, clinicians often face difficulties in obtaining accurate data due to patients’ subjective suppression of emotions, which can lead to misdiagnosis. Deep learning-based micro-expression recognition serves as an effective auxiliary tool, enhancing diagnostic accuracy by addressing this challenge.
Research on micro-expressions typically encompasses two primary areas: micro-expression spotting [6] and micro-expression recognition [1]. Spotting involves identifying the most informative frame within a micro-expression video sequence; specifically, the frame exhibiting the most significant facial movement, which encapsulates critical micro-expression details. Recognition, conversely, focuses on classifying micro-expression data to determine the associated emotion. In spotting, the initial frame of a video sequence often serves as a baseline, with subsequent frames analyzed to compute the average optical flow magnitude, a measure of motion intensity that pinpoints the frame with the greatest change. By contrast, recognition typically entails pre-processing micro-expression images using techniques such as optical flow estimation or motion magnification, followed by classification via deep learning models. This study primarily concentrates on micro-expression recognition.
Micro-expression recognition techniques are generally categorized into two primary groups: traditional methods and deep learning-based approaches. Traditional methods include dynamic texture analysis, such as the Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) method proposed by Zhao et al. [7], which facilitates action and expression recognition in video sequences. Similarly, Liu et al. [8] introduced the Main Directional Mean Optical Flow (MDMO) technique to extract facial features from micro-expression sequences. In this approach, each pixel’s emotional dynamics in an image sequence are represented as a vector, with its direction and magnitude indicating motion. The MDMO algorithm identifies the predominant direction of these vectors, capturing the overall motion trend within the sequence. However, traditional methods face challenges in feature extraction, particularly in capturing high-dimensional features of micro-expressions and effectively leveraging both temporal and spatial information in the sequence.
Recent advancements in deep learning have significantly enhanced micro-expression recognition across various domains [9,10]. Gan et al. [11] proposed a Convolutional Neural Network (CNN)-based approach, introducing the OFF-ApexNet model to extract optical flow features from apex frames. This method integrates deep learning with optical flow estimation, substantially improving accuracy and generalization compared to traditional techniques. However, CNNs often fail to effectively capture temporal dynamics in video sequences. To address this limitation, Nguyen et al. [12] developed Micron-BERT, a novel end-to-end deep learning framework for unsupervised micro-expression detection. Micron-BERT incorporates Diagonal Micro-Attention and Patch of Interest modules, leveraging a multi-head self-attention mechanism to extract both local and global features, thereby outperforming CNN-based models. More recently, Xie et al. [13] applied the Multi-Head Self-Attention mechanism to fuse micro-expression features by performing separate feature extraction on different types of micro-expression images and subsequently merging them, which yielded promising results. Despite these advances, deep learning approaches still struggle with real-world complexities, particularly in detecting subtle facial movements characteristic of micro-expressions. For instance, while the MFDAN model achieved UF1 and UAR scores of 91.34% and 93.26%, respectively, on the CASME II dataset, its performance declined markedly on the SMIC dataset, with UF1 and UAR scores of 68.15% and 70.43% [14]. This performance drop is attributed to the SMIC dataset’s lower-quality, blurry side-face images, resulting from limitations in recording equipment, which highlights the challenges of applying deep learning models to diverse and suboptimal real-world data.
Despite recent advancements in micro-expression recognition, several critical challenges persist. (1) Collecting micro-expression data remains a complex task, resulting in limited dataset availability. Existing datasets are frequently compromised by issues such as noise and low resolution, stemming from recording constraints. (2) The subtle facial muscle movements characteristic of micro-expressions pose significant recognition difficulties. Figure 1a depicts three distinct micro-expression video sequences from the same individual in the CASME II dataset, with the onset frame in the first column and the apex frame in the second-to-last column. The minimal facial changes across the sequence and the subtle distinctions between micro-expression categories are evident. To further illustrate, we applied the t-SNE algorithm [15] to visualize three micro-expression categories from the CASME II dataset, as shown in Figure 1b. The overlapping distributions indicate that intra-class variation exceeds inter-class variation, complicating classification. In contrast, the SMIC-HS dataset, presented in Figure 1c, exhibits more pronounced differences in clarity, resolution, and viewing angles.
To address the challenges outlined above, this study introduces a novel optical flow feature fusion network, termed the Optical Flow Magnification and Cosine Similarity Feature Fusion Network (MCNet), which integrates motion magnification with facial action units (FAUs). The primary contributions of this approach are as follows.
  • To capture subtle motion variations in micro-expression datasets, this work pioneers the application of optical flow estimation to motion-magnified micro-expression images, marking a significant advancement in the field.
  • A novel method is proposed that leverages FAU segmentation to enhance input preprocessing, thereby improving the efficacy of conventional optical flow techniques.
  • Two innovative modules are introduced for feature extraction and fusion: the Mobile Residual KAN CBAM Block Net (MRKCBN) and the Multi-Channel Cosine Fusion Module (MCCFM). The MRKCBN enhances feature learning by substituting traditional weight parameters with learnable univariate functions, facilitating the extraction of highly discriminative features. Concurrently, the MCCFM captures subtle yet critical features often overlooked, thereby elevating micro-expression recognition performance.
The remainder of this paper is structured as follows: Section 2 reviews related literature. Section 3 provides a comprehensive description of the proposed methodology. Section 4 details the experimental setup and analyzes the results. Section 5 concludes the paper.

2. Related Work

2.1. Micro-Expression Recognition

Micro-expressions, characterized by brief and subtle facial muscle movements, pose significant challenges for accurate detection using conventional static image processing techniques. Optical flow methods, which track pixel motion across image sequences, effectively capture dynamic motion information. Figure 2 illustrates the visualization of optical flow features, where hue and brightness in the HSV color space represent the direction and magnitude of motion, respectively.
Recently, Liong et al. [16] applied optical flow to micro-expression recognition, introducing a novel weighted feature extraction scheme based on optical strain. This method assigns weights to different image regions by aggregating motion information across temporal and spatial dimensions. Wei et al. [17] demonstrated that applying motion magnification as a preprocessing step enhances the subtle motion features of micro-expressions, improving recognition accuracy when images are input into a network model. This finding underscores the potential of motion magnification to augment micro-expression detection. Nguyen et al. [12] proposed a BERT-based approach incorporating diagonal micro-attention and patch-of-interest modules. This method automatically focuses on facial regions most indicative of micro-expression changes, thereby enhancing detection and recognition performance.
In contrast, Wang et al. [18] divided micro-expression images into four distinct regions—left eye, right eye, left mouth corner, and right mouth corner—for input into the Hierarchical Transformer Network (HTNet). This region-specific approach enables HTNet to prioritize local details critical for micro-expression analysis. Similarly, Cai et al. [14] proposed the Multi-level Flow-Driven Attention Network (MFDAN), which employs a transformer-based model for feature extraction using motion-magnified micro-expression images and optical flow images. To capture spatio-temporal information from optical flow images, conventional approaches [12,19] often utilize the Convolutional Block Attention Module (CBAM). However, CBAM’s spatial attention mechanism relies on standard 2D convolutions to process concatenated feature maps, limiting its ability to capture global contextual information due to its focus on local neighborhoods. This constraint reduces CBAM’s effectiveness in modeling comprehensive spatio-temporal features.

2.2. KAN (Kolmogorov–Arnold Networks)

The Kolmogorov–Arnold network (KAN) [20] is a neural network architecture inspired by the Kolmogorov–Arnold theorem [20]. This theorem asserts that any multivariate continuous function f ( · ) defined on a bounded domain can be expressed as a finite composition of univariate continuous functions with addition operations. The mathematical expression of KAN is as follows:
f ( x ) = f ( x 1 , , x n ) = q = 1 2 n + 1 ϕ q ( p = 1 n ϕ q , p ( x p ) )
where x = [ x 1 , , x n ] R 1 × n , and ϕ q , p and ϕ q are all univariate continuous functions, with ϕ q , p : [ 0 , 1 ] -> R, ϕ q : R -> R. KAN incorporates KANLinear layers, which differ from traditional linear layers in that they utilize learnable univariate functions, typically based on splines, to approximate complex multivariate functions.
Recently, Bresson et al. [21] introduced KAGNN, where KAN replaces the Multi-Layer Perceptron (MLP) in Graph Neural Networks (GNNs). Since MLPs have limitations as transformers in GNNs, KAN offers a more effective means of capturing nonlinear relationships between complex inputs and outputs with fewer parameters. Unlike MLPs, which approximate complex functions through linear combinations and activation functions across multiple layers, KAN can flexibly model intricate relationships, significantly enhancing model capacity. Inspired by KAN and LSTM, Genet et al. [22] proposed a new neural network architecture called TKAN, combining the strengths of both networks to achieve more accurate and efficient multi-step time series predictions. Despite its outstanding performance in image processing applications, no previous studies have explored the application of KAN in micro-expression recognition.

3. The Optical Flow Magnification and Cosine Similarity Feature Fusion Network

The architecture of the proposed MCNet is depicted in Figure 3. The MCNet model comprises three core components: the optical flow processing module, the Mobile Residual KAN CBAM Block Net (MRKCBN) module, and the Multi-Channel Cosine Fusion Module (MCCFM). The optical flow processing module enhances subtle facial movements, while the MRKCBN and MCCFM modules handle feature extraction and fusion, respectively, to optimize micro-expression recognition performance.

3.1. The Optical Flow Processing Module

In micro-expression recognition, existing optical flow algorithms often fail to effectively capture subtle facial muscle movements due to their minimal amplitude. To enhance the extraction of facial motion information, this study proposes two novel optical flow estimation modules: the Optical Flow Estimation Module based on Motion Amplification (OMAM) and the Optical Flow Estimation Module based on Action Unit Cropping (OAUM).
The workflow of the optical flow processing module, as depicted in Figure 4, begins with preprocessing each micro-expression video sequence. The sequence is first segmented into individual image frames, followed by facial alignment and cropping using the 68 facial landmark detection model from the dlib library (https://pypi.org/project/dlib/). The cropped image sequences are then processed by the OMAM and OAUM modules.
In the OMAM module, the Flowmag model [23] is employed for motion amplification to enhance the magnitude of subtle facial movements. An amplification factor (alpha) of 5 is set to achieve optimal magnification. Subsequently, TVL1 optical flow [24] is computed between the onset and apex frames. Compared to optical flow derived from unamplified sequences, the amplified optical flow images exhibit more pronounced facial motion, improving feature visibility.
While optical flow effectively captures rich facial expression features, modeling complex flow patterns across the entire face remains challenging. Some methods address this by segmenting facial images into distinct regions—such as the left eye, right eye, left mouth corner, and right mouth corner—to focus on region-specific motion characteristics [18]. However, such approaches limit the integration of collaborative information across facial regions. To address this, the OAUM module leverages Facial Action Units (FAUs), which partition facial expressions into upper and lower regions to capture primary expression features. As illustrated in Figure 4b, the y-axis coordinate of the 30th facial landmark is used as a reference to horizontally crop the optical flow image, isolating the eye and mouth–nose regions. This enables the OAUM module to prioritize critical movement information in these key areas during training while facilitating the exchange of collaborative information between related regions.

3.2. The Mobile Residual KAN CBAM Block Net

To extract representative and discriminative features from input optical flow images, we propose the Mobile Residual KAN CBAM Block Net (MRKCBN), a novel network architecture built upon the MobileNetV3 framework [25]. The structure of MRKCBN is depicted in Figure 5. The model comprises three primary components: MobileNetV3, KCBAM, and AdaptiveAvgPool.
The fused optical flow image is first processed by MobileNetV3 to extract representative features. This involves an initial convolution (with a kernel size of 3 and 16 output channels), followed by batch normalization and a Hardswish nonlinear activation function. Recent work by Liu et al. [20] demonstrated that the Kolmogorov–Arnold network (KAN) excels at extracting local features through convolution operations, enhancing the representation of spatial relationships among pixels. Inspired by this, we introduce the KCBAM module, which integrates KAN into both the channel and spatial attention mechanisms of CBAM. By leveraging KAN’s adaptive modeling capabilities, KCBAM improves the representation of spatio-temporal information in the extracted features.
The detailed architecture of MRKCBN is illustrated in Figure 6a. The KCAM optimizes the original linear module in the channel attention mechanism by incorporating the KANLinear module, as shown in Figure 6b. KANLinear introduces B-spline functions to approximate traditional activation functions, creating an adaptive linear layer [20]. This module can dynamically adjust its behavior based on the input data and regulate the model’s complexity via a regularization loss term. Compared to traditional MLPs, KAN provides enhanced adaptability and interpretability, making it particularly well-suited for tasks such as micro-expression recognition, where features are highly subtle and nuanced. The core operation of the KCAM model is expressed as follows:
M c ( F 1 ) = σ ( K A N ( A v g P o o l ( F 1 ) + K A N ( M a x P o o l ( F 1 ) ) ) )
where F 1 represents the input sequence of the KCAM, A v g P o o l denotes the average pooling operation, M a x P o o l represents the max pooling operation, σ is the activation function, and M c ( F 1 ) represents the output sequence of the KCAM. Additionally, the KCAM model replaces the convolution module in the spatial attention mechanism of CBAM with KANConv [26], which combines convolutional layers with KAN principles. KANConv can approximate traditional activation functions using B-splines and integrates features such as grouped convolution, normalization, and dropout. These features result in a more flexible and powerful convolutional layer implementation, allowing the KCAM to better adapt to the data’s characteristics, thus improving the model’s representational capacity and generalization performance.
Similarly, KAN is integrated into the spatial attention module to form the KSAM (KAN Spatial Attention Module), as illustrated in Figure 6c. The core operations of the KSAM are expressed as follows:
M s ( F 2 ) = σ ( K A N C o n v ( A v g P o o l ( F 2 ) ; M a x P o o l ( F 2 ) ) )
where F 2 represents the input sequence of the KSAM, K A N C o n v denotes the convolution operation, and M s ( F 2 ) represents the output sequence of the KSAM.
To mitigate the challenges of vanishing gradients and convergence difficulties commonly encountered in deep neural network architectures, the KCBAM module incorporates a residual block comprising two 3 × 3 convolutional layers, two batch normalization layers, and a ReLU activation function. This configuration enhances training stability and facilitates smoother convergence. Additionally, the AdaptiveAvgPool layer is employed to dynamically perform average pooling on feature maps, reducing their spatial dimensions (e.g., height and width) while extracting more robust and semantically meaningful features. This operation improves computational efficiency and overall model performance. Unlike conventional pooling methods, such as MaxPool or AvgPool, AdaptiveAvgPool dynamically adjusts the pooling operation to align with the target output size, ensuring precise feature aggregation. The detailed configuration of the MRKCBN module is presented in Table 1.

3.3. The Multi-Channel Cosine Fusion Module

We propose the Multi-Channel Cosine Fusion Module (MCCFM), which leverages an optimized Multi-Head Self-Attention (MHSA) mechanism to improve feature integration. In conventional MHSA, attention weights are derived by computing the correlation between Query and Key operations to identify the most relevant Value. In this study, we introduce the Multi-Channel Cosine Fusion Self-Attention (MCCFSA) mechanism, which enhances the standard dot-product attention by incorporating cosine similarity between Query and Key feature sequences. This approach better captures latent correlations among diverse feature sequences, thereby improving feature fusion performance. The architecture of the MCCFM is depicted in Figure 7.
The input features are first normalized using Layer Normalization to stabilize and standardize the feature representations. Subsequently, the normalized features are processed by the Multi-Channel Cosine Fusion Self-Attention (MCCFSA) mechanism. The input tokens—Token1, Token2, and Token3—correspond to the optical flow images from the eye region, the motion-amplified optical flow images, and the optical flow images from the mouth–nose region, respectively. These tokens represent the three feature sequences derived from the Mobile Residual KAN CBAM Block Net (MRKCBN) model. For each token, Query, Key, and Value vectors are extracted, enabling the MCCFSA mechanism to compute attention weights based on cosine similarity, as described in the following formulation:
Q 1 , K 1 , V 1 , Q 2 , K 2 , V 2 , Q 3 , K 3 , V 3 = W q k v ( X 1 , X 2 , X 3 )
where Q i , K i , V i represent the extracted Query, Key, and Value, respectively. W q k v denotes the parameters of the linear transformations. The cosine similarity between the Queries and Keys of different feature sequences is then computed as follows:
c o s ( Q , K ) = Q · K Q · K
A t t n M i = c o s ( Q i , K j ) d + b
The cosine similarity between feature sequences is computed using the function c o s ( · , · ) . The d represents the number of heads in the MHSA, serving as a scaling factor to stabilize numerical computations. The bias term b enhances the model’s adaptability to the data distribution. The resulting attention matrix, denoted as A t t e n M i corresponds to the ith attention head.
The attention matrix A t t e n M i is then normalized and multiplied by the corresponding Value vector V j as follows:
S o f t m a x ( A t t n M i ) = e x p ( A t t n M i ) i = 1 L e x p ( A t t n M i )
A t t e n t i o n i ( Q , K , V ) = S o f t m a x ( A t t n M i · V j )
Through this matrix multiplication, the attention feature vector for the ith element is obtained. The three attention feature vectors, derived from Token1, Token2, and Token3, are aggregated to form a unified attention feature vector that effectively integrates multi-scale information. This process is formulated as follows:
A t t e n t i o n ( Q , K , V ) = A t t e n t i o n 1 + A t t e n t i o n 2 + A t t e n t i o n 3
The attention-weighted features are processed through a residual connection followed by Layer Normalization to generate the final output features. The proposed Multi-Channel Cosine Fusion Module (MCCFM) effectively integrates diverse feature sequences, enabling motion-amplified optical flow images to comprehensively capture facial movement information during micro-expression events. By computing cosine similarity between these feature sequences and local features from the eye and mouth–nose regions, the model seamlessly combines local and global facial movement characteristics. This approach significantly enhances the robustness and richness of feature extraction, thereby improving overall model performance.

3.4. Loss Function

To address class imbalance in micro-expression recognition, we utilize the focal loss function during model training. The focal loss function augments the standard cross-entropy loss by incorporating a modulating factor that reduces the contribution of easily classified samples, thereby prioritizing difficult or misclassified instances. This approach enhances the model’s ability to handle imbalanced datasets, which is particularly advantageous in micro-expression recognition where certain classes may be underrepresented. The mathematical formulation of the focal loss function is expressed as follows:
F L ( p t ) = α t ( 1 p t ) γ l o g ( p t )
where α t is a weighting factor that mitigates class imbalance, ( 1 p t ) γ serves as a modulating factor, and γ 0 is the focusing parameter that adjusts the emphasis on hard-to-classify samples. By tuning γ , the model prioritizes challenging instances during training, enhancing its ability to address class imbalance in micro-expression recognition.

4. Experimental Results and Analysis

4.1. Experimental Dataset

To evaluate the effectiveness of the proposed Optical Flow Magnification and Cosine Similarity Feature Fusion Network (MCNet), comprehensive experiments were conducted on four widely recognized public micro-expression datasets: CASME II [27], SAMM [28], SMIC-HS [29], and MMEW [30].
The CASME II dataset, developed by the Institute of Psychology, Chinese Academy of Sciences, comprises micro-expression videos from 35 Chinese participants exposed to emotionally stimulating videos. It includes five emotional categories: happiness, surprise, disgust, repression, and other. The SAMM dataset, provided by the University of Manchester, UK, contains micro-expression videos from 32 participants of diverse ethnicities and genders, recorded under emotionally evocative stimuli, capturing emotions such as anger, disgust, fear, happiness, sadness, surprise, and others. The SMIC-HS dataset, released by the University of Oulu, Finland, focuses on three primary emotional states: positive, negative, and surprise. The MMEW dataset, compiled by Professor Ben’s team at Shandong University and released in 2021, includes 300 micro-expression video samples from 36 participants, with facial images standardized at 400 × 400 pixels. It includes seven emotional categories: happiness, anger, surprise, disgust, fear, sadness, and others.
To ensure consistency and comparability across datasets, emotional categories were harmonized into three unified classes: “positive” (happiness), “negative” (sadness, disgust, contempt, fear, anger), and “surprise” (exclusively surprise instances). Additionally, a composite dataset was created by merging these four datasets for validation purposes. The distribution of emotional categories, illustrated in Figure 8, reveals a significant class imbalance, with negative samples substantially outnumbering positive and surprise samples, particularly in the SAMM dataset.

4.2. Experimental Settings

To assess the effectiveness of the proposed Optical Flow Magnification and Cosine Similarity Feature Fusion Network (MCNet), we conducted a comprehensive comparison with conventional approaches, including LBP-TOP [7] and Bi-WOOF [31], as well as state-of-the-art deep learning approaches, such as CapsuleNet [32], FDCN [19], STSTNet [33], OFFApexNet [11], EMR [34], RCN [35], FeatRef [36], and MFDAN [14].
The experiments utilized the leave-one-subject-out cross-validation (LOSOCV) protocol, where each subject serves as the test set while the remaining subjects’ data are used for training. LOSOCV is widely adopted in micro-expression recognition research due to its robust generalization and efficient data utilization. The model was trained with an initial learning rate of 0.0001, a batch size of 128, and 300 epochs. Data augmentation included random horizontal flipping (p = 0.5), random cropping (scale [0.9, 1.0]), and random rotation (±10°). For KAN in MRKCBN, we used a B-spline configuration with five segments, spline order 3, L1 regularization strength of 1. The focal loss hyperparameter α t was set as N / ( N t × C ) , where N is the total number of samples, N t is the number of positive samples, and C is the number of classes. The focusing parameter γ was set to 0.2. Experiments were conducted on a system running Windows 11 OS, using PyTorch 2.3.0, an Intel Core i7-12700 processor, 64 GB of RAM, and an NVIDIA GeForce RTX 4080 GPU.
Two primary evaluation metrics were employed: the Unweighted F1 score (UF1) and the Unweighted Average Recall (UAR). The UF1, also known as the macro-average F1 score, is particularly effective for multi-class classification tasks with imbalanced datasets, as it equally weights each class’s F1 score, ensuring fair performance evaluation across all emotional categories. The UF1 is calculated as follows:
F 1 c = 2 × T P c 2 × T P c + F P c + F N c
U F 1 = F 1 c C
where F P c , F N c , and T P c represent the number of false positives, false negatives, and true positives for class c, respectively, and C denotes the total number of classes. This formulation ensures that all classes contribute equally to the metric, making UF1 particularly effective for handling class imbalance.
The Unweighted Average Recall (UAR) is another critical metric for evaluating model performance in the presence of imbalanced class distributions. It computes the average recall across all classes, assigning equal weight to each class regardless of sample size. The UAR is defined as
U A R = 1 C C T P c n c
where n c is the total number of samples in class c. By equally weighting each class’s recall, UAR effectively addresses the challenges of imbalanced datasets in micro-expression recognition.

4.3. Experimental Results

The performance of the proposed MCNet compared to other comparative methods across the four datasets and the composite dataset is presented in Table 2. The results demonstrate that MCNet consistently outperforms competing methods across all datasets. On the CASME II dataset, MCNet achieves UF1 and UAR of 94.34% and 96.89%, outperforming MFDAN by improvements of 3.00% in UF1 and 3.63% in UAR. For the SAMM dataset, MCNet exhibits significant improvements, with a 3.17% increase in UF1 and a 10.28% increase in UAR. The substantial UAR improvement highlights MCNet’s enhanced capability to classify micro-expressions across diverse emotional categories, with the focal loss function effectively mitigating the pronounced class imbalance in SAMM.
The SMIC-HS dataset poses unique challenges due to low-quality images caused by varying shooting conditions and camera angles, introducing significant noise that degrades the performance of many micro-expression recognition methods. While MFDAN, which relies solely on motion magnification, struggles with this noise, MCNet integrates optical flow estimation with motion-amplified images to robustly represent facial micro-expression features. This approach minimizes noise interference, resulting in a 17.9% improvement in UF1 and a 19.7% improvement in UAR compared to MFDAN. On the composite dataset, MCNet achieves a 1.70% increase in UF1 and a 6.0% increase in UAR over MFDAN, further confirming the robustness and generalizability of the proposed method.
To provide a detailed evaluation of the proposed MCNet, we analyzed its performance in recognizing micro-expression categories—positive (label 0), negative (label 1), and surprise (label 2)—using confusion matrices, as depicted in Figure 9. The results demonstrate MCNet’s robust recognition capabilities across the CASME II and SMIC-HS datasets, with some limitations observed on the SAMM dataset.
On the CASME II dataset, MCNet achieves exceptional accuracy across all three emotional categories, with a notable recognition accuracy of 0.97 for the negative category. Challenges such as varying image quality and shooting angles are effectively mitigated through the integration of motion amplification and optical flow processing, enabling superior performance. For the SMIC-HS dataset, MCNet outperforms most prior methods, achieving high Unweighted F1 (UF1) and Unweighted Average Recall (UAR) scores despite low-quality images and noise from diverse shooting conditions. This underscores the model’s robustness in handling noisy data, further validating its effectiveness.
The SAMM dataset, characterized by a significant class imbalance with a high proportion of negative samples, poses a unique challenge. Confusion matrix analysis reveals that the negative category achieves a precision of 0.96, while the positive category exhibits lower accuracy, with approximately 0.33 of positive instances misclassified as negative. This indicates that, despite the use of focal loss and composite dataset integration to mitigate class imbalance, the dominance of negative samples continues to impact the recognition of positive and surprise categories. On the composite dataset, MCNet effectively handles all three emotional categories, though performance varies. The model excels at recognizing negative emotions but faces challenges with positive and surprise categories due to residual class imbalance effects, highlighting areas for further improvement.
To further evaluate the effectiveness of the proposed MCNet, we conducted comprehensive experiments on the MMEW dataset. The results are presented in Table 3. MCNet achieves a recognition accuracy of 76.2%, surpassing the SGCN by 3.2%. This improvement underscores the robustness and efficacy of MCNet in micro-expression recognition.

4.4. Ablation Experiments

To thoroughly assess the contributions of individual components within the proposed MCNet, we conducted comprehensive ablation experiments, with results detailed in Table 4.
In Experiment 1, we removed the OMAM motion magnification step while retaining the other modules, and fed the images processed by OAUM directly into the network. This approach resulted in a UF1 score of 83.77%, showing a decrease of 10.57% compared to the full model. This highlights that relying solely on motion magnification introduces significant image noise, which interferes with the feature extraction process, leading to a performance drop.
In Experiment 2, we removed the FAU-based segmentation module from the OAUM module, and fed the images processed only by the OMAM module into the network for feature extraction. This led to a 5.16% decrease in the UF1 score. This performance drop may be attributed to the reduced focus on key features, as the complete optical flow image is less optimized for capturing essential micro-expression details, resulting in suboptimal outcomes.
In Experiment 3, we conducted an ablation study by removing the KAN component from the MRKCBN model while preserving all other modules. This resulted in a UF1 score of 87.31%, compared to 94.34% for the complete MCNet model, indicating a performance decrease of 7.03%. These findings underscore the significant contribution of the KAN component to the overall efficacy of the proposed model integration.
In Experiment 4, we excluded the RKCB module, keeping all other components and hyperparameters unchanged. This resulted in a UF1 score of 92.70%, a 1.64% decrease compared to the full model. The RKCB module, which integrates spatial and channel attention mechanisms, is vital for capturing nuanced features in optical flow images generated from micro-expressions. Its absence highlights its critical role in enhancing model performance.
In Experiment 5, we bypassed the Multi-Channel Cosine Fusion Module (MCCFM), directly concatenating features for classification. This approach yielded a 2.74% decrease in the UF1 score compared to the full model with MCCFM. The results indicate that MCCFM’s cosine similarity-based feature fusion significantly improves the integration of diverse feature sequences, enhancing overall performance.
Finally, Experiment 6 confirms that all proposed modules—OMAM, OAUM, MRKCBN, and MCCFM—collectively contribute to MCNet’s superior performance, as their combined integration maximizes feature extraction and fusion for micro-expression recognition.
To further validate the contributions of the Optical Flow Estimation Module based on Motion Amplification (OMAM) and the Optical Flow Estimation Module based on Action Unit Cropping (OAUM) within the MCNet, we conducted detailed ablation experiments on the CASME II dataset. The results are presented in Figure 10. In Experiment 1, utilizing both the OMAM and OAUM modules, MCNet achieved the highest recognition accuracy of 95.87%. In Experiment 2, removing the OAUM module and relying solely on the OMAM module for processing motion-magnified images resulted in an accuracy of 94.37%, a 1.50% decrease compared to the full model. This reduction highlights the OAUM module’s critical role in extracting Facial Action Unit (FAU)-based features and facilitating interactions between facial regions.
In Experiment 3, excluding the OMAM module and using only the OAUM module with segmented optical flow images fed into the dual-branch network led to a 4.61% accuracy drop. This indicates that, while FAU-based segmented optical flow images are effective, the OMAM module significantly enhances recognition by amplifying global micro-expression features.
In Experiment 4, omitting FAU-based segmentation features from the OAUM module while retaining the complete OMAM module and an incomplete OAUM module resulted in an accuracy of 90.19%, a 5.68% reduction compared to Experiment 1. This underscores the importance of FAU-based segmentation for improving recognition accuracy.
In Experiment 5, removing the optical flow estimation from the OMAM module and combining motion-amplified images with the complete OAUM module led to a 9.0% accuracy decline compared to the full model. This substantial drop emphasizes the synergistic impact of both OMAM and OAUM modules in enhancing MCNet’s performance in micro-expression recognition.
The intermediate results of the proposed MCNet are visualized in Figure 11. The first two columns present the onset and apex frames of the original micro-expression images. The subsequent two columns compare the TVL1 optical flow derived from the original sequence with the output of our proposed method, which integrates motion amplification followed by optical flow estimation. Our approach clearly captures more pronounced macro-level details, enhancing the representation of micro-expressions. The final two columns illustrate the feature extraction results with and without the application of the Kolmogorov–Arnold Convolutional Block Attention Module (KCBAM). Without KCBAM, the model’s attention is dispersed across the entire image, failing to prioritize key regions associated with micro-expressions. In contrast, the integration of KCBAM concentrates attention on facial regions exhibiting movement, thereby highlighting critical areas for accurate micro-expression recognition.

5. Conclusions

In this study, we introduced the Optical Flow Magnification and Cosine Similarity Feature Fusion Network (MCNet), a novel framework for micro-expression recognition. MCNet enhances subtle facial movements by integrating optical flow estimation with motion-amplified images, effectively mitigating noise introduced during amplification. The model incorporates the Mobile Residual KAN CBAM Block Net (MRKCBN) and the Multi-Channel Cosine Fusion Module (MCCFM) to extract and fuse representative features robustly. Extensive experiments on public datasets, including CASME II, SAMM, SMIC-HS, and MMEW, demonstrate MCNet’s superior recognition performance.
Despite its high accuracy, MCNet’s extended training times indicate a need for optimization to enable real-time applications. Future work will focus on improving computational efficiency and exploring multi-modal feature integration to broaden the applicability of micro-expression recognition in areas such as deception detection, emotion analysis, and mental health evaluation.

Author Contributions

Conceptualization, H.Z.; Methodology, H.C.; Software, J.Y.; Validation, J.Y.; Formal analysis, W.X.; Data curation, K.H.; Writing—original draft, H.C.; Writing—review & editing, J.Z.; Visualization, W.X.; Supervision, H.Z.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant no. 61976118), and Open Fund of Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information (grant no. 2023-10).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zheng, H.; Wang, R.; Ji, W.; Zong, M.; Wong, W.K.; Lai, Z.; Lv, H. Discriminative deep multi-task learning for facial expression recognition. Inf. Sci. 2020, 533, 60–71. [Google Scholar] [CrossRef]
  2. Zheng, H.; Geng, X.; Tao, D.; Jin, Z. A multi-task model for simultaneous face identification and facial expression recognition. Neurocomputing 2016, 171, 515–523. [Google Scholar] [CrossRef]
  3. Zhang, J.; Zhang, H.; Bo, L.L.; Li, H.R.; Xu, S.; Yuan, D.Q. Subspace transform induced robust similarity measure for facial images. Front. Inf. Technol. Electron. Eng. 2020, 21, 1334–1345. [Google Scholar] [CrossRef]
  4. Huang, X.; Wang, S.J.; Liu, X.; Zhao, G.; Feng, X.; Pietikäinen, M. Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition. IEEE Trans. Affect. Comput. 2017, 10, 32–47. [Google Scholar] [CrossRef]
  5. Verma, M.; Vipparthi, S.K.; Singh, G.; Murala, S. LEARNet: Dynamic imaging network for micro expression recognition. IEEE Trans. Image Process. 2019, 29, 1618–1627. [Google Scholar] [CrossRef]
  6. Zhao, S.; Tang, H.; Liu, S.; Zhang, Y.; Wang, H.; Xu, T.; Guan, C. ME-PLAN: A deep prototypical learning with local attention network for dynamic micro-expression recognition. Neural Netw. 2022, 153, 427–443. [Google Scholar] [CrossRef]
  7. Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
  8. Liu, Y.J.; Zhang, J.K.; Yan, W.J.; Wang, S.J.; Zhao, G.; Fu, X. A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Trans. Affect. Comput. 2015, 7, 299–310. [Google Scholar] [CrossRef]
  9. Li, G.H.; Yuan, Y.F.; Ben, X.Y.; Zhang, J. Spatiotemporal attention network for micro-expression recognition. J. Image Graph. 2020, 25, 2380–2390. [Google Scholar] [CrossRef]
  10. Chang, H.; Zhang, F.; Ma, S.; Gao, G.; Zheng, H.; Chen, Y. Unsupervised domain adaptation based on cluster matching and Fisher criterion for image classification. Comput. Electr. Eng. 2021, 91, 107041. [Google Scholar] [CrossRef]
  11. Gan, Y.S.; Liong, S.T.; Yau, W.C.; Huang, Y.C.; Tan, L.K. OFF-ApexNet on micro-expression recognition system. Signal Process. Image Commun. 2019, 74, 129–139. [Google Scholar] [CrossRef]
  12. Nguyen, X.B.; Duong, C.N.; Li, X.; Gauch, S.; Seo, H.S.; Luu, K. Micron-bert: Bert-based facial micro-expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–14 June 2023; Volume 1482–1492. [Google Scholar]
  13. Xie, Z.; Zhao, C. Dual-Branch Cross-Attention Network for Micro-Expression Recognition with Transformer Variants. Electronics 2024, 13, 461. [Google Scholar] [CrossRef]
  14. Cai, W.; Zhao, J.; Yi, R.; Yu, M.; Duan, F.; Pan, Z.; Liu, Y.J. MFDAN: Multi-level Flow-Driven Attention Network for Micro-Expression Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12823–12836. [Google Scholar] [CrossRef]
  15. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  16. Liong, S.T.; See, J.; Phan, R.C.W.; Le Ngo, A.C.; Oh, Y.H.; Wong, K. Subtle expression recognition using optical strain weighted features. In Proceedings of the Computer Vision-ACCV 2014 Workshops, Singapore, 1–2 November 2014; Revised Selected Papers, Part II 12. pp. 644–657. [Google Scholar]
  17. Wei, M.; Zheng, W.; Zong, Y.; Jiang, X.; Lu, C.; Liu, J. A novel micro-expression recognition approach using attention-based magnification-adaptive networks. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 2420–2424. [Google Scholar]
  18. Wang, Z.; Zhang, K.; Luo, W.; Sankaranarayana, R. HTNet for micro-expression recognition. Neurocomputing 2024, 602, 128196. [Google Scholar] [CrossRef]
  19. Tang, J.; Li, L.; Tang, M.; Xie, J. A novel micro-expression recognition algorithm using dual-stream combining optical flow and dynamic image convolutional neural networks. Signal Image Video Process. 2023, 17, 769–776. [Google Scholar] [CrossRef]
  20. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  21. Bresson, R.; Nikolentzos, G.; Panagopoulos, G.; Chatzianastasis, M.; Pang, J.; Vazirgiannis, M. KAGNNs: Kolmogorov-arnold networks meet graph learning. arXiv 2024, arXiv:2406.18380. [Google Scholar]
  22. Genet, R.; Inzirillo, H. TKAN: Temporal Kolmogorov-arnold networks. arXiv 2024, arXiv:2405.07344. [Google Scholar] [CrossRef]
  23. Pan, Z.; Geng, D.; Owens, A. Self-supervised motion magnification by backpropagating through optical flow. Adv. Neural Inf. Process. Syst. 2024, 36, 253–273. [Google Scholar]
  24. Zach, C.; Pock, T.; Bischof, H. A duality based approach for realtime TV-L1 optical flow. In Proceedings of the Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, 12–14 September 2007; pp. 214–223. [Google Scholar]
  25. Howard, M.A.; Sandler, G.; Chu, L.C.; Chen, B.; Chen, M.; Tan, W.; Wang, Y.; Zhu, R.; Pang, V.; Vasudevan, Q.V.; et al. Adam: Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  26. Bodner, A.D.; Tepsich, A.S.; Spolski, J.N.; Pourteau, S. Convolutional Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2406.13155. [Google Scholar] [PubMed]
  27. Yan, W.J.; Li, X.; Wang, S.J.; Zhao, G.; Liu, Y.J.; Chen, Y.H.; Fu, X. CASME II: An improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE 2014, 9, e86041. [Google Scholar] [CrossRef] [PubMed]
  28. Davison, A.K.; Lansley, C.; Costen, N.; Tan, K.; Yap, M.H. SAMM: A spontaneous micro-facial movement dataset. IEEE Trans. Affect. Comput. 2016, 9, 116–129. [Google Scholar] [CrossRef]
  29. Davison, A.K.; Li, J.; Yap, M.H.; See, J.; Cheng, W.H.; Li, X.; Wang, S.J. MEGC2023: ACM Multimedia 2023 ME Grand Challenge. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9625–9629. [Google Scholar]
  30. Ben, X.; Ren, Y.; Zhang, J.; Wang, S.J.; Kpalma, K.; Meng, W.; Liu, Y.J. Video-based facial micro-expression analysis: A survey of datasets, features and algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5826–5846. [Google Scholar]
  31. Liong, S.T.; See, J.; Wong, K.; Phan, R.C.W. Less is more: Micro-expression recognition from video using apex frame. Signal Process. Image Commun. 2018, 62, 82–92. [Google Scholar] [CrossRef]
  32. Van Quang, N.; Chun, J.; Tokuyama, T. CapsuleNet for micro-expression recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar]
  33. Liong, S.T.; Gan, Y.S.; See, J.; Khor, H.Q.; Huang, Y.C. Shallow triple stream three-dimensional CNN (STSTNet) for micro-expression recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–5. [Google Scholar]
  34. Liu, Y.; Du, H.; Zheng, L.; Gedeon, T. A neural micro-expression recognizer. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–4. [Google Scholar]
  35. Xia, Z.; Peng, W.; Khor, H.Q.; Feng, X.; Zhao, G. Revealing the invisible with model and data shrinking for composite-database micro-expression recognition. IEEE Trans. Image Process. 2020, 29, 8590–8605. [Google Scholar] [CrossRef]
  36. Zhou, L.; Mao, Q.; Huang, X.; Zhang, F.; Zhang, Z. Feature refinement: An expression-specific feature learning and fusion method for micro-expression recognition. Pattern Recognit. 2022, 122, 108275. [Google Scholar] [CrossRef]
  37. Xu, F.; Zhang, J.; Wang, J.Z. Microexpression Identification and Categorization Using a Facial Dynamics Map. IEEE Trans. Affect. Comput. 2017, 8, 254–267. [Google Scholar] [CrossRef]
  38. Hu, C.; Jiang, D.; Zou, H.; Zuo, X.; Shu, Y. Multi-task Micro-expression Recognition Combining Deep and Handcrafted Features. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 946–951. [Google Scholar] [CrossRef]
  39. Wang, S.-J.; Li, B.-J.; Liu, Y.-J.; Yan, W.-J.; Ou, X.; Huang, X.; Fu, X. Micro-expression recognition with small sample size by transferring long-term convolutional neural network. Neurocomputing 2018, 312, 251–262. [Google Scholar] [CrossRef]
  40. Tang, H.; Chai, L. Facial micro-expression recognition using stochastic graph convolutional network and dual transferred learning. Neural Netw. 2024, 178, 106421. [Google Scholar] [CrossRef]
Figure 1. The challenges of micro-expression recognition. (a) Comparison of three different micro-expressions in the CASME II dataset. (b) Distribution of three different micro-expressions in the CASME II dataset. (c) Some samples in the SMIC-HS dataset.
Figure 1. The challenges of micro-expression recognition. (a) Comparison of three different micro-expressions in the CASME II dataset. (b) Distribution of three different micro-expressions in the CASME II dataset. (c) Some samples in the SMIC-HS dataset.
Mathematics 13 02330 g001
Figure 2. One sample of optical flow image from onset frame and apex frame.
Figure 2. One sample of optical flow image from onset frame and apex frame.
Mathematics 13 02330 g002
Figure 3. The overview of the proposed MCNet.
Figure 3. The overview of the proposed MCNet.
Mathematics 13 02330 g003
Figure 4. The workflow of the optical flow processing module. (a) Pre-processing steps applied to the original micro-expression image data; (b) the structure of OMAM, which extracts optical flow features following motion amplification; (c) the structure of OAUM, which crops the optical flow images according to facial action units.
Figure 4. The workflow of the optical flow processing module. (a) Pre-processing steps applied to the original micro-expression image data; (b) the structure of OMAM, which extracts optical flow features following motion amplification; (c) the structure of OAUM, which crops the optical flow images according to facial action units.
Mathematics 13 02330 g004
Figure 5. The structure of MRKCBN.
Figure 5. The structure of MRKCBN.
Mathematics 13 02330 g005
Figure 6. The structure of KCBAM. (a) KAN Convolutional Block Attention Module; (b) KAN Channel Attention Module; (c) KAN Spatial Attention Module.
Figure 6. The structure of KCBAM. (a) KAN Convolutional Block Attention Module; (b) KAN Channel Attention Module; (c) KAN Spatial Attention Module.
Mathematics 13 02330 g006
Figure 7. The structure of MCCFM (Multi-channel cosine fusion module).
Figure 7. The structure of MCCFM (Multi-channel cosine fusion module).
Mathematics 13 02330 g007
Figure 8. The distribution of categories in the micro-expression datasets.
Figure 8. The distribution of categories in the micro-expression datasets.
Mathematics 13 02330 g008
Figure 9. The visualization of confusion matrices of the proposed method on the CASME II, SAMM, SMIC-HS, and the 3-DB Composite dataset.
Figure 9. The visualization of confusion matrices of the proposed method on the CASME II, SAMM, SMIC-HS, and the 3-DB Composite dataset.
Mathematics 13 02330 g009
Figure 10. The verification of the OMAM and OAUM modules. Experiment 1: with both OMAM and OAUM; Experiment 2: only with OMAM; Experiment 3: only with OAUM; Experiment 4: with MOAM and OAUM without cutting operation; Experiment 5: with OAUM and MOAM without amplification step.
Figure 10. The verification of the OMAM and OAUM modules. Experiment 1: with both OMAM and OAUM; Experiment 2: only with OMAM; Experiment 3: only with OAUM; Experiment 4: with MOAM and OAUM without cutting operation; Experiment 5: with OAUM and MOAM without amplification step.
Mathematics 13 02330 g010
Figure 11. The visualization of some samples. The first two columns display the onset and apex frames of the original micro-expression images. The third and fourth columns illustrate the optical flow images generated using the TVL1 method and our proposed method. The final two columns depict the the feature extraction results with and without the application of the KCBAM.
Figure 11. The visualization of some samples. The first two columns display the onset and apex frames of the original micro-expression images. The third and fourth columns illustrate the optical flow images generated using the TVL1 method and our proposed method. The final two columns depict the the feature extraction results with and without the application of the KCBAM.
Mathematics 13 02330 g011
Table 1. The specific configuration of the MRKCBN module.
Table 1. The specific configuration of the MRKCBN module.
LayerInput_SizeOutput_SizeCore
Conv2d3 × 224 × 22416 × 112 × 1123
BatchNorm2d16 × 112 × 11216 × 112 × 112
Hardswish16 × 112 × 11216 × 112 × 112
MobileNetV316 × 112 × 11296 × 7 × 7
Conv2d96 × 7 × 7576 × 7 × 71
BatchNorm2d576 × 7 × 7576 × 7 × 7
Hardswish576 × 7 × 7576 × 7 × 7
RKCB576 × 7 × 71152 × 7 × 7
Table 2. Comparison of experimental results on different databases. The highest results are highlighted in bold, and the second-best results are underlined.
Table 2. Comparison of experimental results on different databases. The highest results are highlighted in bold, and the second-best results are underlined.
MethodCASME IISAMMSMICFull
UF1 (%)UAR (%)UF1 (%)UAR (%)UF1 (%)UAR (%)UF1 (%)UAR (%)
LBP-TOP70.2674.2939.5441.0220.0052.8058.8257.85
Bi-WOOF78.0580.2652.1151.3957.2758.2962.9662.27
CapsuleNet70.6870.1862.0959.8958.2058.7765.2065.06
FDCN73.0972.0058.0757.00
STSTNet83.8286.8665.8868.1068.0170.1373.5376.05
OFFApexNet87.6486.8154.0953.9268.1766.9571.9670.96
EMR82.9382.0977.5471.5274.6175.3078.8578.24
RCN85.1281.2376.0167.1563.2664.4174.3271.90
FeatRef89.1588.7373.7271.5570.1170.8378.3878.32
MFDAN91.3493.2678.7181.9668.1570.8384.5386.88
MCNet (ours)94.3496.8981.8892.2486.0590.5586.3092.88
Table 3. Recognition rates of different methods on MMEW dataset.
Table 3. Recognition rates of different methods on MMEW dataset.
MethodsRecognition Rate (%)
FDM [37]34.6
Handcrafted features + deep learning [38]36.6
MDMO [8]65.7
TLCNN [39]69.4
SGCN [40]73.0
MCNet (ours)76.2
Table 4. Ablation experimental results on the CASME II dataset (✓: Enabled, ×: Disabled).
Table 4. Ablation experimental results on the CASME II dataset (✓: Enabled, ×: Disabled).
ExperimentsOMAMOAUMKANRKCBMCCFMUF1 (%)
1×83.77
2×89.18
3×87.31
4×92.70
5×91.60
694.34
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, H.; Yang, J.; Huang, K.; Xu, W.; Zhang, J.; Zheng, H. Optical Flow Magnification and Cosine Similarity Feature Fusion Network for Micro-Expression Recognition. Mathematics 2025, 13, 2330. https://doi.org/10.3390/math13152330

AMA Style

Chang H, Yang J, Huang K, Xu W, Zhang J, Zheng H. Optical Flow Magnification and Cosine Similarity Feature Fusion Network for Micro-Expression Recognition. Mathematics. 2025; 13(15):2330. https://doi.org/10.3390/math13152330

Chicago/Turabian Style

Chang, Heyou, Jiazheng Yang, Kai Huang, Wei Xu, Jian Zhang, and Hao Zheng. 2025. "Optical Flow Magnification and Cosine Similarity Feature Fusion Network for Micro-Expression Recognition" Mathematics 13, no. 15: 2330. https://doi.org/10.3390/math13152330

APA Style

Chang, H., Yang, J., Huang, K., Xu, W., Zhang, J., & Zheng, H. (2025). Optical Flow Magnification and Cosine Similarity Feature Fusion Network for Micro-Expression Recognition. Mathematics, 13(15), 2330. https://doi.org/10.3390/math13152330

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop