2.1. Hand-Crafted Methods
Hand-crafted feature extraction methods were widely used in the early stages of MER. Common methods include Histogram of Oriented Gradients (HOG), Histogram of Oriented Optical Flow (HOOF), and Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) for extracting ME features. HOG utilizes gradient information in local regions of the image and represents image features by calculating histograms of gradient directions at each pixel. Polikovsky et al. [
8] adopted 3D HOG features to recognize movements in selected facial regions. This method subdivides facial expressions into specific regions and extracts 3D histograms for classification, effectively capturing subtle variations in facial expressions. Davison et al. [
16] proposed a personalized micro-movement detection method based on 3D HOG. This method defines a facial template consisting of 26 regions based on the FACS and uses 3D HOG to extract temporal features for each region, providing a detailed description of facial motion changes. HOOF captures dynamic information of facial movements in MEs, where recognition largely depends on the motion vectors around the facial regions. Chaudhry et al. [
17] proposed the Histogram of Oriented Optical Flow (HOOF) technique, which constructs histograms based on motion vectors. HOOF is scale-invariant and orientation-independent but is sensitive to illumination changes. Happy et al. [
18] introduced the Fuzzy Histogram of Optical Flow Orientation (FHOFO) to overcome the limitations of HOOF. FHOFO uses histogram fuzzification to construct an appropriate angular histogram from the optical flow vectors, encoding temporal patterns robustly against changes in expression intensity. To address the issue of redundant information in complete ME sequences, Liong et al. [
19] proposed the Bi-Weighted Oriented Optical Flow (Bi-WOOF) method, which computes optical flow between the onset and apex frames and encodes it into a histogram descriptor using local weighting based on magnitude and global weighting based on optical strain, effectively capturing the subtle motion patterns in MEs. LBP-TOP can simultaneously capture the spatial and temporal texture information in ME sequences. Yan et al. [
20] employed it as a baseline method to evaluate the newly constructed CASME II dataset. Wang et al. [
21] proposed LBP with Six Intersection Points (LBP-SIP) and LBP with Mean Orthogonal Planes (LBP-MOP). LBP-SIP and LBP-MOP improve feature extraction speed by reducing redundant information. Huang et al. [
22] proposed the Effective Spatio-Temporal Completed Local Quantization Pattern (STCLQP) for MER, significantly enhancing facial ME analysis compared to LBP-TOP. Additionally, research has shown that color can also provide useful information for face recognition. Wang et al. [
23] extracted LBP-TOP from Tensor-independent Color Space (TICS). To extract sparse information from ME sequences, Wang et al. [
10] combined Robust Principal Component Analysis (RPCA) and Local Spatio-Temporal Directional Features (LSTD) to recognize MEs. RPCA extracts the fine motion information of MEs, while LSTD extracts the local texture features of the information.
Hand-crafted feature extraction methods rely on the expertise and prior knowledge of domain experts to design feature extraction algorithms based on prior knowledge of muscle movements and changes in facial regions [
24]. The advantage of hand-crafted feature extraction methods lies in their ability to interpret the physical meaning of features and their lower computational resource requirements. However, these methods have certain limitations in terms of the effectiveness and generalization ability of the manually designed feature representations.
2.2. Deep Learning Methods
With the growth of ME databases and computational power, deep learning methods, particularly those based on CNNs, have also emerged. Xia et al. [
25] proposed Spatio-Temporal Recurrent Convolutional Networks (STRCN) to establish relationships among facial position information and capture facial muscle contractions in different regions. This model combines multiple recurrent convolutional layers with a classification layer to extract visual features for facial emotion recognition. To reduce model complexity and computational costs, Liong et al. [
6] introduced the Shallow Triple-Stream 3D CNN (STSTNet), which performs feature extraction by fusing three types of optical flow (i.e., optical strain, horizontal, and vertical optical flow fields), enabling the network to learn discriminative high-level and detailed features of MEs. Khor et al. [
26] proposed a lightweight Dual-Stream Shallow Network (DSSN), which achieves MER by fusing heterogeneous input features, maintaining high performance while reducing model complexity. To better exploit local features, Chen et al. [
27] proposed a novel Block-based Convolutional Neural Network (BDCNN) with implicit deep feature enhancement, where each image is divided into a set of small patches, followed by convolution and pooling operations on each patch. Li et al. [
11] noted that different facial regions contribute unequally to MER and thus proposed a Local-to-Global Collaborative learning model (LGCcon). They extracted six ROIs using a sliding window with a stride of 1/6 of the face height and jointly learned emotional features from core local regions and global facial information. Nie et al. [
28] proposed GEME, which integrates gender features as an auxiliary task into a multi-task learning framework for MER, and combines it with a class-balanced focal loss to effectively improve MER accuracy. Zhao et al. [
29] were the first to exploit offset frames to complement motion details in MEs. They proposed a Channel Self-Attention Residual Network (CSARNet), which introduces a local feature augmentation strategy to highlight subtle facial muscle movements and builds a channel self-attention enhancement module to extract and refine features from motion flow images.
Graph Convolutional Network (GCN)-based approaches have also been applied in MER. Wei et al. [
30] proposed SS-GN, the first to systematically explore the contribution of facial landmarks to MER. This network aggregates both low-order and high-order geometric motion information from facial landmarks, accurately capturing subtle geometric variations in MEs. The success of Vision Transformers (ViTs) in computer vision has also motivated researchers to extend them to MER. Hong et al. [
31] proposed Later, which is the first work to apply Transformer to MER and effectively alleviates the data scarcity issue by incorporating optical flow motion features with a late fusion strategy. To overcome the limitations of CNNs in spatio-temporal feature extraction, Zhang et al. [
32] proposed the SLSTT-LSTM network, which combines Transformer with LSTM and simultaneously models short- and long-term spatial and temporal dependencies through multi-head self-attention and temporal aggregation modules. Lei et al. [
33] proposed AU-GCN, which simultaneously learns facial graph representations via a Transformer encoder and models Action Unit (AU) information with GCN as an adjacency matrix. Pan et al. [
34] designed C3DBed, a 3D CNN embedded with a Transformer, combining the spatiotemporal feature extraction capability of 3D convolutions with the attention mechanism of Transformers. Bao et al. [
35] proposed SRMCL, which innovatively combines supervised prototype-based memory contrastive learning with a self-expressive reconstruction auxiliary task, enhancing both the discriminability and generalization of features. Wang et al. [
36] developed a Multi-scale Multi-modal Transformer Network (MMTNet), which integrates multi-scale feature learning with multi-modal representations from dynamic and optical flow features within a unified Transformer framework, thereby balancing fine-grained motion feature capture and efficient multimodal integration.
Although prior works [
6,
25] explored feature learning for MER, they failed to effectively capture subtle local features critical for MER. Chen et al. [
27] recognized the importance of local features in extracting discriminative patterns but overlooked global contextual information. Li et al. [
11] jointly considered global and local features but ignored the problem of redundant information caused by region partitioning, which reduced recognition accuracy. GCN-based methods [
30,
33] often rely on domain experts’ prior knowledge for defining facial regions, landmark selection, or AU configurations, which limits their adaptability. Existing ViT-based methods [
35,
36] suffer from two main drawbacks: on the one hand, their self-attention mechanism focuses more on learning global relationships and may overlook subtle local muscle movements essential for recognition; on the other hand, the complex structure of ViTs makes them less suitable for the limited-scale MER datasets, hindering their performance. Therefore, the objectives of this study are as follows: (1) To address the limited size of MER datasets, we propose GLFNet, which employs a lightweight backbone network to reduce the risk of overfitting; (2) to capture the subtle and weak-intensity features of MEs, we propose the LB module, which partitions input feature maps into non-overlapping blocks and applies independent convolutional operations to each block, thereby precisely extracting local subtle muscle movement information; (3) to compensate for CNN’s limitations in global feature modeling, we propose the GA module to learn the overall facial structure and contextual information; (4) to fully exploit the complementarity between local and global features, we propose the AFF module, which incorporates a dynamic weight allocation strategy to adaptively fuse the local detail features captured by the LB module with the global relational features learned by the GA module. Experimental results validate the effectiveness and superiority of the proposed method in MER tasks.