Next Article in Journal
Energy-Efficient Container Scheduling Based on Deep Reinforcement Learning in Data Centers
Previous Article in Journal
Dual-Algorithm Framework for Privacy-Preserving Task Scheduling Under Historical Inference Attacks
Previous Article in Special Issue
Multi-Agent RAG Framework for Entity Resolution: Advancing Beyond Single-LLM Approaches with Specialized Agent Coordination
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Micro-Expression Recognition Using Transformers Neural Networks

by
Rodolfo Romero-Herrera
*,
Franco Tadeo Sánchez García
,
Nathan Arturo Álvarez Peñaloza
,
Billy Yong Le López Lin
and
Edwin Josué Juárez Utrilla
Department of Computer Science and Engineering, Escuela Superior de Cómputo, Instituto Politécnico Nacional, Mexico City 07738, Mexico
*
Author to whom correspondence should be addressed.
Computers 2025, 14(12), 559; https://doi.org/10.3390/computers14120559
Submission received: 27 October 2025 / Revised: 11 December 2025 / Accepted: 11 December 2025 / Published: 16 December 2025
(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (2nd Edition))

Abstract

A person’s face can reveal their mood, and microexpressions, although brief and involuntary, are also authentic. People can recognize facial gestures; however, their accuracy is inconsistent, highlighting the importance of objective computational models. Various artificial intelligence models have classified microexpressions into three categories: positive, negative, and surprise. However, it is still significant to address the basic Ekman microexpressions (joy, sadness, fear, disgust, anger, and surprise). This study proposes a Transformers-based machine learning model, trained on CASME, SAMM, SMIC, and its own datasets. The model offers comparable results with other studies when working with seven classes. It applies various component-based techniques ranging from ViT to optical flow with a different perspective, with low training rates and competitive metrics comparable with other publications on a laptop. These results can serve as a basis for future research.

Graphical Abstract

1. Introduction

Facial expressions appear daily in human beings. They involve changes in the face that reflect emotional states; however, individuals may conceal or disguise their true feelings [1]. Despite this, the production of micro-expressions is unavoidable, as they tend to occur involuntarily within a second fraction, often becoming imperceptible to the human eye [2].
The recognition of micro-expressions is a research topic with various fields of practical application, such as in [3], where it is employed to identify emotions from a psychological perspective. There are basic emotions: happiness, surprise, contempt, sadness, anger, disgust, and fear, which can be used to analyze an individual’s emotional behavior to assess their actual emotional state [3]. Throughout this work, the focus will be placed on the basic emotions identified by Ekman [4,5,6].
Various techniques have been used for their analysis. For example, in [7], mathematical strategies are employed to address the recognition of micro-expressions and the micro-movements involved; in [8], image analysis is conducted across different domains, providing a comparison of the obtained results; and in [9], the Extreme Learning Machine technique is implemented for the recognition of seven types of basic expressions.
In recent years, one of the technologies with the greatest growth due to its versatility and adaptability has been neural networks. In several of the consulted works, such as [2,7,8,9], some variation in this technology is applied to solve problems related to facial analysis. Throughout this project, a solution for the detection of facial micro-expressions based on neural networks is presented.
The proposed paper offers an entirely fresh approach to automatic micro-expression recognition with a novel model, called DualHybridFace, a hybrid model that redefined the fusion of Transformers, Optical Flow, and also includes modules such as Inception and CBAM. Rather than following traditional design approaches, the architecture introduces a relatively unexplored synergy by fusing the spatiotemporal sensitivity of Optical Flow with the global feature modeling capacity of ViT, which enables the decomposition of very short facial micro-movements (usually to near 0.5 s length) into discriminative signals. Beyond detecting subtle, minute facial variations that would be invisible to the naked eye, this method also removes dependence on related but irrelevant details such as facial morphology, resulting in representations that are solely concerned with significant facial dynamics.
The model illustrates its internal robustness by obtaining competitive results in seven emotional categories on a user-specific dataset and public datasets (CASME, SAMM and SMIC) with no data augmentation. This Stochastic Gradient Descent shortens the training time and allows practical deployment on typical hardware by consuming 132,000 images in about two hours. As a result, computational power and modular design come to the fore in driving its efficiency by enhancing component reuse, preserving functional integrity, and making better use of optical flow within the Transformer structure.
The research provides a fresh insight through the deliberate combination of Optical Flow and Attention mechanisms, which will be applied to Micro-expression Detection instead of simply utilizing Optical Flow as another method for this purpose. This combination of Optical Flow with Transformer allows researchers to leverage the unique ability of Transformers to recognize small changes in both time and space, thus providing researchers with a highly reliable, efficient and scalable framework for conducting future research into Dynamic Facial Analysis.

1.1. Problem Statement

Micro-expressions are brief and spontaneous facial movements, with a maximum duration of up to 0.5 s. Additionally, in some cases, individuals attempt to mask these facial movements to hide or suppress their emotions, which makes them nearly impossible to detect with the naked eye and difficult to capture on video.
For this system, a Transformer model was used, trained with a series of datasets based on human emotional expressions, such as SMIC, CASME, and SAMM, in addition to the creation of a dataset composed of adult participants. To elicit the required emotions for analysis, a set of videos categorized according to the emotional classification presented in [10] was shown to the participants. An RGB video camera was used to capture the expressed emotions. Once the recordings were obtained, they were divided into frames to be used primarily for machine learning based on Transformers.
The objective is to develop a machine learning model capable of detecting facial micro-expressions through frame-based analysis.

1.2. Related Works

The Facial Action Coding System (FACS) is a technique for classifying movements associated with facial muscles, as shown in Figure 1. It allows the measurement of all visible facial movements without being limited to actions related solely to emotions [11]. Instead of naming the muscles individually, Action Units (AUs) are specified, focusing on specific areas of facial movement. By combining these AUs, it is possible to identify different micro-expressions. See Figure 1.
Paul Ekman identified a limited set of seven fundamental emotions that are universally recognized through facial expressions [13,14,15]. These have subsequently been generally classified as positive and negative emotions. The task of detecting and classifying micro-expressions is challenging and involves several complexities. For example, in [16], an ensemble of multiple models based on convolutional neural networks is used in order to leverage the advantages of each model while compensating for their respective limitations. In studies [17,18], residual blocks are employed to create shortcut connections between neurons, thereby avoiding the generation of excessively long information chains. In [19], the most relevant information regarding facial features associated with specific micro-expressions is further grouped into memory units, so that these features are retained throughout the detection and recognition process. In [18], a micro-attention unit is utilized, achieving a significant improvement in results despite the use of a smaller data sample [19].
On the other hand, refs. [8,17] employ a model trained for the detection of micro-expressions through image analysis. This process results in an improvement in detection performance with substantial room for further enhancement, outperforming other models. One of the main tasks in micro-expression detection involves isolating short-duration facial features. For instance, in [7], variational models and the RAFT method are used to compute Optical Flow, which emphasizes the desired facial characteristics, thereby facilitating precise detection of details that are decomposed into two domains: spatial and temporal, with the purpose of isolating micro-movements. The details are then magnified using an amplification factor and the “forward warping” method (a technique that involves transforming or mapping a source image to an end image using a deformation field). This methodology highlights the importance of employing Transformers to enhance the detection and emphasis of micro-expressions.
Another study introduced the LGNMNet model, which integrates Lite General Network and MagFace CNN to detect micro-expressions in long video sequences, achieving F1-scores of 0.2474 and 0.2555 on the CAS(ME)2 and SAMM-LV datasets, respectively, and proposing a pair-merge technique to improve micro-expression interval generation [20]. Likewise, an optical-flow–based method combining magnitude and angle information was presented, along with the new SDU2 dataset containing 1602 clips, significantly outperforming previous methods in micro-expression spotting tasks [21]. Additional advances include the lightweight HTNet model with Separable Self-Attention, which achieved a UF1 of 0.8498 with improved inference speed and parameter efficiency [22]; a Graph Transformer with a learnable adjacency matrix designed to capture spatiotemporal relationships across facial regions [23]; and Transformer architectures employing multi-head self-attention to model short- and long-range spatial–temporal dependencies, outperforming state-of-the-art approaches on SMIC-HS, CASME II, and SAMM [24]. Vision Transformer–based approaches have also been proposed, such as HTNet with LAPE and entropy-based attention to improve computational efficiency [25], as well as Transformer Fusion mechanisms that integrate local, global, and full-face information to extract multi-level features from Action Units (AUs) and onset–apex phases [26]. Finally, Visual Transformer and multimodal Swin Transformer models have reported accuracies of up to 81.50%, 82.97%, and 79.99% on the CASME II, MMEW, and SMIC datasets, respectively, establishing new state-of-the-art performance in micro-expression recognition [27].
The final approach considered is presented in [28], which utilizes infrared cameras in combination with visible-light cameras to improve the accuracy of emotion detected, particularly when installed at different angles.

2. Materials and Methods

The component-based methodology was selected for the development of the system. This methodology consists of reusing code modules that enable the execution of different tasks [29,30], which leads to the construction of a complex system while ensuring its correct functioning and performance [31].
The diagram in Figure 2 illustrates two data flows. The first corresponds to the system training process: the user’s face is recorded by a camera, and this video is sent to the first component, the computer. The procedure starts with the recording of the video, which is stored for model training and then transferred to the dataset component. This component contains both the custom dataset and the external datasets; additionally, the Optical Flow method is applied to extract features that provide supplementary information to the model. Afterwards, both datasets are sent to the video processing component, where the video is segmented into frames and conditioning procedures are applied. Once the datasets have been processed, they are sent to the AI module, where parameter tuning and model evaluations are carried out to define the final system architecture. After obtaining the evaluation results, model testing is performed.
The second data flow corresponds to the system execution phase. The camera records and stores a video of the user’s face, which is then sent to the first component, the computer. Subsequently, the video is passed to the video processing component, where it is segmented into frames, conditioned, and filtered. These frames are then forwarded to the model prediction component, and once the result is analyzed, it is displayed in the computer component through a graphical user interface.
To induce in the participants the different emotional states required for this work—joy, sadness, surprise, anger, disgust, fear, and neutral—existing micro-expression datasets were used, in which short movie clips and videos were employed to elicit emotional reactions. For the development of the dataset, the results from [10] were used, where a study was conducted to generate a set of film scenes proven to be effective in evoking emotional states, resulting in a total of 57 scenes associated with the seven target emotions. In addition to this set of film scenes, short video clips specifically selected to produce the desired emotions were also used. As a result, eight film compilations were generated, each containing seven different scenes designed to induce one of the following emotions: neutral, sadness, surprise, fear, disgust, anger, and happiness.
A “C922 PRO HD STREAM WEBCAM” was used, capable of recording in Full HD 1080p and 720p at 60 fps, equipped with autofocus, glass lens, diagonal field of view of 78°, and 1.2× digital zoom. The camera was configured to 30 FPS, MP4 format, with a cropping filter to ensure that the recording remained focused and centered on the participant’s face. See Figure 3.

2.1. Data Set

In addition to the CASME, SAMM, and SMIC datasets, an additional dataset was created to validate the system. Each participant was provided with a form to record their emotions. The form included a multiple-choice question to register the facial expression with the highest intensity as the response, with the option to specify in writing the moment at which the emotion was manifested. The resulting dataset is composed of a total of 85 individuals, of whom 76.47% are male and 21.18% are female. The FRANI dataset was collected by the authors under informed consent, following ethical research practices, and all data were anonymized prior to use.

2.2. Segmentation and Frame Extraction

Each video was recorded at a rate of 30 frames per second (fps), but only 20 fps were used for analysis. Based on the form responses, the moment of highest emotional intensity was identified for each of the 7 scenes shown in the compilation, and the time interval in which the micro-expression occurred was determined. This interval was segmented at 20 frames per second; the duration of the intervals ranged from 2 to 4 s for each scene.
Based on the approach proposed by Jin Hyun Cheong and Eshin Jolly [32], a model was developed that enables the identification, measurement, and classification of macro-expressions using a person’s Facial Action Units. To accomplish this, the MP4 video corresponding to the selected recording interval was loaded. The model analyzes each frame, assigns the corresponding Action Units, provides a weighting to determine the associated emotion, and returns an analysis of the selected frames along with a plot representing the entire video.
As a result of the analysis, two types of plots are generated, as shown in Figure 4. Each plot is accompanied by an image of the analyzed frame, in which the participant’s facial features are highlighted using a white outline. A scoring of the most significant Action Units in the participant’s face is performed, and these scores are summed up to determine the emotional intensity expressed. The plot displays the intensity of each of the seven emotional states across the frames. The plot on the right provides an analysis across all frames that make up the segmented video, allowing the visualization of facial expression changes over time. This visualization makes it possible to identify the points of highest emotional intensity and determine their corresponding emotional category.
Inter-subject variability is one of the most challenging factors in micro-expression analysis; therefore, strategies are applied to normalize or learn identity-invariant representations. In this context, the system leverages the sensitivity of optical flow to dynamic changes, while deep architectures such as Transformers and CNNs capture high-level facial patterns that are less dependent on identity, age, or ethnicity. Additionally, landmark-based alignment standardizes facial shape prior to processing, reducing the influence of morphological differences across subjects. These action units (AUs) are analyzed between 2 and 4 fps because it is feasible to obtain smaller intervals, less than 0.5 s, in which micro expressions can be located. This allows for an even greater number of frames to be analyzed. Additionally, landmark-based alignment standardizes facial shape prior to processing, reducing the influence of morphological differences across subjects. These action units (AUs) are analyzed between 2 and 4 fps because it is feasible to obtain smaller intervals, less than 0.5 s, in which micro expressions can be located. This allows for an even greater number of frames to be analyzed.
Illumination directly affects pixel values, so the solutions focus on stabilizing or learning features that remain robust under variable lighting. Preprocessing normalization reduces shadows and homogenizes tonal variations; gradient-based optical flow is inherently less sensitive to illumination changes; and Transformers extract high-level features that are less affected by lighting fluctuations. Moreover, the use of high-speed cameras under controlled laboratory conditions minimized noise and improved training quality.
Finally, pose variation alters facial geometry and hinders the detection of subtle micro-movements. To mitigate this issue, frames with incomplete face visibility were discarded, stable tracking of key regions and action units was performed using landmarks, and the model was trained with data containing multiple head orientations, enhancing positional invariance.

2.3. Applied Approach

Originally, the neural network classifies images by combining features extracted from two distinct branches: one based on the Vision Transformer (vit_pos) and the other on Inception and CBAM layers (incep_sca). The LSTM plays a fundamental role in this architecture, as it facilitates the capture of temporal and long-term dependencies among the combined features. First, the network extracts features from both branches, with each branch highlighting specific aspects of the image. These features are then merged into a single tensor, enriching the image representation by integrating different perspectives.
In the present work, the DualHybridFace model was designed and implemented. This model incorporates the Vision Transformer architecture to interpret images, the Convolutional Block Attention Module (CBAM) to emphasize salient features, and additional convolutional neural network structures. Meanwhile, the Inception architecture applies convolutional filters—specifically, a Laplacian convolution filter in this case—without imposing high computational demands when scaling the image to different sizes. The DualHybridFace model thus combines two branches: a scale branch (incep_sca) based on the Inception and CBAMs, and a channel-position branch (vit_pos) based on the Vision Transformer (ViT) module. The output of both branches is fused and passed through a fully connected layer to obtain the final prediction.
As a result of the analysis of the B-LiT model, a system is obtained for micro-expression detection based on the processing of video frames for facial analysis and micro-expression identification.
  • Modules
Considering the two branches of DualHybridFace, the system consists of the following four modules:
  • Data Flow: This module is present in each of the subsequent modules. First, channel attention is computed and multiplied by the input, producing an enhanced representation of the most significant channels. Then, spatial attention is calculated and multiplied by the output of the channel attention step, generating a final representation that emphasizes both relevant channels and spatial regions. The final output is the input image with channel and spatial attention applied, allowing the network to focus on the most relevant features for the task.
  • Transformer DualHybridFace_IncepCBAM: This configuration uses only the Inception and CBAM branch (incep_sca), excluding the vit_pos branch. It includes the following hyperparameters:
  • in_channels: Number of input channels (default: 3 for RGB and 2 for grayscale images).
  • num_classes: Number of classes for the classification task (default: 3 for RGB images and 2 for grayscale images).
  • fc: A sequence of fully connected layers that combines the extracted features and performs classification.
  • Transformer DualHybridFace_ViT: This configuration uses only the vit_pos branch based on the Vision Transformer module; it does not include the incep_sca branch nor the LSTM module. It includes the following hyperparameters:
  • in_channels: Number of input channels (default: 3 for RGB images and 2 for grayscale images).
  • num_classes: Number of classes for the classification task (default: 3 for RGB images and 2 for grayscale images).
  • fc: A sequence of fully connected layers that combines the extracted features and performs classification.
  • Transformer DualHybridFace_LSTM: This module is similar to DualHybridFace but incorporates an LSTM layer after the vit_pos branch. The output of the LSTM is combined with the features from the incep_sca branch and passed through a fully connected layer for classification. It includes the following hyperparameters:
  • in_channels: Number of input channels (default: 3 for RGB images and 2 for grayscale images).
  • num_classes: Number of classes for the classification task (default: 3 for RGB images and 2 for grayscale images).
  • hidden_dim: Dimensionality of the additional feature representation (default: 512 for RGB images).
  • fc: A sequence of fully connected layers that combines the features from both branches and performs classification.
  • Hyperparameters
Only two hyperparameters are required to operate this architecture:
  • in_channels: The number of input channels (default: 3 for RGB images and 2 for grayscale images).
  • num_classes: The number of output classes for each convolution branch (default: 3 for RGB images and 2 for grayscale images).
The LSTM processes the combined sequence of features, leveraging its ability to retain information across extended sequences. In image classification tasks where global context is critical, this allows the network to learn linkages and long-term dependences within the input data. By capturing patterns over lengthy sequences, the LSTM improves the network’s capacity to generalize unseen input, leading to increased accuracy and resilience in the final classification.
  • Hybrid Architecture
The combination of the Inception, CBAM, and ViT modules within a single model can be highly effective, addressing several limitations inherent to each component individually:
  • Inception: Enables efficient extraction of high-frequency features such as textures and local details, which are crucial in many vision tasks. While CNNs excel at this, pure Transformer models tend to focus more on low-frequency, global dependencies. Combining Inception with ViT allows the system to leverage the strengths of both approaches.
  • Spatial and Channel Attention with CBAM: By introducing spatial and channel attention methods, the CBAM improves performance in tasks like object detection and semantic segmentation by enabling the model to selectively focus on the most informative regions and channels.
  • Global Dependency Capture with ViT: The primary benefit of ViT is its capacity to use self-attention mechanisms to record long-range dependencies throughout an image. This is especially helpful for duties that call for a deep understanding of the scene.
By integrating these three components, the model can:
  • Efficiently extract both high- and low-frequency features (Inception)
  • Selectively emphasize relevant spatial regions and channels (CBAM)
  • Capture global dependencies across the image (ViT)
This synergy leads to enhanced visual representations for classification, detection, and segmentation tasks. The complementary strengths of each module enable the combined model to surpass the performance of each component when used individually.
Moreover, incorporating LSTM modules alongside Inception, CBAM, and ViT enables robust modeling of sequential and temporal dependencies, complementing the spatial and attention-based capabilities of the architecture. The LSTM enhances ViT’s self-attention by encoding inter-token relationships, processes multi-scale features from Inception and CBAM hierarchically, and models temporal dynamics in sequential video data. Thus, the combined Inception–CBAM–ViT–LSTM architecture can effectively integrate spatial, attention-based, global, and temporal information for richer visual understanding.
  • DualHybridFace Transformer Model
This model is inspired by the Dual Attention Network (DualATTNet or DANet), originally developed for segmenting images into regions corresponding to specific visual characteristics [33]. Since it incorporates two attention modules that rely on self-attention mechanisms, it enables improved feature detection. These two modules are positional attention and channel attention, and their interaction is illustrated in Figure 5. The first learns spatial relationships among pixels and updates each pixel’s representation by considering similar features across the entire image. The second analyzes correlations among multiple feature maps (channel maps), each representing responses to specific visual patterns—for example, the “grass” channel map may correlate with the “vegetation” or “tree” channel maps [34]. See Figure 5.

2.4. Transformer

Transformer networks are deep learning architectures that use self-attention mechanism to handle sequential data like text, images, or audio. Self-attention enables the model to give varying weights to each segment of the input depending on its significance and relationships with the remainder of the sequence. In this way, the model can understand the contextual and semantic connections in the data, resulting in more precise and natural predictions [35].
Transformer models are structured around an encoder–decoder architecture, similar to seq2seq models, but without the use of recurrent or convolutional networks [35]. Instead, they rely on attention layers, which can be categorized into three types: scaled dot-product attention, multi-head attention, and encoder–decoder attention. These layers enable the model to capture relationships between elements in both the input and output sequences, producing encoded representations that contain contextual information.
The proposed B-LiT model (By Intelligent Learning for Microexpressions in Visible and Infrared Light) consists of a set of four architectures that use the Transformer model as a foundation, with modules specifically designed for the task of facial micro-expression recognition. The model was developed, trained, and evaluated on a portable ASUS TUF Gaming FX505GM laptop equipped with an Intel(R) Core(TM) i5-8300H CPU @ 2.30 GHz, 16 GB DDR4-SDRAM, an NVIDIA GeForce GTX 1060 graphics card with 6 GB VRAM, running Windows 11 Home 64-bit and Python version 3.9.0.
The prototype for recognizing facial micro-expressions was developed utilizing a Vision Transformer (ViT) model (refer to Figure 6). The process can be outlined in the following way:
  • The model receives an input image to be classified. The ViT partitions this image into small blocks called patches, which are then transformed into numerical vectors through a process known as linear embedding, analogous to describing the colors of a visual scene using descriptive terms.
  • After embedding the patches, the model incorporates positional embeddings, which allow it to retain information about the original spatial arrangement of each patch. This step is critical, as the meaning of visual components may depend on their spatial relationships.
  • Once the patches have been embedded and assigned positional information, they are arranged into a sequence and processed through a Transformer encoder. This encoder functions as a mechanism that learns the relationships between patches, forming a holistic representation of the image.
  • Finally, to enable image classification, a special classification token is appended at the beginning of the sequence. This token is trained jointly with the rest of the model and ultimately contains the information necessary to determine the image category.
Consequently, the ViT can be conceptualized as a puzzle that takes an image, divides it into pieces, represents those pieces in a language the computer can interpret, and then reassembles them to determine the image’s content.
To construct the model, ten primary modules were implemented to ensure the appropriate processing of frames:
  • Patch Embedding: The image is divided into patches and converted into linear embeddings. This is the initial step in preparing the image so that the Transformer can interpret it.
  • Classification Token: A special token added to the sequence of embeddings that, after passing through the Transformer, contains the necessary information for image classification.
  • Positional Embeddings: Incorporated into the patch embeddings to preserve spatial details about the original position of each patch in the image.
  • Transformer Blocks: A series of blocks that sequentially process the embeddings using attention mechanisms to understand relationships among the different patches.
  • Layer Normalization: Applied to stabilize the embedding values before and after passing through the Transformer blocks.
  • Representation Layer or Pre-Logits: An optional layer that may transform the extracted features before final classification, depending on whether a representation size has been defined (patch size).
  • Classification Head: The final component of the model that maps the processed features to the predicted classes.
  • Mask Generation: An additional layer suggesting that the model may also be designed for segmentation tasks by producing a mask for the image.
  • Weight Initialization: Functions that initialize the weights and biases of linear and normalization layers with specific values, providing a suitable starting point for training.
  • Additional Functions: Supplementary functions required to exclude certain parameters from weight decay, manipulate the classification head, and define the data flow throughout the model.
In addition to the modules implemented in the ViT, it is necessary to define the hyperparameters required for the correct operation of the model, as these standardize and regulate the information processing. The following lists the hyperparameters used (number or specification indicated in parentheses):
  • Image Size: Defines the size of the input image and determines how it will be divided into patches (14).
  • Patch Size: Specifies the dimensions of each patch (1).
  • Input Channels: Indicates the number of channels in the input image (3).
  • Number of Classes: Determines the number of output categories for the classification head (1000).
  • Embedding Dimension: The embedding dimension for each patch, representing the feature space in which the Transformer operates (512).
  • Depth: The depth of the Transformer, referring to the number of sequential Transformer blocks in the model (3).
  • Number of Attention Heads: The count of attention heads in every Transformer block allows the model to concentrate on various parts of the image at the same time (4).
  • MLP Ratio: The ratio between the hidden layer size of the multilayer perceptron (MLP) and the embedding dimension (2).
  • Query-Key-Value Attention Bias: Enables bias terms in the query, key, and value projections when set to true (True).
  • Attention Dropout Rate: The dropout rate applied specifically to the attention mechanism for regularization (0.3).
  • Attention Head Dropout Type: Specifies the dropout strategy applied to the attention heads (e.g., HeadDropOut).
  • Attention Head Dropout Rate: The dropout rate applied to the attention heads (0.3).

2.5. Mathematical Foundation

The input characteristics of the image are height H, width W, and C channels; as previously mentioned, the image is divided into two-dimensional patches. This results in N = H W p 2 patches, where each patch has a resolution of ( p , p ) pixels.
Prior to inputting the data into the Transformer, the subsequent processes are performed:
  • For each image patch is compressed into a vector x p n of length p 2 × C , where n = 1 , , N .
  • A series of embedded image patches is produced by mapping the flattened patches into D dimensions using a trained linear projection E .
  • A learnable class embedding x class is prepended to the sequence of embedded image patches. The value of x class represents the classification output y .
The patch embeddings are then expanded with one-dimensional positional embeddings E pos , thereby launching positional data into the input, which is also understood during training. The resulting sequence of embedding vectors after these operations is:
z 0 =     x c l a s s   ;   x p 1 E   ; ;   x p N E   +   E p o s
To perform classification, z 0 is fed into the input of the Transformer encoder, which depends on a stack of L identical layers. Then, the value of x c l a s s at the output of the Lth coder layer is taken and passed via a classification head.
Now, a Classification Head can be partitioned into Single-Head Attention and Multi-Head Attention. First, Single-Head Attention is defined in order to introduce the latter, which is the one implemented. Each attention mechanism is addressed separately.
  • Single-Head Attention
Every attention module in a Transformer is based on Q (Queries), K (Keys), and V (Values), which form information matrices, as illustrated in Figure 7 [37].
Considering the above, the attention module can be defined as follows:
A t t e n t i o n Q ,   K ,   V = s o f t m a x Q · K T d k   ·   V
where
  • Q is the Query
  • K is the Key
  • V is the value
  • dk is the Ker Dimension
  • T is the sequence Lenght
Then, it can be specified as follows:
A =   Q · K T d k
From which it can be obtained:
q k s c a l e =   1 d k   = c h a n 0.5
where qk_scale is one of the hyperparameters required for training, and chan refers to the channel or channels of the model.
Moreover, it is also necessary to compute the matrix multiplication Q·KT, as illustrated in Figure 8 [36].
Finally, the value of the final x is computed. It is important to note that, in some cases, it may also be necessary to reduce the dimensions of this output, as shown in Figure 9:
  • Multi-head Attention
In the context of the multi-head attention mechanism in Transformer neural networks, Queries (Q), Keys (K), and Values (V) are computed following the same procedure as in single-head attention. However, a key difference lies in how the data is structured. Instead of having a single large matrix for each of the components Q, K, and V, these matrices are divided into multiple smaller segments, one for each attention head. This is achieved by separating the channel dimension (chan) by the number of heads (num_heads), resulting in a new token length for each segment.
The total size of the Q, K, and V matrices does not change, but their contents are redistributed across the head dimension. This can be visualized as segmenting the single-head matrices into multiple smaller matrices, one for each attention head (see Figure 10) [37].
The submatrices are denoted as Q_Hi for Query Head i , from which the following can be defined:
A H i =   Q H i · K H i T d k
So, for each Query Head we have:
d k =   c h a n n u m _ h e a d s
Having Figure 11:
For the calculation of the softmax parameter of A, it remains unchanged, and its dimension does not change either. However, it is worth mentioning that:
x H i =   A H i   ×   V H i
With this, finally, we have Figure 12:
In Positional Embedding (PE), the order of elements is important (e.g., pixels in an image). Positional embeddings assistance the model recognizes the spatial structure and context of the input data. Typically, positional embeddings are added to the standard embeddings (pixel embeddings in computer vision) to give the model with information about the position or order of elements in a sequence. This allows the model to consider the relative positions of elements when processing the input, which can be crucial for capturing temporal or spatial dependencies in the data and improving model performance in tasks such as machine translation or object recognition in images (see Figure 13).
And it can be defined as follows:
θ i j =   i 10 , 000 2 j d 1
P E ( i ,   2 j ) = sin θ i j
P E ( i ,   2 j + 1 ) = cos θ i j
where:
  • i is the Token Number.
  • j is the Projection of the Token Length.
Finally, the num_tokens parameter that is used to define the order and number of tokens for the Token Transformer is defined visually in Figure 14 [37]:
Mathematically, it may be described in the following way
n u m t o k e n s = h + 2 × p k 1 1 s + 1   ×   w + 2 × p k 1 1 s + 1
where:
  • h is the height of the image.
  • w is the width of the image.
  • s is the stride, which can be explained as s = ceil·(k/2).
  • p is the padding, which can be described as p = ceil·(k/4).
  • k is the kernel.
ceil refers to the ceiling function.
Reconstructing the image layer considering the channel [37], the batch and the num_tokens gives the representation in Figure 15:
Transforming to its 2D version, it can be represented as in Figure 16:
  • Convolutional Block Attention Module (CBAM) Model
CBAM is an element used to enhance the implementation of convolutional neural networks by allowing them to discover to center on the most important features in both the channel and spatial dimensions of input images. This attention-based approach helps the network concentrate on relevant features while suppressing irrelevant ones, which can conduct to enhanced performance in diverse assignments, such as image classification (see Figure 17). This capability is particularly well-suited for detecting facial features, especially the subtle and imperceptible changes associated with micro-expressions [36].
For this architecture, only three modules are used for its operation:
Channel Attention: This module uses max-pooling and average-pooling operations along the spatial dimension to obtain unique and complementary channel features. A shared fully linked layer is applied to both the max-pooled and average-pooled features to efficiently capture channel information. The channel features are combined and moved through a sigmoid function to take the channel attention map, representing the relative significance of each channel. Finally, the channel attention map is multiplied by the input to emphasize essential channels and remove less important ones
Spatial Attention: This module captures spatial features by averaging and taking the maximum across the channel dimension, producing unique and complementary spatial features. These spatial characteristics are concatenated and passed through a 2D convolution followed by a sigmoid function to generate the spatial attention map, which represents the relative importance of each spatial region. The spatial attention map is multiplied by the input to emphasize important regions and suppress less relevant ones.
Data Flow: First, the channel attention is computed and multiplied by the input, producing an enhanced representation of important channels. Then, spatial attention is calculated and multiplied by the output of the channel attention, producing a final representation that emphasizes both important channels and spatial regions. The final output is the input image with both channel and spatial attention applied, enabling the network to center on the most significant characteristics.
The model has only three hyperparameters necessary to operate this architecture:
  • Channel (channel: 48): The number of input channels.
  • Reduction (reduction: 16): Used to reduce the channel dimension in the fully connected layers, enabling greater computational efficiency.
  • Kernel Size (k_size: 3): The kernel size for the 2D convolution used in spatial attention.
The Spatial Attention Module (SAM) consists of a sequential three-step operation. The initial step is named the Channel Pool, where the input tensor of dimensions c × h × w (where c is the channel, h is the height, and w is the width) is decomposed into two channels, i.e., 2 × h × w , where each of the two channels symbolizes Max Pooling and Average Pooling throughout the channels. This serves as input to a convolutional layer that produces a single-channel feature map, i.e., the output dimension is 1 × h × w . This convolution preserves the spatial dimensions using padding. In implementation, the convolution is continued by a Batch Normalization layer to normalize and scale the convolution output. Some approaches provide the alternative to use a ReLU activation after the convolution layer; however, by default, only Convolution + Batch Norm is used. The output is then given through a Sigmoid Activation layer. The sigmoid function, existing a probabilistic activation function, maps all values to a range between 0 and 1. This spatial attention mask is used for all feature maps in the input tensor by element-by-element multiplication.
The mathematical model is:
  • Grouping Channel
M a x   P o o l i n g X i j = m a x k   X i j k         p a r a         k = 1   ,   ,   c
A v g   P o o l i n g X i j = 1 c   k = 1 c X i j k
where:
  • X is the input tensor of dimensions c × h × w.
  • i and j are the spatial coordinates.
  • c is the number of channels.
  • h, w are the height and width of the image, correspondingly.
2.
Convolution Layer
Y i j = m = k k n = k k X i + m , j + n × W m , n
where:
  • Y is the convolution output.
  • X is the input.
  • W is the convolution kernel.
  • k is the kernel size.
3.
Batch Normalization
B a t c h N o r m X i j = γ   Y i j μ σ 2 + ϵ + β
where:
  • μ and σ2 are the mean and variance of Y calculated over the batch.
  • γ and β are the learned scale and bias parameters, respectively.
  • ϵ is a constant for numerical stability.
4.
Sigmoid Activation
S i g m o i d Z i j = 1 1 + e Z i j
where:
  • Z is the input of the sigmoid function, which is the output of the Batch Normalization layer.
5.
Spatial Attention Mask
O u t p u t i j k = X i j k × S i g m o i d Z i j
where:
  • X is the input tensor.
  • Z is the output of the convolution layer continued by Batch Normalization and Sigmoid Activation.
This data flow is represented in Figure 18.
The Channel Attention Module (CAM) is another sequential operation, more complex than the Spatial Attention Module (SAM). The CAM resembles a Compression and Excitation (SE) layer.
Specifically:
  • Squeeze Operation: The “squeeze” operation consists of reducing the spatial dimensions of a feature tensor X (with dimensions c × h × w ) to a single-channel feature tensor (dimensions c × 1 × 1 ). This is achieved through Global Average Pooling:
S i = 1 h × w   j = 1 h k = 1 w X i j k         p a r a         i = 1   ,   ,   c
where:
  • Si represents the “squeeze” value for channel i , indicating the importance of the channel relative to the other channels.
  • Excitation Operation: The “excitation” operation uses fully linked layers to model the relationships between channels and to learn attention weights.
E i = σ W 2 δ W 1 S i         p a r a         i = 1   ,   ,   c
where:
  • δ represents an activation function (in this case, ReLU).
  • W1 and W2 are learned weight matrices.
  • Scale Operation: The “scale” operation scales the original feature channels using the attention weights calculated in the “excitation” stage.
Y i j k = E i   X i j k         p a r a         i = 1 ,   ,   c
where:
  • Yijk is the final value of the output tensor Y after applying channel attention, where each value of channel i at spatial position (j,k) is scaled by the excitation weight Ei.
These equations model the Channel Attention Module (CAM). As illustrated in Figure 19, they represent pooling operations, nonlinear transformations, and attention-based scaling, enabling the model to focus on specific channels and enhance feature representations according to their relative importance.
  • Inception Model Architecture
The Inception module is a convolutional neural network designed to address the problem of representing patterns at different spatial scales. The central component of this architecture is the Inception block, as shown in Figure 20, which permits the network to efficiently learn characteristics from kernels of various sizes. By using convolutional filters of different sizes in parallel, the network can obtain patterns at multiple scales without a significant increase in computational cost. Furthermore, the different branches of the Inception block learn complementary representations of the input data, which can lead to improved performance across various tasks. The Inception block can be stacked and combined with other layers and modules to build more complex and deeper convolutional neural networks, as in this case, its combination with CBAM and ViT [36,38].
Modules
For this architecture, only two modules are considered necessary for its operation:
Inception Block: It consists of four parallel branches that apply different convolution operations. It is worth noting that convolutions of different sizes can continue to be applied, but in this case, only up to 5 × 5 is considered: 1 × 1 Branch, 3 × 3 Branch, 5 × 5 Branch, Max-Pooling Branch.
Data Flow: The input is propagated through all branches in parallel. The outputs from all branches are chain along the channel dimension using matrix concatenation. The final output is the concatenation of the features learned by each branch with different kernel sizes, permitting the network to acquire patterns at multiple spatial scales.
Hyperparameters
Two hyperparameters are defined for the model of integration with the CBAM, the parameter configuration is set as follows:
  • in_channels: 3 input channels.
  • out_channels: 6 output channels for each convolutional branch.
Equations
To define the equations of the Inception model, the ViT model and the definitions from that architecture must first be reviewed. ViT first divides the input image into a sequence of tokens, and each patch token is thrown into a hidden vector through a linear layer, denoted as:
x 1 ,   x 2 ,   ,   x N   o   Χ R N × C
where:
  • N is the number of patch tokens.
  • C denotes the feature dimension.
All tokens are linked with the Positional Embedding and fed into the ViT layers, which comprise Multi-Head Self-Attention (MSA) and a Feedforward Neural Network (FFN). With the context of the Inception architecture model, there are three fundamental components required to mathematically model it.
First, the Inception Token Mixer, which is used to inject the deep capability of Convolutional Neural Networks (CNNs)—for extracting high-frequency representations—into ViTs. Instead of feeding the image tokens directly into the MSA mixer, the Inception mixer first splits the input feature along the channel dimension and then feeds the split modules, respectively, into the High-Frequency Mixer and the Low-Frequency Mixer. Here, the high-frequency mixer consists of a max-pooling operation and a parallel convolution operation, whereas the low-frequency mixer is implemented via self-attention.
Mathematically, this can be expressed as X R N × C , which is factorized as:
Χ h R N × C h y   Χ l R N × C l
where:
  • X h is the High-Frequency Mixer.
  • X l is the Low-Frequency Mixer.
  • C h denotes the feature dimension of the High-Frequency Mixer.
  • C l denotes the feature dimension of the Low-Frequency Mixer.
All of this is along the channel dimension, where C h + C l = C . Then, X h and X l are transferred to the High-Frequency Mixer and the Low-Frequency Mixer, respectively.
For the High-Frequency Mixer, considering the sharp sensitivity of the max-pooling filter and the detail perception of the convolution operation, a parallel structure is proposed to learn the high-frequency components by splitting the input X h into:
Χ h 1 R N × C h 2 y   Χ h 2 R N × C h 2
Both along the channel dimension. Χh1 is embedded with Max-pooling and a Linear layer, and Χh2 is fed into a Linear layer and a Depthwise convolution layer:
Y h 1 = F C M a x P o o l Χ h 1
Y h 2 = D w C o n v F C Χ h 2
where:
  • Y h 1 and Y h 2 denote the outputs of the high-frequency mixers.
  • FC is the Fully Connected layer, referring to a linear or dense layer.
  • MaxPool performs max subsampling to reduce resolution and capture invariant features.
  • DwConv refers to the Depthwise Convolution layer (channel-wise separable) and efficiently applies convolutions separately for each channel to capture spatial and channel-wise patterns.
On the other hand, the Low-Frequency Mixer uses standard multi-head self-attention to transmit information, including all tokens. Although the strong capacity of attention to learn macro representations, the high resolution of the feature maps would incur a substantial computational cost in the lower layers. Thus, an average-pooling layer is applied to decrease the spatial scale of X l before the attention operation, and an upsampling layer is used to restore the first spatial dimension after attention. This alternative reduces computational costs and allows the operation of the service to incorporate general information.
This branch can be defined as:
Y l = U p s a m p l e M S A A v e P o o l i n g Χ l
where:
  • Y l is the output of the Low-Frequency Mixer.
  • Upsample is an operation that improves the spatial resolution of a feature or feature map.
  • MSA (Multi-Head Self-Attention) enables capturing global dependencies among tokens.
  • AvePooling (Average Pooling) performs subsampling by averaging regions to reduce resolution and smooth features.
Y c = C o n c a t Y l ,   Y h 1 ,   Y h 2
Therefore, the Transformer Inception block is defined as:
Y =   Χ + I T M L N Χ y   H = F F N L N Y
where:
  • ITM is the Inception Token Mixer.
  • FFN is the Feedforward Neural Network.
  • LN denotes Layer Normalization.
These are the main equations that model the Inception Transformer. The model combines the capability of CNNs to capture high-frequency details with the ability of Transformers to capture global dependencies, through a novel design of High- and Low-Frequency Mixers, in addition to a frequency ramp structure.
  • Long Short-Term Memory (LSTM)
The Long Short-Term Memory (LSTM) constitutes a distinct variant of recurrent neural networks (RNNs) that have been meticulously engineered to address challenges associated with long-term sequence processing and temporal dependencies. In contrast to conventional RNNs, which are susceptible to the vanishing gradient phenomenon, LSTMs possess the capability to acquire long-term dependencies owing to their distinctive internal configuration. This configuration encompasses a series of gates that modulate the information flow, thereby enabling the network to judiciously retain and discard information.
Architecture of an LSTM Cell
An LSTM cell comprises four fundamental components: the memory cell, the input gate, the forget gate, and the output gate. The operational characteristics of these gates are delineated through a series of mathematical formulations that regulate the information flow within the cell [39]. The architecture is illustrated in Figure 21.
Hyperparameters
In the implementation of an LSTM, the parameters required to configure an LSTM cell are as follows:
  • input_size: Number of input features per time step.
  • hidden_size: Dimensionality of the hidden vector ( h t ) and the cell state ( C t ).
  • batch_first: If True, the LSTM input and output will have the shape ( batch _ size , seq _ length , feature _ size ) .
The parameter configuration for the present case is:
input_size: 14 × 14 × ( 256 + 256 ) .
hidden_size: 512.
batch_first: True.
Cell State: The cell state functions as a long-term memory, transmitting relevant information throughout the sequence. The input gate modifies it, as does the forgetting gate. The cell is updated at time intervals, and the input and forget gates control what information is added or deleted.
Forget Gate: The forget gate determines how much of the prior information should be forgotten. A value near to 0 shows that the data should be discarded, while a value close to 1 show that it should be retained. This is mathematically modeled as follows:
f t =   σ W f   ·   h t 1 ,   x t + b f
where
  • ft is the activation vector of the forget gate.
  • σ serve as the sigmoid function, which converts input values into the range [0,1].
  • Wf is the weight matrix for the forget gate.
  • h t 1 ,   x t contains the chain of the previous hidden state and the current input.
  • bf is the bias of forget gate.
Input Gate: The input gate controls the amount of new information added to the cell state. The candidate memory ( C ~ t ) denotes the new probable information that can be additional, which is represented as:
i t =   σ W i   ·   h t 1 ,   x t + b i       y       C ~ t =   tanh W c   ·   h t 1 ,   x t + b c
where
  • i t is the activation vector of the input gate.
  • σ serves as the sigmoid function, which regulates the amount of new information added.
  • C ~ t is the new candidate memory that can be added to the cell state.
  • tanh is the hyperbolic tangent function, which converts input values into the range [−1, 1].
  • Wi, bi are the weights and bias of the input gate, correspondingly.
  • Wc, bc are the weights and bias for the candidate memory, correspondingly.
Cell State Update: The cell state is renewed by linking the maintained data (modulated by f t ) with the new data (modulated by i t ).
C t = f t     C t 1 + i t     C ~ t
where
  • C t is the updated cell state.
  • ⊙ denotes the element-wise product operation.
  • ftCt−1 represents the data maintained from the previous cell state.
  • i t     C ~ t represents the new information added to the cell state.
Output Gate: Finally, the output gate determines how much information from the current cell state should be emitted as the hidden state ( h t ) and the network output at that time step. Mathematically, this can be defined as:
o t =   σ W o   ·   h t 1 ,   x t + b o       y       h t =   o t     tanh C t
where
  • ot is the activation vector of the output gate.
  • ht is the hidden state and output of the LSTM at the current time step.
  • Wo is the weight matrix for the output gate.
  • h t 1 ,   x t contains the concatenation of the prior hidden state and the current input.
  • bo is the bias of the output gate.
  • σ serves as the sigmoid function, which regulates the amount of information emitted.
  • tanh is the hyperbolic tangent function applied to the cell state.
  • Optical Flow
Methods for estimating optical flow are based on two principles: brightness constancy and light motion. Brightness constancy accepts that the grayscale intensity of a moving object remains unchanged, while small motion assumes that the velocity vector field changes gradually over a short time interval.
Thus, a pixel can be defined as I ( x , y , t ) in a video clip, such as those in our datasets, which moves by Δ x , Δ y , Δ t to the next frame. Corresponding to the brightness constancy assumption, the pixel intensity before and after motion remains constant, allowing us to derive:
I x ,   y ,   t   =   I x   +   x ,   y   +   y ,   t   +   t
The right-hand side of Equation (33) can be expanded by a Taylor series, resulting in Equation (34).
I x   +   x ,   y   +   y ,   t   +   t   =   I x ,   y ,   t   + I x x + I y y + I t t + ε
where ε represents the higher-order term, which can be ignored.
The variables, u and v represent the horizontal and vertical parts of the optical flow, respectively, as u = ∆x/∆t and v = ∆y/∆t. Substituting them in the previous equation, we have:
I x u   +   I y v   + I t   =   0
where I x = I x , I y = I y , I t = I t are the partial derivatives of the pixel intensity with respect to x , y , and t , respectively, and ( u , v ) is referred to as the optical flow field.
Based on the above, and as an example, Figure 22 illustrates an application of optical flow, showing the result of the module. The figure uses the frame flow from one of the users in our own dataset to visualize the flow vectors corresponding to motion or changes throughout the sequence.

3. Results and Discussion

The B-Lit Transformer model was initially trained using three microexpression datasets: SMIC, CASME, and SAMM, and validated on a proprietary dataset [18]. See Table 1.
A second stage of preprocessing consisted of the generation of tensors, mathematical objects that allow storing image features across each of their dimensions.
In Figure 23, two frames corresponding to different emotional states are shown: on the right, an image of a neutral state, and on the left, an image of anger. The main difference between the two occurs in the participant’s brow and lip corners. In the left image, the brow is pronounced due to shadows and facial wrinkles in that area, while the shape of the mouth is not associated with any specific emotion. Conversely, in the right image, the brow curves downward, and a slight movement of the mouth is perceptible; both movements are associated with the emotion of anger.
In Figure 24, frames are extracted at 5 s intervals. The changes in the participant’s face are subtle, occurring mainly around the eyes, when the participant shifts their gaze to different areas of the screen (minute 1:35) and during blinking (1:55), as well as slight rotations of the head. From these, the micro expressions produced in the preceding frames were recorded.
In Figure 25, there is considerably more movement in the participant’s face. Participant 69 exhibited a wide variety of facial movements in a short period, between minutes 0:55 and 1:20. Gestures can be identified around the mouth (minute 1:05) or the eyes (minute 1:20), in addition to changes in head position throughout the entire recording.
  • Stochastic Gradient Descent
It is essential to identify a method to accelerate the training process [40], given that the initial execution required approximately one and a half days. Therefore, the Stochastic Gradient Descent (SGD) method was employed. As a result, the training procedure became more efficient. Thus, training on 132,000 images required less than two hours, as described in Table 2 representing a reduction of more than 95% in training time compared to the report in [40]. The considerable decrease in training time is a noteworthy result of this strategy. This reduction demonstrates how the algorithm can effectively handle massive datasets without sacrificing the accuracy or stability of the model.
Impact of Stochastic Gradient Descent on Training Efficiency
The main optimization algorithm used in the study is Stochastic Gradient Descent (SGD), which significantly improves training performance and computational efficiency. In the context of high-dimensional micro-expression recognition, the application of SGD significantly accelerated the parameter-update cycle and enabled faster convergence, outperforming alternative optimization strategies.
Algorithmic Considerations
For optimizing the model under significant computational load, SGD’s stochastic, mini-batch-based update mechanism worked especially well. The system was able to effectively adjust to the intricate spatial-temporal patterns linked to micro-expression features thanks to its capacity for quick, incremental parameter changes. Furthermore, by concentrating on frame-to-frame pixel variations rather than static facial traits, SGD improved predictive performance and accelerated convergence when combined with Optical Flow as a preprocessing technique.
Rapid model convergence, effective large-dataset handling, and a significant decrease in computational time (all crucial benefits for developing micro-expression recognition frameworks) were made possible by the training pipeline optimization made possible by the application of SGD.
  • Optical Flow
As Transformer models are generated, time can be optimized, as it is evident that the model can pay attention to irrelevant details. Knowing that Optical Flow only provides a representation of the motion of the original image, changes over time are captured; the Transformer model can focus exclusively on these pixel variations between frames. In this way, the Transformer avoids processing or learning trivial features, such as face shape, distance between facial features, skin tone, hair color, or eye color.
Consequently, processing times were significantly reduced compared to standard executions. See Table 2: even with 40 fewer epochs, the version of the model without Optical Flow takes approximately five times longer, and although both versions achieve similar accuracy and F1-score performance, there is a substantial improvement in training loss (with an 80% training/20% testing split). See Table 2.
Performance Metrics for Three Classes (CASME, SMIC, SAMM Datasets)
Table 3 summarizes the most relevant characteristics with the aim of presenting the performance of model when performing the three-class classification task (positive, negative, surprise), for both the training and testing stages.
For the training stage, 70% of the data that was used, employing a cross-validation method known as LOSO (Leave-One-Subject-Out), which, according to [41], is considered optimal for micro expression classification tasks. The training process consisted of a total of 60 epochs, resulting in a training loss of 0.1917 and an accuracy of 0.9474, with a total execution time of 8 min and 38 s. This process required 13.3% of the CPU processing power and 72.1% of the available RAM. Conssidering that these three datasets together comprise more than 132,000 frames, the low computational resources and short execution time required are noteworthy, especially given that image analysis tasks typically involve a high computational load. See Table 3.
The number of folds in this scheme is strictly determined by the number of participants; in our case, 85 subjects yield 85 folds. In each iteration, the training set is made up of samples from the remaining 84 subjects, while the test set is made up of all samples from one individual. In order to ensure an evaluation free from identity leakage and to provide a solid assessment of the model’s capacity to generalize to unseen individuals, this process is repeated repeatedly until each subject has been used as the test set once. Although the study takes into account seven emotional categories, the actual distribution varies across subjects because some contribute only a small number of samples and others do not exhibit certain classes, resulting in folds with underrepresented or even missing classes. This presents significant methodological challenges related to class imbalance. Class imbalance–insensitive metrics were used to reduce this problem and avoid evaluation bias brought on by unequal sample distribution. This allowed for a more stable, comprehensible, and comparable evaluation across folds by giving each class equal weight regardless of its frequency. See Table 4.
For the testing stage of the model, 20% of the available data from the three-class datasets was used. The metrics employed to evaluate the performance of the model in this stage are summarized in Table 5. In the F1-score column, the value obtained was 0.8574, which indicates that the model maintains a strong balance between correctly predicting each class and minimizing false predictions across classes. The percentage of correctly classified predictions is measured through the accuracy column, where it can be observed that the model correctly classifies 81.27% of all test instances. Meanwhile, the precision value indicates that 85.19% of the model’s predictions are correct. The recall score has a value of 0.8127, suggesting that the model correctly identifies 81.27% of all positive cases. See Table 5.
Figure 26 presents the graph illustrating the loss of training time by epoch. It can be observed that although the initial loss value is relatively high, it rapidly begins to converge toward lower levels, indicating that the model is effectively adapting and learning efficiently. Around epoch 30, the loss value ceases to show significant variation and begins to stabilize; by epoch 60, the rate of reduction reaches its lowest point. Increasing the number of epochs beyond this point would entail the risk of overfitting the model. Therefore, it was determined that 60 epochs constitute an appropriate number to achieve optimal model performance.
The training accuracy graph in Figure 27 illustrates the model’s adjustment process over each epoch. Initially, the model’s accuracy is low, approximately 0.4, and exhibits considerable fluctuation. This behavior is normal in the early stages of training, as the model is still adjusting its parameters. Starting around epoch 20, the curve demonstrates a more stable upward trend, indicating that the model is learning and improving its accuracy on the training dataset. Subsequently, from approximately epoch 30 onward, accuracy stabilizes around 0.9, with minor fluctuations. This suggests that the model has achieved a high level of accuracy and is converging.
The accuracy curve becomes more stable and approaches a value near 0.75, indicating that the model has learned the fundamental patterns within the data, improving its performance while slowing the rate of further gains. As a result, beyond this point, more epochs are required to observe meaningful improvements. Around epoch 60, the model’s performance reaches its highest level.
The ROC curve represents the relationship between the true positive rate and the false positive rate. When a ROC curve lies close to the diagonal line, it indicates poor model performance. An AUC value of 1 corresponds to a perfect classifier, whereas an AUC value of 0.5 indicates a classifier with no discriminative capability [42].
The plot shown in Figure 28 and Figure 29 illustrates the classifier’s performance, where it can be observed that for all three classes (0 for positive, 1 for negative, and 2 for surprise), the AUC value is high, exceeding 0.8 in every case. Since the AUC measures the classifier’s ability to distinguish between classes, these values indicate that the model possesses strong discriminative capability, making few errors when determining the correct class to which a given sample belongs.
  • Testing with Proprietary Data
For the proprietary dataset, a confusion matrix was generated, with the following class encoding: 0: Neutral, 1: Sadness, 2: Surprise, 3: Fear, 4: Disgust, 5: Anger, 6: Happiness. It can be observed that the Neutral class has the highest number of correct predictions (20), whereas the Sadness and Surprise classes are the most frequently misclassified (6 misclassified instances in both cases). The Sadness class is most commonly confused with Neutral (3 confusions), and to a lesser extent with Fear (1 confusion), Disgust (1 confusion), and Anger (1 confusion).
Meanwhile, the Surprise class is most frequently confused with Sadness (2 confusions) and Anger (2 confusions). The first case may be explained by the fact that concern can arise from a combination of surprise and fear, with fear being the predominant expression in those microexpressions. In the second case, the confusion with anger may be attributed to surprise triggering a fight-oriented response, leading the microexpressions to resemble anger. See Figure 30.
The frequent confusion among sadness, surprise, and fear—compared to the higher performance observed for neutrality and happiness—can be explained by physiological and computational factors related to how micro-expressions emerge. These emotions often activate overlapping Action Units (AUs): fear and surprise share inner and outer eyebrow raising and widened eyes, while sadness and fear can share eyebrow tension and subtle periocular micro-tension. Such overlap produces highly similar facial patterns, reducing class separability and increasing misclassification even in advanced recognition models, especially because these emotions appear with low intensity, short duration, and high inter-subject variability. Happiness, on the other hand, is characterized by highly distinctive AUs, mainly activation of the orbicularis oculi and pulling of the lip corners, which produce symmetric, high intensity micromovements that are consistent across individuals. These features enhance discriminability and contribute to the higher accuracy observed in automated recognition systems by generating clearer spatiotemporal signatures.
In the ROC curve shown in Figure 31, it can be observed that, overall, the model performs well in discriminating among the 7 classes. Class 0 (neutral) and class 6 (happiness) achieve the highest scores, both with 0.87. This result is consistent with the confusion matrix, as these are the classes for which the model exhibits the fewest errors and, therefore, the ones it learned to differentiate most effectively from the rest. Although the performance for classes 1, 2, and 3 (sadness, surprise, and fear, respectively) is satisfactory, with values close to 0.80, it is evident that the model still has a greater margin for improvement in these cases. Specifically, the model could be further refined to enhance discrimination for these classes, in order to reach levels comparable to classes such as 4 and 5 (disgust and anger, respectively), which, with values close to 0.85, indicate that the model learned to identify them more optimally.
  • Comparison of Machine Learning Models
The results presented in Figure 28 may be considered as an average across the three datasets (CASME, SAMM, and SMIC), which demonstrates that the B-LiT Transformer model is, to some extent, superior to those reported in [18]. It is likely that using the three datasets simultaneously, rather than independently, contributed to improving performance, since the model had access to a larger number of training instances and, to a certain degree, the manner in which the frames are presented is very similar across datasets. However, we consider this to be a relatively low-probability factor, as the average results reported in that study would still not surpass our findings.
Furthermore, the study in [17] corresponds to a competition held in 2018 focused on micro-expression detection, where the highest F1 score achieved was 0.8330. In contrast, our results reached 0.8556 and 0.8453, with and without Optical Flow, respectively. This represents a significant improvement, which is further reinforced by the use of gradient descent techniques that drastically reduced training time, as well as by the inclusion of Optical Flow, whose implementation clearly had a positive impact on the overall performance.
  • Novelty of the B-LiT/DualHybridFace
The novelty of the B-LiT/DualHybridFace model lies in its hybrid architecture, which strategically integrates Vision Transformers (ViT), Inception modules, the CBAM attention mechanism, and LSTM units to improve micro-expression detection. This combination makes use of each component’s advantages: ViT uses self-attention to capture global dependencies while Inception modules extract local, high-frequency features. This allows for a balanced representation of fine facial details and more general contextual patterns.
By adding spatial and channel attention, CBAM improves this process by enabling the model to highlight the most informative facial regions, which is essential for identifying the minute and subtle changes typical of micro-expressions. By capturing sequential dynamics across frames and enhancing ViT’s global attention mechanism to better comprehend temporal micro-movements, the integration of LSTM offers robust temporal modeling.
Prior to multi-head self-attention, the model also includes an Inception Token Mixer that enriches Transformer tokens with high-frequency CNN-based representations. This improves the system’s capacity to identify minute facial differences.
The architecture is appropriate for portable hardware because it achieves competitive performance at a low computational cost. Stochastic Gradient Descent (SGD) greatly increases training efficiency, cutting the time needed to process 132,000 images to less than two hours. Furthermore, Optical Flow suppresses unimportant information like skin tone or face shape while concentrating the model on significant inter-frame variations.
In complex hybrid architectures such as the one proposed in this study, performing exhaustive ablation experiments for each component (Inception, CBAM, ViT, LSTM, and Optical Flow) is neither methodologically appropriate nor computationally feasible. Due to the highly coupled nature of the model, the individual removal of modules alters the information flow in a non-linear manner and modifies the system’s internal dynamics, preventing an independent interpretation of each component’s contribution. Moreover, the computational cost associated with training multiple model variants exceed reasonable limits in environments without access to high-performance computing infrastructure, particularly given the LOSO validation scheme and the limited size of micro-expression datasets, which increases the risk of overfitting and statistical variability. For these reasons, the experimental approach focused on evaluating the performance of the complete system, supported by comparisons with state-of-the-art methods and robust metrics, thereby validating the effectiveness of the proposed architecture without requiring exhaustive ablation analysis.
  • The information related to the ethical
The study was reviewed and approved by an institutional committee, which evaluated the risks and benefits for participants, the protection of personal and biometric data—including facial information and identity, the anonymization procedures, the secure storage with restricted access, and the recording methodology, such as the use of cameras, sensors, or automated facial analysis systems. All subjects signed an informed consent form that clearly described the purpose of the study focused on scientific analysis of micro-expressions, the capture procedures, the type of data collected, the methods for data storage and protection, the potential future use of the recordings (including their possible availability to the scientific community), their right to withdraw from the study without consequences, and the guarantees of anonymity and confidentiality. The consent form also specified explicit authorization for the use of the data in academic research and in the development of artificial intelligence models. In the case of publicly available datasets used in the study, it is acknowledged that these have institutional approval, as ethical approval is not required; furthermore, informed consent has been obtained from the participants.

4. Conclusions

The dataset did not require data augmentation strategies such as image rotation or the application of filters. Data augmentation is generally beneficial in machine learning classification tasks; however, in the case of the B-LiT model, having a large sample allows for a greater range of possibilities in terms of both diversity and representativeness. An increased number of instances in the dataset directly impacted on the model’s accuracy and robustness. Furthermore, avoiding data augmentation has a positive effect, as such techniques can introduce unwanted artifacts or biases.
Technologies such as Stochastic Gradient Descent (SGD) and Optical Flow demonstrated their effectiveness in accelerating the training process and substantially improving performance. SGD’s capability to update parameters was crucial for optimizing the model when handling large volumes of information, while Optical Flow, commonly associated with motion analysis in images and videos, was employed as a data flow technique that helped reduce image-specific features and, consequently, video features. The concurrent iteration of tests enabled model refinement through parameter adjustment and training time optimization, constituting a critical stage in the machine learning workflow.

Author Contributions

Conceptualization, N.A.Á.P. and E.J.J.U.; methodology, F.T.S.G. and B.Y.L.L.L.; software, F.T.S.G. and E.J.J.U.; validation, N.A.Á.P., B.Y.L.L.L. and R.R.-H.; formal analysis, R.R.-H.; investigation, F.T.S.G., N.A.Á.P., B.Y.L.L.L. and E.J.J.U.; resources, F.T.S.G., N.A.Á.P., B.Y.L.L.L. and E.J.J.U.; data curation, N.A.Á.P., B.Y.L.L.L. and E.J.J.U.; writing—original draft preparation, R.R.-H., F.T.S.G. and B.Y.L.L.L.; writing—review and editing, R.R.-H., N.A.Á.P. and E.J.J.U.; visualization, F.T.S.G., N.A.Á.P., B.Y.L.L.L. and E.J.J.U.; supervision, R.R.-H.; project administration, R.R.-H.; funding acquisition, R.R.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding and The APC was funded by IPN (Instituto Politécnico Nacional).

Institutional Review Board Statement

Ethical review and approval were waived for this study because it uses facial image datasets (CAME, SAMM, SMIC, and the author’s own FRANI dataset) for which informed consent was obtained from all participants. All data are fully anonymized, contains no personally identifiable information, and were used exclusively for research purposes.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the CAME, SAMM, SMIC, and FRANI datasets.

Data Availability Statement

The data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank the IPN (Instituto Politécnico Nacional) and the ESCOM (Escuela Superior de Cómputo) for the support received in carrying out this work.

Conflicts of Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  1. Shen, X.; Sui, H. The time—frequency characteristics of EEG activities while recognizing microexpressions. In Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS), Shanghai, China, 17–19 October 2016; pp. 180–183. [Google Scholar]
  2. Merghani, W.; Davison, A.K.; Yap, M.H. A Review on Facial Micro-Expressions Analysis: Datasets, Features and Metrics. arXiv 2018, arXiv:1805.02397. [Google Scholar] [CrossRef]
  3. Matsumoto, D.; Hwang, H.S.; López, R.M.; Pérez-Nieto, Á.P. Lectura de la expresión facial de las emociones: Investigación básica en la mejora del reconocimiento de emociones. Ansiedad Estrés 2013, 19, 121–129. [Google Scholar]
  4. Revina, I.M.; Emmanuel, W.S. A Survey on Human Face Expression Recognition Techniques. J. King Saud Univ.—Comput. Inf. Sci. 2021, 33, 619–628. [Google Scholar] [CrossRef]
  5. Canal, F.Z.; Müller, T.R.; Matias, J.C.; Scotton, G.G.; de Sa Junior, A.R.; Pozzebon, E.; Sobieranski, A.C. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf. Sci. 2022, 582, 593–617. [Google Scholar] [CrossRef]
  6. Ekman, P.; Friessen, W. Unmasking the Face; Consulting Psychologists Press: Redwood City, CA, USA, 1984. [Google Scholar]
  7. Strauss, D.; Steidl, G.; Heiß, C.; Flotho, P. Lagrangian Motion Magnification with Landmark-Prior and Sparse PCA for Facial Microexpressions and Micromovements. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Scotland, UK, 8 September 2022; pp. 2215–2218. [Google Scholar]
  8. Zong, Y.; Zheng, W.; Cui, Z.; Zhao, G.; Hu, B. Toward Bridging Microexpressions From Different Domains. IEEE Trans. Cybern. 2020, 50, 5047–5060. [Google Scholar] [CrossRef]
  9. Fibriani, I.; Mardiyanto, R.; Purnomo, M.H. Detection of Kinship through Microexpression Using Colour Features and Extreme Learning Machine. In Proceedings of the 2021 International Seminar on Intelligent Technology and Its Applications (ISITIA), Surabaya, Indonesia, 4 August 2021; pp. 331–336. [Google Scholar]
  10. Megías, C.F.; Mateos, J.C.P.; Ribaudi, J.S.; Fernández-Abascal, E.G. Validación española de una batería de películas para inducir emociones. Psicothema 2011, 23, 778–785. [Google Scholar]
  11. Navarro-Corrales, E. El lenguaje no verbal. de Rev. Comun. 2011, 20, 46–51. [Google Scholar]
  12. Herrera, R.R.; Torres, S.D.; Martínez, R.A.O. Recognition of emotions through HOG. In Proceedings of the 2018 XX Congreso Mexicano de Robótica (COMRob), Ensenada, Mexico, 12–14 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
  13. Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
  14. Ekman, P.; Harrieh, O. Expresiones faciales de la emoción. Annu. Rev. Psycology 1979, 30, 527–554. [Google Scholar] [CrossRef]
  15. Ekamn, P. Emotions Revealed: Recognizing Faces and Feelings to Improve Communication and Emotional Life; Times Books; Henry Holt and Co.: New York, NY, USA, 2003. [Google Scholar]
  16. Kay, T.; Ringel, Y.; Cohen, K.; Azulay, M.-A.; Mendlovic, D. Person Recognition using Facial Micro-Expressions with Deep Learning. arXiv 2023, arXiv:2306.13907. [Google Scholar] [CrossRef]
  17. Yap, M.H.; See, J.; Hong, X.; Wang, S.-J. Facial Micro-Expressions Grand Challenge 2018 Summary. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 675–678. [Google Scholar]
  18. Wang, C.; Peng, M.; Bi, T.; Chen, T. Micro-Attention for Micro-Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6339–16347. [Google Scholar]
  19. Wang, S.-J.; Li, B.-J.; Liu, Y.-J.; Yan, W.-J.; Ou, X.; Huang, X.; Xu, F.; Fu, X. Micro-expression recognition with small sample size by transferring long-term convolutional neural network. IEEE Int. Conf. Image Process. 2018, 321, 251–262. [Google Scholar] [CrossRef]
  20. Gu, Q.-L.; Yang, S.; Yu, T. Lite general network and MagFace CNN for micro-expression spotting in long videos. Multimed. Syst. 2023, 29, 3521–3530. [Google Scholar] [CrossRef]
  21. Guo, Y.; Li, B.; Ben, X.; Ren, Y.; Zhang, J.; Yan, R.; Li, Y. A Magnitude and Angle Combined Optical Flow Feature for Microexpression Spotting. IEEE Multimed. 2021, 28, 29–39. [Google Scholar] [CrossRef]
  22. Yang, P.; Zhu, T. Micro-Expression Recognition Method Based on Transformer with Separable Self-Attention. In Proceedings of the 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), Zhuhai, China, 28–30 June 2024; pp. 90–95. [Google Scholar] [CrossRef]
  23. Cheng, X.; Shang, L. Decoding Emotions: How Graph Transformer with Adaptive Graph Structure Learning Understands Micro-Expressions. In Proceedings of the 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), Tampa/Clearwater, FL, USA, 26–30 May 2025; pp. 1–10. [Google Scholar] [CrossRef]
  24. Zhang, L.; Hong, X.; Arandjelovic, O.; Zhao, G. Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition. IEEE Trans. Affect. Comput. 2022, 13, 1973–1985. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Lin, W.; Zhang, Y.; Xu, J.; Xu, Y. Leveraging vision transformers and entropy-based attention for accurate micro-expression recognition. Dent. Sci. Rep. 2025, 15, 13711. [Google Scholar] [CrossRef]
  26. Zhai, Z.; Zhao, J.; Long, C.; Xu, W.; He, S.; Zhao, H. Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; Volume abs/2304.04420. [Google Scholar] [CrossRef]
  27. Zhao, X.; Lv, Y.; Huang, Z. Multimodal Fusion-based Swin Transformer for Facial Recognition Micro-Expression Recognition. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 780–785. [Google Scholar] [CrossRef]
  28. Wang, S.; Lv, S.; Wang, X. Infrared Facial Expression Recognition Using Wavelet Transform. In Proceedings of the 2008 International Symposium on Computer Science and Computational Technology, Shanghai, China, 20–22 December 2008; pp. 327–330. [Google Scholar] [CrossRef]
  29. Modelo Basado en Componentes. Metodologías de Software. 1 December 2017. Available online: https://metodologiassoftware.wordpress.com/2017/12/01/modelo-basado-en-componentes/ (accessed on 9 April 2023).
  30. Jubair, B.; Humera, S.; Rawoof, S.A. An Approach to generate Reusable design from legacy components and Reuse Levels of different environments. Int. J. Curr. Eng. Technol. 2014, 4, 4234–4237. [Google Scholar]
  31. Happe, J.; Koziolek, H. A QoS Driven Development Process Model for Component-Based Software Systems; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4063. [Google Scholar]
  32. Jolly, E.; Cheong, J.H.; Xie, T.; Chang, L.J. Py-Feat. 2022. Available online: https://py-feat.org/basic_tutorials/01_basics.html (accessed on 26 April 2024).
  33. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. IEEE Xplore 2019. [Google Scholar] [CrossRef]
  34. Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? Icml 2021, 2, 4. [Google Scholar]
  35. Vaswani, A.; Shazeer, N.; Parmar, N.; Uskoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Poluskhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  36. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  37. Kong, D.; Zhang, J.; Zhang, S.; Yu, X.; Prodhan, F.A. MHIAIFormer: Multihead Interacted and Adaptive Integrated Transformer With Spatial-Spectral Attention for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14486–14501. [Google Scholar] [CrossRef]
  38. Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception Transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 23495–23509. [Google Scholar]
  39. Calzone, O. Medium. An Intuitive Explanation of LSTM. 21 February 2022. Available online: https://medium.com/@ottaviocalzone/an-intuitive-explanation-of-lstm-a035eb6ab42c (accessed on 29 May 2024).
  40. Sánchez García, F.T.; Lin, B.; Romero-Herrera, R. Detection of facial micro-expressions using CNN. J. Theor. Appl. Inf. Technol. 2023, 101, 7592–7601. [Google Scholar]
  41. Tran, T.-K.; Vo, Q.-N.; Hong, X.; Zhao, G. Micro-expression spotting: A new benchmark. Neurocomputing 2021, 443, 356–368. [Google Scholar] [CrossRef]
  42. Torres, L. The Machine Learners. Curva ROC y AUC en Python. Available online: https://www.themachinelearners.com/curva-roc-vs-prec-recall/ (accessed on 31 May 2024).
Figure 1. Facial expressions of a person [12].
Figure 1. Facial expressions of a person [12].
Computers 14 00559 g001
Figure 2. Architecture Diagram.
Figure 2. Architecture Diagram.
Computers 14 00559 g002
Figure 3. Camera recording view.
Figure 3. Camera recording view.
Computers 14 00559 g003
Figure 4. Analysis results of the second Python (version 3.9.0) script.
Figure 4. Analysis results of the second Python (version 3.9.0) script.
Computers 14 00559 g004
Figure 5. Position attention module [33].
Figure 5. Position attention module [33].
Computers 14 00559 g005
Figure 6. Vision Transformer (VIT) [36].
Figure 6. Vision Transformer (VIT) [36].
Computers 14 00559 g006
Figure 7. Generation of queries, keys and values for single-headed attention.
Figure 7. Generation of queries, keys and values for single-headed attention.
Computers 14 00559 g007
Figure 8. Matrix multiplication Q·KT.
Figure 8. Matrix multiplication Q·KT.
Computers 14 00559 g008
Figure 9. Matrix multiplication A·V.
Figure 9. Matrix multiplication A·V.
Computers 14 00559 g009
Figure 10. Multi-Head Attention Segmentation.
Figure 10. Multi-Head Attention Segmentation.
Computers 14 00559 g010
Figure 11. Matrix multiplication Q·KT for MSA.
Figure 11. Matrix multiplication Q·KT for MSA.
Computers 14 00559 g011
Figure 12. Matrix multiplication A·V for MSA.
Figure 12. Matrix multiplication A·V for MSA.
Computers 14 00559 g012
Figure 13. Shape of the positional embedding’s matrix.
Figure 13. Shape of the positional embedding’s matrix.
Computers 14 00559 g013
Figure 14. Token matrix.
Figure 14. Token matrix.
Computers 14 00559 g014
Figure 15. A single batch of tokens.
Figure 15. A single batch of tokens.
Computers 14 00559 g015
Figure 16. Reconstructed image.
Figure 16. Reconstructed image.
Computers 14 00559 g016
Figure 17. Convolutional Block Attention Module [36].
Figure 17. Convolutional Block Attention Module [36].
Computers 14 00559 g017
Figure 18. Spatial Attention Module [27].
Figure 18. Spatial Attention Module [27].
Computers 14 00559 g018
Figure 19. Channel attention model [34,36].
Figure 19. Channel attention model [34,36].
Computers 14 00559 g019
Figure 20. Inception Architecture [36,38].
Figure 20. Inception Architecture [36,38].
Computers 14 00559 g020
Figure 21. LSTM architecture.
Figure 21. LSTM architecture.
Computers 14 00559 g021
Figure 22. Example of the Application of Optical Flow.
Figure 22. Example of the Application of Optical Flow.
Computers 14 00559 g022
Figure 23. Person 49 analysis—Facial variation (neutral state on the left, angry state on the right).
Figure 23. Person 49 analysis—Facial variation (neutral state on the left, angry state on the right).
Computers 14 00559 g023
Figure 24. Recording person 39 (from minute 1:30 to 1:59).
Figure 24. Recording person 39 (from minute 1:30 to 1:59).
Computers 14 00559 g024
Figure 25. Person 69 (from minute 0:55 to 1:20).
Figure 25. Person 69 (from minute 0:55 to 1:20).
Computers 14 00559 g025
Figure 26. Training loss graph for 3 classes.
Figure 26. Training loss graph for 3 classes.
Computers 14 00559 g026
Figure 27. Training Accuracy Plot.
Figure 27. Training Accuracy Plot.
Computers 14 00559 g027
Figure 28. ROC and AUC Plot for Three Classes.
Figure 28. ROC and AUC Plot for Three Classes.
Computers 14 00559 g028
Figure 29. Training Accuracy Curve for Seven Classes.
Figure 29. Training Accuracy Curve for Seven Classes.
Computers 14 00559 g029
Figure 30. Confusion matrix for 7 classes during the testing stage.
Figure 30. Confusion matrix for 7 classes during the testing stage.
Computers 14 00559 g030
Figure 31. ROC and AUC curve for 7 classes.
Figure 31. ROC and AUC curve for 7 classes.
Computers 14 00559 g031
Table 1. Summary of the three datasets (CASME, SAMM, and SMIC).
Table 1. Summary of the three datasets (CASME, SAMM, and SMIC).
FeatureCASME IISAMMSMIC
Number of Sample255159164
Participants353216
EthnicitiesChineseChinese and 12 moreChinese
Facial Resolution280 ∗ 340400 ∗ 400640 ∗ 480
Happiness (25)Happiness (24)
Surprise (15)Surprise (13)Positive (51)
CategoriesAnger (99)Anger (20)Negative (70)
Sadness (20)Sadness (3)Surprise (43)
Fear (1) Others (69)Fear (7) Others (84)
Table 2. Performance comparison when applying Optical Flow across the three main datasets (CASME, SAMM, and SMIC).
Table 2. Performance comparison when applying Optical Flow across the three main datasets (CASME, SAMM, and SMIC).
Model VersionNumber of EpochsTraining TimeLoss of TrainingAccuracyF1 Score
Without Optical Flow601 h 45 min and 58 s0.210890.28%0.8453
With Optical Flow10023 min and 55 s0.098690%0.8556
Table 3. Training results for the three-class model.
Table 3. Training results for the three-class model.
Training LossAccuracyEpochsTraining TimeCPU UsageMemory Usage
0.19170.9474608 min 38 s13.3%72.1%
Table 4. FLOP description, parameter count, and FPS.
Table 4. FLOP description, parameter count, and FPS.
ModelFlopd (Aprox)ParametersEstimated FPS (GTX 1060, Batch)
ResNet-504.1 GFLOPs25.6 M35–50 FPS
Vision Transformer (Vit-B/16)~17.6 GFLOPSs86 M8–12 FPS
Vision Transfomer (Vit-L/16)~60 FLOPS307 M2–4 FPS
Table 5. Performance metrics for the three-class model—Testing stage.
Table 5. Performance metrics for the three-class model—Testing stage.
F1-ScorePrecisionAccuracyRecallMean Square Error (MSE)Mean Absolute Error (MAE) R 2
0.85740.81270.85190.81270.22780.21750.8243
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Romero-Herrera, R.; Sánchez García, F.T.; Álvarez Peñaloza, N.A.; López Lin, B.Y.L.; Utrilla, E.J.J. Micro-Expression Recognition Using Transformers Neural Networks. Computers 2025, 14, 559. https://doi.org/10.3390/computers14120559

AMA Style

Romero-Herrera R, Sánchez García FT, Álvarez Peñaloza NA, López Lin BYL, Utrilla EJJ. Micro-Expression Recognition Using Transformers Neural Networks. Computers. 2025; 14(12):559. https://doi.org/10.3390/computers14120559

Chicago/Turabian Style

Romero-Herrera, Rodolfo, Franco Tadeo Sánchez García, Nathan Arturo Álvarez Peñaloza, Billy Yong Le López Lin, and Edwin Josué Juárez Utrilla. 2025. "Micro-Expression Recognition Using Transformers Neural Networks" Computers 14, no. 12: 559. https://doi.org/10.3390/computers14120559

APA Style

Romero-Herrera, R., Sánchez García, F. T., Álvarez Peñaloza, N. A., López Lin, B. Y. L., & Utrilla, E. J. J. (2025). Micro-Expression Recognition Using Transformers Neural Networks. Computers, 14(12), 559. https://doi.org/10.3390/computers14120559

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop