Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning

Sălăgean, Gabriela Laura; Leba, Monica; Ionica, Andreea Cristina

doi:10.3390/app15126417

Open AccessArticle

Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning

by

Gabriela Laura Sălăgean

¹,

Monica Leba

^2,*

and

Andreea Cristina Ionica

³

¹

Doctoral School, University of Petroșani, 332006 Petrosani, Romania

²

System Control and Computer Engineering Department, University of Petroșani, 332006 Petrosani, Romania

³

Management and Industrial Engineering Department, University of Petroșani, 332006 Petrosani, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6417; https://doi.org/10.3390/app15126417

Submission received: 5 May 2025 / Revised: 31 May 2025 / Accepted: 5 June 2025 / Published: 6 June 2025

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed system enables the real-time detection of facial micro-expressions and can be applied in high-stakes environments such as security screening and the clinical assessments of emotional states. It also holds potential for enhancing the user experience in adaptive human–computer interaction systems and for analyzing authentic consumer responses in market research.

Abstract

This paper presents a real-time system for the detection and classification of facial micro-expressions, evaluated on the CASME II dataset. Micro-expressions are brief and subtle indicators of genuine emotions, posing significant challenges for automatic recognition due to their low intensity, short duration, and inter-subject variability. To address these challenges, the proposed system integrates advanced computer vision techniques, rule-based classification grounded in the Facial Action Coding System, and artificial intelligence components. The architecture employs MediaPipe for facial landmark tracking and action unit extraction, expert rules to resolve common emotional confusions, and deep learning modules for optimized classification. Experimental validation demonstrated a classification accuracy of 93.30% on CASME II, highlighting the effectiveness of the hybrid design. The system also incorporates mechanisms for amplifying weak signals and adapting to new subjects through continuous knowledge updates. These results confirm the advantages of combining domain expertise with AI-driven reasoning to improve micro-expression recognition. The proposed methodology has practical implications for various fields, including clinical psychology, security, marketing, and human-computer interaction, where the accurate interpretation of emotional micro-signals is essential.

Keywords:

affective computing; facial action coding system; emotion classification; human-computer interaction; multimodal AI

1. Introduction

Micro-facial expressions are brief, involuntary facial movements that occur within 1/25 to 1/15 of a second and reveal genuine emotional states that individuals may attempt to conceal [1]. These subtle cues are of particular interest in emotionally charged contexts such as deception, internal conflict, or suppressed reactions. Despite their brevity and low intensity, micro-expressions are recognized as powerful indicators of emotional authenticity, with significant implications across various fields including clinical psychology, law enforcement, neuropsychiatry, marketing, and human–computer interaction.

In clinical settings, micro-expression analysis can help therapists assess patient responses more objectively. In security applications, these expressions may aid in the detection of suspicious or deceptive behavior. Similarly, in market research and advertising, understanding subtle consumer reactions can refine product design and campaign strategies. Within human–computer interaction, recognizing micro-expressions enables systems to respond more empathetically and effectively to user states. Additionally, in cognitive and affective neuroscience, studying these involuntary signals provides insights into emotional processing and regulation mechanisms [2].

However, the automatic recognition of micro-expressions presents several challenges. These include their extremely short duration, often lasting under 100 milliseconds; the low amplitude of facial muscle movements, typically two to five times smaller than standard expressions; high inter-subject variability due to differences in facial morphology and expression patterns; ambiguity in emotional interpretation; and frequent confusion between visually similar expressions such as fear and surprise or sadness and disgust. Moreover, the scarcity of well-annotated training data and the absence of contextual cues further complicate the accurate detection and classification of micro-expressions [3].

The CASME II dataset [4] was developed with the specific aim of supporting micro-expression research. It comprises 247 high-speed video sequences (200 FPS) from 35 subjects and includes detailed annotations such as emotion categories, facial action units (AUs), and onset-apex-offset frames. The dataset’s design—eliciting genuine but suppressed emotional responses under controlled stimuli—makes it particularly suited for developing systems that simulate real-world micro-expression analysis conditions.

To address the existing limitations in micro-expression recognition, this research introduces a hybrid real-time detection system that integrates psychological models, like Facial Action Coding System (FACS), computer vision (MediaPipe-based facial landmark detection), and artificial intelligence (AI)—specifically, large language models (LLMs) such as OpenAI’s GPT-3.5—for emotion classification. This system bridges the interpretability and rule-driven precision of classical methods with the generalization power of data-driven AI, yielding a model that is both robust and adaptable in dynamic environments.

Our approach is grounded in several innovations. First, it introduces a modular architecture that enhances interpretability and facilitates component-level optimization. Second, it leverages expert-defined rules to resolve known emotional confusions such as those between fear and surprise or sadness and disgust. Third, it incorporates weak signal amplification mechanisms to better capture low-intensity expressions, and fourth, it implements mechanisms for continuous learning, enabling the system to update its knowledge base as it encounters new subjects. These contributions are validated through rigorous experiments on the CASME II dataset using leave-one-subject-out and k-fold cross-validation protocols.

In addition to technical contributions, this work explores an emerging interdisciplinary frontier: the application of LLMs in affective computing. While prior studies have demonstrated the effectiveness of machine learning and deep learning frameworks on micro-expression datasets, few have considered how LLMs—traditionally used in natural language processing—can be integrated into emotion recognition systems. Recent evidence suggests that these models, when contextualized with structured rules and domain knowledge, can enhance the emotion classification accuracy and reduce ambiguity, especially when emotions are embedded within complex affective states.

The integration of LLMs with visual processing opens new possibilities for decoding not only facial movements, but also their underlying emotional narratives. This synergy between language and vision may ultimately support more nuanced interpretations of nonverbal communication. In doing so, it also addresses a limitation of conventional systems, which often rely on binary or simplistic emotion categories and fail to capture the complexity of emotional expression.

The primary objectives of this research were as follows:

O1: Develop a real-time system for micro-expression detection and classification, tested on the CASME II dataset;

O2: Implement a hybrid methodology combining expert rules with machine learning and LLMs;

O3: Resolve frequent emotional confusions using rule-based reasoning and signal enhancement;

O4: Conduct a rigorous performance evaluation using standard accuracy and F1-based metrics.

This paper contributes to the scientific community by demonstrating that micro-expression recognition can be significantly improved through a synergistic integration of structured knowledge, visual computation, and large-scale generative AI models. It further lays the foundation for future systems that are capable of understanding nuanced human emotions in real-time and in diverse application settings.

The remainder of this paper is structured as follows. Section 2 reviews the relevant literature in micro-expression recognition, focusing on hybrid methods and LLM integration. Section 3 presents the materials and methods including dataset preprocessing, system architecture, and the hybrid classification pipeline. Section 4 details the experimental design and validation protocols. Section 5 discusses the results, comparative evaluations, and real-time performance. Section 6 concludes the paper by highlighting the contributions, limitations, and directions for future research.

2. Related Work

2.1. Foundations of Micro-Expression Research

The concept of micro-expressions was first introduced by Haggard and Isaacs in 1966 under the term momentary micro-expressions, identified during psychotherapy sessions as brief, involuntary emotional leaks [5]. This foundational work was significantly expanded by Paul Ekman and Wallace Friesen, who formalized the study of micro-expressions as universal indicators of underlying emotions, largely uncontrollable by conscious effort [1]. Their work also led to the development of the FACS [6], which encodes facial expressions into discrete AUs corresponding to muscle movements. FACS remains the gold standard for facial expression analysis including micro-expression recognition [7].

2.2. Traditional Computational Approaches

Initial efforts in automatic micro-expression recognition relied on handcrafted features and classical image processing techniques. Methods such as optical flow were employed to track pixel-level movements between video frames [8]. Feature descriptors like LBP-TOP (Local Binary Patterns from Three Orthogonal Planes), HOG, and their temporal extensions were subsequently developed to capture the dynamic texture of facial regions [9,10,11]. While these approaches were foundational, they often lacked robustness across subjects and environmental conditions, limiting their practical applicability.

Recent advancements in micro-expression recognition highlight the importance of robust spotting techniques, with [12] proposing the MSOF method based on bi-directional optical flow to enhance motion analysis in complex scenarios, and [13] introducing a head pose segmentation approach to improve spotting accuracy under uncontrolled conditions.

2.3. Deep Learning and Spatiotemporal Modeling

The emergence of deep learning significantly advanced the field. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs), particularly LSTM-based models, were adopted to learn hierarchical and temporal representations of facial dynamics [14,15,16]. These methods improved recognition accuracy by capturing both spatial detail and the temporal progression of expressions. More recent explorations have incorporated 3D CNNs, transformers, and generative adversarial networks (GANs) to enhance the extraction of micro-expression features [17,18]. However, many of these models are computationally intensive and often optimized for controlled environments.

Phan et al. [19] provide an extensive review of the application of graph convolutional networks (GCNs) in sentiment analysis, which proves pertinent due to the natural graph-structured representation of facial landmarks and their dynamic relations. Shehu et al. [20] offer a detailed survey on facial expression recognition datasets and methods, highlighting key challenges such as spontaneous expression recognition, class imbalance, and the difficulty of annotating low-intensity emotions. These findings emphasize the growing consensus in the literature that traditional single-model approaches often fall short when confronted with subtle or ambiguous expressions. In this context, the Transformer-Augmented Network with Online Label Correction proposed by Ma et al. [21] illustrates a compelling approach by combining spatial and temporal attention with dynamic label correction. Their work underlines the importance of both architectural adaptiveness and label reliability, especially when working with inherently uncertain or low-intensity facial expressions.

2.4. Language Models in Emotion Recognition

A novel and transformative direction has emerged through the integration of LLMs into micro-expression recognition. Visual-linguistic architectures such as DFER-CLIP demonstrate the ability to use vision-language embeddings for facial expression recognition in natural contexts [22]. Micron-BERT, an adaptation of BERT for unsupervised facial recognition, incorporates Diagonal Micro Attention and Patch of Interest modules to detect fine-grained facial changes without explicit labels [23].

General-purpose models like GPT-4V with Emotion showcase zero-shot emotion classification capabilities across modalities [24], while MicroEmo combines visual, linguistic, and acoustic features to analyze emotions in video conversations [25]. Systems like Exp-CLIP and ExpLLM extend zero-shot recognition and chain-of-thought reasoning to explain facial expressions and their emotional implications [26,27].

2.5. The Role of CASME II and Dataset-Centric Advances

The CASME II dataset remains a cornerstone of micro-expression research. With high-speed, high-resolution recordings and detailed AU annotations, it has enabled the development and benchmarking of a wide range of recognition models. Studies consistently emphasize CASME II’s value in uncovering the nuances of subtle facial cues and validating model performance [28,29].

Nonetheless, current systems trained on CASME II often struggle with generalization. Issues such as inter-subject variability, subtle expression intensity, and class imbalance persist. This highlights the need for systems that combine robust modeling techniques with interpretability and contextual adaptability.

2.6. Challenges and Gaps in Existing Research

Despite the progress seen in Table 1, several limitations continue to hinder the practical deployment of micro-expression systems. These include:

Lack of robustness in real-world conditions—most models are sensitive to variations in lighting, occlusion, and camera angles;
Dataset scarcity and class imbalance—the small size and uneven emotion distribution in datasets like CASME II reduce generalization;
Emotion confusion—visual overlap between pairs like fear and surprise or disgust and sadness lead to misclassification;
Cultural and individual variation—facial expression interpretation varies significantly across individuals and social contexts;
Limited contextual integration—many models do not account for linguistic, situational, or multimodal cues that influence emotional meaning [30,31].

These challenges underscore the importance of integrating contextual awareness and multimodal learning strategies, particularly those offered by LLMs and vision-language models.

Table 1. Emotion recognition milestones.

Approach	Achievements
Enhancement of micro-expression recognition through a lightweight MFE-Net using multi-scale features and attention [29].	Demonstrated effectiveness of MFE-Net on CASME II and MEGC2019 datasets, improving feature extraction.
Integration of GANs with CNNs to enhance micro-expression dataset diversity and classification accuracy [32].	Outperformed existing CNN models with augmented dataset from CASME II, improving classification accuracy.
Development of a multimodal large language model for facial expression and attribute understanding [33].	Face-LLaVA outperformed existing models on multiple datasets, enhancing performance in facial processing tasks.
Improvement of multi-class facial emotion identification using deep learning techniques [34].	Achieved high accuracy, showing effectiveness of the combined use of EfficientNetB0 and WKELM.
Utilization of advanced deep learning techniques for emotion recognition from visual and textual data [30].	Demonstrated the ability to recognize seven primary human emotions effectively.
A systematic review on emotion detection methodologies utilizing deep learning and computer vision [35].	Identified trends in models used for emotion detection, emphasizing CNNs and their effectiveness.
Proposal of a two-level feature fused two-stream network for improved micro-expression recognition [31].	Achieved UAR exceeding 0.905 on CASME II and significant improvements on other datasets.
Development of a method for magnifying subtle facial movements for better detection of micro-expressions [36].	Introduced a technique that successfully enhanced the detection of micro-expressions in facial videos.
Review of advances in facial emotion recognition using neural network models [37].	Highlighted the dominance of CNN architectures in facial emotion recognition tasks with ongoing challenges.
Evaluation of ChatGPT applications in psychiatry and potential for emotion recognition [38].	Identified the potential for ChatGPT to enhance psychiatric care through emotion detection and automated systems.
Development of a real-time approach for sign language recognition using pose estimation [39].	Demonstrated effectiveness in recognizing signs with the aid of facial and body movements.
Emotion detection in Hindi–English code-mixed text using deep learning [40].	Achieved 83.21% classification accuracy in detecting emotions in social media texts.
Review of multimodal data integration in oncology using deep neural networks [41].	Emphasized the importance of GNNs and Transformers for improving cancer diagnosis and treatment.

Our proposed system directly addresses the above limitations through a hybrid architecture that integrates:

MediaPipe-based facial landmark tracking for efficient AU detection;
Expert rule systems grounded in FACS to resolve ambiguous emotion pairs;
LLM-based classification for robust and context-aware interpretation;
Weak signal amplification to enhance detection of low-intensity expressions;
Real-time capabilities validated on CASME II under practical constraints.

By leveraging the strengths of psychology, computer vision, and computational linguistics, our research advances the state-of-the-art in real-time, interpretable, and contextually adaptive micro-expression recognition.

3. Materials and Methods

The detection of micro-expressions remains a significant challenge due to their subtle, transient nature and the frequent overlap between emotional categories. Existing solutions often rely heavily on either deep learning models or handcrafted feature extraction methods, each with limitations in terms of interpretability, adaptability, or sensitivity to weak signals. To address the identified challenges, we propose a modular micro-expression detection system that integrates multiple state-of-the-art technologies and expert knowledge.

3.1. Materials

To develop our proposed system, we integrated multiple cutting-edge technologies, each selected for its specific capabilities and strengths. Python 3.8 served as the primary development language, providing a flexible and powerful foundation for implementation. OpenCV 4.6.0 was employed for image and video processing tasks, while the MediaPipe 0.8.9 framework from Google (Mountain View, CA, USA) enabled accurate facial landmark detection. For efficient data manipulation and handling, we utilized the NumPy 1.21.0 and Pandas 1.3.4 libraries. Scikit-learn 1.0.2 was used to evaluate the system performance through established machine learning metrics. Additionally, the OpenAI (San Francisco, CA, USA) ChatGPT-3.5-Turbo API was incorporated to perform advanced classification tasks. Finally, Matplotlib 1.3.0 and Seaborn 0.13.2 were applied to visualize the results and facilitate comprehensive analysis.

The proposed micro-expression detection system adopts a modular architecture to ensure clarity, flexibility, and robustness across all processing stages, as shown in Figure 1. The workflow begins with the Video Preprocessing Module, where raw video streams are enhanced and prepared for analysis. This is followed by MediaPipe Facial Detection, which identifies and tracks facial landmarks with high precision. The detected features are then processed by the Facial AU System, enabling a structured representation of facial movements. To further refine interpretation, the Expert Rules Module applies specialized decision rules aimed at resolving typical emotional ambiguities. Subsequently, the Amplification of Weak Signals stage enhances subtle facial cues that are characteristic of micro-expressions. The refined data are passed to the Classification Ensemble, where multiple classifiers work collectively to improve recognition accuracy. Finally, the system outputs the Identified Micro-Expression, offering a comprehensive and accurate analysis of subtle emotional expressions. This modular approach, while promoting scalability and easy updates, also allows for targeted improvements in individual system components.

Our proposed system for real-time micro-expression detection was tested on the CASME II dataset, and its general organization is presented in Figure 2.

The processing flow begins with extracting relevant frames from videos or web camera video streams, followed by face and facial point detection in each frame. This information is then converted into facial AUs, which are used to infer emotions through multiple complementary methods. The system includes feedback and self-improvement components that allow adaptation to specific subject and context characteristics.

The Video Preprocessing Module uses OpenCV and is responsible for extracting frames from video sequences or real-time web camera streams, performing normalization, facial alignment, and temporal filtering.

The MediaPipe Facial Detector uses the MediaPipe Face Landmarker model to extract facial points and expressions. It provides 468 high-precision facial landmarks, facial blendshape estimations representing specific facial configurations, and stable tracking between frames for continuous analysis. This detector offers significant advantages compared with traditional approaches: high precision even for subtle movements, robustness to lighting and position variations, and computational efficiency for real-time processing.

Facial AU System—An important aspect of the system is converting MediaPipe blendshapes to standardized AUs according to the FACS system. This mapping allows for the use of existing scientific knowledge about the relationship between AUs and emotions. The system implements comprehensive mapping, as shown in Table 2. This detailed mapping covers all AUs relevant to micro-expressions.

The Expert Rules Module implements rules based on specialized knowledge for micro-expression classification. Classic machine learning and deep learning approaches on the CASME II dataset encounter significant difficulties when directly applied to micro-expressions for the following reasons:

Limited dataset: with only 247 examples and an unbalanced distribution, deep learning models cannot efficiently generalize;
AU specificity: psychological research has established clear correspondences between certain AUs and specific emotions;
Need for interpretability: in critical applications, it is essential to be able to explain why the system classified a particular expression.

For these reasons, the system implements a comprehensive set of expert rules based on specialized literature in emotion psychology and the FACS system. These rules define distinctive AU combinations for each emotion and are essential for resolving ambiguities, as shown in Table 3.

A key contribution of the proposed system is the implementation of specialized decision rules designed to address common emotional confusions, as illustrated in the block diagram presented in Figure 3.

The rules are applied hierarchically, first checking the “must_have” conditions, then “should_have”, and finally “may_have”, ensuring a robust classification even for partial or ambiguous facial micro-expressions.

Amplification of Weak Signals

Micro-expressions are characterized by low intensity, which can lead to the incomplete detection of AUs. To compensate for this limitation, the system implements a selective amplification strategy for weak signals.

For Disgust:

‘AU9’: 3.0, # Massive amplification for “noseWrinkler” (Disgust)

‘AU10’: 2.5, # Amplification for “upperLipRaiser” (Disgust)

For Happiness:

‘AU12’: 3.0, # Massive amplification for “lipCornerPuller” (Happiness)

‘AU6’: 2.0, # Amplification for “cheekRaiser” (Happiness)

For Repression:

‘AU4’: 2.0, # Amplification for “browLowerer” (Repression, Fear)

‘AU7’: 3.0, # Massive amplification for “lidTightener” (Repression)

For Fear vs. Surprise:

‘AU20’: 3.0, # Massive amplification for “lipStretcher” (Fear)

‘AU5’: 2.0, # Amplification for “upperLidRaiser” (Surprise, Fear)

For Sadness—resolving confusion:

‘AU15’: 3.0, # Massive amplification for “lipCornerDepressor” (Sadness)

‘AU1’: 2.0, # Amplification for “browInnerRaiser” (Sadness, Surprise, Fear)

‘AU17’: 2.0, # Amplification for “chinRaiser” (Sadness)

For differentiating Surprise vs. Fear:

‘AU26’: 2.5, # Amplification for “jawOpen” (Surprise)

‘AU2’: 1.8, # Amplification for “browOuterUpLeft” and “browOuterUpRight” (Surprise, Fear)

This approach is based on the principle that certain AUs are more diagnostic for specific emotions, and therefore deserve a higher weight in the decision-making process.

Classification Ensemble combines multiple methods to improve the micro-expression classification accuracy, in the following order:

Rule-based classification—direct application of expert rules for clear cases;
Prototype similarity—comparing detected AU configurations with known patterns for each emotion;
Contextual analysis—considering subject-specific characteristics;
Language model assistance—using OpenAI for difficult cases.

This hybrid approach allows the system to combine the advantages of explicitly coded expert knowledge with the flexibility and adaptability of learning-based methods. For similarity cases, a function was used that calculates the Jaccard score between the observed AU configuration and prototypes for each emotion, thus providing an objective measure of similarity.

The integration of the LLM, specifically OpenAI’s GPT-3.5-turbo, into the proposed micro-expression recognition system is illustrated in Figure 4. This component is activated when the rule-based classifier fails to produce a clear emotion label, typically due to ambiguous or overlapping AU patterns. Upon activation, a structured prompt is dynamically constructed. This prompt includes: (1) explicitly defined classification rules grounded in the FACS; (2) the current AU list detected in the frame; (3) contextual information such as subject-specific emotion trends; and (4) targeted disambiguation guidelines to address known confusions between similar emotions (e.g., fear vs. surprise). The GPT model processes this input using its transformer-based architecture and generates a classification decision.

The LLM’s output undergoes a post-processing validation stage that verifies the consistency with predefined logical constraints. If inconsistencies are detected, correction rules are applied; otherwise, the system either accepts the result or defaults back to deterministic rule-based classification. Additionally, the system includes a continuous learning mechanism that updates its AU-emotion association knowledge base incrementally, allowing for improved adaptation to individual expression patterns over time. The final classification output is further used by an assessment module that computes the system performance metrics and generates interpretive reports for research analysis.

To address ambiguous cases where rule-based classification is inconclusive, we integrated a LLM—specifically, OpenAI’s GPT-3.5-turbo—into our micro-expression recognition pipeline. This module serves as a fallback mechanism when partial or overlapping AU activations prevent a confident decision (e.g., AU1, AU2, and AU5 without AU4, which introduces ambiguity between fear and surprise). Upon activation, a structured prompt is generated that includes the detected AUs, explicit FACS-based AU-emotion rules, and contextual metadata such as the participant’s recent emotion trends. The prompt follows strict logical patterns (e.g., “If AU4 is present and AU5 is not, label as Fear”) to constrain the LLM’s output within the domain-relevant boundaries. The reasoning is performed internally by GPT-3.5’s transformer architecture, which utilizes multi-head self-attention to interpret the prompt and resolve ambiguous AU combinations. While the model’s internal weights are not accessible, its output is validated through a post-processing module that checks for rule compliance and logical coherence with established AU-emotion mappings. If inconsistencies are found, the system defaults back to deterministic classification. This dual-layer strategy ensures that LLM-based reasoning complements but does not override domain constraints, thereby enhancing both adaptability and reliability in edge cases.

3.2. Experiment

The experiments were conducted using a high-performance computing setup to ensure efficient processing and reliable results. The hardware configuration included a system equipped with an Intel Core i9 processor (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM, and an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). On the software side, the development environment was based on Python 3.8, incorporating libraries such as OpenCV 4.6.0 for image processing, MediaPipe 0.8.9 for facial detection, NumPy 1.21.0 and Pandas 1.3.4 for data manipulation, and Scikit-learn 1.0.2 for performance evaluation. Additionally, external support for advanced classification tasks was provided through the integration of the OpenAI ChatGPT-3.5-Turbo API. This setup ensured that the system operated efficiently across all stages of the micro-expression detection pipeline. The system was developed through a series of methodical steps aimed at optimizing performance and robustness. Its effectiveness was then evaluated through systematic experimental validation using CASME II. Given the system’s reliance on fine-grained AU-based reasoning and real-time video processing, CASME II was selected as the primary evaluation dataset due to its comprehensive AU annotations, discrete emotion labels, and frame-level temporal segmentation—criteria not fully met by alternative datasets such as SMIC or SAMM. The CASME II dataset served as the primary dataset, with preprocessing steps carefully applied to ensure data consistency and alignment with our experimental objectives. Initially, metadata were extracted from the CASME2-ObjectiveClasses.xlsx file, after which the file names were validated and normalized. Information regarding the onset and offset frames for each sample was retrieved, and the objective classes were mapped to standard emotional labels to facilitate accurate classification.

A critical aspect of system tuning involved determining the optimal threshold for AU detection. Threshold values of 0.1, 0.15, 0.2, and 0.25 were systematically evaluated. It was observed that thresholds below 0.1 led to excessive AU detection and considerable noise, while thresholds above 0.25 caused many subtle micro-expressions to be missed. An optimal threshold of 0.15 was identified, offering the best balance between sensitivity and specificity in detecting meaningful facial movements.

Further experimentation focused on prompt formulation strategies for the OpenAI-based classification component. Three different prompt approaches were tested: general prompts, prompts incorporating explicit decision rules, and prompts optimized for known emotional confusions. General prompts yielded a modest classification accuracy of approximately 13%, while the use of explicit rules significantly improved the performance to around 60%. The highest accuracy, 93.3%, was achieved using prompts specifically optimized to address frequent emotional confusions observed during the preliminary evaluations.

To rigorously assess the system’s performance, several validation strategies were employed. Leave-one-subject-out cross-validation was implemented to test the system’s generalization capability to unseen individuals. Additionally, a 5-fold cross-validation protocol was used to evaluate the overall robustness of the model. Performance on specific emotion subsets was also analyzed to gain deeper insights into the system’s behavior across different emotional categories. Notably, the CASME II dataset presented a substantial class imbalance, with emotions such as fear (86 samples) and disgust (80 samples) being well-represented, while others like sadness were underrepresented with only 3 samples, posing additional challenges to the classification task.

4. Results

4.1. Performance Metrics

The system’s performance was evaluated using the following metrics: accuracy—proportion of correct predictions, precision, recall, and F1-score—for each emotional class, and confusion matrix—for detailed analysis of the classification errors.

Our system demonstrates a solid overall performance, with an accuracy of 93% across all emotion categories. The weighted average metrics (precision: 0.96, recall: 0.93, F1 score: 0.94) indicate a robust performance, taking into account the non-uniform distribution of the sample data. However, the macro average (precision: 0.85, recall: 0.94, F1 score: 0.86) revealed some variability in performance across different emotion categories, suggesting that certain emotions are detected more reliably than others.

The detailed performance for each emotional class existing in the CASME II dataset is presented in Table 4.

The system particularly excelled in detecting fear, with almost perfect metrics (precision: 0.99, recall: 1.00, F1 score: 0.99) for 86 samples, making it the most reliably detected emotion. Disgust and repression also showed an impressive performance, with F1-scores of 0.93 and 0.98, respectively. Happiness detection was very precise, with perfect precision (1.00) and high recall (0.94).

The high recall values for most categories (five out of seven categories had a recall ≥0.94) indicate that the system rarely missed instances of these emotions when present, the most important aspect for real-time applications that require high sensitivity.

The most significant challenge lies in detecting sadness, which presents a perfect recall but an extremely low precision (0.30). This imbalance suggests that the system frequently classified other emotions as sadness, leading to many false-positive results. However, the very small sample size (only three instances) makes these metrics statistically unreliable and requires larger validation datasets.

The detection of surprise showed moderate precision (0.68) despite a perfect recall, indicating some confusion with other emotions. The “Others” category maintained perfect precision but a lower recall (0.75), suggesting that some non-standard expressions are incorrectly classified into specific emotion categories.

4.2. Confusion Matrix Analysis

The confusion matrix from Figure 5 provides a clear perspective on the performance of the micro-expression detection system using the CASME II dataset.

This analysis highlights the strengths and challenges of the system in correctly identifying the seven emotion categories. The main diagonal of the matrix, representing correct predictions, showed dominant values (72, 84, 15, 16, 30, 3, 17), indicating an overall good system performance. The confusion matrix indicates that the system performed well overall, with most emotions being correctly classified. Notably, disgust (72/80), fear (84/86), repression (30/31), and surprise (17/19) were recognized with high precision, highlighting the system’s strength on well-represented and structurally distinct classes.

However, happiness and others showed moderate confusion, with two instances of happiness misclassified as surprise and others. This may be due to overlapping AU patterns such as AU12 appearing weakly in other expressions.

The most significant challenge remains sadness, with only three available samples, all misclassified—reflecting the impact of severe class imbalance and the lack of representative AU patterns in training.

These observations confirm that the recognition quality depends not only on AU distinctiveness, but also on sample distribution, supporting the importance of both rule-based and data-driven components.

The real-time performance of the proposed system, as depicted in Figure 6, demonstrates its operational viability for practical micro-expression recognition. At a processing rate of 30 FPS (0.033 s/frame), micro-expressions of 100 ms duration were captured across approximately three frames, while shorter expressions of 40–50 ms were preserved in ~1.5 frames, providing fine-grained temporal resolution consistent with the established benchmarks for high-sensitivity facial analysis. Even when operating at 24 FPS (0.042 s/frame), the system maintained an adequate resolution, capturing 100 ms events in ~2.4 frames and shorter events in at least one frame. This satisfies the minimum functional threshold of 20 FPS, widely cited as the baseline requirement for real-time facial behavior monitoring systems (as noted in [9,44]). The observed drop-in frame rate during expression processing was expected due to the additional computational load involved in landmark tracking, AU analysis, and language model inference. However, this drop remained within the acceptable performance envelope (24–30 FPS) for real-time applications in human–computer interaction, psychological assessment, and security contexts. This temporal resolution enabled the system to capture the onset and apex of micro-expressions—critical phases for accurate emotion interpretation—thus preserving detection integrity even under moderate computational constraints.

Classification latency using the OpenAI API was recorded as the round-trip duration between the API request submission and receipt of the model’s response. This measurement does not reflect a consistent runtime performance, as it depends on the external server load and network latency. For this reason, only the local processing time (AU extraction and rule-based classification) was used in the timing comparisons.

5. Discussion

The results of our micro-expression detection system demonstrate a strong overall performance, particularly in recognizing high-frequency emotions such as fear, disgust, and repression. The system achieved a high-weighted average across the key evaluation metrics, reflecting its ability to handle class imbalance effectively. This indicates that the model is well-calibrated for real-world applications where emotional expressions are often unevenly distributed. The consistently high recall rates across most categories highlight the system’s sensitivity—an essential feature for detecting fleeting micro-expressions in dynamic environments.

Notably, the system excelled in identifying fear, with near-perfect metrics, suggesting that its associated facial action patterns are both distinct and well-learned by the model. Similarly, the reliable performance in disgust and repression detection reinforces the system’s robustness for these categories. However, a different pattern emerged for less represented or more ambiguous emotions. For example, while sadness showed perfect recall, its low precision indicates a high false-positive rate, largely due to its confusion with disgust. This misclassification reflects underlying similarities in facial muscle activity between these emotions, a challenge compounded by the limited number of sadness samples available for training.

Furthermore, the confusion between happiness and surprise—despite high precision for the former—suggests overlapping expressive features, likely due to shared AU activations around the mouth and eyes. Similar ambiguities were observed in the “Others” category, which, although handled with high precision, suffered from moderate recall, implying that several non-standard expressions were misassigned to more defined emotional classes.

With 93.3% accuracy on CASME II under the class imbalance and overlapping categories, the architecture demonstrated strong robustness—driven by expert-rule integration and AU-based reasoning.

The comparative analysis presented in Table 5 highlights both the diversity of the methodologies used in micro-expression recognition and the relative strengths of our proposed system. Importantly, all listed methods were evaluated on the CASME II dataset, which ensures the comparability of results under consistent experimental conditions. Within this shared benchmark, our modular architecture—combining expert-rule reasoning with GPT-3.5-based prompt engineering—achieved the highest reported accuracy of 93.3%. This performance exceeded that of traditional deep learning models such as 3D CNNs [45], lightweight ResNet-based architectures [46], and video transformers [47] as well as advanced transformer-based methods like µ-BERT [23], which currently represents one of the most effective approaches tailored for micro-expression recognition. The consistent dataset context confirms that the performance gains are attributable to the architecture itself, rather than to differences in data or evaluation protocol.

Compared with µ-BERT, our method delivered a comparable performance (90.34% vs. 93.3% accuracy), while offering greater interpretability and reduced computational complexity. µ-BERT’s architecture, although highly accurate, relies on extensive pretraining on millions of unlabeled frames and specialized modules such as Diagonal Micro-Attention and Patch of Interest. In contrast, our use of domain-specific rules and prompt optimization makes the system more transparent and adaptable to practical settings with limited computational resources.

The GPT-4V model evaluated by [24] demonstrates the growing potential of multimodal large language models (MLLMs) in generalized emotion recognition tasks. However, its low accuracy (14.64%) on micro-expression recognition highlights a clear gap between general-purpose zero-shot models and domain-specialized systems. This result emphasizes the importance of incorporating domain knowledge—either through expert rules, task-specific architecture (as in µ-BERT), or hybrid approaches like ours—to handle subtle, low-intensity emotional expressions effectively.

While LLMs enhance the system’s adaptability, domain-informed classification based on action units remains essential for high-precision tasks. The hybrid integration ensures reliable performance in sensitive applications where both accuracy and interpretability are critical.

The findings of this study offer several important advancements in the field of micro-expression recognition, with implications for both scientific understanding and real-world applications. The proposed hybrid architecture—combining MediaPipe-based facial feature extraction, expert rule encoding, and prompt-driven language model classification—represents a novel paradigm that departs from traditional end-to-end deep learning pipelines. This modular integration enhances system interpretability and enables flexible adaptation across datasets and use cases.

One of the most notable contributions lies in the resolution of frequent emotional confusions such as those between fear and surprise, or sadness and disgust. By embedding specialized decision rules into the classification logic, the system effectively addresses a persistent challenge in affective computing: the overlap in facial AUs between certain emotion classes. This targeted disambiguation, supported by high recall and precision scores, underpins the system’s superior accuracy compared with both conventional CNN-based and transformer-based architectures.

Moreover, the inclusion of weak signal amplification techniques addresses the inherent difficulty of detecting micro-expressions—brief, subtle, and low-intensity facial movements often imperceptible to the human eye. This capability positions the system for high sensitivity in security-critical applications, such as lie detection or psychological assessment, where missed signals could be consequential.

The study also introduced a continuous learning framework, allowing the knowledge base to evolve as new data are encountered. This adaptability enhances generalization to new subjects and mitigates the risk of overfitting to limited or imbalanced datasets such as CASME II. Furthermore, the use of rigorous validation protocols, including leave-one-subject-out and k-fold cross-validation, ensures the robustness and reproducibility of the results.

In sum, this research contributes by offering a methodologically transparent, high-performance, and practically deployable system for micro-expression recognition. It bridges the gap between high interpretability and cutting-edge accuracy while also setting a precedent for how hybrid AI-human knowledge systems can outperform purely data-driven approaches in specialized affective computing tasks.

While our system addresses several critical limitations of traditional micro-expression recognition methods, such as extending recognition to seven distinct emotions, handling short-duration and low-intensity expressions, functioning under variable lighting and head pose conditions, and adapting to individual expression differences, there remain important challenges that warrant discussion.

First, despite the system’s demonstrated effectiveness, it is still partially dependent on the quality and diversity of the training dataset. The CASME II dataset, though widely used, presents a known class imbalance, with emotions such as sadness and surprise being significantly underrepresented. This imbalance may have introduced classification bias, especially in rare emotion categories, leading to inflated recall but lower precision, as observed in the case of sadness. The rule-based disambiguation partially mitigated these effects but could not fully compensate for the lack of representative data.

Second, while the integration of expert rules and prompt-based language models significantly improved interpretability and performance, the system’s reliance on handcrafted rules and prompt engineering introduces potential brittleness. The effectiveness of emotion disambiguation heavily depends on the quality and comprehensiveness of these rules. As such, the system may not generalize optimally to unseen emotion variants or spontaneous expressions that fall outside the defined decision boundaries.

Third, although the system includes mechanisms for continuous knowledge updating, these are not fully autonomous and require manual oversight for rule validation and performance assessment. This semi-automated learning may limit scalability in real-time or high-volume deployment scenarios.

Additionally, while robustness to variable lighting and moderate head movements was qualitatively observed, these factors were not explicitly isolated and tested in controlled experimental conditions. Therefore, the claimed generalizability across challenging real-world scenarios, such as occlusion, rapid head motion, or extreme illumination shifts, should be interpreted with caution until further validation is conducted.

Finally, while the integration of the OpenAI GPT model enabled sophisticated emotion classification, it introduced a dependence on external API access, associated usage limits, and potential variability in model responses. This reliance could affect system consistency in time-sensitive or privacy-constrained applications.

In summary, although the proposed system sets a new benchmark in hybrid micro-expression recognition, future work should address dataset diversity, enhance rule adaptability, automate knowledge updating, and improve robustness testing to further solidify its utility in practical, real-world settings.

Building on the promising results and architectural innovations of this study, several directions for future research can be pursued to further advance the field of micro-expression recognition.

First, to mitigate class imbalance and improve generalization across rare emotional categories, future studies should incorporate more diverse and balanced datasets including CASME3, SAMM, and SMIC. Cross-dataset evaluation protocols could also be adopted to assess the transferability of the system under varying recording conditions, cultural expression norms, and emotional labeling schemas.

Second, while the current system leverages handcrafted expert rules to resolve emotion confusion, future work could explore adaptive rule generation via data-driven methods. For example, integrating symbolic learning or reinforcement learning to refine or suggest rule updates based on misclassifications could reduce the reliance on manual tuning and enhance scalability.

Additionally, the prompt optimization process, while effective, remains semi-manual and may benefit from automated techniques. Employing prompt tuning or instruction optimization frameworks could help identify optimal prompt structures for different datasets or emotion subsets, thereby improving consistency and reducing model variability.

Further experiments are also needed to quantify the system’s robustness to environmental variations. Dedicated testing under conditions such as occlusions, rapid head movements, extreme lighting changes, and varied camera angles would provide a clearer understanding of the system’s applicability in real-world, unconstrained environments.

Moreover, expanding the continuous learning component into a fully automated feedback loop—potentially through unsupervised clustering of new emotion data—would allow the system to evolve over time without human intervention, enhancing its long-term adaptability.

Finally, future work could investigate the integration of audio or physiological signals (e.g., heartbeat, skin conductance) to enrich the emotion recognition process. Multimodal fusion, especially when aligned with the modular architecture of the current system, could substantially improve the recognition of complex or ambiguous affective states.

Enhancing dataset diversity, automating rule and prompt adaptation, expanding multimodal capabilities, and validating real-world robustness stand as key directions to strengthen and extend the impact of the proposed system.

6. Conclusions

This study presents a significant advancement in the field of micro-expression recognition by introducing a hybrid system that effectively bridges domain-specific knowledge with state-of-the-art artificial intelligence. Through the integration of facial action coding principles, modern computer vision frameworks, and the contextual reasoning capabilities of large language models, the proposed approach offers a novel solution to many of the challenges traditionally associated with micro-expression detection.

Specifically, we proposed a hybrid AU-LLM architecture that integrates a rule-based AU detection pipeline with a transformer-based LLM, specifically GPT-3.5-turbo, which serves as an intelligent fallback for ambiguous or partial AU configurations. To our knowledge, this represents one of the first practical implementations of an LLM in a micro-expression recognition system. The system incorporates a rule validation layer that enforces domain-specific AU-emotion mappings after LLM classification, ensuring that the outputs remain interpretable and conform to psychological ground truths—a functionality often missing in black-box hybrid models. Additionally, we demonstrated robust performance on the CASME II dataset, achieving 93.3% accuracy despite severe class imbalance, without relying on conventional deep model retraining or synthetic data generation. This was enabled by our AU-based amplification strategy, which enhanced subtle expression patterns in a lightweight, interpretable way. Finally, our use of transparent prompt engineering allowed the LLM to operate within well-defined diagnostic constraints, promoting explainability and domain alignment rather than generic text-based reasoning. These integrated innovations collectively position our system as a novel, scalable, and explainable solution for real-time micro-expression analysis.

Theoretical insights gained from this work reinforce the importance of AU patterns in encoding emotional content, and the system’s confusion matrix analysis revealed empirically grounded trends in misclassification between visually similar emotions. These findings contribute to a deeper understanding of the perceptual and semantic overlap among certain emotional states. Furthermore, the observed variability in expression patterns across subjects underscores the necessity of adaptive mechanisms that account for individual differences in affective display. The study also highlights the role of contextual interpretation, suggesting that purely visual analysis may be insufficient for robust emotion inference, especially in subtle or ambiguous expressions.

From a practical standpoint, the developed system holds considerable promise for real-world applications across several domains. In security and law enforcement, it offers a tool for enhancing deception detection and behavioral analysis. In clinical psychology, it provides an objective layer for evaluating patient affect, potentially supporting diagnostic and therapeutic processes. Marketing and consumer research may benefit from its capacity to discern authentic emotional reactions, while human–computer interaction stands to gain from the system’s potential to enable emotionally responsive interfaces.

By demonstrating the effectiveness of combining symbolic rules, weak signal amplification, and AI-driven classification, this work lays a foundation for more intelligent, interpretable, and adaptable emotion recognition systems. It also opens pathways for further exploration into multimodal emotion analysis, cultural variation in expression, and real-time deployment in diverse environments. In doing so, the study contributes meaningfully to the broader goals of affective computing and emotionally intelligent technologies.

While the current system achieved strong performance and demonstrated real-time feasibility, several avenues remain open for future exploration. Expanding the training and validation across more diverse and larger datasets, such as CASME3, SAMM, and SMIC, would enhance the generalizability of the model to different ethnicities, age groups, and cultural contexts. Additionally, integrating audio and textual inputs could facilitate the development of multimodal systems capable of interpreting emotions in a richer communicative framework. In the current version of the system, we implemented a basic form of the synthetic injection of AUs, where expected the AUs (e.g., AU1 + AU15 for sadness) were artificially added to real samples during evaluation to improve recognition. Future extensions of this work will consider data augmentation techniques to enhance the representation of underrepresented classes and improve the classification balance. Automating the refinement of expert rules and prompt structures using adaptive or reinforcement learning methods may further improve system scalability and reduce manual tuning. Future work should also focus on quantifying robustness under uncontrolled environmental conditions including varying lighting, occlusions, and natural head movements. Finally, investigating user-centered design and ethical implications—particularly in sensitive domains such as surveillance or clinical assessment—will be essential to ensure the responsible and human-aligned deployment of micro-expression recognition technologies.

Author Contributions

Conceptualization, M.L. and A.C.I.; Methodology, A.C.I.; Software, G.L.S.; Validation, M.L., G.L.S. and A.C.I.; Formal analysis, M.L.; Investigation, G.L.S.; Resources, G.L.S.; Data curation, G.L.S.; Writing—original draft preparation, G.L.S.; Writing—review and editing, M.L. and A.C.I.; Visualization, M.L. and A.C.I.; Supervision, M.L.; Project administration, A.C.I.; Funding acquisition, G.L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the policies of University of Petroșani regarding exemption from ethical approval for low-risk and non-invasive technical research involving adult volunteers.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

CASME II dataset, granted access on 7th of February 2025 from http://casme.psych.ac.cn/casme/e2, accessed on 8 February 2025.

Acknowledgments

During the preparation of this manuscript, the authors used Copilot for the purposes of language editing, drafting assistance, and reference formatting. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ekman, P.; Friesen, W.V. The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding. Semiotica 1969, 1, 49–98. [Google Scholar] [CrossRef]
Yan, W.J.; Wu, Q.; Liang, J.; Chen, Y.H.; Fu, X. How Fast are the Leaked Facial Expressions: The Duration of Micro-Expressions. J. Nonverbal Behav. 2013, 37, 217–230. [Google Scholar] [CrossRef]
Frank, M.G.; Svetieva, E. Microexpressions and Deception. In Understanding Facial Expressions in Communication; Springer: Dordrecht, The Netherlands, 2014; pp. 227–242. [Google Scholar] [CrossRef]
Yan, W.J.; Li, X.; Wang, S.J.; Zhao, G.; Liu, Y.J.; Chen, Y.H.; Fu, X. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation. PLoS ONE 2014, 9, e86041. [Google Scholar] [CrossRef] [PubMed]
Haggard, E.A.; Isaacs, K.S. Micromomentary Facial Expressions as Indicators of Ego Mechanisms in Psychotherapy. In Methods of Research in Psychotherapy; Springer: Boston, MA, USA, 1966. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V. Facial Action Coding System; Consulting Psychologists Press: Palo Alto, CA, USA, 1978. [Google Scholar] [CrossRef]
Cohn, J.F.; Ambadar, Z.; Ekman, P. Observer-Based Measurement of Facial Expression with the Facial Action Coding System. In Handbook of Emotion Elicitation and Assessment; Coan, J.A., Allen, J.J.B., Eds.; Oxford University Press: New York, NY, USA, 2007; pp. 203–221. Available online: https://psycnet.apa.org/record/2007-08864-013 (accessed on 20 February 2025).
Shreve, M.; Godavarthy, S.; Manohar, V.; Goldgof, D.; Sarkar, S. Towards Macro- and Micro-Expression Spotting in Video Using Strain Patterns. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird, UT, USA, 7–8 December 2009; pp. 1–6. [Google Scholar] [CrossRef]
Pfister, T.; Li, X.; Zhao, G.; Pietikäinen, M. Recognising Spontaneous Facial Micro-Expressions. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1449–1456. [Google Scholar] [CrossRef]
Wang, Y.; See, J.; Phan, R.C.W.; Oh, Y.H. LBP with Six Intersection Points: Reducing Redundant Information in LBP-TOP for Micro-Expression Recognition. In Computer Vision—ACCV 2014; Cremers, D., Reid, I., Saito, H., Yang, M.H., Eds.; Springer: Cham, Switzerland, 2015; Volume 9003, pp. 525–537. [Google Scholar] [CrossRef]
Huang, X.; Zhao, G.; Hong, X.; Zheng, W.; Pietikäinen, M. Spontaneous Facial Micro-Expression Analysis Using Spatiotemporal Completed Local Quantized Patterns. Neurocomputing 2016, 175, 564–578. [Google Scholar] [CrossRef]
Yang, H.; Huang, S.; Li, M. MSOF: A main and secondary bi-directional optical flow feature method for spotting micro-expression. Neurocomputing 2025, 630, 129676. [Google Scholar] [CrossRef]
Yang, X.; Yang, H.; Li, J.; Wang, S. Simple but effective in-the-wild micro-expression spotting based on head pose segmentation. In Proceedings of the 3rd Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis, Ottawa, ON, Canada, 29 October 2023; pp. 9–16. [Google Scholar] [CrossRef]
Patel, D.; Hong, X.; Zhao, G. Selective Deep Features for Micro-Expression Recognition. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2258–2263. [Google Scholar] [CrossRef]
Zhang, H.; Huang, B.; Tian, G. Facial Expression Recognition Based on Deep Convolution Long Short-Term Memory Networks of Double-Channel Weighted Mixture. Pattern Recognit. Lett. 2020, 131, 128–134. [Google Scholar] [CrossRef]
Sălăgean, G.L.; Leba, M.; Ionica, A.C. Leveraging Symmetry and Addressing Asymmetry Challenges for Improved Convolutional Neural Network-Based Facial Emotion Recognition. Symmetry 2025, 17, 397. [Google Scholar] [CrossRef]
Lei, L.; Li, J.; Chen, T.; Li, S. A Novel Graph-TCN with a Graph Structured Representation for Micro-Expression Recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM ‘20), Seattle, WA, USA, 12–16 October 2020; pp. 2237–2245. [Google Scholar] [CrossRef]
Kumar, A.J.R.; Bhanu, B. Micro-Expression Classification Based on Landmark Relations with Graph Attention Convolutional Network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1511–1520. [Google Scholar] [CrossRef]
Phan, H.T.; Nguyen, N.T.; Hwang, D. Aspect-level sentiment analysis: A survey of graph convolutional network methods. Inf. Fusion 2023, 91, 149–172. [Google Scholar] [CrossRef]
Shehu, H.A.; Browne, W.N.; Eisenbarth, H. Emotion categorization from facial expressions: A review of datasets, methods, and research directions. Neurocomputing 2025, 624, 129367. [Google Scholar] [CrossRef]
Ma, F.; Sun, B.; Li, S. Transformer-Augmented Network with Online Label Correction for Facial Expression Recognition. IEEE Trans. Affect. Comput. 2024, 15, 593–605. [Google Scholar] [CrossRef]
Zhao, Z.; Patras, I. Prompting Visual-Language Models for Dynamic Facial Expression Recognition. arXiv 2024, arXiv:2308.13382. [Google Scholar]
Nguyen, X.-B.; Duong, C.N.; Li, X.; Gauch, S.; Seo, H.-S.; Luu, K. Micron-BERT: BERT-Based Facial Micro-Expression Recognition. arXiv 2023, arXiv:2304.03195. [Google Scholar]
Lian, Z.; Sun, L.; Sun, H.; Chen, K.; Wen, Z.; Gu, H.; Liu, B.; Tao, J. GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition. arXiv 2024, arXiv:2312.04293. [Google Scholar] [CrossRef]
Zhang, L.; Luo, Z.; Wu, S.; Nakashima, Y. MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Subtle Clue Dynamics in Video Dialogues. In Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing (MRAC ‘24). Association for Computing Machinery, Melbourne, VIC, Australia, 28 October 2024–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 110–115. [Google Scholar] [CrossRef]
Zhao, Z.; Cao, Y.; Gong, S.; Patras, I. Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 815–824. [Google Scholar] [CrossRef]
Lan, X.; Xue, J.; Qi, J.; Jiang, D.; Lu, K.; Chua, T.S. ExpLLM: Towards Chain of Thought for Facial Expression Recognition. arXiv 2024, arXiv:2409.02828. [Google Scholar] [CrossRef]
Talib, H.K.B.; Xu, K.; Cao, Y.; Xu, Y.P.; Xu, Z.; Zaman, M.; Akhunzada, A. Micro-Expression Recognition Using Convolutional Variational Attention Transformer (ConVAT) with Multihead Attention Mechanism. IEEE Access 2025, 13, 20054–20070. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Q.; Shu, X. Micro-Expression Recognition Using a Multi-Scale Feature Extraction Network with Attention Mechanisms. Signal Image Video Process. 2024, 18, 5137–5147. [Google Scholar] [CrossRef]
Gupta, D.K.; Agarwal, D.; Perwej, Y.; Vishwakarma, O.; Mishra, P.; Nitya. Sensing Human Emotion Using Emerging Machine Learning Techniques. Int. J. Sci. Res. Sci. Eng. Technol. 2024, 11, 80–91. [Google Scholar] [CrossRef]
Wang, Z.; Yang, M.; Jiao, Q.; Xu, L.; Han, B.; Li, Y.; Tan, X. Two-Level Spatio-Temporal Feature Fused Two-Stream Network for Micro-Expression Recognition. Sensors 2024, 24, 1574. [Google Scholar] [CrossRef] [PubMed]
Naidana, K.S.; Divvela, L.P.; Yarra, Y. Micro-Expression Recognition Using Generative Adversarial Network-Based Convolutional Neural Network. In Proceedings of the 2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 7–8 March 2024. [Google Scholar] [CrossRef]
Chaubey, A.; Guan, X.; Soleymani, M. Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning. arXiv 2025, arXiv:2504.07198. [Google Scholar]
Anand, M.; Babu, S. Multi-Class Facial Emotion Expression Identification Using DL-Based Feature Extraction with Classification Models. Int. J. Comput. Intell. Syst. 2024, 17, 25. [Google Scholar] [CrossRef]
Pereira, R.; Mendes, C.; Ribeiro, J.; Ribeiro, R.; Miragaia, R.; Rodrigues, N.M.M.; Costa, N.M.; Costa, N.; Pereira, A. Systematic Review of Emotion Detection with Computer Vision and Deep Learning. Sensors 2024, 24, 3484. [Google Scholar] [CrossRef] [PubMed]
Flotho, P.; Heiß, C.; Steidl, G.; Strauß, D.J. Lagrangian Motion Magnification with Double Sparse Optical Flow Decomposition. Front. Appl. Math. Stat. 2023, 9, 1164491. [Google Scholar] [CrossRef]
Cîrneanu, A.-L.; Popescu, D.; Iordache, D.D. New Trends in Emotion Recognition Using Image Analysis by Neural Networks: A Systematic Review. Sensors 2023, 23, 7092. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.-W.; Chang, C.-W.; Chang, W.-J.; Wang, H.; Liang, C.-S.; Kishimoto, T.; Chang, J.P.-C.; Kuo, J.S.; Su, K. The Now and Future of ChatGPT and GPT in Psychiatry. Psychiatry Clin. Neurosci. 2023, 77, 592–596. [Google Scholar] [CrossRef]
Amrutha, K.; Prabu, P.; Paulose, J. Human Body Pose Estimation and Applications. In Proceedings of the Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; pp. 1–6. [Google Scholar] [CrossRef]
Sasidhar, T.T.; Premjith, B.; Soman, K.P. Emotion Detection in Hinglish (Hindi+English) Code-Mixed Social Media Text. Procedia Comput. Sci. 2020, 171, 1346–1352. [Google Scholar] [CrossRef]
Waqas, A.; Tripathi, A.; Ramachandran, R.P.; Stewart, P.A.; Rasool, G. Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. Front. Artif. Intell. 2024, 7, 1408843. [Google Scholar] [CrossRef]
iMotions. Learning the Facial Action Coding System (FACS). Available online: https://imotions.com/blog/learning/research-fundamentals/facial-action-coding-system/ (accessed on 12 December 2024).
Matsumoto, D.; Hwang, H.S. Evidence for Training the Ability to Read Microexpressions of Emotion. Motiv. Emot. 2011, 35, 181–191. [Google Scholar] [CrossRef]
Davison, A.K.; Lansley, C.; Costen, N.P.; Tan, K.; Yap, M.H. SAMM: A Spontaneous Micro-Facial Movement Dataset. IEEE Trans. Affect. Comput. 2018, 9, 116–129. [Google Scholar] [CrossRef]
Talluri, K.K.; Fiedler, M.-A.; Al-Hamadi, A. Deep 3D Convolutional Neural Network for Facial Micro-Expression Analysis from Video Images. Appl. Sci. 2022, 12, 11078. [Google Scholar] [CrossRef]
Belaiche, R.; Liu, Y.; Migniot, C.; Ginhac, D.; Yang, F. Cost-Effective CNNs for Real-Time Micro-Expression Recognition. Appl. Sci. 2020, 10, 4959. [Google Scholar] [CrossRef]
Hong, J.; Lee, C.; Jung, H. Late Fusion-Based Video Transformer for Facial Micro-Expression Recognition. Appl. Sci. 2022, 12, 1169. [Google Scholar] [CrossRef]
Li, J.; Wang, T.; Wang, S.-J. Facial Micro-Expression Recognition Based on Deep Local-Holistic Network. Appl. Sci. 2022, 12, 4643. [Google Scholar] [CrossRef]

Figure 1. Proposed modular approach in the micro-expression detection system.

Figure 2. System architecture.

Figure 3. Rules-based logic diagram for eliminating confusions.

Figure 4. LLM integration process.

Figure 5. Confusion matrix.

Figure 6. Real-time running results (presented with the written approval of the subject).

Table 2. Action units and their descriptions (Adapted from freely available resource [42]).

Action Unit	Description
AU1	“browInnerUp”, “browInnerUpLeft”,”browInnerUpRight”, “browInnerRaiser”
AU2	“browOuterUpLeft”, “browOuterUpRight”
AU4	“browDown”, “browDownLeft”, “browDownRight”, “browLowerer”, “browFurrower”
AU5	“eyeLookUpLeft”, “eyeLookUpLeft”, “eyeLookUpLeft”, “eyeLookUpRight”, “eyeWideLeft”, “eyeWideRight”, “upperLidRaiser”
AU6	“cheekSquintLeft”, “cheekSquintRight”, “cheekRaiser”
AU7 (Crucial for Repression)	“lidTightener”, “eyeSquintLeft”, “eyeSquintRight”
AU9 (Crucial for Disgust)	“noseSneerLeft”, “noseSneerRight”, “noseWrinkler”
AU10	“mouthUpperUpLeft”, “mouthUpperUpRight”, “upperLipRaiser”
AU12 (Crucial for Happiness)	“lipCornerPuller”, “lipCornerPullerLeft”, “lipCornerPullerRight”, “mouthSmileLeft”, “mouthSmileRight”
AU14	“mouthDimpleLeft”, “mouthDimpleRight”
AU15 (Crucial for Sadness)	“lipCornerDepressor”, “lipCornerDepressorLeft”, “lipCornerDepressorRight”, “mouthFrownLeft”, “mouthFrownRight”
AU16	“mouthLowerDownLeft”, “mouthLowerDownRight”
AU17	“chinRaiser”
AU20 (Crucial for Fear)	“mouthStretchLeft”, “mouthStretchRight”, “lipStretcher”
AU23	“lipsTogether”
AU24	“lipPressLeft”, “lipPressRight”
AU25	“lipsPart”
AU26	“jawOpen”, “mouthOpen”
AU45	”eyeBlinkLeft”, “eyeBlinkRight”
AU29	“jawForward”
AU28	“jawLeft”, “jawRight”

Table 3. Combination of emotions and AUs [6,43].

Emotion	AU
Happiness	AU12, AU6
Disgust	AU9, AU10, AU15
Repression	AU4, AU7, AU23
Fear	AU1, AU2, AU4, AU20
Surprise	AU1, AU2, AU5, AU26
Sadness	AU1, AU15, AU17
Contempt	AU14, AU24
Anger	AU 4, AU5, AU7, AU23

Table 4. Performance for each emotional class.

Micro-Expression	Precision	Recall	F1-Score	Support Image Number
Disgust	1.00	0.86	0.93	80
Fear	0.99	1.00	0.99	86
Happiness	1.00	0.94	0.97	17
Repression	0.97	1.00	0.98	31
Sadness	0.30	1.00	0.46	3
Surprise	0.68	1.00	0.81	17
Others	1.00	0.75	0.86	20

Table 5. Comparative analysis.

Method Type	Accuracy (%)	Main Advantage	Limitation
3D CNN + Apex Frame Selection [45]	88.2	Strong deep learning baseline with optimized video selection	Less interpretable, requires large data
Hybrid CNN (DLHN) + Handcrafted Features [48]	60.3	Good balance of learned and handcrafted features	Lower accuracy on fine-grained classes
Video Transformer + Optical Flow [47]	73.2	Transformer model for subtle motion capture	Computationally intensive, comparable accuracy only
Shallow CNN (ResNet-optimized) + Optical Flow [46]	60.2	Fast, low-resource, real-time compatible	Lower accuracy, limited class resolution
Transformer-based (µ-BERT) with Diagonal Micro-Attention and PoI modules [23]	90.34	State-of-the-art micro-expression accuracy using unsupervised pretraining	High computational demand, less interpretability, needs massive pretraining
Multimodal LLM (GPT-4V), zero-shot prompting [24]	14.64	Zero-shot capability across multiple GER tasks, strong multimodal reasoning	Low accuracy in micro-expression tasks due to lack of specialized knowledge
Our modular system + OpenAI API + Expert Rules	93.3	High accuracy, interpretable, handles emotional confusion	Requires prompt design and external API

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sălăgean, G.L.; Leba, M.; Ionica, A.C. Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning. Appl. Sci. 2025, 15, 6417. https://doi.org/10.3390/app15126417

AMA Style

Sălăgean GL, Leba M, Ionica AC. Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning. Applied Sciences. 2025; 15(12):6417. https://doi.org/10.3390/app15126417

Chicago/Turabian Style

Sălăgean, Gabriela Laura, Monica Leba, and Andreea Cristina Ionica. 2025. "Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning" Applied Sciences 15, no. 12: 6417. https://doi.org/10.3390/app15126417

APA Style

Sălăgean, G. L., Leba, M., & Ionica, A. C. (2025). Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning. Applied Sciences, 15(12), 6417. https://doi.org/10.3390/app15126417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Foundations of Micro-Expression Research

2.2. Traditional Computational Approaches

2.3. Deep Learning and Spatiotemporal Modeling

2.4. Language Models in Emotion Recognition

2.5. The Role of CASME II and Dataset-Centric Advances

2.6. Challenges and Gaps in Existing Research

3. Materials and Methods

3.1. Materials

Amplification of Weak Signals

3.2. Experiment

4. Results

4.1. Performance Metrics

4.2. Confusion Matrix Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI