Next Article in Journal
A Study on the Appearance and Behavioral Patterns of Robots for Fostering Attachment in Users
Previous Article in Journal
A Privacy-Preserving Classification Framework for Multi-Class Imbalanced Data Using Geometric Oversampling and Homomorphic Encryption
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models

by
Mirela-Magdalena Grosu (Marinescu)
1,
Octaviana Datcu
1,*,
Ruxandra Tapu
1,2 and
Bogdan Mocanu
1,2
1
Faculty of Electronics, Telecommunications and Information Technology, National University of Science and Technology “Politehnica” Bucharest, Bd. Iuliu Maniu 1-3, 061071 Bucharest, Romania
2
Laboratoire SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, 91000 Paris, France
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(3), 1289; https://doi.org/10.3390/app16031289
Submission received: 25 December 2025 / Revised: 18 January 2026 / Accepted: 19 January 2026 / Published: 27 January 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Featured Application

Emotion recognition in video (ERV) enables practical deployment in human–AI interaction, assistive systems, and intelligent monitoring applications. The review highlights engineering trade-offs among accuracy, robustness, and computational cost, showing that lightweight deep models are well suited to resource-constrained platforms, whereas multimodal large language models support context-aware interaction in cloud-assisted systems.

Abstract

Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language models and multimodal large language models (MLLMs). This review surveys this evolution, with an emphasis on engineering considerations relevant to real-world deployment. We analyze multimodal fusion strategies, dataset characteristics, and evaluation protocols, highlighting limitations in robustness, bias, and annotation quality under unconstrained conditions. Emerging MLLM-based approaches are examined in terms of performance, reasoning capability, computational cost, and interaction potential. By comparing task-specific models with foundation model approaches, we clarify their respective strengths for resource-constrained versus context-aware applications. Finally, we outline practical research directions toward building robust, efficient, and deployable ERV systems for applied scenarios such as assistive technologies and human–AI interaction.

1. Introduction

Emotion recognition in video (ERV) is a core task in affective computing that draws on computer vision, speech analysis, and psychology. Despite this interdisciplinary foundation, ERV remains only partially integrated across these research communities. In practice, ERV refers to systems that attempt to infer people’s emotional states from video streams, usually by integrating facial expressions, body posture, the surrounding context, and available vocal cues. Despite its apparent simplicity at a conceptual level, emotion recognition involves considerable complexity in practice, a point that has been widely acknowledged in previous work [1,2].
Interest in ERV has steadily increased, driven partly by applications in human–AI interaction and healthcare and partly by the growing expectation that machines should be able to engage with humans in a socially aware manner. Although the extent to which such systems can reliably model human affect remains an open question, these expectations have driven rapid methodological advances. Nevertheless, the development of robust ERV systems remains challenging due to substantial inter- and intra-subject variability, strong dependence on context, annotation subjectivity, and the presence of noise and bias in available datasets. Two people rarely express the same emotion in the same way, and even a single individual does not do so consistently. When cultural differences, lighting variations, background noise, and occasional facial occlusions are introduced, the task becomes significantly more unpredictable than the benchmark evaluations suggest [3,4,5,6].
Early ERV systems relied on handcrafted features, such as facial landmarks, local binary patterns (LBP), and prosodic descriptors, combined with conventional classifiers. While effective under controlled conditions, these approaches exhibited limited robustness to variations in expression, pose, and recording conditions, leading to degraded performance in unconstrained environments [1]. The introduction of deep learning marked a significant advance in ERV, enabling substantial performance improvements. Convolutional neural networks (CNNs) facilitated the automatic learning of discriminative spatial representations, while their integration with temporal modeling mechanisms, such as long short-term memory (LSTM) networks or 3D convolutions, enabled the capture of dynamic patterns underlying emotional expressions [1,2]. This progress addressed several long-standing issues, while simultaneously introducing new challenges related to data requirements and model generalization.
Transformers pushed the field another step forward. Vision transformers (ViTs) [7] and subsequent video and multimodal transformer models demonstrated that long-range attention mechanisms can help stabilize recognition under the types of variation commonly observed in everyday video, as summarized in recent surveys [1,2,5].
A more substantial shift occurred with the introduction of vision–language models (VLMs) and their more powerful successors, multimodal large language models (MLLMs). These models altered the nature of the problem. Emotion recognition is no longer treated as a purely perceptual classification task but instead as a reasoning-oriented process that incorporates contextual cues, inferred intentions, and cross-modal interactions. Earlier VLMs, such as CLIP and ALIGN [8], demonstrated that large-scale image–text alignment exhibits strong transfer properties. BLIP and Flamingo [9,10] further enriched multimodal understanding. More recent MLLMs (e.g., GPT-4V, Gemini, and Kosmos) extend these capabilities to video, enabling the interpretation of emotional cues with a level of flexibility not attainable with earlier model architectures.
Recent survey work has also examined emotion recognition from physiological modalities. In particular, a comprehensive review of EEG-based emotion recognition analyzes methodological trends, challenges, and future directions in neural affective computing [11]. While such studies provide valuable insights into internal emotion modeling using specialized sensing hardware, they address a fundamentally different problem setting. The present review focuses on emotion recognition in video, where affective understanding is inferred from observable behavioral cues in audio–visual streams under unconstrained, real-world conditions. Consequently, this work emphasizes distinct methodological challenges, including multimodal fusion, temporal modeling of dynamic expressions, dataset bias, and deployment-oriented trade-offs, thereby complementing existing EEG-focused surveys rather than overlapping with them.
Despite several existing surveys addressing different parts of this research landscape, such as facial expression recognition [1,2], multimodal fusion [3,4], and broader affective computing [5,6], the literature remains fragmented. What appears to be missing is an integrated account of how the field evolved from feature engineering to deep perceptual models and, more recently, to multimodal reasoning systems. This review seeks to chart that trajectory while acknowledging that the field is still evolving and several open questions remain.
This review aims to consolidate these developments with greater chronological and conceptual coherence, focusing specifically on emotion recognition in video as a temporally grounded, multimodal, and deployment-oriented problem. Rather than surveying individual model families in isolation, the paper emphasizes how successive methodological paradigms have reshaped system design, evaluation practice, and real-world applicability. More specifically, we engage in the following:
  • Trace the evolution of emotion recognition in video from handcrafted feature pipelines to deep perceptual models and, most recently, to multimodal large language models, explicitly identifying shifts in modeling assumptions, supervision strategies, and system capabilities;
  • Analyze how dataset design choices, including recording conditions, annotation paradigms, and modality coverage, have influenced reported performance trends and constrained cross-generation comparisons;
  • Synthesize the empirical strengths and limitations of pre-MLLM and MLLM-based systems in unconstrained, real-world, and culturally diverse scenarios, with attention paid to robustness, generalization, and failure modes;
  • Examine persistent deployment challenges in ERV, including dataset imbalance, annotation subjectivity, cultural and demographic bias, computational cost, and reproducibility;
  • Outline technically grounded research directions that follow from these analyses, such as lightweight multimodal architectures, hybrid task-specific and foundation model designs, personalized affect modeling, and evaluation practices aligned with real-time and interactive applications.
The remainder of this paper is organized as follows. Section 2 describes the review methodology, including the literature search strategy, selection criteria, and scope definition. Section 3 provides a structured methodological overview of emotion recognition in video, covering traditional machine learning approaches, unimodal deep learning methods, multimodal deep learning models, transformer-based architectures, and their respective strengths and limitations. Section 4 focuses on multimodal large language models (MLLMs) in affective computing, analyzing their architectural characteristics, capabilities, and limitations and comparing them with pre-MLLM approaches. Section 5 presents a synthesis of data requirements and evaluation considerations for advanced ERV systems. Section 6 discusses open challenges and future research directions in emotion recognition in video, with particular emphasis on robustness, scalability, explainability, and inclusivity. Finally, Section 7 concludes the paper by summarizing the main findings and reflecting on the evolution of ERV toward reasoning-driven and multimodal frameworks.

2. Review Methodology

To provide a comprehensive and up-to-date analysis of emotion recognition in video (ERV), with particular emphasis on multimodal learning, transformer-based architectures, and multimodal large language models (MLLMs), a structured narrative literature review was conducted. The objective of this review was to identify the most salient and influential works representative of the current state of the art, rather than perform a formal systematic review or meta-analysis. In this context, particular attention was given to studies with demonstrated scientific impact, as reflected by citation counts and by publication in high-impact journals and leading international conferences. This approach ensures that the selected publications are both technically relevant and influential within the research community, and they provide a solid foundation for the comparative analyses and synthesis presented in the subsequent sections.

2.1. Search Strategy and Data Sources

The primary literature search was conducted using the Web of Science Core Collection, which provides broad coverage of peer-reviewed journals and conference proceedings in engineering and computer science. To ensure adequate coverage of influential conference papers and recently published works, the search was complemented by targeted queries using Google Scholar, particularly for articles from venues such as IEEE, ACM, MDPI, Springer Nature, and Elsevier that are well represented in the reference list of this paper. Full-text versions of the selected publications were accessed whenever available to allow detailed methodological inspection and comparison.
Search terms were defined based on terminology consistently used in the cited literature and on established and emerging research themes in ERV. The keyword groups included combinations of the following terms: emotion recognition in video, affective computing in video, facial expression analysis, unimodal, bimodal, and multimodal emotion recognition, transformer-based models, vision–language models, and multimodal large language models in conjunction with emotion recognition. Boolean operators (AND, OR) and phrase-based searches were used to improve coverage while reducing irrelevant results.
The literature search primarily covered publications from 2016 to 2025, a time span selected to capture the major methodological developments in emotion recognition in video, including the transition from deep learning approaches to transformer-based architectures and, more recently, multimodal large language models. Earlier seminal works were selectively included to provide methodological background and historical context. When multiple studies addressed similar methodological ideas, priority was given to works with a higher citation impact or published in venues with recognized scientific standing in order to emphasize representative and influential contributions. To ensure the relevance and maturity of the reviewed literature, the analysis focused on studies published in established peer-reviewed journals and leading international conferences. Priority was given to works that have demonstrated scientific impact, as reflected by citation counts and by publication in major venues, rather than preliminary or preprint-only studies. This selection strategy was adopted to emphasize validated and influential contributions while still providing a timely overview of the evolving research landscape at the time of submission.
The review adopted a comparative and impact-driven synthesis strategy, similar in spirit to recent narrative surveys of multimodal and generative AI systems that emphasize conceptual evolution and representative methodological trends rather than exhaustive enumeration [12].

2.2. Study Selection and Scope Definition

Following the initial search, titles and abstracts were screened to exclude publications not directly related to video-based emotion recognition, affective computing, or multimodal perception. Studies focusing exclusively on static image analysis, non-affective tasks, or unrelated application domains were excluded unless they introduced methodological contributions directly relevant to ERV. The remaining publications were examined at the full-text level to ensure sufficient methodological detail and experimental relevance. Additional exclusion criteria included non-English publications, duplicate records, and works lacking empirical validation or technical description. The final set of retained studies was selected to ensure balanced coverage of traditional machine learning methods, deep learning approaches, multimodal fusion strategies, transformer-based architectures, and MLLM-based systems while emphasizing state-of-the-art methods and influential research outputs, as reflected in the comparative tables and analyses presented in Section 3, Section 4, Section 5 and Section 6.
To enhance methodological transparency and provide a compact overview of the review workflow, the study selection procedure is illustrated in Figure 1. Rather than introducing additional criteria, the flowchart visually synthesizes the sequential refinement process described above, explicitly reporting the main inclusion and exclusion criteria applied at each stage together with the corresponding number of studies retained and clarifying how the initial corpus of records was progressively narrowed to the final set of publications analyzed in this review.

3. Background and Theoretical Foundations

3.1. Emotion Theories and Representations

ERV systems typically adopt categorical or dimensional emotion modeling frameworks (or combinations of both), which directly shape the design of annotation protocols and the formulation of prediction tasks based on multimodal inputs such as facial, vocal, and physiological signals. Categorical theories conceptualize emotion as a finite set of discrete states, as exemplified by Ekman’s basic emotions [13] and Plutchik’s extensions incorporating intensity and compositional blends [14]. Subsequent work has further emphasized the role of contextual and cultural factors in shaping categorical emotion representations [15]. This categorical perspective underpins many early emotion recognition datasets and systems, as well as CNN-based classification models [1,2,16,17]. In contrast, dimensional theories model emotion as a continuous affective process, most commonly within the valence–arousal (VA) space (see Figure 2).
The empirical robustness of these dimensions [18] has motivated their adoption in continuous annotation protocols and benchmarking datasets such as AVEC and Aff-Wild2 [19,20]. Categorical and dimensional models address different properties of emotion and are commonly used in parallel. In unconstrained settings, emotions often overlap and vary over time, which limits the effectiveness of single-label descriptions. Studies on emotional complexity report frequent co-occurrence of emotions and identify memory, appraisal, and self-reflection as factors influencing their temporal development [21,22]. In ERV, emotional content may therefore be represented using multiple labels or as a trajectory in a valence–arousal space. Several recent methods combine categorical and dimensional outputs [3,4].
Emotion-related cues are expressed through facial activity, speech, body movement, gazes, and physiological responses, as documented in prior studies [23,24,25,26]. These signals are acquired using audio–visual recordings and, in some datasets, wearable sensors. Unimodal systems have shown reduced robustness outside controlled conditions. Multimodal approaches, integrating complementary cues, are increasingly adopted [4,5], as visual models are affected by partial visibility, pose variation, and illumination changes, while audio-based models are influenced by background noise and recording variability [1,2,27,28].
Temporal variation further impacts ERV performance, as emotional states evolve over the course of an interaction. Consequently, short-lived phenomena such as micro-expressions or brief prosodic changes, while informative, may be missed by frame-level analysis [29,30,31]. Dimensional models represent affect as continuous valence and arousal signals, whereas categorical models rely on segment or sequence-level annotations. In summary, ERV research is grounded in categorical theories [13,14,15] and dimensional frameworks [18], with additional input from studies on emotional complexity [21,22]. Figure 3 summarizes the effects emotions have across cognitive, physiological, behavioral, and physical domains, offering cues for recognition.

3.2. Datasets and Benchmarks

Progress in ERV has been closely tied to the availability of datasets collected under different levels of control and realism. These datasets vary along three principal dimensions: modality (visual, audio–visual, or multimodal), recording conditions (studio, laboratory, or in the wild), and annotation paradigm (categorical or dimensional). Beyond serving as benchmarks for empirical comparison, dataset characteristics directly shape model architectures, learning objectives, and evaluation protocols [23].
Table 1 summarizes representative datasets grouped according to their recording conditions and degree of realism. Studio datasets emphasize maximal experimental control and reproducibility. Resources such as CREMA-D, RAVDESS, SMIC, and VideoEmotion-8 [17,32,33,34] consist of scripted or posed performances recorded under standardized conditions. While these datasets facilitate controlled benchmarking and model diagnostics, they often fail to capture the variability and contextual complexity of real-world emotional behavior. Canonical posed-expression paradigms based on basic emotion theory, such as Ekman’s emotion categories [35], further exemplify this setting.
Laboratory datasets occupy an intermediate position between control and realism. Datasets such as CK+, CASME II, SMIC, IEMOCAP, EEV, and MELD [16,29,32,36,37,38] are typically recorded in controlled environments, often with acted or elicited emotional expressions. These datasets provide cleaner signals and more reliable annotations, making them suitable for detailed analysis of facial dynamics, speech–emotion alignment, and multimodal fusion, while still exhibiting limited spontaneity relative to in-the-wild data.
Recent studies further demonstrate that dataset composition and annotation paradigms play a decisive role in shaping reported performance trends in emotion recognition in video, often confounding comparisons across generations of methods. For example, datasets such as IEMOCAP and MELD, which are commonly used in multimodal emotion recognition studies, differ substantially in recording conditions, interaction structure, speaker diversity, and labeling schemes, despite both incorporating audio–visual information [39]. These differences introduce systematic biases that affect evaluation outcomes; acted versus spontaneous emotional expression, categorical versus dialogue-level annotation, and varying emotional granularity can all lead to divergent performance estimates even when similar modeling approaches are employed. In addition, modality gaps arising from inconsistent audio quality, partial facial visibility, or imperfect temporal alignment further impact model behavior and reduce the comparability of results across datasets. These factors indicate that apparent performance improvements reported in the literature must be interpreted in relation to dataset bias, annotation strategy, and modality coverage, rather than being attributed solely to methodological progress.
In-the-wild datasets capture affective behavior under unconstrained conditions and provide high ecological validity. Unimodal visual datasets such as AffectNet and FER2013 [20,40] support the learning of robust facial representations from large-scale, diverse imagery, albeit with substantial label noise and class imbalance. Video-based datasets with continuous annotations, including Aff-Wild2 [41], extend this setting by modeling affective dynamics over time.
Audio–visual and multimodal in-the-wild datasets such as AFEW, CMU-MOSEI, CMU-MOSI, SEWA, HEU Emotion, and VEATIC [26,42,43,44,45,46] further incorporate speech, text, or contextual cues, enabling richer affective modeling at the cost of increased annotation subjectivity and noise.
A growing class of multimodal datasets extends beyond audio–visual cues by incorporating physiological signals, wearable sensors, textual content, or contextual metadata. Datasets such as AMIGOS, ASCERTAIN, DEAP, and SEED [25,47,48,49] enable the study of affective processes across multiple modalities and timescales. While these resources support more comprehensive modeling of emotion and interaction, they introduce additional challenges, including higher data collection and annotation costs, limited participant diversity, and increased model complexity.
An important limitation of the current ERV landscape concerns the cultural coverage of commonly used datasets and benchmarks. The majority of large-scale facial and audio–visual emotion datasets referenced in the literature have been collected primarily in Western cultural contexts and reflect culturally specific norms of emotional expression, display rules, and annotation practices. As a result, model performance reported on these benchmarks may implicitly favor Western affective patterns and interaction styles, limiting generalization to non-Western populations. Although cross-dataset evaluation is sometimes employed to assess robustness, systematic validation across culturally diverse cohorts and explicit fairness audits with respect to cultural background remain relatively rare. This gap is largely attributable to the limited availability of standardized, culturally diverse ERV datasets with consistent annotation protocols, rather than a lack of methodological awareness.
Table 1. Representative emotion recognition datasets grouped by realism and data acquisition setting and by realism and modality.
Table 1. Representative emotion recognition datasets grouped by realism and data acquisition setting and by realism and modality.
DatasetModality and LabelsCharacteristics
Studio (Controlled Acting)
CREMA-D [33]Audio–visual clips (7k), 6–8 emotionsBalanced acted speech and facial expressions
Ekman-6 [35]Posed facial expressions, basic emotionsCanonical emotion categories; conceptual benchmark rather than a single dataset
RAVDESS [17]Audio–visual (24 actors), 8 emotionsHigh-quality recordings; scripted acting
SMIC [32]High-speed video, micro-expressionsSubtle spontaneous facial motions; limited dataset size
VideoEmotion-8 [34]Audio–visual (1.1k), 8 coarse emotionsWeak supervision; coarse affect categories
Lab (Acted or Semi-Spontaneous)
CASME II [29]High-speed video, micro-expressionsHigh temporal resolution; subtle facial motions; small dataset
CK+ [16]Video sequences, acted emotions and AUsFully controlled expressions; long-standing benchmark
IEMOCAP [36]AVT (12 h), categorical and VADyadic dialogues; aligned audio, video, and text
EEV [37]Video (evoked), evoked emotion labelsWide stimulus variety; limited spontaneity
MELD [38]AVT (13,000), dialogue-level emotionsScripted TV dialogues; clean multi-speaker transcripts
In the Wild
FER2013 [40]Images (35,000), 7 categorical emotionsWeb-scraped data; noisy labels; widely used FER baseline
AffectNet [20]Images (1,000,000+), categorical and VALarge-scale FER; strong class imbalance; noisy web annotations
Aff-Wild2 [41]Video (600+), continuous VAHigh ecological validity; dense annotations; notable subjectivity
AFEW [44]Audio–visual clips (1800), 7 emotionsMovie-based data; noisy labels; AVEC benchmark
CMU-MOSEI [42]AVT (23,000 utterances), emotion and sentimentLarge multimodal corpus; label noise; standard fusion benchmark
CMU-MOSI [45]AVT (2000 utterances), sentimentSentiment-focused dataset; single-speaker monologues
SEWA [43]AVT with context, continuous VARich social interaction context; costly annotation process
VEATIC [46]Video with context, contextual emotion labelsStrong scene context; varied backgrounds; annotation variability
Multimodal (AV + Text + Physiology + Context)
HEU Emotion [9,26]AVT (9000), categorical and VAUser-generated content; label imbalance and annotation noise
SEED [48]EEG + video, VAStable EEG emotion patterns; limited participant diversity
DEAP [49]Physiology + video, VAClean physiological signals; constrained experimental stimuli
ASCERTAIN [25]Wearables + AV, affective statesLongitudinal recordings; high annotation and collection cost
AMIGOS [47]AV + physiology, VA and arousalControlled protocol; small-scale dataset
Across all dataset categories, a fundamental trade-off exists between experimental control and ecological realism. Studio and laboratory datasets provide clean, well-aligned signals, whereas in-the-wild datasets better reflect natural affective behavior but suffer from noisier annotations, stronger class imbalance, and greater contextual variability [23,50,51]. These trade-offs have directly influenced recent research directions, motivating work on robustness, domain generalization, and flexible multimodal modeling frameworks.

3.3. Evaluation Metrics

The choice of evaluation metrics in ERV is closely linked to the underlying emotion representation and the properties of the datasets described above. In this subsection, we briefly review the most commonly used metrics for categorical and dimensional tasks and highlight issues that arise in imbalanced or noisy settings. For categorical emotion recognition, standard metrics include the overall accuracy, precision, recall, and F1 score computed at the class level and averaged across classes. Accuracy is intuitive but can be misleading when class distributions are highly skewed, as is common in in-the-wild datasets where neutral or low-arousal states dominate [23]. The macro-averaged F1 score assigns equal weight to each class and is therefore more informative in the presence of imbalances, whereas a weighted F1 score emphasizes performance in frequent categories. In multi-label settings, where samples may contain multiple emotions or attributes, micro-averaged precision–recall and Jaccard similarity scores are also used. For dimensional affect (e.g., valence and arousal trajectories), evaluation is typically formulated as a regression problem.
The concordance correlation coefficient (CCC) jointly reflects the correlation and mean squared deviation between predictions and the ground truth, penalizing both bias and a lack of agreement. The mean squared error (MSE) and Pearson’s correlation coefficient are often reported as complementary metrics, but the CCC is generally preferred when comparing systems across corpora and partitions. When models are evaluated on time-continuous labels, additional design choices arise regarding temporal smoothing, lag compensation, and segment-level aggregation. Challenge protocols such as AVEC specify these details to ensure comparability across systems [19]. For multimodal and cross-corpus studies, it is good practice to report multiple metrics and adopt transparent and consistent reporting guidelines for systematic evaluation [50].
It is important to recognize that the metric choice in ERV can systematically bias the conclusions about model performance and progress. Frame-level evaluation, while convenient, often overestimates performance by ignoring temporal coherence and affective continuity, particularly in video-based settings, where emotional states evolve gradually. Sequence-level or segment-level metrics provide a more realistic assessment of affective understanding but typically yield lower scores, complicating direct comparison with frame-based results reported in earlier works.
A similar inconsistency arises between categorical and dimensional evaluation paradigms. The accuracy and F1 score implicitly assume discrete and mutually exclusive emotional states, whereas dimensional metrics such as the CCC capture continuous affective variation but are sensitive to temporal alignment, annotation delay, and smoothing strategies. As a result, models optimized for one paradigm may appear to outperform others under specific metrics without offering broader affective robustness. This mismatch complicates cross-study comparison and can obscure genuine methodological advances.
Metric selection therefore has direct implications for applied ERV systems. In real-time or interactive applications, temporal stability and consistency across sequences may be more critical than peak frame-level accuracy, favoring sequence-aware or temporally aggregated metrics. In contrast, monitoring or diagnostic applications may prioritize correlation-based dimensional measures that capture affective trends over time. These considerations suggest that no single metric is universally appropriate; instead, the metric choice should be guided by the task requirements, annotation structure, and deployment constraints. Transparent reporting of evaluation granularity and complementary metrics is essential for fair comparison and for translating benchmark performance into reliable real-world behavior.
Beyond the choice of evaluation metrics, emotion recognition performance is strongly influenced by the evaluation protocol adopted during experimentation. A commonly used setting is within-subject evaluation, where training and testing data are drawn from the same subjects. While this protocol often yields higher performance estimates, it primarily reflects a model’s ability to capture subject-specific patterns and may overestimate the generalization capability. In contrast, cross-subject evaluation enforces subject-independent testing by separating individuals across training and test sets, providing a more realistic assessment of robustness to inter-subject variability, which is particularly relevant for real-world deployment. Another important evaluation paradigm is cross-dataset validation, in which models trained on one dataset are evaluated on a different corpus. This setting exposes sensitivity to dataset bias, annotation differences, and recording conditions, and it is widely regarded as a stringent test of generalization.
Although performance typically degrades under cross-dataset evaluation, such results offer valuable insight into a model’s practical transferability beyond controlled benchmarks. In addition, real-world and deployment-oriented evaluations emphasize performance under unconstrained conditions, including partial occlusion, noise, modality degradation, and streaming or real-time constraints. These scenarios are rarely captured by standard benchmark protocols but are critical for applied ERV systems. As a result, the reported performance across studies should be interpreted in light of the underlying evaluation protocol, as different methodological paradigms, ranging from traditional models to deep learning and MLLM-based approaches, exhibit distinct strengths and limitations depending on the experimental setting.
Finally, as ERV systems are increasingly considered for deployment in sensitive domains (e.g., mental health monitoring or assistive technologies), evaluation should extend beyond the average performance. Fairness across demographic groups, robustness to distribution shifts, and reliability under sensor degradation are emerging concerns [23,52]. These aspects are often assessed through stratified analyses or tailored robustness tests rather than a single scalar metric.

4. Pre-MLLM Era of Emotion Recognition in Video

Prior to the emergence of multimodal large language models, emotion recognition in video evolved through a sequence of increasingly expressive modeling paradigms. Early approaches were dominated by traditional machine learning methods built on handcrafted features, followed by unimodal deep learning models that enabled data-driven visual representation learning. Subsequent work introduced multimodal deep learning architectures to integrate facial, vocal, and textual cues and later adopted transformer-based models to capture long-range dependencies and cross-modal interactions. This chapter reviews these pre-MLLM approaches, highlighting their methodological contributions and inherent limitations that ultimately motivated the transition toward MLLM-based frameworks.
To facilitate a high-level understanding of the methodological evolution in emotion recognition in video, Figure 4 provides a visual timeline summarizing the major paradigm shifts in the field. The figure illustrates the progression from early handcrafted feature-based pipelines to deep learning approaches, transformer-based architectures, and, more recently, multimodal large language model (MLLM)-based frameworks. Representative studies are positioned chronologically to highlight changes in feature representation, modality integration, and reasoning capability across methodological generations.

4.1. Traditional Machine Learning Approaches

Before the adoption of deep learning, emotion recognition in video was predominantly formulated as a feature engineering and pattern classification problem, relying on manually designed representations coupled with classical statistical learning models. Typical pipelines extracted low- and mid-level appearance and motion cues using handcrafted descriptors such as local binary patterns (LBPs), the histogram of oriented gradients (HOG), dense or sparse optical flow, and, in audio–visual settings, mel-frequency cepstral coefficients (MFCCs).
These representations were subsequently mapped to discrete emotion categories or continuous affective dimensions using discriminative or generative classifiers, most commonly support vector machines (SVMs), hidden Markov models (HMMs), or related probabilistic frameworks [53].
As summarized in Table 2, these traditional approaches exhibited several practical advantages, including interpretable feature representations, modular pipeline design, and relatively low computational complexity, which made them suitable for early laboratory studies and resource-constrained environments.
Explicit preprocessing stages and structured temporal modeling enabled controlled analysis of affective dynamics, while rule-based or feature-level fusion strategies facilitated the integration of multimodal cues in audio–visual settings. However, the same design choices also imposed fundamental limitations. Empirical evidence shows that handcrafted representations are highly sensitive to domain shift, resulting in pronounced degradation of cross-dataset generalization performance when capture conditions, subjects, or recording environments differ [54]. Furthermore, annotation noise and label uncertainty, which are intrinsic to affective datasets, introduced additional instability during model training and evaluation, limiting robustness even within nominally similar domains [55]. Visual degradations such as pose variation, partial occlusions, and face coverings further compromised the reliability of feature extraction and downstream classification, leading to significant performance drops in unconstrained scenarios [27].
Despite these shortcomings, handcrafted ERV pipelines established several foundational architectural principles that continue to inform modern systems, including explicit preprocessing stages, structured temporal modeling of affective dynamics, and manually designed fusion strategies for integrating multimodal cues. Contemporary deep learning approaches can be interpreted as a generalization of these paradigms, replacing fixed, task-specific feature representations with hierarchical features learned directly from data through end-to-end optimization. This shift has enabled improved robustness to appearance variability, contextual ambiguity, and capture conditions, thereby enhancing generalization in complex real-world emotion recognition scenarios.
Table 2. Traditional machine learning approaches for emotion recognition in video.
Table 2. Traditional machine learning approaches for emotion recognition in video.
ReferenceFeature Representation and ModelStrengthsLimitations
Jiang et al. (2014) [34]LBP or HOG for appearance, optical flow for motion, MFCCs for audio; SVMs and HMMsExplicit separation of feature extraction and classification; structured preprocessing and temporal modeling; modular audiovisual fusion pipelineStrong reliance on handcrafted descriptors; sensitivity to recording conditions and subject variability; limited robustness in unconstrained video settings
Long et al. (2015) [54]Task-specific handcrafted features; classical ML with adaptation objectivesFormal analysis of cross-domain generalization behavior; early identification of dataset bias and distribution mismatchSignificant performance degradation under domain shift; limited transferability without explicit domain adaptation or retraining
Wang et al. (2018) [55]Handcrafted visual features; statistical learning with uncertainty modelingExplicit modeling of annotation noise; improved training stability under noisy labelsRequires curated datasets or noise estimates; does not fully address intrinsic subjectivity and ambiguity of affective annotations
Dagher et al. (2019) [53]Handcrafted facial appearance descriptors; SVMsInterpretable feature representations; low computational and memory complexity; stable performance under controlled acquisition protocolsHigh sensitivity to poses, illumination changes, and partial occlusions; limited robustness to inter-subject and cross-dataset variability
Yang et al. (2020) [27]Handcrafted facial features; classical ML classifiersSystematic evaluation of occlusion effects; controlled analysis of partial facial visibility scenariosSevere performance degradation under realistic occlusions (e.g., face coverings); limited applicability to in-the-wild deployment scenarios

4.2. Unimodal Deep Learning Methods

The emergence of deep learning substantially advanced emotion recognition in video by enabling end-to-end learning of hierarchical spatial and temporal representations directly from raw or minimally processed inputs. Unlike traditional pipelines based on handcrafted descriptors, deep learning architectures jointly optimize feature extraction and classification objectives, thereby reducing reliance on manual feature engineering.
Recent EEG-based deep learning studies provide representative examples of advanced modeling strategies in affective and cognitive state recognition. For instance, the authors of [56] combined normalized mutual information features with a self-optimized Gaussian kernel-based extreme learning machine, while the authors of [57] introduced a graph attention convolutional neural network with mutual information-driven connectivity to model inter-channel EEG dependencies. Although these EEG-based approaches operate under different sensing assumptions than video-based emotion recognition, they exemplify broader methodological trends in affective computing, including the use of information-theoretic measures, structured representations, and end-to-end deep learning. In video-based emotion recognition, these trends are realized through architectures that operate on observable behavioral signals, where affective understanding relies on jointly modeling spatial appearance cues and temporal dynamics. Consequently, deep learning approaches for ERV have primarily centered on convolutional and recurrent neural architectures that enable hierarchical spatial encoding and temporal aggregation of facial and audio–visual data, forming the foundation of the unimodal deep learning methods discussed next.
In this context, convolutional neural networks (CNNs) have been widely adopted to encode spatial structure and local appearance patterns from facial imagery, while recurrent architectures, such as long short-term memory (LSTM) networks and convolutional LSTMs (ConvLSTMs), have been employed to model temporal dependencies across video frames and audio streams [1,2].
In the pre-transformer era, unimodal deep learning approaches for ERV largely converged toward three dominant architectural paradigms. The first and widely explored family consists of CNN-based models coupled with recurrent temporal aggregation mechanisms. In these architectures, CNNs are used to extract frame-level spatial representations, which are subsequently aggregated using LSTM or ConvLSTM units to capture the temporal evolution of affective expressions [58,59]. The primary strength of this paradigm lies in its explicit separation between spatial encoding and temporal modeling, which facilitates architectural modularity and interpretability. However, this decoupled design also introduces limitations, including increased computational complexity, sensitivity to frame sampling strategies, and reduced robustness under long or highly variable temporal sequences.
The second prominent family comprises three-dimensional convolutional neural networks (3D-CNNs), which perform joint spatiotemporal feature learning by applying three-dimensional convolutional kernels over short video clips. By directly encoding both motion and appearance information, 3D-CNNs eliminate the need for explicit recurrent modeling and have demonstrated effectiveness in capturing short-term facial dynamics, including micro-expressions and brief affective events [30,31]. The main advantage of this approach is its ability to learn compact spatiotemporal representations in an end-to-end manner. Nevertheless, 3D-CNN-based models typically exhibit high computational and memory requirements and are constrained by a limited temporal receptive field, which can hinder their ability to model long-range affective dependencies and reduce scalability to longer video sequences.
The third category encompasses attention-augmented CNN architectures and bilinear pooling models, which aim to enhance representational capacity by selectively emphasizing salient spatial regions, channels, or feature interactions. Attention mechanisms enable adaptive weighting of informative facial regions or feature maps, while bilinear pooling captures higher-order correlations between feature representations, leading to more discriminative affective embeddings [59,60,61]. These approaches improve sensitivity to subtle expression cues and reduce the impact of irrelevant background information. However, their increased architectural complexity often leads to higher training instability and greater data dependency, and their performance gains may diminish when applied to limited or noisy datasets.
Table 3 provides a comparative synthesis of representative unimodal deep learning architectures, explicitly summarizing their principal strengths and limitations. The analysis highlights that although unimodal deep learning methods substantially improve representation learning over handcrafted pipelines, they remain constrained by modality-specific degradations, data availability requirements, and limited generalization across subjects and recording conditions. These limitations motivate the transition toward bimodal and multimodal architectures, which seek to exploit complementary information sources to improve robustness and generalization.
Table 3. Unimodal deep learning approaches for emotion recognition in video.
Table 3. Unimodal deep learning approaches for emotion recognition in video.
ReferenceArchitecture and ModalityStrengthsLimitations
Zhang et al. (2019) [61]CNN with bilinear pooling; visual and facial imagesModeling of higher-order feature interactions; enhanced discriminative capacity for subtle affective cuesIncreased architectural and computational complexity; sensitivity to feature noise and facial misalignment
Li et al. (2020) [1]CNN with temporal aggregation; visual and facial videoEnd-to-end learning of spatial representations; reduced reliance on handcrafted descriptorsLimited temporal modeling capability; sensitivity to frame sampling and facial alignment
Zhao et al. (2020) [59]CNN–LSTM with attention; visual and video sequencesExplicit modeling of temporal affective dynamics with adaptive feature weightingHigher data dependency; training instability; limited cross-dataset robustness
Haddad et al. (2020) [30]3D convolutional neural network; visual and spatiotemporal clipsJoint learning of spatial and short-term temporal representationsHigh computational and memory requirements; restricted temporal receptive field
Zhou et al. (2021) [60]Attention-augmented CNN; visual and facial imagesSelective emphasis on informative spatial regionsLimited temporal modeling; performance degradation under severe occlusion
Hans et al. (2021) [58]CNN–LSTM; visual and video sequencesExplicit temporal aggregation of frame-level featuresIncreased computational complexity; limited cross-subject generalization
Talluri et al. (2022) [31]3D-CNN with spatiotemporal encoding; visual and video clipsEnd-to-end spatiotemporal feature learning without explicit recurrenceStrong dependence on large-scale annotated datasets; limited long-range temporal modeling

4.3. Multimodal Deep Learning Models

Deep multimodal architectures extended unimodal emotion recognition frameworks by jointly modeling heterogeneous information sources, most commonly visual, auditory, and, in some cases, textual or physiological signals. By integrating complementary modalities, these approaches aim to overcome the intrinsic limitations of single-modality systems, particularly their sensitivity to modality-specific degradations. Fusion strategies explored in the literature can be broadly categorized into early fusion schemes, which concatenate modality-specific features at the representation level; late fusion approaches, which combine modality-wise predictions; and intermediate or hybrid fusion mechanisms, which perform integration within the network using attention modules, graph-based representations, or bilinear pooling operations. By exploiting complementary affective cues from facial expressions, speech prosody, and linguistic content, multimodal deep learning systems have consistently demonstrated improved recognition performance compared with unimodal baselines [3,4,5,6].
Early multimodal ERV approaches relied primarily on statistical correlation analysis and generalized canonical correlation analysis (CCA) to align heterogeneous feature spaces [62]. While these methods provided a principled framework for cross-modal representation alignment, their expressiveness was limited, and their performance strongly depended on the handcrafted feature quality. Subsequent research introduced more expressive fusion paradigms, including cross-modal graph attention networks [63], double-attention fusion mechanisms [64], and hierarchical or dynamically adaptive fusion strategies that explicitly model inter-modal dependencies and temporal interactions [65,66,67,68,69,70,71,72].
As ERV research progressed beyond unimodal visual analysis, bimodal audio–visual architectures emerged as a particularly prominent instantiation of multimodal learning, explicitly modeling interactions between facial dynamics and speech-related affective cues. These approaches generally improve affect recognition by leveraging complementary information from facial expressions and vocal prosody, thereby increasing robustness to partial modality degradation. Table 4 provides a comparative synthesis of representative pre-MLLM bimodal architectures, summarizing their principal strengths and limitations and highlighting common design trade-offs.
Table 4. Representative bimodal (audio–visual) deep learning approaches for ERV.
Table 4. Representative bimodal (audio–visual) deep learning approaches for ERV.
ReferenceFusion Architecture and ModalityStrengthsLimitations
Lan et al. (2020) [62]Generalized CCA; visual–audioPrincipled cross-modal alignment; low model complexity; interpretable fusionLimited representational capacity; reliance on handcrafted features
Peng et al. (2022) [63]Cross-modal graph attention; visual–audioExplicit modeling of inter-modal dependencies; improved robustness to partial modality degradationHigh computational cost; sensitivity to temporal synchronization
Mocanu et al. (2022) [64]Double-attention fusion; visual–audioAdaptive emphasis on salient temporal segments and modalitiesPerformance degrades under noisy or misaligned inputs
Chen et al. (2023) [65]Hierarchical temporal fusion; visual–audioMulti-level integration of spatial, temporal, and cross-modal cuesIncreased architectural complexity; higher data requirements
Dutta et al. (2023) [66]Dynamic self-attention fusion; visual–audioFlexible temporal weighting of modalities; improved handling of modality dominanceComputationally expensive training; limited scalability
Ghaleb et al. (2023) [68]Bimodal audio–visual fusionLong-range temporal and cross-modal dependency modelingHigh memory and compute cost; limited real-time feasibility
Dixit et al. (2024) [69]Adaptive multimodal attention; visual–audioImproved robustness under partial modality corruptionSensitivity to synchronization errors; training instability
Early fusion-based bimodal models emphasize rich cross-modal feature interactions and can capture fine-grained correlations between visual and acoustic signals; however, they often incur substantial computational overhead and exhibit sensitivity to temporal misalignment between modalities. Attention-based architectures, such as VAANet and double-attention fusion models, introduce adaptive mechanisms to selectively emphasize salient regions, frames, or modalities, thereby improving discriminative capacity under favorable conditions. Nonetheless, their effectiveness is closely tied to the input quality and reliable cross-modal synchronization, and performance can degrade under noisy or asynchronous recordings.
Beyond early audio–visual fusion strategies, subsequent bimodal and multimodal architectures increasingly incorporated attention-based mechanisms and structured representations to address the limitations of uniform feature aggregation. Self-attention and graph-inspired formulations enabled models to adaptively weigh spatial, temporal, and cross-modal cues, allowing emotionally salient regions, time segments, or modality interactions to be emphasized during inference. By explicitly modeling the relationships among facial dynamics, speech prosody, and contextual signals, these approaches moved ERV systems beyond fixed fusion pipelines and toward more flexible and context-sensitive representation learning.
More recent bimodal architectures incorporating self-attention mechanisms enable more expressive temporal and cross-modal representations but typically require large-scale, well-annotated datasets and significant computational resources to train effectively. Collectively, these approaches illustrate a recurring trade-off between representational power, computational complexity, and practical robustness, motivating the exploration of more flexible and scalable multimodal modeling paradigms.

4.4. Transformer-Based Architectures

The introduction of transformer architectures marked a significant turning point in emotion recognition in video. Table 5 provides a structured overview of representative transformer-based architectures for emotion recognition in video and multimodal affective analysis, summarizing their core design choices, input modalities, and principal strengths and limitations.
Earlier deep learning approaches based on CNNs, LSTMs, and 3D-CNNs focused on learning spatiotemporal patterns from visual and auditory streams but offered limited support for global context modeling or semantic alignment across modalities [1,2].
Transformers addressed these limitations by using self-attention to capture long-range dependencies and integrate heterogeneous inputs within a single framework. Vision transformers (ViTs) [7] introduced patch-based self-attention, allowing models to capture fine-grained spatial relations and adapt more effectively to in-the-wild variation in pose, illumination, and occlusion. Adaptations for facial expression recognition, including ViT-FER, TransFER, and multi-scale ViT variants, further demonstrated improved robustness and expressiveness in visual emotion modeling [73,74,75]. Parallel work on video transformers strengthened temporal modeling by leveraging self-attention over frame sequences [76,77].
Recent work on transformer-based temporal modeling in emotion recognition has explored both efficiency and representational richness. The STT-Net architecture simplifies temporal transformers for emotion recognition by introducing a multi-head self- and cross-attention mechanism designed to model temporal interactions with reduced computational overhead, demonstrating competitive performance on benchmark facial expression datasets [78]. In parallel, hybrid transformer approaches have been proposed that emphasize multilevel feature representation of sequential signals, such as combining parallel convolutional and sequential encoders to capture both the global and temporal characteristics of speech for improved recognition accuracy [79].
These developments underscore the ongoing diversification of transformer architectures in ERV research, where design choices balance computational efficiency, depth of representation, and the capacity to model temporal evolution in affective signals. Transformers also played a central role in the development of multimodal ERV. Cross-attention mechanisms and audio–visual transformer blocks made it possible to align speech prosody, facial expressions, and contextual visual cues more effectively [63,67,80,81]. These models began to bridge low-level perception with higher-level semantic interpretation by embedding multiple modalities within a shared representational space.
While temporal dynamics are widely acknowledged as central to emotion recognition in video, different architectural paradigms exhibit distinct strengths and limitations in modeling long-term affective evolution. Early CNN–LSTM pipelines capture temporal information through recurrent state propagation but often struggle with long-range dependencies, accumulated error, and sensitivity to noise in unconstrained video streams. These limitations become pronounced in scenarios involving gradual affective transitions, intermittent emotional cues, or long interaction sequences, where frame-level features provide only weak supervisory signals.
Subsequent approaches have emphasized sequence-aware modeling and temporal consistency through explicit tracking, bidirectional temporal aggregation, or optimization over extended video segments. Such strategies have been shown to improve robustness to temporal fragmentation and to better preserve affective continuity over time, particularly in in-the-wild settings, where the facial visibility, poses, and expression intensity vary substantially [82]. However, these methods typically remain dependent on carefully designed temporal objectives and well-aligned annotations, limiting their generalization across datasets and recording conditions.
Table 5. Transformer-based architectures for ERV and multimodal affective analysis.
Table 5. Transformer-based architectures for ERV and multimodal affective analysis.
ReferenceArchitecture and ModalityStrengthsLimitations
Dosovitskiy et al. (2021) [7]Vision transformer; visualGlobal spatial context modeling via patch-based self-attention; improved robustness to pose, illumination, and occlusionHigh data and computational requirements; limited inductive bias for fine-grained facial dynamics
Tian et al. (2024); Chaudhari et al. (2022); Liu et al. (2022) [73,74,75]ViT-based FER variants; visualImproved expressiveness and generalization for facial emotion modeling in unconstrained environmentsComputationally expensive; sensitive to dataset size and facial alignment quality
Neshov et al. (2024); Bertasius et al. (2021) [76,77]Video transformers; visualLong-range temporal dependency modeling through self-attention; enhanced sequence-level affect representationQuadratic complexity with sequence length; limited scalability to long video streams
Peng et al. (2022); Dutta et al. (2023); Fu et al. (2021); Venkatraman et al. (2024); Guo et al. (2022) [63,66,67,80,81]Audio–visual transformers; bimodalEffective cross-modal alignment via attention; improved robustness through complementary cue integrationSensitivity to temporal synchronization errors; high computational and memory overhead
Lu et al. (2019); Li et al. (2019) [83,84]VisualBERT and ViLBERT; vision–languageEarly joint visual–textual representations via cross-attention; foundational multimodal groundingLimited scalability and task coverage; indirect affective supervision
Radford et al. (2021); Jia et al. (2021) [8,85]CLIP and ALIGN; vision–languageLarge-scale contrastive pretraining; strong transferability to downstream affective tasksLimited temporal modeling; implicit rather than explicit emotion representation
Li et al. (2022); Alayrac et al. (2022) [9,10]BLIP and Flamingo; vision–languageImproved cross-modal alignment with generative objectives; flexible multimodal reasoningHigh training cost; emotion understanding remains indirect
Lian et al. (2024); Vaiani et al. (2024); Shou et al. (2025); Huang et al. (2025) [86,87,88,89]Emotion-adapted VLMs; multimodalTask-specific supervision improves affect recognition and semantic groundingDomain adaptation required; limited robustness across datasets
Achiam et al. (2023); Qi et al. (2023); Li et al. (2025) [90,91,92]MLLMs (GPT-4V, Gemini, Claude 3); multimodalUnified perception, reasoning, and instruction following; contextual and semantic emotion understandingVery high computational cost; limited transparency and controllability
Transformer-based architectures address some of these challenges by leveraging self-attention mechanisms to integrate information across long temporal spans without explicit recurrence. This enables more flexible modeling of non-local temporal dependencies but introduces sensitivity to the sequence length, annotation sparsity, and computational constraints. More recently, multimodal large language models extend temporal reasoning by integrating visual dynamics with linguistic and contextual information, allowing affective interpretation to be grounded in higher-level narrative and situational context. Nevertheless, their ability to faithfully model fine-grained affective evolution over long, unconstrained videos remains uneven and strongly dependent on prompt formulation, data coverage, and evaluation protocol design.
One closely related development within transformer-based architectures was the rise of vision–language pretraining (VLP). Early transformer models such as VisualBERT and ViLBERT [83,84] introduced joint representations for images and text through cross-modal attention mechanisms. Large-scale contrastive models, including CLIP and ALIGN [8,85], further demonstrated that pretrained vision–language embeddings exhibit strong transferability across downstream tasks, including affective analysis [4]. Subsequent architectures such as BLIP and Flamingo [9,10] extended these ideas by incorporating generative objectives and improved cross-modal alignment, broadening the applicability of transformer-based vision–language models to emotion-related tasks.
In modern ERV pipelines, modality-specific encoders for text, audio, images, and video are typically coupled with multimodal attention mechanisms that integrate heterogeneous representations into a shared latent space for emotion classification or regression. This architectural paradigm provides both the conceptual and technical foundation for contemporary multimodal large language models (MLLMs).
Early general-purpose vision–language models (VLMs) established robust multimodal grounding through joint visual–textual representation learning [8,9,10,83,84,85]. Subsequent work has adapted these transformer-based architectures to emotion recognition and affective analysis, incorporating task-specific supervision and multimodal attention mechanisms [86,87,88,89]. These developments have directly informed the design of contemporary multimodal large language models (MLLMs) such as GPT-4V, Gemini, and Claude 3 [90,91,93,94], which build upon transformer foundations by integrating large-scale pretraining with instruction following, reasoning capabilities, and multi-turn multimodal interaction.
Beyond high-capacity transformer architectures, recent research has also explored lightweight and resource-efficient transformer variants aimed at deployment under constrained computational budgets. Models such as MobileViT [95] and EfficientFormer [92] integrate convolutional inductive biases with transformer-style attention to reduce the parameter count, memory footprint, and inference latency while preserving the competitive representational capacity. These architectures were originally developed for mobile vision tasks but are increasingly relevant for emotion recognition in video, where real-time processing, on-device inference, and energy efficiency are often critical requirements.
In the context of ERV, lightweight transformers have primarily been evaluated as backbone encoders for facial or audio–visual feature extraction rather than as end-to-end affect reasoning systems. Existing studies indicate that compact transformer variants can achieve performance comparable to heavier models on constrained benchmarks when trained on large-scale datasets, although their ability to capture long-range temporal dependencies and subtle affective cues remains more limited. As a result, they are particularly well suited for scenarios such as mobile affect sensing, embedded human–AI interaction, and privacy-sensitive applications, where cloud-based inference is impractical.
Complementary efforts have also emerged toward parameter-efficient adaptations of multimodal large language models for edge deployment. Techniques such as distillation, low-rank adaptation, quantization, and modality-specific pruning have enabled reduced-scale vision–language models, including TinyLLaVA-style architectures [96], to perform multimodal perception and reasoning under limited computational resources. While these compact transformer variants are still largely unexplored in ERV-specific benchmarks, they represent a promising direction for bridging the gap between expressive multimodal reasoning and deployable, resource-aware emotion recognition systems.
Overall, transformer-based architectures have enhanced global context modeling, strengthened cross-modal alignment, and enabled scalable multimodal representation learning. Collectively, these advances have facilitated a methodological shift in ERV from perception-driven pipelines toward MLLM-based frameworks capable of contextual, semantic, and reasoning-aware emotion understanding.

5. The Rise of MLLMs in Emotion Recognition in Video

The emergence of multimodal large language models (MLLMs) marks a significant shift in emotion recognition in video, extending earlier multimodal and transformer-based approaches through large-scale pretraining, unified multimodal representations, and enhanced reasoning capabilities. Unlike pre-MLLM pipelines that relied on task-specific architectures and fixed fusion strategies, MLLMs enable more flexible integration of visual, auditory, textual, and contextual information. This section examines the role of MLLMs in affective computing, provides a comparative analysis with pre-MLLM approaches, and analyzes their performance, robustness, reasoning, explainability, and interactive capabilities while also discussing current limitations and trade-offs.

5.1. Multimodal Large Language Models in Affective Computing

Multimodal large language models (MLLMs) extend transformer-based vision–language models (VLMs) by coupling modality-specific encoders with large-scale autoregressive language decoders trained on web-scale text corpora [90,97,98]. In contrast to earlier multimodal models that primarily produced fixed label predictions or joint embeddings, MLLMs generate free-form natural language outputs and support instruction following. This enables capabilities such as explanation of predictions, clarification-seeking, and adaptive behavior aligned with user objectives [99,100,101,102].
Contemporary general-purpose MLLMs, including GPT-4V, Gemini, and Claude 3, integrate vision and language processing within a unified interactive interface [90,91,93]. These models process images and, increasingly, video and audio streams by aligning non-textual inputs with internal token representations through cross-modal attention and projection mechanisms [10,94]. Visual instruction tuning and parameter-efficient adaptation strategies [99,100,101,102] further allow MLLMs to acquire task-specific competencies, such as detailed visual reasoning, affective description, and contextual interpretation, while using relatively modest amounts of supervised data.
Recent studies have further explored the integration of multimodal fusion strategies with large language models to address the challenges of noisy data, heterogeneous modalities, and high-level affective reasoning in real-world settings. In the context of social media analysis, Maazallahi et al. [103] proposed a hybrid framework that combines multimodal feature fusion with fine-tuned large language models to improve emotion recognition under weak supervision and label inconsistency. Their approach highlights how LLM-based reasoning can complement conventional multimodal pipelines by enforcing semantic coherence and compliance constraints across noisy visual, textual, and contextual inputs.
Beyond classification-oriented settings, LLMs have also been investigated as mechanisms for emotion-aware representation fusion and explainable affective reasoning. Rasool et al. [104] introduced an embedding-level fusion strategy in which emotion-specific representations are integrated within large language models such as Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4. By combining lexicon-derived affective cues with attention-based contextual embeddings, their framework enables interpretable emotion reasoning and response generation, illustrating the potential of LLMs to unify perception, affect modeling, and explanation within a single architecture.
More recently, multimodal emotion recognition has been extended beyond facial and vocal cues to include body language and motion dynamics. Lu et al. [105] demonstrated that large language models can be adapted to process structured representations of human body movements, enabling emotion recognition and natural language explanation based on skeletal motion patterns. This work underscores the capacity of MLLM-based systems to incorporate diverse behavioral modalities and to reason over affective signals that are not explicitly tied to facial expressions or speech.
Collectively, these studies illustrate an emerging trend toward hybrid multimodal fusion and LLM-centered affective reasoning, where large language models act not only as classifiers but as integrative reasoning modules capable of handling noisy inputs, diverse expressive channels, and explainable emotion understanding. Such approaches complement vision- and audio-centric ERV pipelines and point toward more flexible, context-aware, and interactive emotion recognition systems.
From an affective computing perspective, these architectural and training advances enable MLLMs to extend beyond perception-centric emotion prediction to more integrated emotion understanding. Rather than outputting a single categorical label or continuous affective score, MLLMs can jointly reason over facial expressions, vocal prosody, linguistic content, and contextual cues and articulate their interpretations in natural language. This shift enables system-level functions, such as explanation generation, uncertainty expression, and interactive clarification, which are difficult to realize within earlier modular ERV pipelines.
Table 6 provides a comparative overview of representative MLLMs and MLLM-based affective systems, summarizing their core architectural characteristics, principal strengths, and current limitations. As highlighted in the table, general-purpose MLLMs offer unprecedented flexibility by unifying multimodal perception, reasoning, and language generation within a single framework, while emotion-specialized variants further improve the affect recognition accuracy and explanation quality through task-oriented instruction tuning and hybrid expert designs. At the same time, the table underscores persistent challenges, including the high computational cost, limited transparency, dependence on large-scale multimodal data, and sensitivity to noisy or misaligned inputs, which currently constrain large-scale deployment and reproducibility.
Building on these capabilities, recent research has begun to specialize MLLMs for emotion recognition and affective analysis. Lian et al. proposed GPT-4V with Emotion, a zero-shot benchmark designed to probe generalized emotion recognition abilities in multimodal LLMs without task-specific fine-tuning [86]. Emotion-LLaMA further advances this direction by incorporating instruction tuning with emotion-oriented datasets, improving both affective recognition performance and the quality of generated explanations [106]. Additional studies explored MLLM-based pipelines for video emotion recognition and evaluation protocols [87,88,107], alongside the introduction of dedicated benchmarks for emotional intelligence in LLMs and MLLMs, such as EmoBench and EmoBench-M [108,109].
Hybrid expert architectures, exemplified by Emotion-Qwen [89], further investigate the integration of general vision–language understanding with emotion-specific reasoning mechanisms. Beyond benchmarking and model analysis, MLLMs are increasingly explored in applied and assistive affective systems. For instance, Audo-Sight and EmoAssist leverage multimodal reasoning to support blind and visually impaired users through context- and emotion-aware ambient interaction [110,111]. These applications demonstrate how the reasoning and explanation capabilities of MLLMs can bridge emotion recognition with broader goals in human–AI interaction and accessibility [52,110]. Despite their strong representational and reasoning capabilities, VLMs and MLLMs impose specific data requirements that are not yet fully met by existing affective datasets. Table 7 summarizes the key data requirements and persistent gaps for VLM/MLLM-based emotion recognition.
Table 6. Multimodal large language models (MLLMs) for emotion recognition and affective computing.
Table 6. Multimodal large language models (MLLMs) for emotion recognition and affective computing.
ReferenceModel and ModalityStrengthsLimitations
Achiam et al. (2023); Qi et al. (2023); [90,91]GPT-4V, Gemini, Claude 3; multimodalUnified perception, language generation, and reasoning; instruction following; free-form affective explanation and contextual interpretationExtremely high computational cost; limited transparency and controllability; restricted access and reproducibility
Lian et al. (2024) [86]GPT-4V with Emotion; multimodalZero-shot evaluation of generalized emotion understanding; strong cross-domain transfer without task-specific fine-tuningLimited insight into internal reasoning; dependence on proprietary model behavior
Cheng et al. (2024) [106]Emotion-LLaMA; multimodalEmotion-oriented instruction tuning improves recognition accuracy and explanation qualityRequires curated emotion datasets; limited robustness under domain shift
Vaiani et al. (2024); Shou et al. (2025); Bhattacharyya et al. (2025) [87,88,107]MLLM-based video ERV pipelines; multimodalJoint reasoning over visual, audio, and linguistic cues; flexible handling of long temporal contextHigh data and computing requirements; sensitivity to noisy or misaligned multimodal inputs
Sabour et al. (2024); Hu et al. (2025) [108,109]EmoBench and EmoBench-M; multimodal benchmarksStandardized evaluation of emotional intelligence and multimodal affect reasoningLimited coverage of real-world interaction scenarios; benchmark saturation risk
Huang et al. (2025) [89]Emotion-Qwen; multimodalHybrid expert architecture combining general VLM reasoning with emotion-specific modulesIncreased architectural complexity; training and inference overhead
Ainary et al. (2025); Qi et al. (2025) [110,111]Audo-Sight, EmoAssist; assistive MLLMsContext- and emotion-aware interaction for accessibility and assistive technologiesDependence on robust perception modules; limited evaluation in diverse real-world settings
In particular, the lack of large-scale, well-aligned multimodal data with expressive and fine-grained affect annotations constrains the effective adaptation and evaluation of MLLMs for ERV.
Furthermore, demographic imbalance and limited cultural diversity introduce biases that undermine generalization and fairness, especially in socially sensitive applications. Temporal annotation sparsity and synchronization errors further limit the ability of MLLMs to exploit long-range reasoning capabilities. Finally, accessibility considerations are rarely incorporated at the data level, despite growing interest in assistive affective technologies. Addressing these gaps is essential to fully leverage MLLMs for robust, inclusive, and context-aware emotion recognition and remains a key challenge for future research.
Table 7. Data requirements and remaining gaps for VLM- and MLLM-based emotion recognition.
Table 7. Data requirements and remaining gaps for VLM- and MLLM-based emotion recognition.
RequirementNotes
Multimodality and contextual groundingJoint availability of vision, audio, text, and contextual metadata is required; in-the-wild interactions and long temporal context remain underrepresented.
Granular and expressive labelsNeed for combined categorical and dimensional (valence–arousal) annotations, including compound, subtle, and ambiguous affective states.
Scale, diversity, and balanceLarge-scale datasets with demographic, cultural, and situational diversity are necessary; current resources suffer from imbalance and limited coverage.
Temporal continuity and alignmentFine-grained temporal annotations and reliable cross-modal synchronization are critical for modeling affective dynamics yet remain scarce.
Accessibility and inclusive designData collection and annotation protocols should account for sensory impairments and support assistive and accessibility-oriented applications.
The requirements summarized in Table 7 highlight a fundamental mismatch between the capabilities of modern VLMs and MLLMs and the limitations of existing affective datasets. While current models can integrate heterogeneous modalities and reason over extended contexts, most emotion recognition datasets remain limited in modality coverage, temporal depth, and contextual richness.

5.2. Comparative Analysis: Pre-MLLMs Versus MLLMs

The transition from a pre-MLLM architectures to multimodal large language models (MLLMs) represents a fundamental shift in the design paradigm of emotion recognition in video (ERV), moving from modular, task-specific pipelines toward unified frameworks capable of multimodal reasoning and contextual integration. Earlier deep learning approaches, including convolutional, recurrent, and 3D convolutional networks as well as pre-transformer multimodal architectures, were typically engineered for a fixed set of input modalities and target datasets, relying on separate encoders and explicitly designed fusion mechanisms for each information stream [3,4]. While these models achieved strong performance on specific benchmarks such as AffectNet, Aff-Wild2, and AVEC challenge datasets, their generalization abilities are often limited, with performance degrading under domain shift, annotation noise, or previously unseen interaction scenarios [19,20,23].
MLLMs address many of these limitations by leveraging large-scale multimodal pretraining combined with instruction tuning to support integrated perception, reasoning, and explanation across text, audio, image, and video inputs [10,90,91,93,94]. Rather than maintaining modality-specific pipelines and task-dependent fusion strategies, MLLMs map heterogeneous inputs into a shared semantic representation space. A large autoregressive language decoder then processes this representation to support reasoning and generation. This design enables flexible inference through prompting, supports zero- and few-shot emotion recognition, and allows models to generate natural language explanations of their predictions [86,99,100,101,102,106].
A key consequence of this architectural unification is improved generalization. Through exposure to diverse multimodal data during pretraining, MLLMs exhibit enhanced robustness to dataset bias and domain variation, enabling affect recognition in previously unseen settings with minimal or no task-specific supervision [8,85,86,87,90]. At the same time, the use of a single multimodal interface allows emotion recognition, explanation, and downstream decision making to be performed within a coherent conversational framework, reducing the need for task-specific system redesign and facilitating interactive affective analysis [88,89,94].
Another distinguishing characteristic of MLLMs is their capacity for language-based explainability. Unlike conventional classifiers that output discrete labels or continuous scores, MLLMs can articulate the rationale underlying their predictions, reference salient visual or acoustic cues, and express uncertainty in natural language. This capability aligns closely with emerging benchmarks on emotional intelligence and affective reasoning, which emphasize interpretability and contextual understanding in addition to predictive accuracy [107,108,109]. Furthermore, instruction tuning, visual instruction tuning, and parameter-efficient adaptation techniques enable MLLMs to rapidly incorporate emotion-specific knowledge or adapt to new application domains using relatively modest amounts of supervised data [99,100,101,102,106].
These properties make MLLMs particularly well suited for application-oriented ERV systems, including assistive technologies, conversational agents, and decision support tools, where emotion understanding must be tightly integrated with contextual reasoning and user interaction [52,110,111]. However, the increased flexibility and expressive power of MLLMs also introduce new challenges related to dataset bias, annotation reliability, privacy, and ethical considerations, which are not fully addressed by scale alone and motivate the open issues discussed later in this paper [23,50,52,112].
Rather than maintaining separate pipelines per modality, MLLMs map heterogeneous inputs into a shared semantic space and operate on this representation with a powerful language decoder. This enables zero- and few-shot emotion classification, flexible adaptation to new tasks via prompts, and natural language explanations of predictions [86,99,100,101,102,106].
It is important to note that this review does not aim to provide a quantitative meta-analysis or a unified performance ranking across methods. Reported evaluation metrics, such as the accuracy, F1-score, or concordance correlation coefficient (CCC), are highly sensitive to dataset characteristics, annotation protocols, evaluation settings, and training assumptions, all of which vary substantially across studies. This variability is further amplified when comparing pre-MLLM models, typically trained and evaluated in fully supervised, dataset-specific settings, with multimodal large language models, which are often assessed under zero-shot or few-shot configurations using prompt-based inference.
Across the surveyed literature, performance improvements were commonly reported within narrowly defined experimental contexts, such as within-subject or in-dataset evaluation on specific benchmarks (e.g., AffectNet, Aff-Wild2, and AVEC-related datasets). While such results demonstrate the effectiveness of individual models under controlled conditions, they are not directly comparable across studies due to differences in data splits, label definitions, temporal granularity, and evaluation protocols. Similarly, reported gains for MLLM-based approaches are frequently demonstrated through task-specific benchmarks, qualitative reasoning evaluations, or small-scale empirical studies that emphasize flexibility, generalization, or explanatory capabilities rather than optimized performance on standardized ERV datasets. For these reasons, direct numerical comparison of reported scores across datasets, model families, and evaluation paradigms can be misleading and may obscure the underlying trade-offs between methodological approaches. Instead, this review adopts a structured qualitative comparison strategy, synthesizing reported results within their experimental context and focusing on relative strengths, limitations, and deployment considerations.
Table 8 contrasts the representative characteristics of pre-MLLM- and MLLM-based approaches across multiple dimensions relevant to ERV, synthesizing trends reported in recent research papers, benchmark analyses, and empirical studies [4,5,6,86,87,89,107,108,109,113]. Table 8 distills the key trade-offs between pre-MLLM emotion recognition systems and MLLM-based approaches. Pre-MLLM models emphasize efficiency and task specificity, achieving strong performance under well-defined conditions but exhibiting limited robustness and adaptability. In contrast, MLLMs prioritize unified multimodal reasoning and language-based interaction, enabling broader task coverage, improved zero- or few-shot generalization, and enhanced interpretability at the cost of higher computational demands and increased deployment complexity.
It is important to note that the comparative advantages attributed to MLLMs over task-specific ERV models are primarily derived from qualitative analyses, benchmark-level evaluations, and limited empirical studies rather than from standardized, head-to-head comparisons under shared experimental conditions. Unlike classical and deep learning ERV systems, which are typically evaluated using fixed datasets, well-defined metrics, and reproducible training protocols, MLLMs are often assessed through heterogeneous benchmarks, zero-shot or prompt-based settings, and, in many cases, proprietary model implementations.
As a result, reported gains in robustness, generalization, and reasoning should be interpreted with caution. The performance improvements observed for MLLMs frequently depend on prompt formulation, task framing, and access to large-scale pretraining data that are not publicly disclosed. Moreover, many existing evaluations focus on small benchmark subsets, qualitative reasoning tasks, or human-in-the-loop assessments, which limits direct comparability with task-specific ERV models trained and evaluated under controlled conditions.
Consequently, while current evidence suggests that MLLMs offer enhanced flexibility, multimodal reasoning, and explanatory capabilities, their maturity and reliability for emotion recognition in video remain constrained by the absence of standardized benchmarks, shared evaluation protocols, and transparent reporting practices. Establishing fair and reproducible comparison frameworks between task-specific models and MLLMs remains an open challenge and a prerequisite for drawing stronger quantitative conclusions.

5.3. Performance and Robustness

Performance evaluation in ERV has historically relied on challenge-based benchmarks such as AVEC and Aff-Wild2, which established reference baselines for audio–visual and continuous valence–arousal prediction under realistic, in-the-wild conditions. Early task-specific deep learning models typically achieved concordance correlation coefficients (CCCs) and F1 scores in the moderate range, reflecting the intrinsic difficulty of modeling affective behavior in unconstrained environments characterized by annotation noise, subject variability, and contextual ambiguity [19,20].
Subsequent advances in transformer-based architectures and multimodal fusion strategies led to measurable performance gains by improving temporal modeling, long-range dependency capture, and cross-modal alignment. Approaches incorporating self-attention, bilinear fusion, and hybrid convolutional-transformer designs consistently outperformed earlier CNN–RNN pipelines across multiple benchmarks, particularly in settings involving long video sequences or complex audio–visual interactions [61,66,81]. Foundational architectural ideas underlying cross-modal attention and tensor-based fusion were first explored in multimodal sentiment and language analysis [114,115] and later adapted to emotion recognition in video. Nevertheless, these improvements remained largely bounded by dataset-specific supervision and often failed to generalize robustly across domains or annotation schemes.
More recent evaluations of VLMs and MLLMs indicate a further shift in both performance and robustness characteristics. GPT-4V with Emotion [86] demonstrated competitive and in some cases superior zero-shot performance relative to supervised baselines on several emotion recognition benchmarks, while simultaneously providing natural language rationales that enhance interpretability. Vaiani et al. [87] showed that multimodal LLMs can recognize emotions from video with limited task-specific supervision and flexibly adapt to alternative label taxonomies through prompting, reducing reliance on fixed annotation schemas.
Recent empirical analyses, including the evaluation study by Bhattacharyya and Wang [107], suggest that MLLMs can exhibit improved cross-dataset generalization compared with conventional VLMs, particularly under domain shift. These effects are most apparent when transferring across datasets with heterogeneous recording conditions, cultural contexts, or annotation schemes. Beyond standard accuracy metrics, preliminary human-centered evaluations indicate that MLLM-generated emotion interpretations may be perceived as more coherent and contextually grounded, highlighting the potential benefits of integrated multimodal reasoning. However, these findings remain preliminary and warrant further validation through standardized benchmarks and large-scale comparative studies.
Nevertheless, recent benchmark initiatives such as EmoBench and EmoBench-M [108,109] reveal persistent limitations. Current MLLMs continue to struggle with fine-grained affect distinctions, subtle or mixed emotional states, and cross-cultural variability in emotional expression. Performance remains sensitive to the dataset composition, prompt formulation, and evaluation protocol, and models may exhibit overconfident or internally inconsistent reasoning when confronted with ambiguous affective cues. These findings suggest that while MLLMs improve robustness and flexibility relative to earlier approaches, their affective competence remains uneven and highly dependent on the data and evaluation design.

5.4. Reasoning, Explainability, and Interaction

A central distinction between multimodal large language models (MLLMs) and pre-MLLM emotion recognition architectures lies in the integration of perceptual inference with language-based reasoning. Conventional emotion recognition systems typically output categorical labels or continuous affective scores, offering limited transparency into the underlying decision process. Interpretability in such models is generally restricted to post hoc analyses, including attention visualizations, saliency maps, or feature ablation studies, which provide indirect and often ambiguous explanations of model behavior [1,2].
In contrast, MLLMs support explicit, language-mediated reasoning over multimodal inputs, enabling them to generate natural language explanations that reference salient visual, acoustic, linguistic, or contextual cues, such as facial muscle tension, vocal prosody, lexical content, or interactional context, while also expressing uncertainty or alternative affective interpretations [86,87,106,107]. This capability shifts explainability from an external diagnostic tool to an intrinsic model output, allowing emotion recognition decisions to be contextualized, interrogated, and refined through interaction.
Beyond interpretability, the reasoning capabilities of MLLMs enable more interactive and adaptive affective systems. Emotion-aware assistants, including early prototypes such as Audo-Sight and EmoAssist [110,111], illustrate how multimodal reasoning can be leveraged to describe perceived emotional states, relate them to situational context, and dynamically adapt feedback or guidance in response to user queries. Such interactional flexibility is particularly relevant in accessibility-oriented applications, mental health support, and educational settings, where affective understanding must be integrated with dialogue, explanation, and user intent [52].
In real deployments, these advances introduce new technical and ethical challenges. Language-based explanations generated by MLLMs may convey a false sense of certainty, obscure underlying model limitations, or rationalize incorrect predictions in a persuasive manner. In emotionally sensitive contexts, this raises concerns regarding transparency, emotional manipulation, user overreliance, and the appropriate calibration of trust [50,52,112]. Addressing these issues requires not only improved evaluation protocols for reasoning and explainability, but also careful system design that explicitly accounts for uncertainty, user agency, and ethical deployment constraints.

5.5. Limitations and Trade-Offs

Despite their representational flexibility and reasoning capabilities, multimodal large language models (MLLMs) are not a universal replacement for pre-MLLM emotion recognition architectures. A primary limitation concerns computational cost; both pretraining and inference typically require specialized hardware, large memory budgets, and careful system-level optimization. In contrast, smaller convolutional- or transformer-based models remain well suited for real-time or on-device ERV deployments, where latency, energy consumption, and deployment constraints are critical [1,81].
A second challenge arises from the reliance of MLLMs on large, opaque, web-scale training corpora. The limited transparency of these data sources complicates reproducibility, bias auditing, and fairness assessment, particularly in socially sensitive affective applications [52,98]. Unlike task-specific emotion recognition datasets with well-documented annotation protocols, web-scale multimodal data introduce heterogeneous labeling practices and cultural biases that are difficult to characterize or mitigate post hoc. Moreover, while MLLMs can generate natural language explanations for affective predictions, these explanations are not guaranteed to be faithful reflections of the underlying decision process [108,109].
Beyond computational cost and data dependency, additional practical drawbacks of current MLLM-based approaches warrant explicit consideration. A key limitation concerns reproducibility; many reported results rely on closed-source or rapidly evolving proprietary models, making it difficult to replicate findings or perform controlled ablation studies. This lack of transparency complicates fair comparison with task-specific ERV systems, which are typically evaluated using fixed architectures, public datasets, and reproducible training pipelines.
Another practical concern is sensitivity to prompt formulation and task framing. MLLM performance can vary substantially depending on prompt wording, contextual cues, or instruction style, introducing an additional source of variability that is largely absent in conventional supervised ERV models. In affective analysis, this sensitivity may lead to inconsistent predictions across semantically similar inputs or evaluation set-ups.
Finally, in emotionally ambiguous or subtle scenarios, MLLM-generated rationales may be fluent yet misleading, reflecting hallucinated explanations or uncertain internal evidence. This can obscure failure modes and complicate the calibration of trust in high-stakes settings, especially when evaluation protocols do not explicitly test for faithfulness, uncertainty communication, or explanation consistency [108,109].
Collectively, these issues highlight that practical deployment challenges for MLLMs extend beyond computational considerations, encompassing reproducibility, reliability, interpretability, and failure transparency. Addressing these limitations will require not only improved model architectures but also standardized evaluation practices, open benchmarks, and clearer reporting guidelines for multimodal affective reasoning systems.
By contrast, conventional deep learning architectures continue to offer favorable trade-offs in settings where task definitions are clear, training data are well curated, and computational resources are limited. Carefully optimized CNN- and transformer-based emotion recognition systems often achieve competitive performance on established benchmarks such as AffectNet, Aff-Wild2, and AVEC while maintaining lower operational cost and reduced governance complexity [4,20].
Taken together, these considerations suggest that MLLMs and pre-MLLM architectures should be viewed as complementary rather than competing solutions. MLLMs are most advantageous in scenarios that require rich multimodal context integration, flexible interaction, and language-based explanation, whereas specialized deep models remain effective for focused, resource-constrained ERV deployments. Building on this comparison, the following sections discuss open challenges related to data availability, evaluation methodology, ethical considerations, and application-specific design choices for emotionally intelligent multimodal systems.

6. Challenges and Future Directions of Research

Despite substantial progress in emotion recognition in video, particularly with the emergence of multimodal large language models, several fundamental challenges remain unresolved. These challenges span data availability and quality, evaluation methodology, robustness, explainability, ethical considerations, and practical deployment constraints. Moreover, recent advances open promising avenues for future research aimed at addressing these limitations and advancing toward more reliable, interpretable, and socially responsible affective systems. This section reviews key open issues and outlines potential research directions for the next generation of ERV models.

6.1. Challenges and Open Issues

Although the field has developed quickly in recent years, especially with the introduction of multimodal large language models, emotion recognition in video (ERV) still faces a number of open problems. Several challenges identified in early deep learning systems remain unresolved, and some have become more pronounced now that models attempt to combine multiple modalities and reason about context.
The first challenge is the limited availability of balanced and diverse multimodal datasets. Most existing corpora do not cover sufficient demographic or cultural variation, which makes it difficult for models to generalize reliably. Approaches based on synthetic data and self-supervised learning may help reduce this gap. Another long-standing problem is the subjectivity of emotion labels. Annotations are often noisy or interpreted differently by individual raters, which motivates the use of continuous or probabilistic affect models and methods that explicitly account for annotator variation. Cultural and demographic bias adds another layer of complexity, since people express emotions in different ways, and fairness-aware or cross-cultural adaptation techniques are needed to avoid systematic errors.
Privacy and ethical issues also remain central. Emotional behavior constitutes highly sensitive information, and prior work has stressed the importance of consent, data protection, and clear governance frameworks [52,112]. Technical constraints come into play as well; transformer-based systems and MLLMs require significant memory and computational resources, which makes real-time deployment difficult on devices with limited capacities. Techniques such as efficient tuning, model distillation, and hardware-aware optimization are therefore important. Robustness remains an open challenge, as ERV models often struggle with occlusion, lighting changes, motion blur, and other in-the-wild effects. Benchmarking studies [23,50] and large-scale evaluations such as AVEC [19] show that issues related to annotation quality and domain shift continue to affect performance.
Beyond facial and vocal cues, ERV increasingly needs to consider contextual information present directly within the video. Text appearing on screen, such as subtitles, signs, or meme captions, can influence how emotions are interpreted. Detecting and recognizing such text typically relies on detection-based or segmentation-based text spotting methods [116,117,118,119].
These challenges have concrete practical consequences. For instance, mental health monitoring requires privacy-preserving and culturally fair systems; educational or interactive settings depend on models that remain stable under changing conditions; and assistive technologies require context-aware and personalized interpretations. As noted, recent performance gains often come at the cost of higher computational requirements, which reinforces the need for lighter and more energy-efficient models. Overall, ERV research is moving toward systems that are more interpretable, more aware of social context, and better aligned with human expectations.

6.2. Future Directions of Research

Recent advances in emotion recognition in video have substantially expanded the range of modeling approaches and applications while also revealing a set of persistent challenges that motivate continued research. As discussed throughout the paper, the field is progressively shifting from perception-driven, task-specific pipelines toward multimodal frameworks that integrate contextual cues, support higher-level reasoning, and enable flexible interaction. This transition motivates several interrelated directions for future research.
A first direction concerns personalization and user-aware adaptation. Emotional expression exhibits substantial inter-individual, cultural, and neurodiversity-related variability, which limits the effectiveness of population-level models. Future ERV systems will therefore need to adapt to individual users. This includes user-specific calibration, adaptive baselines, and conditioning on personal, social, and interaction history. Such capabilities are particularly critical in healthcare, education, and assistive scenarios, where misinterpretation of affect can have significant consequences. Emerging approaches, including few-shot learning, meta-learning, and prompt-based adaptation in MLLMs, offer promising avenues for achieving personalization without extensive retraining or data collection overhead.
A second research direction focuses on learning under low-resource conditions and improving robustness to domain shift. Despite recent progress, ERV models remain constrained by data scarcity, annotation subjectivity, and demographic imbalance. Consequently, increasing attention is being devoted to self-supervised and weakly supervised representation learning, domain adaptation techniques, uncertainty-aware modeling, and synthetic affect generation. Cross-dataset and cross-cultural evaluation protocols remain essential for assessing generalization, while recent advances in spatiotemporal fusion and emotion-aware pretraining demonstrate the potential for mitigating dataset-specific biases [114,120].
A further line of work centers on multimodal reasoning and interaction-oriented system design. As highlighted in earlier sections, ERV is moving beyond isolated recognition toward explanation, grounding, and conversational interaction. Future systems are expected to jointly model emotion, sentiment, engagement, and intent within unified multimodal reasoning frameworks [72,80,121]. This shift will likely be supported by multimodal knowledge representations, causal affect models, and real-time feedback mechanisms, enabling applications such as socially responsive avatars, intelligent tutoring systems, and human–robot interaction. Group-level dynamics and immersive environments further require context-sensitive and temporally extended affect modeling.
Another important direction involves improving computational efficiency and deployability. While MLLMs offer strong zero- and few-shot capabilities, their computational and energy requirements limit applicability in latency-sensitive or resource-constrained settings. Ongoing research on parameter-efficient adaptation, modular and adapter-based architectures, model compression, and hybrid cloud–edge deployment aims to bridge this gap, particularly for healthcare, assistive technologies, and wearable systems. These efforts reflect the broader need to balance representational power with practical deployment constraints.
Future ERV systems must also generalize across diverse datasets, modalities, and interaction contexts. Broadly trained, cross-domain foundation models [122] are emerging as a potential pathway toward improved transferability. At the same time, deeper modeling of temporal and social dynamics, such as long-term affect trajectories, memory-like mechanisms, and interaction history, will be necessary to capture emotional evolution over extended interactions and complex social settings.
Finally, progress toward trustworthy and ethically grounded ERV remains a critical concern. Privacy-preserving learning paradigms, including federated and decentralized multimodal training, are increasingly important for handling sensitive affective data. Improved interpretability, uncertainty estimation, and transparent reporting can support appropriate user trust, while safeguards against affect manipulation and culturally biased inference are essential for responsible deployment. Evaluation protocols must also evolve to explicitly account for demographic and cultural diversity.
Collectively, these research directions point toward ERV systems that are more context-aware, robust, and socially responsible, with multimodal reasoning capabilities that extend beyond perception alone. Continued progress will depend not only on advances in model architectures but also on improved datasets, evaluation methodologies, and ethical governance frameworks, setting the stage for the concluding remarks of this paper.

6.3. Ethical Considerations, Bias, and Fairness in Emotion Recognition in Video

Bias and ethical considerations have emerged as central challenges in affective computing, particularly as models evolve from laboratory-scale prototypes toward deployment in real-world, interactive, and assistive settings. While bias is frequently acknowledged in the literature, its treatment is often inconsistent, and comprehensive analyses of ethical implications and mitigation strategies remain limited, especially when compared with the rapid methodological advances in deep learning, transformer-based models, and multimodal large language models (MLLMs).
A primary source of bias in ERV systems originates from dataset composition and annotation practices. Widely used emotion datasets are typically collected under constrained recording conditions and exhibit imbalanced distributions with respect to age, gender, ethnicity, cultural background, and situational context. As discussed in earlier sections of this review, such limitations directly affect cross-dataset generalization and contribute to performance degradation under domain shift. Subjective annotation processes further exacerbate this issue, as affect labels often reflect annotator-specific interpretations that vary across cultures and social norms, introducing systematic label noise and latent bias into model training and evaluation.
Recent transformer-based architectures and MLLMs, despite their superior representational capacity and contextual modeling abilities, do not inherently resolve these biases. On the contrary, large-scale pretraining on web-derived data may amplify demographic and cultural imbalances already present in the training corpora. As highlighted in studies reviewed in Section 4 and Section 5, vision–language models and MLLMs inherit biases from both visual and textual modalities, which can manifest as uneven emotion recognition performance across demographic groups or as culturally skewed affect interpretations. The opacity of internal representations and reliance on proprietary or weakly curated pretraining data further complicate bias auditing and accountability.
Existing ERV research has addressed bias-related challenges primarily through implicit mitigation strategies, such as data augmentation, domain adaptation, and robustness-oriented architectural design. While these approaches can improve average performance and stability, they rarely incorporate explicit fairness-aware objectives, demographic stratification in evaluation protocols, or bias-sensitive loss functions. More principled mitigation strategies, such as balanced dataset construction, representation debiasing during training, domain-invariant feature learning, and post hoc calibration have been explored in isolated studies but remain largely disconnected from mainstream ERV pipelines and benchmark evaluations.
The emergence of MLLMs introduces additional ethical dimensions beyond predictive accuracy, including explainability, transparency, and responsible interaction. Unlike earlier ERV systems that output categorical labels or continuous affective scores, MLLMs generate free-form natural language responses and engage in interactive reasoning. While this capability enables explanation generation and uncertainty expression, it also raises new risks related to misleading rationales, overconfident explanations, and the propagation of latent biases through language generation. As discussed in recent benchmark-oriented works, current evaluation frameworks for MLLMs in affective computing focus predominantly on task performance, with limited consideration of fairness, bias sensitivity, or ethical robustness.
Beyond the identification of ethical risks, the recent ERV literature has begun to explore methodological strategies aimed at mitigating bias, improving fairness, and enhancing robustness. Debiasing approaches commonly focus on reducing the influence of confounding factors such as gender, ethnicity, age, and cultural background, which are often entangled with affective labels in real-world datasets. Representative techniques include dataset rebalancing, sample reweighting, adversarial feature disentanglement, and domain-invariant representation learning, where auxiliary objectives are introduced to suppress demographic or dataset-specific cues while preserving emotion-relevant information.
Fairness-aware training paradigms have also been investigated, particularly in the context of facial emotion recognition. These methods typically incorporate fairness constraints or regularization terms that aim to equalize performance across demographic subgroups or minimize disparities in error rates. Multi-task learning frameworks, adversarial debiasing architectures, and causal-inspired formulations have been proposed to explicitly control for sensitive attributes during training, although their effectiveness remains closely tied to the availability and reliability of demographic annotations.
In parallel, adversarial robustness and reliability under distribution shift have emerged as important concerns for ERV deployment. Studies have shown that emotion recognition models are vulnerable to adversarial perturbations, occlusions, compression artifacts, and environmental noise, which can disproportionately affect certain populations or interaction contexts. Robust training strategies, including data augmentation, adversarial training, uncertainty modeling, and consistency regularization, have been explored to improve stability under such conditions, though comprehensive evaluation under real-world attack and degradation scenarios remains limited.
Despite these advances, existing debiasing and robustness techniques are often evaluated in controlled experimental settings and on limited benchmarks, making it difficult to assess their generalizability. Moreover, the increasing adoption of large-scale and multimodal pretrained models introduces new ethical challenges related to opaque training data, inherited societal biases, and limited controllability. As a result, the development of standardized fairness benchmarks, transparent reporting practices, and robust evaluation protocols remains an open research direction for building trustworthy and ethically aligned ERV systems.
Overall, the surveyed literature indicates a clear gap between technical advances in ERV and the systematic treatment of ethical and bias-related issues. Addressing this gap requires the development of standardized bias evaluation protocols, fairness-aware training and adaptation strategies, and more inclusive, demographically diverse datasets. These challenges are particularly critical for socially sensitive and assistive applications, where biased emotion recognition can have a disproportionate negative impact. Future research must therefore integrate ethical considerations as a first-class objective alongside accuracy and efficiency to ensure responsible and trustworthy deployment of ERV systems.

6.4. Actionable Research Directions for Advanced ERV Systems

Future work in affective computing should place greater emphasis on methodologically rigorous and implementation-oriented research directions that transform conceptual challenges into well-defined algorithmic and system-level solutions.
A first priority concerns data modeling and supervision strategies. Beyond collecting larger datasets, future work should explore weakly supervised, self-supervised, and cross-modal pretraining objectives that reduce dependence on densely annotated affect labels. Techniques such as contrastive multimodal pretraining, temporal consistency regularization, and pseudo-label refinement across modalities offer promising directions for leveraging uncurated or partially labeled video data while preserving affective discriminability.
From a modeling perspective, explicit temporal abstraction and sparsification remain underexplored in ERV. Transformer-based architectures and MLLMs would benefit from hierarchical temporal representations, adaptive frame or token selection, and sparse attention mechanisms that decouple short-term facial dynamics from long-range contextual reasoning. Such designs are critical to scaling ERV models to long-duration videos and real-time streams without prohibitive computational cost.
In terms of learning objectives and optimization, future ERV systems should move toward multi-objective training formulations that jointly optimize recognition accuracy, uncertainty calibration, and robustness under distribution shift. For multimodal architectures, this includes modality-aware regularization and dynamic reweighting strategies that explicitly handle modality dominance, missing modalities, or asynchronous inputs during both training and inference.
Evaluation protocols also require greater technical rigor and granularity. Rather than relying solely on aggregate accuracy or F1 scores, future benchmarks should incorporate protocol-level stress testing, including controlled perturbations of temporal alignment, modality dropout, and noise injection. For MLLM-based systems, evaluation should further assess temporal reasoning consistency, explanation stability across prompts, and sensitivity of generated affective descriptions to input perturbations.
Finally, system-level integration and deployment constraints must be addressed more explicitly. Research should consider memory-efficient inference, parameter-efficient adaptation, and on-device or edge deployment strategies, particularly for assistive and interactive ERV applications. These considerations are especially relevant for MLLMs, where practical deployment remains limited by computational and energy requirements.
Collectively, these directions emphasize a shift from incremental architectural refinement toward scalable, robust, and deployable ERV systems, bridging the gap between methodological advances and real-world applicability.

6.5. Practical Applications and Deployment Scenarios

Emotion recognition in video streams has been investigated across a broad range of application domains, with the selection of modeling paradigms largely influenced by deployment constraints, including computational resources, inference latency, robustness requirements, and interaction complexity. Classical machine learning pipelines and early deep learning models have primarily been applied in controlled or resource-constrained environments, such as laboratory-based affect analysis, surveillance-oriented monitoring, and embedded human–AI interaction systems. Their relatively low computational cost, modular processing structure, and stable inference behavior facilitated early deployment, despite limited robustness and generalization under unconstrained conditions.
The adoption of unimodal and multimodal deep learning architectures expanded ERV applications toward interactive and socially aware systems, including affect-aware dialogue interfaces, multimedia content analysis, and driver or operator monitoring. In particular, audio–visual fusion models demonstrated improved resilience to noise, partial occlusion, and modality-specific degradation, enabling deployment in semi-controlled real-world scenarios such as vehicle cabins, smart classrooms, and multimedia analysis pipelines. However, these approaches typically require task-specific training and careful cross-modal synchronization, which can limit scalability and transferability across application domains.
Transformer-based architectures further extended ERV capabilities by supporting longer temporal context modeling and more complex multimodal interactions. These models have been applied to tasks such as continuous affect tracking in naturalistic video, video-based mental health assessment, and social signal processing. While self-attention mechanisms improve the modeling of long-range temporal dependencies and contextual relationships, the associated computational and memory demands often restrict deployment to server-side or offline processing environments.
Multimodal large language models introduce a distinct application paradigm by integrating emotion recognition with contextual reasoning, explanation generation, and interactive dialogue. As discussed in Section 5, MLLMs enable use cases such as assistive technologies for visually impaired users, emotion-aware conversational agents, and decision support systems, where affective understanding must be combined with semantic interpretation and user interaction. Despite their increased flexibility and expressiveness, practical deployment of MLLMs is currently constrained by computational cost, inference latency, and limited controllability, favoring cloud-based or hybrid deployment configurations.
Overall, practical ERV deployment reflects a recurring trade-off between computational efficiency and representational expressiveness. Lightweight classical and deep learning models remain suitable for real-time, embedded, or privacy-sensitive applications, whereas MLLMs are more appropriate for context-rich and interaction-intensive scenarios that require reasoning and explanation. This comparative perspective highlights that effective ERV deployment depends on aligning application requirements with system capabilities, rather than adopting a single dominant modeling approach.

6.6. Practical Applications Beyond Video-Based Emotion Recognition

Building upon the practical deployment scenarios discussed above, recent application-oriented studies in adjacent affective computing domains further illustrate how emotion and affective state recognition systems are deployed under real-world constraints. While these works do not exclusively focus on video-based emotion recognition, they provide complementary methodological and application-level insights that are directly relevant to the design, evaluation, and deployment of ERV systems.
In particular, applications in driver state and fatigue monitoring have explored efficient affective and cognitive state decoding using information-theoretic feature selection strategies, such as normalized mutual information combined with lightweight classifiers including extreme learning machines. These approaches have demonstrated practical advantages in safety-critical and resource-constrained environments, highlighting how information-driven modeling can support robust affective state recognition under strict latency and computational constraints. More advanced deployment scenarios, such as continuous driver fatigue detection, have adopted graph-based deep learning architectures, including graph attention convolutional neural networks with mutual information-driven connectivity. By explicitly modeling relational dependencies between temporal segments, physiological signals, or behavioral cues, these methods enable adaptive representation learning that improves robustness in dynamic operational settings. Conceptually, these architectures parallel recent ERV approaches that leverage attention mechanisms and structured modeling to capture dependencies across facial regions, temporal windows, or multimodal inputs.
In parallel, EEG-based affective recognition systems have been widely investigated in applied contexts such as workload assessment, fatigue monitoring, and emotion analysis. Recent application-focused studies and surveys emphasize challenges related to inter-subject variability, annotation subjectivity, and generalization across recording conditions. Although EEG-based systems rely on different sensing modalities, many of the identified deployment challenges, such as robustness, domain adaptation, and explainability, are shared with video-based emotion recognition and multimodal ERV systems.
From an application perspective, these adjacent domains reflect a broader convergence toward information-aware, attention-based, and graph-structured learning paradigms for affective state recognition. These paradigms increasingly inform the development of transformer-based ERV models and multimodal large language models, particularly in applications requiring reliability, interpretability, and real-time interaction. Incorporating insights from affective computing applications beyond video therefore supports the development of more robust, generalizable, and deployable ERV systems across safety-critical, assistive, and human–AI interaction scenarios.

7. Conclusions

Emotion recognition in video (ERV) has evolved through a sequence of methodological paradigms. Early work relied on handcrafted features combined with classical machine learning techniques, followed by the adoption of deep learning architectures based on convolutional and recurrent neural networks. These early systems typically focused on facial cues or short temporal segments and achieved reasonable performance under controlled laboratory conditions, but their robustness often deteriorated when applied to unconstrained, real-world recordings.
Subsequent multimodal approaches incorporating audio alongside visual information demonstrated improved performance on several benchmarks; however, their effectiveness remained strongly dependent on the dataset characteristics, recording conditions, and annotation protocols. The introduction of transformer-based architectures marked a further step forward by enabling the modeling of long-range temporal dependencies and more expressive cross-modal representations, including links between visual content and language. These capabilities addressed several limitations of earlier models, particularly with respect to contextual integration and temporal reasoning.
More recently, multimodal large language models (MLLMs) have emerged as a new paradigm for emotion understanding. Using large-scale multimodal pretraining and instruction following, these models can perform emotion-related inference without task-specific retraining and can generate natural-language explanations grounded in visual, auditory, and textual input. That shift has changed how ERV is viewed, moving the field from isolated perceptual classification to multimodal reasoning.
In practice, several challenges remain unresolved. Many existing datasets lack sufficient cultural and demographic diversity, and emotion annotation remains inherently subjective, with substantial inter-annotator variability. In addition, concerns related to fairness, privacy, and computational demands of large-scale models pose significant barriers to real-time and resource-constrained deployment. These limitations motivate ongoing research into more diverse and representative datasets, standardized and comparable evaluation protocols, and lightweight model architectures suitable for edge or on-device inference.
At the same time, growing interest in applications such as assistive technologies, education, mental health support, and interactive systems highlights the increasing demand for emotion-aware AI. This demand underscores a gap between emerging application needs and the current technological maturity of ERV systems, particularly with respect to robustness, interpretability, and ethical deployment.
Emotion recognition in video has progressed from handcrafted pipelines to deep learning architectures and, more recently, to multimodal large language models. While these advances have expanded the modeling capabilities, real-world deployment remains constrained by data quality, robustness, computational cost, and interpretability. This review emphasizes practical engineering trade-offs across modeling paradigms, focusing on multimodal fusion strategies, dataset properties, and evaluation practices relevant to applied systems. Classical and task-specific deep learning models continue to provide efficient and reliable solutions for well-defined scenarios, whereas MLLMs offer added value in context-aware, interactive, and flexible applications. Future ERV systems are therefore likely to adopt hybrid designs. These systems will balance computational efficiency and robustness with higher-level multimodal reasoning to meet real-world operational requirements.

Author Contributions

Conceptualization, M.-M.G., O.D. and R.T.; methodology, O.D. and R.T.; investigation, M.-M.G. and O.D.; resources, M.-M.G.; data curation, R.T.; writing—original draft preparation, O.D. and M.-M.G.; writing—review and editing, R.T., M.-M.G. and B.M.; visualization, M.-M.G. and B.M.; supervision, R.T.; project administration, O.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the National Research Program of the National Association of Technical Universities—GNAC ARUT 2023 under contract number 172/4.12.2023 and was partially supported by two grants from the Ministry of Research, Innovation and Digitization, CCCDI-UEFISCDI, project numbers PN-IV-P6-6.3-SOL-2024-2-0238 and PN-IV-P6-6.3-SOL-2024-0049, within PNCDI IV. The Open Access publication cost was funded by the PubArt program of the University Politehnica of Bucharest.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, S.; Deng, W.-H. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
  2. Sajjad, M.; Ullah, F.U.; Ullah, M.; Christodoulou, G.; Alaya Cheikh, F.; Hijji, M.; Muhammad, K.; Rodrigues, J.J.P.C. A Comprehensive Survey on Deep Facial Expression Recognition: Challenges, Applications, and Future Guidelines. Alex. Eng. J. 2023, 68, 817–840. [Google Scholar] [CrossRef]
  3. Ezzameli, K.; Mahersia, H. Emotion Recognition from Unimodal to Multimodal Analysis: A Review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]
  4. Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep Learning-Based Multimodal Emotion Recognition from Audio, Visual, and Text Modalities: A Systematic Review of Recent Advancements and Future Prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
  5. Ramaswamy, M.P.A.; Palaniswamy, S. Multimodal Emotion Recognition: A Comprehensive Review, Trends, and Challenges. WIREs Data Min. Knowl. Discov. 2024, 14, e1563. [Google Scholar] [CrossRef]
  6. Kalateh, S.; Estrada-Jimenez, L.A.; Nikghadam-Hojjati, S.; Barata, J. A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
  7. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
  8. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  9. Li, J.; Li, D.; Xiong, C.; Hoi, S.C.H. BLIP: Bootstrapping Language–Image Pre-Training for Unified Vision–Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
  10. Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Bojanowski, P.; Joulin, A.; Simonyan, K.; Zisserman, A. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  11. Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Advances in EEG-Based Emotion Recognition: Challenges, Methodologies, and Future Directions. Appl. Soft Comput. 2025, 180, 113478. [Google Scholar] [CrossRef]
  12. El Saddik, A.; Ahmad, J.; Khan, M.; Abouzahir, S.; Gueaieb, W. Unleashing Creativity in the Metaverse: Generative AI and Multimodal Content. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 186. [Google Scholar] [CrossRef]
  13. Ekman, P. An Argument for Basic Emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
  14. Plutchik, R. The Nature of Emotions. Am. Sci. 2001, 89, 344–350. [Google Scholar] [CrossRef]
  15. Keltner, D.; Sauter, D.; Tracy, J.; Cowen, A. Emotional Expression: Advances in Basic Emotion Theory. J. Nonverbal Behav. 2019, 43, 133–160. [Google Scholar] [CrossRef] [PubMed]
  16. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn–Kanade Dataset (CK+): A Complete Dataset for Action Unit and Emotion-Specified Expression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar] [CrossRef]
  17. Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
  18. Bliss-Moreau, E.; Williams, L.A.; Santistevan, A.C. The Immutability of Valence and Arousal in the Foundation of Emotion. Emotion 2020, 20, 993–1004. [Google Scholar] [CrossRef] [PubMed]
  19. Valstar, M.; Gratch, J.; Schuller, B.; Ringeval, F.; Lalanne, D.; Torres Torres, M.; Scherer, S.; Stratou, G.; Cowie, R.; Pantic, M. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (AVEC ’16), Amsterdam, The Netherlands, 16 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 3–10. [Google Scholar] [CrossRef]
  20. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
  21. Berrios, R. What Is Complex/Emotional about Emotional Complexity? Front. Psychol. 2019, 10, 1606. [Google Scholar] [CrossRef]
  22. Cao, S. Emotion and Cognition: On the Cognitive Processing Model of Nostalgia. Front. Psychol. 2024, 15, 1440536. [Google Scholar] [CrossRef]
  23. Costa, W.; Talavera, E.; Oliveira, R.; Figueiredo, L.; Teixeira, J.M.; Lima, J.P.; Teichrieb, V. A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability. Appl. Sci. 2023, 13, 5697. [Google Scholar] [CrossRef]
  24. Miranda-Correa, J.A.; Abadi, M.K.; Sebe, N.; Patras, I. AMIGOS: A Dataset for Affect, Personality and Mood in Individuals and Groups. IEEE Trans. Affect. Comput. 2018, 12, 479–493. [Google Scholar] [CrossRef]
  25. Subramanian, R.; Wache, J.; Abadi, M.K.; Vieriu, R.L.; Winkler, S.; Sebe, N. ASCERTAIN: Emotion and Personality Recognition Using Commercial Sensors. IEEE Trans. Affect. Comput. 2016, 9, 147–160. [Google Scholar] [CrossRef]
  26. Chen, J.; Wang, C.; Wang, K.; Yin, C.; Zhao, C.; Xu, T.; Zhang, X.; Huang, Z.; Liu, M.; Yang, T. HEU Emotion: A Large-Scale Database for Multimodal Emotion Recognition in the Wild. Neural Comput. Appl. 2021, 33, 8669–8685. [Google Scholar] [CrossRef]
  27. Yang, F.; Huang, Y.; Wang, S. Facial Emotion Recognition under Occlusion and Masking. Appl. Intell. 2020, 50, 5937–5950. [Google Scholar] [CrossRef]
  28. Jena, S.; Basak, S.; Agrawal, H.; Saini, B.; Gite, S.; Kotecha, K.; Alfarhood, S. Developing a Negative Speech Emotion Recognition Model for Safety Systems Using Deep Learning. J. Big Data 2025, 12, 54. [Google Scholar] [CrossRef]
  29. Yan, W.; Wu, Q.; Liu, Y.-J.; Wang, S.-J.; Fu, X. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation. PLoS ONE 2014, 9, e86041. [Google Scholar] [CrossRef]
  30. Haddad, J.; Lezoray, O.; Hamel, P. 3D-CNN for facial emotion recognition in videos. In Proceedings of the Advances in Visual Computing (ISVC 2020), San Diego, CA, USA, 5–7 October 2020; pp. 298–309. [Google Scholar] [CrossRef]
  31. Talluri, K.K.; Fiedler, M.-A.; Al-Hamadi, A. Deep 3D convolutional neural network for facial micro-expression analysis from video images. Appl. Sci. 2022, 12, 11078. [Google Scholar] [CrossRef]
  32. Li, X.; Pfister, T.; Huang, X.; Zhao, G.; Pietikäinen, M. A Spontaneous Micro-Expression Database: Inducement, Collection and Baseline. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–6. [Google Scholar]
  33. Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef]
  34. Jiang, Y.-G.; Xu, B.; Xue, X.; Chang, S.-F.; Ngo, C.-W. Predicting emotions in user-generated videos. Proc. AAAI Conf. Artif. Intell. 2014, 28, 73–79. [Google Scholar] [CrossRef]
  35. Ekman, P.; Friesen, W.V. Pictures of Facial Affect; Consulting Psychologists Press: Palo Alto, CA, USA, 1976. [Google Scholar]
  36. Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar]
  37. Sun, X.; Song, Y.; Wu, Q.; Guo, Z.; Loy, C.C.; Liu, J.; Shen, X. EEV: Emotion Evoked Videos Dataset. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
  38. Poria, S.; Hazarika, D.; Majumder, N.; Mihalcea, R.; Cambria, E. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar]
  39. Khan, M.; Ahmad, J.; Gueaieb, W.; De Masi, G.; Karray, F.; El Saddik, A. Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices. IEEE Trans. Consum. Electron. 2025, 71, 1092–1101. [Google Scholar] [CrossRef]
  40. Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.-H.; et al. Challenges in Representation Learning: A Report on Three Machine Learning Contests. In Proceedings of the ICML Workshop on Challenges in Representation Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1–8. [Google Scholar]
  41. Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-Wild2 database and evaluation of state-of-the-art methods. IEEE Trans. Affect. Comput. 2019, 12, 704–718. [Google Scholar]
  42. Zadeh, A.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion. In Proceedings of the ACL 2018 Conference, Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
  43. Kossaifi, J.; Walecki, R.; Panagakis, Y.; Shen, J.; Schmitt, M.; Ringeval, F.; Han, J.; Cowie, R.; Pantic, M. SEWA DB: A Rich Database for Audio–Visual Emotion and Sentiment Research in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1022–1040. [Google Scholar] [CrossRef]
  44. Dhall, A.; Goecke, R.; Joshi, J.; Sikka, K.; Gedeon, T. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 May 2012; pp. 1–6. [Google Scholar]
  45. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. MOSI: Multimodal Sentiment Analysis Dataset. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan, 12–16 November 2016; pp. 151–159. [Google Scholar]
  46. Ren, Z.; Ortega, J.; Wang, Y.; Chen, Z.; Guo, Y.; Yu, S.X.; Whitney, D. VEATIC: Video-Based Emotion and Affect Tracking in Context Dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2024; pp. 4467–4477. [Google Scholar]
  47. Ross, K.; Hungler, P.; Etemad, A. Unsupervised Multi-Modal Representation Learning for Affective Computing with Multi-Corpus Wearable Data. J. Ambient Intell. Humaniz. Comput. 2023, 14, 3199–3224. [Google Scholar] [CrossRef]
  48. Zheng, W.-L.; Lu, B.-L. Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
  49. Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
  50. Mocanu, D.; Stanciu, A.; Ionescu, B. Challenges in Annotating Multimodal Emotion Datasets: A Systematic Review. Inf. Fusion 2023, 88, 45–60. [Google Scholar] [CrossRef]
  51. Costache, A.; Popescu, D. Emotion Sketches: Facial Expression Recognition in Diversity Groups. Sci. Bull. 2021, 83, 29–40. [Google Scholar]
  52. Katirai, A. Ethical Considerations in Emotion Recognition Technologies: A Review of the Literature. AI Ethics 2024, 4, 927–948. [Google Scholar] [CrossRef]
  53. Dagher, I.; Dahdah, E.; Al Shakik, M. Facial Expression Recognition Using Three-Stage Support Vector Machines. Vis. Comput. Ind. Biomed. Art 2019, 2, 24. [Google Scholar] [CrossRef]
  54. Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
  55. Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing Uncertainties for Large-Scale Facial Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6897–6906. [Google Scholar]
  56. Chen, J.; Fan, F.; Wei, C.; Polat, K.; Alenezi, F. Decoding driving states based on normalized mutual information features and hyperparameter self-optimized Gaussian kernel-based radial basis function extreme learning machine. Chaos Solitons Fractals 2025, 199, 116751. [Google Scholar] [CrossRef]
  57. Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Driver fatigue detection using EEG-based graph attention convolutional neural networks: An end-to-end learning approach with mutual information-driven connectivity. Appl. Soft Comput. 2025, 186, 114097. [Google Scholar] [CrossRef]
  58. Hans, A.S.; Singh, P.; Kaur, A. Facial Emotion Recognition Using Convolutional LSTM Networks. Int. J. Adv. Sci. Innov. Soc. 2021, 7, 11–20. [Google Scholar] [CrossRef]
  59. Zhao, S.; Zhang, Y.; Zhang, Z.; Zhang, J.; Li, Y.; Ji, Q. An End-to-End Visual–Audio Attention Network for Emotion Recognition in User-Generated Videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 303–311. [Google Scholar] [CrossRef]
  60. Zhou, H.; Li, K.; Yang, Y.; Hu, Y.; Peng, Y.; Ji, Q. Information Fusion in Attention Networks with Multi-Level Factorized Bilinear Pooling for Audio–Visual Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2617–2629. [Google Scholar] [CrossRef]
  61. Zhang, Y.; Wang, Z.-R.; Du, J. Deep Fusion: Attention-Guided Factorized Bilinear Pooling for Audio–Video Emotion Recognition. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar] [CrossRef]
  62. Lan, Y.-T.; Liu, W.; Lu, B.-L. Multimodal Emotion Recognition Using Deep Generalized Canonical Correlation Analysis with an Attention Mechanism. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
  63. Peng, Y.; Wang, W.; Kong, W.; Nie, F.; Lu, B.-L.; Cichocki, A. Joint feature adaptation and graph adaptive label propagation for cross-subject emotion recognition from EEG signals. IEEE Trans. Affect. Comput. 2022, 13, 1941–1958. [Google Scholar] [CrossRef]
  64. Mocanu, B.; Ţapu, R. Audio–Video Fusion with Double Attention for Multimodal Emotion Recognition. In Proceedings of the 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Nafplio, Greece, 26–28 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
  65. Chen, S.; Zhang, H.; Zhang, J.; Liu, T. A Multi-Stage Dynamical Fusion Network for Multimodal Emotion Recognition. Cogn. Neurodynamics 2023, 17, 671–680. [Google Scholar] [CrossRef]
  66. Dutta, S.; Ganapathy, S. HCAM—Hierarchical Cross-Attention Model for Multi-Modal Emotion Recognition. arXiv 2023, arXiv:2304.06910. [Google Scholar]
  67. Fu, Z.; Zhang, L.; Huang, Q.; He, X.; Ji, Q. A Cross-Modal Fusion Network Based on Self-Attention and Residual Structure for Multimodal Emotion Recognition. arXiv 2021, arXiv:2111.02172. [Google Scholar]
  68. Ghaleb, E.; Niehues, J.; Asteriadis, S. Joint Modelling of Audio–Visual Cues Using Attention Mechanisms for Emotion Recognition. Multimed. Tools Appl. 2023, 82, 11239–11264. [Google Scholar] [CrossRef]
  69. Dixit, C.; Satapathy, S.M. Deep CNN with Late Fusion for Real-Time Multimodal Emotion Recognition. Expert Syst. Appl. 2024, 240, 122579. [Google Scholar] [CrossRef]
  70. Li, Q.; Chen, W.; Xu, H.; Sun, Y. Quantum-Inspired Multimodal Fusion for Video Sentiment Analysis. Inf. Fusion 2021, 65, 58–71. [Google Scholar] [CrossRef]
  71. Li, X.; Zhang, Y.; Zhao, L.; Liu, H.; Yu, P. Multimodal Sentiment Analysis Using Hierarchical Fusion with Context Modeling. Inf. Fusion 2021, 68, 77–89. [Google Scholar] [CrossRef]
  72. Cai, Y.; Li, X.; Zhang, Y.; Li, J.; Zhu, F.; Rao, L. Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning. Sci. Rep. 2025, 15, 2126. [Google Scholar] [CrossRef]
  73. Tian, X.; Zhang, P.; Xu, L. Multi-Scale Vision Transformer for Robust Facial Expression Recognition. Neurocomputing 2024, 584, 127496. [Google Scholar] [CrossRef]
  74. Chaudhari, A.; Bhatt, C.; Krishna, A.; Mazzeo, P.L. ViTFER: Facial Emotion Recognition with Vision Transformers. Appl. Syst. Innov. 2022, 5, 80. [Google Scholar] [CrossRef]
  75. Liu, X.; Zhang, B.; Zhao, J.; Li, S.; Deng, W.; Sun, J. TransFER: Learning Relation-Aware Facial Expression Representations with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2340–2349. [Google Scholar]
  76. Neshov, N.; Christoff, N.; Sechkova, T.; Tonchev, K.; Manolova, A. SlowR50-SA: A Self-Attention Enhanced Dynamic Facial Expression Recognition Model for Tactile Internet Applications. Electronics 2024, 13, 1606. [Google Scholar] [CrossRef]
  77. Bertasius, G.; Wang, H.; Torresani, L. Is space–time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtually, 18–24 July 2021; Volume 139, pp. 813–824. Available online: https://proceedings.mlr.press/v139/bertasius21a.html (accessed on 18 January 2026).
  78. Khan, M.; El Saddik, A.; Deriche, M.; Gueaieb, W. STT-Net: Simplified Temporal Transformer for Emotion Recognition. IEEE Access 2024, 12, 86220–86231. [Google Scholar] [CrossRef]
  79. Swain, M.; Maji, B.; Khan, M.; El Saddik, A.; Gueaieb, W. Multilevel Feature Representation for Hybrid Transformers-Based Emotion Recognition. In Proceedings of the BioSMART 2023-Proceedings: 5th International Conference on Bio-Engineering for Smart Technologies, Paris, France, 7–9 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
  80. Venkatraman, S.; Lee, Y.; Rao, P. Multimodal Emotion Recognition Using Audio–Visual Transformer Fusion with Cross Attention. arXiv 2024, arXiv:2407.18552. [Google Scholar]
  81. Guo, P.; Chen, Z.; Li, Y.; Liu, H. Audio–Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition. In Proceedings of the CAAI International Conference on Artificial Intelligence, Beijing, China, 27–28 August 2022; Springer: Cham, Switzerland, 2022; pp. 315–326. [Google Scholar] [CrossRef]
  82. Othmani, A.; Khan, M.; El Saddik, A.; Cogo-Moreira, H. ARNet: Enhanced Tracking and Optimization for Video-Based Affect Recognition. In Proceedings of the 2025 7th International Conference on Image, Video and Signal Processing (IVSP 2025), Ikuta, Japan, 4–6 March 2025. [Google Scholar] [CrossRef]
  83. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic VisioLinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]
  84. Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
  85. Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision–Language Representation Learning with Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
  86. Lian, Z.; Zhou, Y.; Zhao, T.; Wu, X.; Li, C. GPT-4V with Emotion: A Zero-Shot Benchmark for Generalized Emotion Recognition. Inf. Fusion 2024, 108, 102367. [Google Scholar] [CrossRef]
  87. Vaiani, L.; Cagliero, L.; Garza, P. Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet 2024, 16, 247. [Google Scholar] [CrossRef]
  88. Shou, Y.; Meng, T.; Ai, W.; Li, K. Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey. arXiv 2025, arXiv:2509.24322. [Google Scholar] [CrossRef]
  89. Huang, D.; Li, Z.; Sun, Y.; Chen, H.; Zhao, Y. Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision–Language Understanding. arXiv 2025, arXiv:2505.06685. [Google Scholar]
  90. Achiam, J.; Adler, S.; OpenAI Team. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  91. Qi, Z.; Xu, D.; Luo, R.; Liu, Y.; Zhang, J. Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision–Language Models through Qualitative Cases. arXiv 2023, arXiv:2312.15011. [Google Scholar]
  92. Li, Y.; Yuan, G.; Wen, Y.; Hu, E.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. EfficientFormer: Vision Transformers at MobileNet Speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar] [CrossRef]
  93. Anthropic. Introducing the Next Generation of Claude: The Claude 3 Model Family (Haiku, Sonnet, Opus), 2024. Available online: https://www.anthropic.com/news/claude-3-family (accessed on 8 October 2025).
  94. Li, Y.; Zhang, L.; Liu, P.; Xu, Y.; Zhou, X. Visual Large Language Models for Generalized and Specialized Applications. arXiv 2025, arXiv:2501.02765. [Google Scholar] [CrossRef]
  95. Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtually, 25–29 April 25 2022. [Google Scholar]
  96. Zhu, D.; Chen, J.; Shen, X.; Li, Y. TinyLLaVA: Lightweight Multimodal Large Language Models. arXiv 2024, arXiv:2402.14289. [Google Scholar]
  97. Gan, Z.; Chen, Y.-C.; Li, L.; Wang, L.; Liu, Z.; Cheng, Y.; Gao, J. Vision–Language Pre-Training: Basics, Recent Advances, and Future Trends. Found. Trends Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar] [CrossRef]
  98. Du, Y.; Chen, L.; Wang, S.; Wang, X.; Zhao, X. A Survey of Vision–Language Pre-Trained Models. arXiv 2022, arXiv:2202.10936. [Google Scholar]
  99. Liu, H.; Li, C.; Zhang, Y.; Zhang, S.; Chen, Y.; Liu, Z.; Li, C.; Huang, J.; Wang, Y.; Lin, Z.; et al. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 34892–34916. [Google Scholar]
  100. Luo, G.; Zeng, J.; Zhu, J.; Liu, Y.; Zhang, S.; Zhou, Y. Cheap and Quick: Efficient Vision–Language Instruction Tuning for Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 29615–29627. [Google Scholar]
  101. Liu, H.; Chen, Z.; Wu, J.; Li, Y.; Wang, X.; Zhang, Y. Parameter-Efficient Transfer Learning for Audio–Visual–Language Tasks. In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 387–396. [Google Scholar]
  102. Huang, S.; Wu, Q.; Zhou, Y. Adapting Pre-Trained Language Models to Vision–Language Tasks via Dynamic Visual Prompting. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
  103. Maazallahi, A.; Asadpour, M.; Bazmi, P.; Habibi, H.; Habibi, M. Advancing Emotion Recognition in Social Media: Compliance-Driven Multimodal Training with Fine-Tuned LLMs. Inf. Process. Manag. 2025, 62, 103974. [Google Scholar] [CrossRef]
  104. Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]
  105. Lu, H.; Chen, J.; Liang, F.; Tan, M.; Zeng, R.; Hu, X. Understanding Emotional Body Expressions via Large Language Models. Proc. AAAI Conf. Artif. Intell. 2025, 39, 1447–1455. [Google Scholar] [CrossRef]
  106. Cheng, Z.; Cheng, Z.-Q.; He, J.-Y.; Sun, J.; Wang, K.; Lin, Y.; Lian, Z.; Peng, X.; Hauptmann, A.G. Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning. arXiv 2024, arXiv:2406.11161. [Google Scholar] [CrossRef]
  107. Bhattacharyya, S.; Wang, J.Z. Evaluating Vision–Language Models for Emotion Recognition. arXiv 2025, arXiv:2502.05660. [Google Scholar]
  108. Sabour, S.; Riedl, M.; Deriu, J.M.; Coda, A.; Di Gangi, M.A.; Uthus, D. EmoBench: Evaluating the Emotional Intelligence of Large Language Models. arXiv 2024, arXiv:2402.12071. [Google Scholar] [CrossRef]
  109. Hu, H.; Li, X.; Xu, F.; Zhang, T.; Chen, L.; Wang, P. EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models. arXiv 2025, arXiv:2502.04424. [Google Scholar]
  110. Ainary, B. Audo-Sight: Enabling Ambient Interaction for Blind and Visually Impaired Individuals. arXiv 2025, arXiv:2505.00153. [Google Scholar] [CrossRef]
  111. Qi, X.; Chen, L.; Wang, J.; Li, Y.; Zhang, S. EmoAssist: Emotional Assistant for Visual Impairment Community. arXiv 2025, arXiv:2502.09285. [Google Scholar] [CrossRef]
  112. Safdar, N.M.; Banja, J.D.; Meltzer, C.C. Ethical Considerations in Artificial Intelligence. Eur. J. Radiol. 2020, 122, 108768. [Google Scholar] [CrossRef]
  113. Zhang, Y.; Wang, X.; Chen, L. A Comprehensive Review on Multimodal Emotion Recognition. Multimodal Technol. Interact. 2023, 9, 28. [Google Scholar] [CrossRef]
  114. Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
  115. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
  116. Liao, M.; Shi, B.; Bai, X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef]
  117. Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9357–9366. [Google Scholar] [CrossRef]
  118. Long, S.; He, X.; Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
  119. Liang, M.; Ma, J.-W.; Zhu, X.; Qin, J.; Yin, X.-C. LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15665–15674. [Google Scholar] [CrossRef]
  120. Ge, M.; Li, M.; Tang, D.; Li, P.; Liu, K.; Deng, S.; Pu, S.; Liu, L.; Song, Y.; Zhang, T. Early Joint Learning of Emotion Information Makes Multimodal Model Understand You Better. In Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing (MRAC ’24), Melbourne, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 54–61. [Google Scholar] [CrossRef]
  121. He, J. A Multimodal Approach for Emotion Recognition in Conversations Using the MELD Dataset. In Proceedings of the 2025 Asia-Europe Conference on Cybersecurity, Internet of Things and Soft Computing (CITSC), Rimini, Italy, 10–12 January 2025; pp. 54–58. [Google Scholar] [CrossRef]
  122. Abdrakhmanova, M.; Yermekova, A.; Barko, Y.; Ryspayev, V.; Jumadildayev, M.; Varol, H.A. One Model to Rule Them All: A Universal Transformer for Biometric Matching. IEEE Access 2024, 12, 96729–96739. [Google Scholar] [CrossRef]
Figure 1. Flowchart illustrating the literature identification, eligibility assessment, filtering criteria, and final paper selection process adopted in this review.
Figure 1. Flowchart illustrating the literature identification, eligibility assessment, filtering criteria, and final paper selection process adopted in this review.
Applsci 16 01289 g001
Figure 2. Valence–arousal model: emotions along positive–negative valence and high–low arousal.
Figure 2. Valence–arousal model: emotions along positive–negative valence and high–low arousal.
Applsci 16 01289 g002
Figure 3. Emotion across cognitive, physiological, behavioral, and physical domains: cues for recognition.
Figure 3. Emotion across cognitive, physiological, behavioral, and physical domains: cues for recognition.
Applsci 16 01289 g003
Figure 4. Timeline illustrating the methodological evolution of emotion recognition in video, from traditional machine learning approaches to deep learning, transformer-based architectures, and multimodal large language models (MLLMs).
Figure 4. Timeline illustrating the methodological evolution of emotion recognition in video, from traditional machine learning approaches to deep learning, transformer-based architectures, and multimodal large language models (MLLMs).
Applsci 16 01289 g004
Table 8. Qualitative comparison between pre-MLLM- and MLLM-based ER approaches.
Table 8. Qualitative comparison between pre-MLLM- and MLLM-based ER approaches.
AspectPre-MLLM ModelsMLLM-Based Methods
ArchitectureSeparate modality-specific encoders; handcrafted fusion (early, late, hybrid)Unified multimodal encoder–decoder; shared tokenization across modalities
Task scopeSingle-task or limited multi-task (e.g., dataset-specific FER)Broad, prompt-defined tasks: ER, sentiment, explanation, dialogue, reasoning
Training dataTask-specific datasets (AFEW, Aff-Wild2, AVEC)Web-scale multimodal corpora with targeted emotion fine-tuning
GeneralizationStrong in-domain; weak under domain, cultural, or capture shiftsStrong zero- or few-shot transfer; improved out-of-domain robustness
Output formatFixed categorical and VA outputs (e.g., softmax, CCC)Labels plus free-form text: explanations, uncertainty, recommendations
InterpretabilityMostly implicit; limited to saliency or attention mapsNative textual rationales; can describe cues, ambiguity, and bias
AdaptationRequires model retraining or full fine-tuningPrompting, instruction tuning, and lightweight adapters enable rapid adaptation
ComputationModerate compute; efficient for targeted tasksHigh pretraining and inference cost; often requires accelerators
Ethical or data issuesBias and annotation noise tied to dataset qualitySimilar issues but amplified by large, opaque web-scale training data
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Grosu, M.-M.; Datcu, O.; Tapu, R.; Mocanu, B. A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models. Appl. Sci. 2026, 16, 1289. https://doi.org/10.3390/app16031289

AMA Style

Grosu M-M, Datcu O, Tapu R, Mocanu B. A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models. Applied Sciences. 2026; 16(3):1289. https://doi.org/10.3390/app16031289

Chicago/Turabian Style

Grosu (Marinescu), Mirela-Magdalena, Octaviana Datcu, Ruxandra Tapu, and Bogdan Mocanu. 2026. "A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models" Applied Sciences 16, no. 3: 1289. https://doi.org/10.3390/app16031289

APA Style

Grosu, M.-M., Datcu, O., Tapu, R., & Mocanu, B. (2026). A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models. Applied Sciences, 16(3), 1289. https://doi.org/10.3390/app16031289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop