1. Introduction
Emotion recognition in video (ERV) is a core task in affective computing that draws on computer vision, speech analysis, and psychology. Despite this interdisciplinary foundation, ERV remains only partially integrated across these research communities. In practice, ERV refers to systems that attempt to infer people’s emotional states from video streams, usually by integrating facial expressions, body posture, the surrounding context, and available vocal cues. Despite its apparent simplicity at a conceptual level, emotion recognition involves considerable complexity in practice, a point that has been widely acknowledged in previous work [
1,
2].
Interest in ERV has steadily increased, driven partly by applications in human–AI interaction and healthcare and partly by the growing expectation that machines should be able to engage with humans in a socially aware manner. Although the extent to which such systems can reliably model human affect remains an open question, these expectations have driven rapid methodological advances. Nevertheless, the development of robust ERV systems remains challenging due to substantial inter- and intra-subject variability, strong dependence on context, annotation subjectivity, and the presence of noise and bias in available datasets. Two people rarely express the same emotion in the same way, and even a single individual does not do so consistently. When cultural differences, lighting variations, background noise, and occasional facial occlusions are introduced, the task becomes significantly more unpredictable than the benchmark evaluations suggest [
3,
4,
5,
6].
Early ERV systems relied on handcrafted features, such as facial landmarks, local binary patterns (LBP), and prosodic descriptors, combined with conventional classifiers. While effective under controlled conditions, these approaches exhibited limited robustness to variations in expression, pose, and recording conditions, leading to degraded performance in unconstrained environments [
1]. The introduction of deep learning marked a significant advance in ERV, enabling substantial performance improvements. Convolutional neural networks (CNNs) facilitated the automatic learning of discriminative spatial representations, while their integration with temporal modeling mechanisms, such as long short-term memory (LSTM) networks or 3D convolutions, enabled the capture of dynamic patterns underlying emotional expressions [
1,
2]. This progress addressed several long-standing issues, while simultaneously introducing new challenges related to data requirements and model generalization.
Transformers pushed the field another step forward. Vision transformers (ViTs) [
7] and subsequent video and multimodal transformer models demonstrated that long-range attention mechanisms can help stabilize recognition under the types of variation commonly observed in everyday video, as summarized in recent surveys [
1,
2,
5].
A more substantial shift occurred with the introduction of vision–language models (VLMs) and their more powerful successors, multimodal large language models (MLLMs). These models altered the nature of the problem. Emotion recognition is no longer treated as a purely perceptual classification task but instead as a reasoning-oriented process that incorporates contextual cues, inferred intentions, and cross-modal interactions. Earlier VLMs, such as CLIP and ALIGN [
8], demonstrated that large-scale image–text alignment exhibits strong transfer properties. BLIP and Flamingo [
9,
10] further enriched multimodal understanding. More recent MLLMs (e.g., GPT-4V, Gemini, and Kosmos) extend these capabilities to video, enabling the interpretation of emotional cues with a level of flexibility not attainable with earlier model architectures.
Recent survey work has also examined emotion recognition from physiological modalities. In particular, a comprehensive review of EEG-based emotion recognition analyzes methodological trends, challenges, and future directions in neural affective computing [
11]. While such studies provide valuable insights into internal emotion modeling using specialized sensing hardware, they address a fundamentally different problem setting. The present review focuses on emotion recognition in video, where affective understanding is inferred from observable behavioral cues in audio–visual streams under unconstrained, real-world conditions. Consequently, this work emphasizes distinct methodological challenges, including multimodal fusion, temporal modeling of dynamic expressions, dataset bias, and deployment-oriented trade-offs, thereby complementing existing EEG-focused surveys rather than overlapping with them.
Despite several existing surveys addressing different parts of this research landscape, such as facial expression recognition [
1,
2], multimodal fusion [
3,
4], and broader affective computing [
5,
6], the literature remains fragmented. What appears to be missing is an integrated account of how the field evolved from feature engineering to deep perceptual models and, more recently, to multimodal reasoning systems. This review seeks to chart that trajectory while acknowledging that the field is still evolving and several open questions remain.
This review aims to consolidate these developments with greater chronological and conceptual coherence, focusing specifically on emotion recognition in video as a temporally grounded, multimodal, and deployment-oriented problem. Rather than surveying individual model families in isolation, the paper emphasizes how successive methodological paradigms have reshaped system design, evaluation practice, and real-world applicability. More specifically, we engage in the following:
Trace the evolution of emotion recognition in video from handcrafted feature pipelines to deep perceptual models and, most recently, to multimodal large language models, explicitly identifying shifts in modeling assumptions, supervision strategies, and system capabilities;
Analyze how dataset design choices, including recording conditions, annotation paradigms, and modality coverage, have influenced reported performance trends and constrained cross-generation comparisons;
Synthesize the empirical strengths and limitations of pre-MLLM and MLLM-based systems in unconstrained, real-world, and culturally diverse scenarios, with attention paid to robustness, generalization, and failure modes;
Examine persistent deployment challenges in ERV, including dataset imbalance, annotation subjectivity, cultural and demographic bias, computational cost, and reproducibility;
Outline technically grounded research directions that follow from these analyses, such as lightweight multimodal architectures, hybrid task-specific and foundation model designs, personalized affect modeling, and evaluation practices aligned with real-time and interactive applications.
The remainder of this paper is organized as follows.
Section 2 describes the review methodology, including the literature search strategy, selection criteria, and scope definition.
Section 3 provides a structured methodological overview of emotion recognition in video, covering traditional machine learning approaches, unimodal deep learning methods, multimodal deep learning models, transformer-based architectures, and their respective strengths and limitations.
Section 4 focuses on multimodal large language models (MLLMs) in affective computing, analyzing their architectural characteristics, capabilities, and limitations and comparing them with pre-MLLM approaches.
Section 5 presents a synthesis of data requirements and evaluation considerations for advanced ERV systems.
Section 6 discusses open challenges and future research directions in emotion recognition in video, with particular emphasis on robustness, scalability, explainability, and inclusivity. Finally,
Section 7 concludes the paper by summarizing the main findings and reflecting on the evolution of ERV toward reasoning-driven and multimodal frameworks.
4. Pre-MLLM Era of Emotion Recognition in Video
Prior to the emergence of multimodal large language models, emotion recognition in video evolved through a sequence of increasingly expressive modeling paradigms. Early approaches were dominated by traditional machine learning methods built on handcrafted features, followed by unimodal deep learning models that enabled data-driven visual representation learning. Subsequent work introduced multimodal deep learning architectures to integrate facial, vocal, and textual cues and later adopted transformer-based models to capture long-range dependencies and cross-modal interactions. This chapter reviews these pre-MLLM approaches, highlighting their methodological contributions and inherent limitations that ultimately motivated the transition toward MLLM-based frameworks.
To facilitate a high-level understanding of the methodological evolution in emotion recognition in video,
Figure 4 provides a visual timeline summarizing the major paradigm shifts in the field. The figure illustrates the progression from early handcrafted feature-based pipelines to deep learning approaches, transformer-based architectures, and, more recently, multimodal large language model (MLLM)-based frameworks. Representative studies are positioned chronologically to highlight changes in feature representation, modality integration, and reasoning capability across methodological generations.
4.1. Traditional Machine Learning Approaches
Before the adoption of deep learning, emotion recognition in video was predominantly formulated as a feature engineering and pattern classification problem, relying on manually designed representations coupled with classical statistical learning models. Typical pipelines extracted low- and mid-level appearance and motion cues using handcrafted descriptors such as local binary patterns (LBPs), the histogram of oriented gradients (HOG), dense or sparse optical flow, and, in audio–visual settings, mel-frequency cepstral coefficients (MFCCs).
These representations were subsequently mapped to discrete emotion categories or continuous affective dimensions using discriminative or generative classifiers, most commonly support vector machines (SVMs), hidden Markov models (HMMs), or related probabilistic frameworks [
53].
As summarized in
Table 2, these traditional approaches exhibited several practical advantages, including interpretable feature representations, modular pipeline design, and relatively low computational complexity, which made them suitable for early laboratory studies and resource-constrained environments.
Explicit preprocessing stages and structured temporal modeling enabled controlled analysis of affective dynamics, while rule-based or feature-level fusion strategies facilitated the integration of multimodal cues in audio–visual settings. However, the same design choices also imposed fundamental limitations. Empirical evidence shows that handcrafted representations are highly sensitive to domain shift, resulting in pronounced degradation of cross-dataset generalization performance when capture conditions, subjects, or recording environments differ [
54]. Furthermore, annotation noise and label uncertainty, which are intrinsic to affective datasets, introduced additional instability during model training and evaluation, limiting robustness even within nominally similar domains [
55]. Visual degradations such as pose variation, partial occlusions, and face coverings further compromised the reliability of feature extraction and downstream classification, leading to significant performance drops in unconstrained scenarios [
27].
Despite these shortcomings, handcrafted ERV pipelines established several foundational architectural principles that continue to inform modern systems, including explicit preprocessing stages, structured temporal modeling of affective dynamics, and manually designed fusion strategies for integrating multimodal cues. Contemporary deep learning approaches can be interpreted as a generalization of these paradigms, replacing fixed, task-specific feature representations with hierarchical features learned directly from data through end-to-end optimization. This shift has enabled improved robustness to appearance variability, contextual ambiguity, and capture conditions, thereby enhancing generalization in complex real-world emotion recognition scenarios.
Table 2.
Traditional machine learning approaches for emotion recognition in video.
Table 2.
Traditional machine learning approaches for emotion recognition in video.
| Reference | Feature Representation and Model | Strengths | Limitations |
|---|
| Jiang et al. (2014) [34] | LBP or HOG for appearance, optical flow for motion, MFCCs for audio; SVMs and HMMs | Explicit separation of feature extraction and classification; structured preprocessing and temporal modeling; modular audiovisual fusion pipeline | Strong reliance on handcrafted descriptors; sensitivity to recording conditions and subject variability; limited robustness in unconstrained video settings |
| Long et al. (2015) [54] | Task-specific handcrafted features; classical ML with adaptation objectives | Formal analysis of cross-domain generalization behavior; early identification of dataset bias and distribution mismatch | Significant performance degradation under domain shift; limited transferability without explicit domain adaptation or retraining |
| Wang et al. (2018) [55] | Handcrafted visual features; statistical learning with uncertainty modeling | Explicit modeling of annotation noise; improved training stability under noisy labels | Requires curated datasets or noise estimates; does not fully address intrinsic subjectivity and ambiguity of affective annotations |
| Dagher et al. (2019) [53] | Handcrafted facial appearance descriptors; SVMs | Interpretable feature representations; low computational and memory complexity; stable performance under controlled acquisition protocols | High sensitivity to poses, illumination changes, and partial occlusions; limited robustness to inter-subject and cross-dataset variability |
| Yang et al. (2020) [27] | Handcrafted facial features; classical ML classifiers | Systematic evaluation of occlusion effects; controlled analysis of partial facial visibility scenarios | Severe performance degradation under realistic occlusions (e.g., face coverings); limited applicability to in-the-wild deployment scenarios |
4.2. Unimodal Deep Learning Methods
The emergence of deep learning substantially advanced emotion recognition in video by enabling end-to-end learning of hierarchical spatial and temporal representations directly from raw or minimally processed inputs. Unlike traditional pipelines based on handcrafted descriptors, deep learning architectures jointly optimize feature extraction and classification objectives, thereby reducing reliance on manual feature engineering.
Recent EEG-based deep learning studies provide representative examples of advanced modeling strategies in affective and cognitive state recognition. For instance, the authors of [
56] combined normalized mutual information features with a self-optimized Gaussian kernel-based extreme learning machine, while the authors of [
57] introduced a graph attention convolutional neural network with mutual information-driven connectivity to model inter-channel EEG dependencies. Although these EEG-based approaches operate under different sensing assumptions than video-based emotion recognition, they exemplify broader methodological trends in affective computing, including the use of information-theoretic measures, structured representations, and end-to-end deep learning. In video-based emotion recognition, these trends are realized through architectures that operate on observable behavioral signals, where affective understanding relies on jointly modeling spatial appearance cues and temporal dynamics. Consequently, deep learning approaches for ERV have primarily centered on convolutional and recurrent neural architectures that enable hierarchical spatial encoding and temporal aggregation of facial and audio–visual data, forming the foundation of the unimodal deep learning methods discussed next.
In this context, convolutional neural networks (CNNs) have been widely adopted to encode spatial structure and local appearance patterns from facial imagery, while recurrent architectures, such as long short-term memory (LSTM) networks and convolutional LSTMs (ConvLSTMs), have been employed to model temporal dependencies across video frames and audio streams [
1,
2].
In the pre-transformer era, unimodal deep learning approaches for ERV largely converged toward three dominant architectural paradigms. The first and widely explored family consists of CNN-based models coupled with recurrent temporal aggregation mechanisms. In these architectures, CNNs are used to extract frame-level spatial representations, which are subsequently aggregated using LSTM or ConvLSTM units to capture the temporal evolution of affective expressions [
58,
59]. The primary strength of this paradigm lies in its explicit separation between spatial encoding and temporal modeling, which facilitates architectural modularity and interpretability. However, this decoupled design also introduces limitations, including increased computational complexity, sensitivity to frame sampling strategies, and reduced robustness under long or highly variable temporal sequences.
The second prominent family comprises three-dimensional convolutional neural networks (3D-CNNs), which perform joint spatiotemporal feature learning by applying three-dimensional convolutional kernels over short video clips. By directly encoding both motion and appearance information, 3D-CNNs eliminate the need for explicit recurrent modeling and have demonstrated effectiveness in capturing short-term facial dynamics, including micro-expressions and brief affective events [
30,
31]. The main advantage of this approach is its ability to learn compact spatiotemporal representations in an end-to-end manner. Nevertheless, 3D-CNN-based models typically exhibit high computational and memory requirements and are constrained by a limited temporal receptive field, which can hinder their ability to model long-range affective dependencies and reduce scalability to longer video sequences.
The third category encompasses attention-augmented CNN architectures and bilinear pooling models, which aim to enhance representational capacity by selectively emphasizing salient spatial regions, channels, or feature interactions. Attention mechanisms enable adaptive weighting of informative facial regions or feature maps, while bilinear pooling captures higher-order correlations between feature representations, leading to more discriminative affective embeddings [
59,
60,
61]. These approaches improve sensitivity to subtle expression cues and reduce the impact of irrelevant background information. However, their increased architectural complexity often leads to higher training instability and greater data dependency, and their performance gains may diminish when applied to limited or noisy datasets.
Table 3 provides a comparative synthesis of representative unimodal deep learning architectures, explicitly summarizing their principal strengths and limitations. The analysis highlights that although unimodal deep learning methods substantially improve representation learning over handcrafted pipelines, they remain constrained by modality-specific degradations, data availability requirements, and limited generalization across subjects and recording conditions. These limitations motivate the transition toward bimodal and multimodal architectures, which seek to exploit complementary information sources to improve robustness and generalization.
Table 3.
Unimodal deep learning approaches for emotion recognition in video.
Table 3.
Unimodal deep learning approaches for emotion recognition in video.
| Reference | Architecture and Modality | Strengths | Limitations |
|---|
| Zhang et al. (2019) [61] | CNN with bilinear pooling; visual and facial images | Modeling of higher-order feature interactions; enhanced discriminative capacity for subtle affective cues | Increased architectural and computational complexity; sensitivity to feature noise and facial misalignment |
| Li et al. (2020) [1] | CNN with temporal aggregation; visual and facial video | End-to-end learning of spatial representations; reduced reliance on handcrafted descriptors | Limited temporal modeling capability; sensitivity to frame sampling and facial alignment |
| Zhao et al. (2020) [59] | CNN–LSTM with attention; visual and video sequences | Explicit modeling of temporal affective dynamics with adaptive feature weighting | Higher data dependency; training instability; limited cross-dataset robustness |
| Haddad et al. (2020) [30] | 3D convolutional neural network; visual and spatiotemporal clips | Joint learning of spatial and short-term temporal representations | High computational and memory requirements; restricted temporal receptive field |
| Zhou et al. (2021) [60] | Attention-augmented CNN; visual and facial images | Selective emphasis on informative spatial regions | Limited temporal modeling; performance degradation under severe occlusion |
| Hans et al. (2021) [58] | CNN–LSTM; visual and video sequences | Explicit temporal aggregation of frame-level features | Increased computational complexity; limited cross-subject generalization |
| Talluri et al. (2022) [31] | 3D-CNN with spatiotemporal encoding; visual and video clips | End-to-end spatiotemporal feature learning without explicit recurrence | Strong dependence on large-scale annotated datasets; limited long-range temporal modeling |
4.3. Multimodal Deep Learning Models
Deep multimodal architectures extended unimodal emotion recognition frameworks by jointly modeling heterogeneous information sources, most commonly visual, auditory, and, in some cases, textual or physiological signals. By integrating complementary modalities, these approaches aim to overcome the intrinsic limitations of single-modality systems, particularly their sensitivity to modality-specific degradations. Fusion strategies explored in the literature can be broadly categorized into early fusion schemes, which concatenate modality-specific features at the representation level; late fusion approaches, which combine modality-wise predictions; and intermediate or hybrid fusion mechanisms, which perform integration within the network using attention modules, graph-based representations, or bilinear pooling operations. By exploiting complementary affective cues from facial expressions, speech prosody, and linguistic content, multimodal deep learning systems have consistently demonstrated improved recognition performance compared with unimodal baselines [
3,
4,
5,
6].
Early multimodal ERV approaches relied primarily on statistical correlation analysis and generalized canonical correlation analysis (CCA) to align heterogeneous feature spaces [
62]. While these methods provided a principled framework for cross-modal representation alignment, their expressiveness was limited, and their performance strongly depended on the handcrafted feature quality. Subsequent research introduced more expressive fusion paradigms, including cross-modal graph attention networks [
63], double-attention fusion mechanisms [
64], and hierarchical or dynamically adaptive fusion strategies that explicitly model inter-modal dependencies and temporal interactions [
65,
66,
67,
68,
69,
70,
71,
72].
As ERV research progressed beyond unimodal visual analysis, bimodal audio–visual architectures emerged as a particularly prominent instantiation of multimodal learning, explicitly modeling interactions between facial dynamics and speech-related affective cues. These approaches generally improve affect recognition by leveraging complementary information from facial expressions and vocal prosody, thereby increasing robustness to partial modality degradation.
Table 4 provides a comparative synthesis of representative pre-MLLM bimodal architectures, summarizing their principal strengths and limitations and highlighting common design trade-offs.
Table 4.
Representative bimodal (audio–visual) deep learning approaches for ERV.
Table 4.
Representative bimodal (audio–visual) deep learning approaches for ERV.
| Reference | Fusion Architecture and Modality | Strengths | Limitations |
|---|
| Lan et al. (2020) [62] | Generalized CCA; visual–audio | Principled cross-modal alignment; low model complexity; interpretable fusion | Limited representational capacity; reliance on handcrafted features |
| Peng et al. (2022) [63] | Cross-modal graph attention; visual–audio | Explicit modeling of inter-modal dependencies; improved robustness to partial modality degradation | High computational cost; sensitivity to temporal synchronization |
| Mocanu et al. (2022) [64] | Double-attention fusion; visual–audio | Adaptive emphasis on salient temporal segments and modalities | Performance degrades under noisy or misaligned inputs |
| Chen et al. (2023) [65] | Hierarchical temporal fusion; visual–audio | Multi-level integration of spatial, temporal, and cross-modal cues | Increased architectural complexity; higher data requirements |
| Dutta et al. (2023) [66] | Dynamic self-attention fusion; visual–audio | Flexible temporal weighting of modalities; improved handling of modality dominance | Computationally expensive training; limited scalability |
| Ghaleb et al. (2023) [68] | Bimodal audio–visual fusion | Long-range temporal and cross-modal dependency modeling | High memory and compute cost; limited real-time feasibility |
| Dixit et al. (2024) [69] | Adaptive multimodal attention; visual–audio | Improved robustness under partial modality corruption | Sensitivity to synchronization errors; training instability |
Early fusion-based bimodal models emphasize rich cross-modal feature interactions and can capture fine-grained correlations between visual and acoustic signals; however, they often incur substantial computational overhead and exhibit sensitivity to temporal misalignment between modalities. Attention-based architectures, such as VAANet and double-attention fusion models, introduce adaptive mechanisms to selectively emphasize salient regions, frames, or modalities, thereby improving discriminative capacity under favorable conditions. Nonetheless, their effectiveness is closely tied to the input quality and reliable cross-modal synchronization, and performance can degrade under noisy or asynchronous recordings.
Beyond early audio–visual fusion strategies, subsequent bimodal and multimodal architectures increasingly incorporated attention-based mechanisms and structured representations to address the limitations of uniform feature aggregation. Self-attention and graph-inspired formulations enabled models to adaptively weigh spatial, temporal, and cross-modal cues, allowing emotionally salient regions, time segments, or modality interactions to be emphasized during inference. By explicitly modeling the relationships among facial dynamics, speech prosody, and contextual signals, these approaches moved ERV systems beyond fixed fusion pipelines and toward more flexible and context-sensitive representation learning.
More recent bimodal architectures incorporating self-attention mechanisms enable more expressive temporal and cross-modal representations but typically require large-scale, well-annotated datasets and significant computational resources to train effectively. Collectively, these approaches illustrate a recurring trade-off between representational power, computational complexity, and practical robustness, motivating the exploration of more flexible and scalable multimodal modeling paradigms.
4.4. Transformer-Based Architectures
The introduction of transformer architectures marked a significant turning point in emotion recognition in video.
Table 5 provides a structured overview of representative transformer-based architectures for emotion recognition in video and multimodal affective analysis, summarizing their core design choices, input modalities, and principal strengths and limitations.
Earlier deep learning approaches based on CNNs, LSTMs, and 3D-CNNs focused on learning spatiotemporal patterns from visual and auditory streams but offered limited support for global context modeling or semantic alignment across modalities [
1,
2].
Transformers addressed these limitations by using self-attention to capture long-range dependencies and integrate heterogeneous inputs within a single framework. Vision transformers (ViTs) [
7] introduced patch-based self-attention, allowing models to capture fine-grained spatial relations and adapt more effectively to in-the-wild variation in pose, illumination, and occlusion. Adaptations for facial expression recognition, including ViT-FER, TransFER, and multi-scale ViT variants, further demonstrated improved robustness and expressiveness in visual emotion modeling [
73,
74,
75]. Parallel work on video transformers strengthened temporal modeling by leveraging self-attention over frame sequences [
76,
77].
Recent work on transformer-based temporal modeling in emotion recognition has explored both efficiency and representational richness. The STT-Net architecture simplifies temporal transformers for emotion recognition by introducing a multi-head self- and cross-attention mechanism designed to model temporal interactions with reduced computational overhead, demonstrating competitive performance on benchmark facial expression datasets [
78]. In parallel, hybrid transformer approaches have been proposed that emphasize multilevel feature representation of sequential signals, such as combining parallel convolutional and sequential encoders to capture both the global and temporal characteristics of speech for improved recognition accuracy [
79].
These developments underscore the ongoing diversification of transformer architectures in ERV research, where design choices balance computational efficiency, depth of representation, and the capacity to model temporal evolution in affective signals. Transformers also played a central role in the development of multimodal ERV. Cross-attention mechanisms and audio–visual transformer blocks made it possible to align speech prosody, facial expressions, and contextual visual cues more effectively [
63,
67,
80,
81]. These models began to bridge low-level perception with higher-level semantic interpretation by embedding multiple modalities within a shared representational space.
While temporal dynamics are widely acknowledged as central to emotion recognition in video, different architectural paradigms exhibit distinct strengths and limitations in modeling long-term affective evolution. Early CNN–LSTM pipelines capture temporal information through recurrent state propagation but often struggle with long-range dependencies, accumulated error, and sensitivity to noise in unconstrained video streams. These limitations become pronounced in scenarios involving gradual affective transitions, intermittent emotional cues, or long interaction sequences, where frame-level features provide only weak supervisory signals.
Subsequent approaches have emphasized sequence-aware modeling and temporal consistency through explicit tracking, bidirectional temporal aggregation, or optimization over extended video segments. Such strategies have been shown to improve robustness to temporal fragmentation and to better preserve affective continuity over time, particularly in in-the-wild settings, where the facial visibility, poses, and expression intensity vary substantially [
82]. However, these methods typically remain dependent on carefully designed temporal objectives and well-aligned annotations, limiting their generalization across datasets and recording conditions.
Table 5.
Transformer-based architectures for ERV and multimodal affective analysis.
Table 5.
Transformer-based architectures for ERV and multimodal affective analysis.
| Reference | Architecture and Modality | Strengths | Limitations |
|---|
| Dosovitskiy et al. (2021) [7] | Vision transformer; visual | Global spatial context modeling via patch-based self-attention; improved robustness to pose, illumination, and occlusion | High data and computational requirements; limited inductive bias for fine-grained facial dynamics |
| Tian et al. (2024); Chaudhari et al. (2022); Liu et al. (2022) [73,74,75] | ViT-based FER variants; visual | Improved expressiveness and generalization for facial emotion modeling in unconstrained environments | Computationally expensive; sensitive to dataset size and facial alignment quality |
| Neshov et al. (2024); Bertasius et al. (2021) [76,77] | Video transformers; visual | Long-range temporal dependency modeling through self-attention; enhanced sequence-level affect representation | Quadratic complexity with sequence length; limited scalability to long video streams |
| Peng et al. (2022); Dutta et al. (2023); Fu et al. (2021); Venkatraman et al. (2024); Guo et al. (2022) [63,66,67,80,81] | Audio–visual transformers; bimodal | Effective cross-modal alignment via attention; improved robustness through complementary cue integration | Sensitivity to temporal synchronization errors; high computational and memory overhead |
| Lu et al. (2019); Li et al. (2019) [83,84] | VisualBERT and ViLBERT; vision–language | Early joint visual–textual representations via cross-attention; foundational multimodal grounding | Limited scalability and task coverage; indirect affective supervision |
| Radford et al. (2021); Jia et al. (2021) [8,85] | CLIP and ALIGN; vision–language | Large-scale contrastive pretraining; strong transferability to downstream affective tasks | Limited temporal modeling; implicit rather than explicit emotion representation |
| Li et al. (2022); Alayrac et al. (2022) [9,10] | BLIP and Flamingo; vision–language | Improved cross-modal alignment with generative objectives; flexible multimodal reasoning | High training cost; emotion understanding remains indirect |
| Lian et al. (2024); Vaiani et al. (2024); Shou et al. (2025); Huang et al. (2025) [86,87,88,89] | Emotion-adapted VLMs; multimodal | Task-specific supervision improves affect recognition and semantic grounding | Domain adaptation required; limited robustness across datasets |
| Achiam et al. (2023); Qi et al. (2023); Li et al. (2025) [90,91,92] | MLLMs (GPT-4V, Gemini, Claude 3); multimodal | Unified perception, reasoning, and instruction following; contextual and semantic emotion understanding | Very high computational cost; limited transparency and controllability |
Transformer-based architectures address some of these challenges by leveraging self-attention mechanisms to integrate information across long temporal spans without explicit recurrence. This enables more flexible modeling of non-local temporal dependencies but introduces sensitivity to the sequence length, annotation sparsity, and computational constraints. More recently, multimodal large language models extend temporal reasoning by integrating visual dynamics with linguistic and contextual information, allowing affective interpretation to be grounded in higher-level narrative and situational context. Nevertheless, their ability to faithfully model fine-grained affective evolution over long, unconstrained videos remains uneven and strongly dependent on prompt formulation, data coverage, and evaluation protocol design.
One closely related development within transformer-based architectures was the rise of vision–language pretraining (VLP). Early transformer models such as VisualBERT and ViLBERT [
83,
84] introduced joint representations for images and text through cross-modal attention mechanisms. Large-scale contrastive models, including CLIP and ALIGN [
8,
85], further demonstrated that pretrained vision–language embeddings exhibit strong transferability across downstream tasks, including affective analysis [
4]. Subsequent architectures such as BLIP and Flamingo [
9,
10] extended these ideas by incorporating generative objectives and improved cross-modal alignment, broadening the applicability of transformer-based vision–language models to emotion-related tasks.
In modern ERV pipelines, modality-specific encoders for text, audio, images, and video are typically coupled with multimodal attention mechanisms that integrate heterogeneous representations into a shared latent space for emotion classification or regression. This architectural paradigm provides both the conceptual and technical foundation for contemporary multimodal large language models (MLLMs).
Early general-purpose vision–language models (VLMs) established robust multimodal grounding through joint visual–textual representation learning [
8,
9,
10,
83,
84,
85]. Subsequent work has adapted these transformer-based architectures to emotion recognition and affective analysis, incorporating task-specific supervision and multimodal attention mechanisms [
86,
87,
88,
89]. These developments have directly informed the design of contemporary multimodal large language models (MLLMs) such as GPT-4V, Gemini, and Claude 3 [
90,
91,
93,
94], which build upon transformer foundations by integrating large-scale pretraining with instruction following, reasoning capabilities, and multi-turn multimodal interaction.
Beyond high-capacity transformer architectures, recent research has also explored lightweight and resource-efficient transformer variants aimed at deployment under constrained computational budgets. Models such as MobileViT [
95] and EfficientFormer [
92] integrate convolutional inductive biases with transformer-style attention to reduce the parameter count, memory footprint, and inference latency while preserving the competitive representational capacity. These architectures were originally developed for mobile vision tasks but are increasingly relevant for emotion recognition in video, where real-time processing, on-device inference, and energy efficiency are often critical requirements.
In the context of ERV, lightweight transformers have primarily been evaluated as backbone encoders for facial or audio–visual feature extraction rather than as end-to-end affect reasoning systems. Existing studies indicate that compact transformer variants can achieve performance comparable to heavier models on constrained benchmarks when trained on large-scale datasets, although their ability to capture long-range temporal dependencies and subtle affective cues remains more limited. As a result, they are particularly well suited for scenarios such as mobile affect sensing, embedded human–AI interaction, and privacy-sensitive applications, where cloud-based inference is impractical.
Complementary efforts have also emerged toward parameter-efficient adaptations of multimodal large language models for edge deployment. Techniques such as distillation, low-rank adaptation, quantization, and modality-specific pruning have enabled reduced-scale vision–language models, including TinyLLaVA-style architectures [
96], to perform multimodal perception and reasoning under limited computational resources. While these compact transformer variants are still largely unexplored in ERV-specific benchmarks, they represent a promising direction for bridging the gap between expressive multimodal reasoning and deployable, resource-aware emotion recognition systems.
Overall, transformer-based architectures have enhanced global context modeling, strengthened cross-modal alignment, and enabled scalable multimodal representation learning. Collectively, these advances have facilitated a methodological shift in ERV from perception-driven pipelines toward MLLM-based frameworks capable of contextual, semantic, and reasoning-aware emotion understanding.
5. The Rise of MLLMs in Emotion Recognition in Video
The emergence of multimodal large language models (MLLMs) marks a significant shift in emotion recognition in video, extending earlier multimodal and transformer-based approaches through large-scale pretraining, unified multimodal representations, and enhanced reasoning capabilities. Unlike pre-MLLM pipelines that relied on task-specific architectures and fixed fusion strategies, MLLMs enable more flexible integration of visual, auditory, textual, and contextual information. This section examines the role of MLLMs in affective computing, provides a comparative analysis with pre-MLLM approaches, and analyzes their performance, robustness, reasoning, explainability, and interactive capabilities while also discussing current limitations and trade-offs.
5.1. Multimodal Large Language Models in Affective Computing
Multimodal large language models (MLLMs) extend transformer-based vision–language models (VLMs) by coupling modality-specific encoders with large-scale autoregressive language decoders trained on web-scale text corpora [
90,
97,
98]. In contrast to earlier multimodal models that primarily produced fixed label predictions or joint embeddings, MLLMs generate free-form natural language outputs and support instruction following. This enables capabilities such as explanation of predictions, clarification-seeking, and adaptive behavior aligned with user objectives [
99,
100,
101,
102].
Contemporary general-purpose MLLMs, including GPT-4V, Gemini, and Claude 3, integrate vision and language processing within a unified interactive interface [
90,
91,
93]. These models process images and, increasingly, video and audio streams by aligning non-textual inputs with internal token representations through cross-modal attention and projection mechanisms [
10,
94]. Visual instruction tuning and parameter-efficient adaptation strategies [
99,
100,
101,
102] further allow MLLMs to acquire task-specific competencies, such as detailed visual reasoning, affective description, and contextual interpretation, while using relatively modest amounts of supervised data.
Recent studies have further explored the integration of multimodal fusion strategies with large language models to address the challenges of noisy data, heterogeneous modalities, and high-level affective reasoning in real-world settings. In the context of social media analysis, Maazallahi et al. [
103] proposed a hybrid framework that combines multimodal feature fusion with fine-tuned large language models to improve emotion recognition under weak supervision and label inconsistency. Their approach highlights how LLM-based reasoning can complement conventional multimodal pipelines by enforcing semantic coherence and compliance constraints across noisy visual, textual, and contextual inputs.
Beyond classification-oriented settings, LLMs have also been investigated as mechanisms for emotion-aware representation fusion and explainable affective reasoning. Rasool et al. [
104] introduced an embedding-level fusion strategy in which emotion-specific representations are integrated within large language models such as Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4. By combining lexicon-derived affective cues with attention-based contextual embeddings, their framework enables interpretable emotion reasoning and response generation, illustrating the potential of LLMs to unify perception, affect modeling, and explanation within a single architecture.
More recently, multimodal emotion recognition has been extended beyond facial and vocal cues to include body language and motion dynamics. Lu et al. [
105] demonstrated that large language models can be adapted to process structured representations of human body movements, enabling emotion recognition and natural language explanation based on skeletal motion patterns. This work underscores the capacity of MLLM-based systems to incorporate diverse behavioral modalities and to reason over affective signals that are not explicitly tied to facial expressions or speech.
Collectively, these studies illustrate an emerging trend toward hybrid multimodal fusion and LLM-centered affective reasoning, where large language models act not only as classifiers but as integrative reasoning modules capable of handling noisy inputs, diverse expressive channels, and explainable emotion understanding. Such approaches complement vision- and audio-centric ERV pipelines and point toward more flexible, context-aware, and interactive emotion recognition systems.
From an affective computing perspective, these architectural and training advances enable MLLMs to extend beyond perception-centric emotion prediction to more integrated emotion understanding. Rather than outputting a single categorical label or continuous affective score, MLLMs can jointly reason over facial expressions, vocal prosody, linguistic content, and contextual cues and articulate their interpretations in natural language. This shift enables system-level functions, such as explanation generation, uncertainty expression, and interactive clarification, which are difficult to realize within earlier modular ERV pipelines.
Table 6 provides a comparative overview of representative MLLMs and MLLM-based affective systems, summarizing their core architectural characteristics, principal strengths, and current limitations. As highlighted in the table, general-purpose MLLMs offer unprecedented flexibility by unifying multimodal perception, reasoning, and language generation within a single framework, while emotion-specialized variants further improve the affect recognition accuracy and explanation quality through task-oriented instruction tuning and hybrid expert designs. At the same time, the table underscores persistent challenges, including the high computational cost, limited transparency, dependence on large-scale multimodal data, and sensitivity to noisy or misaligned inputs, which currently constrain large-scale deployment and reproducibility.
Building on these capabilities, recent research has begun to specialize MLLMs for emotion recognition and affective analysis. Lian et al. proposed GPT-4V with Emotion, a zero-shot benchmark designed to probe generalized emotion recognition abilities in multimodal LLMs without task-specific fine-tuning [
86]. Emotion-LLaMA further advances this direction by incorporating instruction tuning with emotion-oriented datasets, improving both affective recognition performance and the quality of generated explanations [
106]. Additional studies explored MLLM-based pipelines for video emotion recognition and evaluation protocols [
87,
88,
107], alongside the introduction of dedicated benchmarks for emotional intelligence in LLMs and MLLMs, such as EmoBench and EmoBench-M [
108,
109].
Hybrid expert architectures, exemplified by Emotion-Qwen [
89], further investigate the integration of general vision–language understanding with emotion-specific reasoning mechanisms. Beyond benchmarking and model analysis, MLLMs are increasingly explored in applied and assistive affective systems. For instance, Audo-Sight and EmoAssist leverage multimodal reasoning to support blind and visually impaired users through context- and emotion-aware ambient interaction [
110,
111]. These applications demonstrate how the reasoning and explanation capabilities of MLLMs can bridge emotion recognition with broader goals in human–AI interaction and accessibility [
52,
110]. Despite their strong representational and reasoning capabilities, VLMs and MLLMs impose specific data requirements that are not yet fully met by existing affective datasets.
Table 7 summarizes the key data requirements and persistent gaps for VLM/MLLM-based emotion recognition.
Table 6.
Multimodal large language models (MLLMs) for emotion recognition and affective computing.
Table 6.
Multimodal large language models (MLLMs) for emotion recognition and affective computing.
| Reference | Model and Modality | Strengths | Limitations |
|---|
| Achiam et al. (2023); Qi et al. (2023); [90,91] | GPT-4V, Gemini, Claude 3; multimodal | Unified perception, language generation, and reasoning; instruction following; free-form affective explanation and contextual interpretation | Extremely high computational cost; limited transparency and controllability; restricted access and reproducibility |
| Lian et al. (2024) [86] | GPT-4V with Emotion; multimodal | Zero-shot evaluation of generalized emotion understanding; strong cross-domain transfer without task-specific fine-tuning | Limited insight into internal reasoning; dependence on proprietary model behavior |
| Cheng et al. (2024) [106] | Emotion-LLaMA; multimodal | Emotion-oriented instruction tuning improves recognition accuracy and explanation quality | Requires curated emotion datasets; limited robustness under domain shift |
| Vaiani et al. (2024); Shou et al. (2025); Bhattacharyya et al. (2025) [87,88,107] | MLLM-based video ERV pipelines; multimodal | Joint reasoning over visual, audio, and linguistic cues; flexible handling of long temporal context | High data and computing requirements; sensitivity to noisy or misaligned multimodal inputs |
| Sabour et al. (2024); Hu et al. (2025) [108,109] | EmoBench and EmoBench-M; multimodal benchmarks | Standardized evaluation of emotional intelligence and multimodal affect reasoning | Limited coverage of real-world interaction scenarios; benchmark saturation risk |
| Huang et al. (2025) [89] | Emotion-Qwen; multimodal | Hybrid expert architecture combining general VLM reasoning with emotion-specific modules | Increased architectural complexity; training and inference overhead |
| Ainary et al. (2025); Qi et al. (2025) [110,111] | Audo-Sight, EmoAssist; assistive MLLMs | Context- and emotion-aware interaction for accessibility and assistive technologies | Dependence on robust perception modules; limited evaluation in diverse real-world settings |
In particular, the lack of large-scale, well-aligned multimodal data with expressive and fine-grained affect annotations constrains the effective adaptation and evaluation of MLLMs for ERV.
Furthermore, demographic imbalance and limited cultural diversity introduce biases that undermine generalization and fairness, especially in socially sensitive applications. Temporal annotation sparsity and synchronization errors further limit the ability of MLLMs to exploit long-range reasoning capabilities. Finally, accessibility considerations are rarely incorporated at the data level, despite growing interest in assistive affective technologies. Addressing these gaps is essential to fully leverage MLLMs for robust, inclusive, and context-aware emotion recognition and remains a key challenge for future research.
Table 7.
Data requirements and remaining gaps for VLM- and MLLM-based emotion recognition.
Table 7.
Data requirements and remaining gaps for VLM- and MLLM-based emotion recognition.
| Requirement | Notes |
|---|
| Multimodality and contextual grounding | Joint availability of vision, audio, text, and contextual metadata is required; in-the-wild interactions and long temporal context remain underrepresented. |
| Granular and expressive labels | Need for combined categorical and dimensional (valence–arousal) annotations, including compound, subtle, and ambiguous affective states. |
| Scale, diversity, and balance | Large-scale datasets with demographic, cultural, and situational diversity are necessary; current resources suffer from imbalance and limited coverage. |
| Temporal continuity and alignment | Fine-grained temporal annotations and reliable cross-modal synchronization are critical for modeling affective dynamics yet remain scarce. |
| Accessibility and inclusive design | Data collection and annotation protocols should account for sensory impairments and support assistive and accessibility-oriented applications. |
The requirements summarized in
Table 7 highlight a fundamental mismatch between the capabilities of modern VLMs and MLLMs and the limitations of existing affective datasets. While current models can integrate heterogeneous modalities and reason over extended contexts, most emotion recognition datasets remain limited in modality coverage, temporal depth, and contextual richness.
5.2. Comparative Analysis: Pre-MLLMs Versus MLLMs
The transition from a pre-MLLM architectures to multimodal large language models (MLLMs) represents a fundamental shift in the design paradigm of emotion recognition in video (ERV), moving from modular, task-specific pipelines toward unified frameworks capable of multimodal reasoning and contextual integration. Earlier deep learning approaches, including convolutional, recurrent, and 3D convolutional networks as well as pre-transformer multimodal architectures, were typically engineered for a fixed set of input modalities and target datasets, relying on separate encoders and explicitly designed fusion mechanisms for each information stream [
3,
4]. While these models achieved strong performance on specific benchmarks such as AffectNet, Aff-Wild2, and AVEC challenge datasets, their generalization abilities are often limited, with performance degrading under domain shift, annotation noise, or previously unseen interaction scenarios [
19,
20,
23].
MLLMs address many of these limitations by leveraging large-scale multimodal pretraining combined with instruction tuning to support integrated perception, reasoning, and explanation across text, audio, image, and video inputs [
10,
90,
91,
93,
94]. Rather than maintaining modality-specific pipelines and task-dependent fusion strategies, MLLMs map heterogeneous inputs into a shared semantic representation space. A large autoregressive language decoder then processes this representation to support reasoning and generation. This design enables flexible inference through prompting, supports zero- and few-shot emotion recognition, and allows models to generate natural language explanations of their predictions [
86,
99,
100,
101,
102,
106].
A key consequence of this architectural unification is improved generalization. Through exposure to diverse multimodal data during pretraining, MLLMs exhibit enhanced robustness to dataset bias and domain variation, enabling affect recognition in previously unseen settings with minimal or no task-specific supervision [
8,
85,
86,
87,
90]. At the same time, the use of a single multimodal interface allows emotion recognition, explanation, and downstream decision making to be performed within a coherent conversational framework, reducing the need for task-specific system redesign and facilitating interactive affective analysis [
88,
89,
94].
Another distinguishing characteristic of MLLMs is their capacity for language-based explainability. Unlike conventional classifiers that output discrete labels or continuous scores, MLLMs can articulate the rationale underlying their predictions, reference salient visual or acoustic cues, and express uncertainty in natural language. This capability aligns closely with emerging benchmarks on emotional intelligence and affective reasoning, which emphasize interpretability and contextual understanding in addition to predictive accuracy [
107,
108,
109]. Furthermore, instruction tuning, visual instruction tuning, and parameter-efficient adaptation techniques enable MLLMs to rapidly incorporate emotion-specific knowledge or adapt to new application domains using relatively modest amounts of supervised data [
99,
100,
101,
102,
106].
These properties make MLLMs particularly well suited for application-oriented ERV systems, including assistive technologies, conversational agents, and decision support tools, where emotion understanding must be tightly integrated with contextual reasoning and user interaction [
52,
110,
111]. However, the increased flexibility and expressive power of MLLMs also introduce new challenges related to dataset bias, annotation reliability, privacy, and ethical considerations, which are not fully addressed by scale alone and motivate the open issues discussed later in this paper [
23,
50,
52,
112].
Rather than maintaining separate pipelines per modality, MLLMs map heterogeneous inputs into a shared semantic space and operate on this representation with a powerful language decoder. This enables zero- and few-shot emotion classification, flexible adaptation to new tasks via prompts, and natural language explanations of predictions [
86,
99,
100,
101,
102,
106].
It is important to note that this review does not aim to provide a quantitative meta-analysis or a unified performance ranking across methods. Reported evaluation metrics, such as the accuracy, F1-score, or concordance correlation coefficient (CCC), are highly sensitive to dataset characteristics, annotation protocols, evaluation settings, and training assumptions, all of which vary substantially across studies. This variability is further amplified when comparing pre-MLLM models, typically trained and evaluated in fully supervised, dataset-specific settings, with multimodal large language models, which are often assessed under zero-shot or few-shot configurations using prompt-based inference.
Across the surveyed literature, performance improvements were commonly reported within narrowly defined experimental contexts, such as within-subject or in-dataset evaluation on specific benchmarks (e.g., AffectNet, Aff-Wild2, and AVEC-related datasets). While such results demonstrate the effectiveness of individual models under controlled conditions, they are not directly comparable across studies due to differences in data splits, label definitions, temporal granularity, and evaluation protocols. Similarly, reported gains for MLLM-based approaches are frequently demonstrated through task-specific benchmarks, qualitative reasoning evaluations, or small-scale empirical studies that emphasize flexibility, generalization, or explanatory capabilities rather than optimized performance on standardized ERV datasets. For these reasons, direct numerical comparison of reported scores across datasets, model families, and evaluation paradigms can be misleading and may obscure the underlying trade-offs between methodological approaches. Instead, this review adopts a structured qualitative comparison strategy, synthesizing reported results within their experimental context and focusing on relative strengths, limitations, and deployment considerations.
Table 8 contrasts the representative characteristics of pre-MLLM- and MLLM-based approaches across multiple dimensions relevant to ERV, synthesizing trends reported in recent research papers, benchmark analyses, and empirical studies [
4,
5,
6,
86,
87,
89,
107,
108,
109,
113].
Table 8 distills the key trade-offs between pre-MLLM emotion recognition systems and MLLM-based approaches. Pre-MLLM models emphasize efficiency and task specificity, achieving strong performance under well-defined conditions but exhibiting limited robustness and adaptability. In contrast, MLLMs prioritize unified multimodal reasoning and language-based interaction, enabling broader task coverage, improved zero- or few-shot generalization, and enhanced interpretability at the cost of higher computational demands and increased deployment complexity.
It is important to note that the comparative advantages attributed to MLLMs over task-specific ERV models are primarily derived from qualitative analyses, benchmark-level evaluations, and limited empirical studies rather than from standardized, head-to-head comparisons under shared experimental conditions. Unlike classical and deep learning ERV systems, which are typically evaluated using fixed datasets, well-defined metrics, and reproducible training protocols, MLLMs are often assessed through heterogeneous benchmarks, zero-shot or prompt-based settings, and, in many cases, proprietary model implementations.
As a result, reported gains in robustness, generalization, and reasoning should be interpreted with caution. The performance improvements observed for MLLMs frequently depend on prompt formulation, task framing, and access to large-scale pretraining data that are not publicly disclosed. Moreover, many existing evaluations focus on small benchmark subsets, qualitative reasoning tasks, or human-in-the-loop assessments, which limits direct comparability with task-specific ERV models trained and evaluated under controlled conditions.
Consequently, while current evidence suggests that MLLMs offer enhanced flexibility, multimodal reasoning, and explanatory capabilities, their maturity and reliability for emotion recognition in video remain constrained by the absence of standardized benchmarks, shared evaluation protocols, and transparent reporting practices. Establishing fair and reproducible comparison frameworks between task-specific models and MLLMs remains an open challenge and a prerequisite for drawing stronger quantitative conclusions.
5.3. Performance and Robustness
Performance evaluation in ERV has historically relied on challenge-based benchmarks such as AVEC and Aff-Wild2, which established reference baselines for audio–visual and continuous valence–arousal prediction under realistic, in-the-wild conditions. Early task-specific deep learning models typically achieved concordance correlation coefficients (CCCs) and F1 scores in the moderate range, reflecting the intrinsic difficulty of modeling affective behavior in unconstrained environments characterized by annotation noise, subject variability, and contextual ambiguity [
19,
20].
Subsequent advances in transformer-based architectures and multimodal fusion strategies led to measurable performance gains by improving temporal modeling, long-range dependency capture, and cross-modal alignment. Approaches incorporating self-attention, bilinear fusion, and hybrid convolutional-transformer designs consistently outperformed earlier CNN–RNN pipelines across multiple benchmarks, particularly in settings involving long video sequences or complex audio–visual interactions [
61,
66,
81]. Foundational architectural ideas underlying cross-modal attention and tensor-based fusion were first explored in multimodal sentiment and language analysis [
114,
115] and later adapted to emotion recognition in video. Nevertheless, these improvements remained largely bounded by dataset-specific supervision and often failed to generalize robustly across domains or annotation schemes.
More recent evaluations of VLMs and MLLMs indicate a further shift in both performance and robustness characteristics. GPT-4V with Emotion [
86] demonstrated competitive and in some cases superior zero-shot performance relative to supervised baselines on several emotion recognition benchmarks, while simultaneously providing natural language rationales that enhance interpretability. Vaiani et al. [
87] showed that multimodal LLMs can recognize emotions from video with limited task-specific supervision and flexibly adapt to alternative label taxonomies through prompting, reducing reliance on fixed annotation schemas.
Recent empirical analyses, including the evaluation study by Bhattacharyya and Wang [
107], suggest that MLLMs can exhibit improved cross-dataset generalization compared with conventional VLMs, particularly under domain shift. These effects are most apparent when transferring across datasets with heterogeneous recording conditions, cultural contexts, or annotation schemes. Beyond standard accuracy metrics, preliminary human-centered evaluations indicate that MLLM-generated emotion interpretations may be perceived as more coherent and contextually grounded, highlighting the potential benefits of integrated multimodal reasoning. However, these findings remain preliminary and warrant further validation through standardized benchmarks and large-scale comparative studies.
Nevertheless, recent benchmark initiatives such as EmoBench and EmoBench-M [
108,
109] reveal persistent limitations. Current MLLMs continue to struggle with fine-grained affect distinctions, subtle or mixed emotional states, and cross-cultural variability in emotional expression. Performance remains sensitive to the dataset composition, prompt formulation, and evaluation protocol, and models may exhibit overconfident or internally inconsistent reasoning when confronted with ambiguous affective cues. These findings suggest that while MLLMs improve robustness and flexibility relative to earlier approaches, their affective competence remains uneven and highly dependent on the data and evaluation design.
5.4. Reasoning, Explainability, and Interaction
A central distinction between multimodal large language models (MLLMs) and pre-MLLM emotion recognition architectures lies in the integration of perceptual inference with language-based reasoning. Conventional emotion recognition systems typically output categorical labels or continuous affective scores, offering limited transparency into the underlying decision process. Interpretability in such models is generally restricted to post hoc analyses, including attention visualizations, saliency maps, or feature ablation studies, which provide indirect and often ambiguous explanations of model behavior [
1,
2].
In contrast, MLLMs support explicit, language-mediated reasoning over multimodal inputs, enabling them to generate natural language explanations that reference salient visual, acoustic, linguistic, or contextual cues, such as facial muscle tension, vocal prosody, lexical content, or interactional context, while also expressing uncertainty or alternative affective interpretations [
86,
87,
106,
107]. This capability shifts explainability from an external diagnostic tool to an intrinsic model output, allowing emotion recognition decisions to be contextualized, interrogated, and refined through interaction.
Beyond interpretability, the reasoning capabilities of MLLMs enable more interactive and adaptive affective systems. Emotion-aware assistants, including early prototypes such as Audo-Sight and EmoAssist [
110,
111], illustrate how multimodal reasoning can be leveraged to describe perceived emotional states, relate them to situational context, and dynamically adapt feedback or guidance in response to user queries. Such interactional flexibility is particularly relevant in accessibility-oriented applications, mental health support, and educational settings, where affective understanding must be integrated with dialogue, explanation, and user intent [
52].
In real deployments, these advances introduce new technical and ethical challenges. Language-based explanations generated by MLLMs may convey a false sense of certainty, obscure underlying model limitations, or rationalize incorrect predictions in a persuasive manner. In emotionally sensitive contexts, this raises concerns regarding transparency, emotional manipulation, user overreliance, and the appropriate calibration of trust [
50,
52,
112]. Addressing these issues requires not only improved evaluation protocols for reasoning and explainability, but also careful system design that explicitly accounts for uncertainty, user agency, and ethical deployment constraints.
5.5. Limitations and Trade-Offs
Despite their representational flexibility and reasoning capabilities, multimodal large language models (MLLMs) are not a universal replacement for pre-MLLM emotion recognition architectures. A primary limitation concerns computational cost; both pretraining and inference typically require specialized hardware, large memory budgets, and careful system-level optimization. In contrast, smaller convolutional- or transformer-based models remain well suited for real-time or on-device ERV deployments, where latency, energy consumption, and deployment constraints are critical [
1,
81].
A second challenge arises from the reliance of MLLMs on large, opaque, web-scale training corpora. The limited transparency of these data sources complicates reproducibility, bias auditing, and fairness assessment, particularly in socially sensitive affective applications [
52,
98]. Unlike task-specific emotion recognition datasets with well-documented annotation protocols, web-scale multimodal data introduce heterogeneous labeling practices and cultural biases that are difficult to characterize or mitigate post hoc. Moreover, while MLLMs can generate natural language explanations for affective predictions, these explanations are not guaranteed to be faithful reflections of the underlying decision process [
108,
109].
Beyond computational cost and data dependency, additional practical drawbacks of current MLLM-based approaches warrant explicit consideration. A key limitation concerns reproducibility; many reported results rely on closed-source or rapidly evolving proprietary models, making it difficult to replicate findings or perform controlled ablation studies. This lack of transparency complicates fair comparison with task-specific ERV systems, which are typically evaluated using fixed architectures, public datasets, and reproducible training pipelines.
Another practical concern is sensitivity to prompt formulation and task framing. MLLM performance can vary substantially depending on prompt wording, contextual cues, or instruction style, introducing an additional source of variability that is largely absent in conventional supervised ERV models. In affective analysis, this sensitivity may lead to inconsistent predictions across semantically similar inputs or evaluation set-ups.
Finally, in emotionally ambiguous or subtle scenarios, MLLM-generated rationales may be fluent yet misleading, reflecting hallucinated explanations or uncertain internal evidence. This can obscure failure modes and complicate the calibration of trust in high-stakes settings, especially when evaluation protocols do not explicitly test for faithfulness, uncertainty communication, or explanation consistency [
108,
109].
Collectively, these issues highlight that practical deployment challenges for MLLMs extend beyond computational considerations, encompassing reproducibility, reliability, interpretability, and failure transparency. Addressing these limitations will require not only improved model architectures but also standardized evaluation practices, open benchmarks, and clearer reporting guidelines for multimodal affective reasoning systems.
By contrast, conventional deep learning architectures continue to offer favorable trade-offs in settings where task definitions are clear, training data are well curated, and computational resources are limited. Carefully optimized CNN- and transformer-based emotion recognition systems often achieve competitive performance on established benchmarks such as AffectNet, Aff-Wild2, and AVEC while maintaining lower operational cost and reduced governance complexity [
4,
20].
Taken together, these considerations suggest that MLLMs and pre-MLLM architectures should be viewed as complementary rather than competing solutions. MLLMs are most advantageous in scenarios that require rich multimodal context integration, flexible interaction, and language-based explanation, whereas specialized deep models remain effective for focused, resource-constrained ERV deployments. Building on this comparison, the following sections discuss open challenges related to data availability, evaluation methodology, ethical considerations, and application-specific design choices for emotionally intelligent multimodal systems.
6. Challenges and Future Directions of Research
Despite substantial progress in emotion recognition in video, particularly with the emergence of multimodal large language models, several fundamental challenges remain unresolved. These challenges span data availability and quality, evaluation methodology, robustness, explainability, ethical considerations, and practical deployment constraints. Moreover, recent advances open promising avenues for future research aimed at addressing these limitations and advancing toward more reliable, interpretable, and socially responsible affective systems. This section reviews key open issues and outlines potential research directions for the next generation of ERV models.
6.1. Challenges and Open Issues
Although the field has developed quickly in recent years, especially with the introduction of multimodal large language models, emotion recognition in video (ERV) still faces a number of open problems. Several challenges identified in early deep learning systems remain unresolved, and some have become more pronounced now that models attempt to combine multiple modalities and reason about context.
The first challenge is the limited availability of balanced and diverse multimodal datasets. Most existing corpora do not cover sufficient demographic or cultural variation, which makes it difficult for models to generalize reliably. Approaches based on synthetic data and self-supervised learning may help reduce this gap. Another long-standing problem is the subjectivity of emotion labels. Annotations are often noisy or interpreted differently by individual raters, which motivates the use of continuous or probabilistic affect models and methods that explicitly account for annotator variation. Cultural and demographic bias adds another layer of complexity, since people express emotions in different ways, and fairness-aware or cross-cultural adaptation techniques are needed to avoid systematic errors.
Privacy and ethical issues also remain central. Emotional behavior constitutes highly sensitive information, and prior work has stressed the importance of consent, data protection, and clear governance frameworks [
52,
112]. Technical constraints come into play as well; transformer-based systems and MLLMs require significant memory and computational resources, which makes real-time deployment difficult on devices with limited capacities. Techniques such as efficient tuning, model distillation, and hardware-aware optimization are therefore important. Robustness remains an open challenge, as ERV models often struggle with occlusion, lighting changes, motion blur, and other in-the-wild effects. Benchmarking studies [
23,
50] and large-scale evaluations such as AVEC [
19] show that issues related to annotation quality and domain shift continue to affect performance.
Beyond facial and vocal cues, ERV increasingly needs to consider contextual information present directly within the video. Text appearing on screen, such as subtitles, signs, or meme captions, can influence how emotions are interpreted. Detecting and recognizing such text typically relies on detection-based or segmentation-based text spotting methods [
116,
117,
118,
119].
These challenges have concrete practical consequences. For instance, mental health monitoring requires privacy-preserving and culturally fair systems; educational or interactive settings depend on models that remain stable under changing conditions; and assistive technologies require context-aware and personalized interpretations. As noted, recent performance gains often come at the cost of higher computational requirements, which reinforces the need for lighter and more energy-efficient models. Overall, ERV research is moving toward systems that are more interpretable, more aware of social context, and better aligned with human expectations.
6.2. Future Directions of Research
Recent advances in emotion recognition in video have substantially expanded the range of modeling approaches and applications while also revealing a set of persistent challenges that motivate continued research. As discussed throughout the paper, the field is progressively shifting from perception-driven, task-specific pipelines toward multimodal frameworks that integrate contextual cues, support higher-level reasoning, and enable flexible interaction. This transition motivates several interrelated directions for future research.
A first direction concerns personalization and user-aware adaptation. Emotional expression exhibits substantial inter-individual, cultural, and neurodiversity-related variability, which limits the effectiveness of population-level models. Future ERV systems will therefore need to adapt to individual users. This includes user-specific calibration, adaptive baselines, and conditioning on personal, social, and interaction history. Such capabilities are particularly critical in healthcare, education, and assistive scenarios, where misinterpretation of affect can have significant consequences. Emerging approaches, including few-shot learning, meta-learning, and prompt-based adaptation in MLLMs, offer promising avenues for achieving personalization without extensive retraining or data collection overhead.
A second research direction focuses on learning under low-resource conditions and improving robustness to domain shift. Despite recent progress, ERV models remain constrained by data scarcity, annotation subjectivity, and demographic imbalance. Consequently, increasing attention is being devoted to self-supervised and weakly supervised representation learning, domain adaptation techniques, uncertainty-aware modeling, and synthetic affect generation. Cross-dataset and cross-cultural evaluation protocols remain essential for assessing generalization, while recent advances in spatiotemporal fusion and emotion-aware pretraining demonstrate the potential for mitigating dataset-specific biases [
114,
120].
A further line of work centers on multimodal reasoning and interaction-oriented system design. As highlighted in earlier sections, ERV is moving beyond isolated recognition toward explanation, grounding, and conversational interaction. Future systems are expected to jointly model emotion, sentiment, engagement, and intent within unified multimodal reasoning frameworks [
72,
80,
121]. This shift will likely be supported by multimodal knowledge representations, causal affect models, and real-time feedback mechanisms, enabling applications such as socially responsive avatars, intelligent tutoring systems, and human–robot interaction. Group-level dynamics and immersive environments further require context-sensitive and temporally extended affect modeling.
Another important direction involves improving computational efficiency and deployability. While MLLMs offer strong zero- and few-shot capabilities, their computational and energy requirements limit applicability in latency-sensitive or resource-constrained settings. Ongoing research on parameter-efficient adaptation, modular and adapter-based architectures, model compression, and hybrid cloud–edge deployment aims to bridge this gap, particularly for healthcare, assistive technologies, and wearable systems. These efforts reflect the broader need to balance representational power with practical deployment constraints.
Future ERV systems must also generalize across diverse datasets, modalities, and interaction contexts. Broadly trained, cross-domain foundation models [
122] are emerging as a potential pathway toward improved transferability. At the same time, deeper modeling of temporal and social dynamics, such as long-term affect trajectories, memory-like mechanisms, and interaction history, will be necessary to capture emotional evolution over extended interactions and complex social settings.
Finally, progress toward trustworthy and ethically grounded ERV remains a critical concern. Privacy-preserving learning paradigms, including federated and decentralized multimodal training, are increasingly important for handling sensitive affective data. Improved interpretability, uncertainty estimation, and transparent reporting can support appropriate user trust, while safeguards against affect manipulation and culturally biased inference are essential for responsible deployment. Evaluation protocols must also evolve to explicitly account for demographic and cultural diversity.
Collectively, these research directions point toward ERV systems that are more context-aware, robust, and socially responsible, with multimodal reasoning capabilities that extend beyond perception alone. Continued progress will depend not only on advances in model architectures but also on improved datasets, evaluation methodologies, and ethical governance frameworks, setting the stage for the concluding remarks of this paper.
6.3. Ethical Considerations, Bias, and Fairness in Emotion Recognition in Video
Bias and ethical considerations have emerged as central challenges in affective computing, particularly as models evolve from laboratory-scale prototypes toward deployment in real-world, interactive, and assistive settings. While bias is frequently acknowledged in the literature, its treatment is often inconsistent, and comprehensive analyses of ethical implications and mitigation strategies remain limited, especially when compared with the rapid methodological advances in deep learning, transformer-based models, and multimodal large language models (MLLMs).
A primary source of bias in ERV systems originates from dataset composition and annotation practices. Widely used emotion datasets are typically collected under constrained recording conditions and exhibit imbalanced distributions with respect to age, gender, ethnicity, cultural background, and situational context. As discussed in earlier sections of this review, such limitations directly affect cross-dataset generalization and contribute to performance degradation under domain shift. Subjective annotation processes further exacerbate this issue, as affect labels often reflect annotator-specific interpretations that vary across cultures and social norms, introducing systematic label noise and latent bias into model training and evaluation.
Recent transformer-based architectures and MLLMs, despite their superior representational capacity and contextual modeling abilities, do not inherently resolve these biases. On the contrary, large-scale pretraining on web-derived data may amplify demographic and cultural imbalances already present in the training corpora. As highlighted in studies reviewed in
Section 4 and
Section 5, vision–language models and MLLMs inherit biases from both visual and textual modalities, which can manifest as uneven emotion recognition performance across demographic groups or as culturally skewed affect interpretations. The opacity of internal representations and reliance on proprietary or weakly curated pretraining data further complicate bias auditing and accountability.
Existing ERV research has addressed bias-related challenges primarily through implicit mitigation strategies, such as data augmentation, domain adaptation, and robustness-oriented architectural design. While these approaches can improve average performance and stability, they rarely incorporate explicit fairness-aware objectives, demographic stratification in evaluation protocols, or bias-sensitive loss functions. More principled mitigation strategies, such as balanced dataset construction, representation debiasing during training, domain-invariant feature learning, and post hoc calibration have been explored in isolated studies but remain largely disconnected from mainstream ERV pipelines and benchmark evaluations.
The emergence of MLLMs introduces additional ethical dimensions beyond predictive accuracy, including explainability, transparency, and responsible interaction. Unlike earlier ERV systems that output categorical labels or continuous affective scores, MLLMs generate free-form natural language responses and engage in interactive reasoning. While this capability enables explanation generation and uncertainty expression, it also raises new risks related to misleading rationales, overconfident explanations, and the propagation of latent biases through language generation. As discussed in recent benchmark-oriented works, current evaluation frameworks for MLLMs in affective computing focus predominantly on task performance, with limited consideration of fairness, bias sensitivity, or ethical robustness.
Beyond the identification of ethical risks, the recent ERV literature has begun to explore methodological strategies aimed at mitigating bias, improving fairness, and enhancing robustness. Debiasing approaches commonly focus on reducing the influence of confounding factors such as gender, ethnicity, age, and cultural background, which are often entangled with affective labels in real-world datasets. Representative techniques include dataset rebalancing, sample reweighting, adversarial feature disentanglement, and domain-invariant representation learning, where auxiliary objectives are introduced to suppress demographic or dataset-specific cues while preserving emotion-relevant information.
Fairness-aware training paradigms have also been investigated, particularly in the context of facial emotion recognition. These methods typically incorporate fairness constraints or regularization terms that aim to equalize performance across demographic subgroups or minimize disparities in error rates. Multi-task learning frameworks, adversarial debiasing architectures, and causal-inspired formulations have been proposed to explicitly control for sensitive attributes during training, although their effectiveness remains closely tied to the availability and reliability of demographic annotations.
In parallel, adversarial robustness and reliability under distribution shift have emerged as important concerns for ERV deployment. Studies have shown that emotion recognition models are vulnerable to adversarial perturbations, occlusions, compression artifacts, and environmental noise, which can disproportionately affect certain populations or interaction contexts. Robust training strategies, including data augmentation, adversarial training, uncertainty modeling, and consistency regularization, have been explored to improve stability under such conditions, though comprehensive evaluation under real-world attack and degradation scenarios remains limited.
Despite these advances, existing debiasing and robustness techniques are often evaluated in controlled experimental settings and on limited benchmarks, making it difficult to assess their generalizability. Moreover, the increasing adoption of large-scale and multimodal pretrained models introduces new ethical challenges related to opaque training data, inherited societal biases, and limited controllability. As a result, the development of standardized fairness benchmarks, transparent reporting practices, and robust evaluation protocols remains an open research direction for building trustworthy and ethically aligned ERV systems.
Overall, the surveyed literature indicates a clear gap between technical advances in ERV and the systematic treatment of ethical and bias-related issues. Addressing this gap requires the development of standardized bias evaluation protocols, fairness-aware training and adaptation strategies, and more inclusive, demographically diverse datasets. These challenges are particularly critical for socially sensitive and assistive applications, where biased emotion recognition can have a disproportionate negative impact. Future research must therefore integrate ethical considerations as a first-class objective alongside accuracy and efficiency to ensure responsible and trustworthy deployment of ERV systems.
6.4. Actionable Research Directions for Advanced ERV Systems
Future work in affective computing should place greater emphasis on methodologically rigorous and implementation-oriented research directions that transform conceptual challenges into well-defined algorithmic and system-level solutions.
A first priority concerns data modeling and supervision strategies. Beyond collecting larger datasets, future work should explore weakly supervised, self-supervised, and cross-modal pretraining objectives that reduce dependence on densely annotated affect labels. Techniques such as contrastive multimodal pretraining, temporal consistency regularization, and pseudo-label refinement across modalities offer promising directions for leveraging uncurated or partially labeled video data while preserving affective discriminability.
From a modeling perspective, explicit temporal abstraction and sparsification remain underexplored in ERV. Transformer-based architectures and MLLMs would benefit from hierarchical temporal representations, adaptive frame or token selection, and sparse attention mechanisms that decouple short-term facial dynamics from long-range contextual reasoning. Such designs are critical to scaling ERV models to long-duration videos and real-time streams without prohibitive computational cost.
In terms of learning objectives and optimization, future ERV systems should move toward multi-objective training formulations that jointly optimize recognition accuracy, uncertainty calibration, and robustness under distribution shift. For multimodal architectures, this includes modality-aware regularization and dynamic reweighting strategies that explicitly handle modality dominance, missing modalities, or asynchronous inputs during both training and inference.
Evaluation protocols also require greater technical rigor and granularity. Rather than relying solely on aggregate accuracy or F1 scores, future benchmarks should incorporate protocol-level stress testing, including controlled perturbations of temporal alignment, modality dropout, and noise injection. For MLLM-based systems, evaluation should further assess temporal reasoning consistency, explanation stability across prompts, and sensitivity of generated affective descriptions to input perturbations.
Finally, system-level integration and deployment constraints must be addressed more explicitly. Research should consider memory-efficient inference, parameter-efficient adaptation, and on-device or edge deployment strategies, particularly for assistive and interactive ERV applications. These considerations are especially relevant for MLLMs, where practical deployment remains limited by computational and energy requirements.
Collectively, these directions emphasize a shift from incremental architectural refinement toward scalable, robust, and deployable ERV systems, bridging the gap between methodological advances and real-world applicability.
6.5. Practical Applications and Deployment Scenarios
Emotion recognition in video streams has been investigated across a broad range of application domains, with the selection of modeling paradigms largely influenced by deployment constraints, including computational resources, inference latency, robustness requirements, and interaction complexity. Classical machine learning pipelines and early deep learning models have primarily been applied in controlled or resource-constrained environments, such as laboratory-based affect analysis, surveillance-oriented monitoring, and embedded human–AI interaction systems. Their relatively low computational cost, modular processing structure, and stable inference behavior facilitated early deployment, despite limited robustness and generalization under unconstrained conditions.
The adoption of unimodal and multimodal deep learning architectures expanded ERV applications toward interactive and socially aware systems, including affect-aware dialogue interfaces, multimedia content analysis, and driver or operator monitoring. In particular, audio–visual fusion models demonstrated improved resilience to noise, partial occlusion, and modality-specific degradation, enabling deployment in semi-controlled real-world scenarios such as vehicle cabins, smart classrooms, and multimedia analysis pipelines. However, these approaches typically require task-specific training and careful cross-modal synchronization, which can limit scalability and transferability across application domains.
Transformer-based architectures further extended ERV capabilities by supporting longer temporal context modeling and more complex multimodal interactions. These models have been applied to tasks such as continuous affect tracking in naturalistic video, video-based mental health assessment, and social signal processing. While self-attention mechanisms improve the modeling of long-range temporal dependencies and contextual relationships, the associated computational and memory demands often restrict deployment to server-side or offline processing environments.
Multimodal large language models introduce a distinct application paradigm by integrating emotion recognition with contextual reasoning, explanation generation, and interactive dialogue. As discussed in
Section 5, MLLMs enable use cases such as assistive technologies for visually impaired users, emotion-aware conversational agents, and decision support systems, where affective understanding must be combined with semantic interpretation and user interaction. Despite their increased flexibility and expressiveness, practical deployment of MLLMs is currently constrained by computational cost, inference latency, and limited controllability, favoring cloud-based or hybrid deployment configurations.
Overall, practical ERV deployment reflects a recurring trade-off between computational efficiency and representational expressiveness. Lightweight classical and deep learning models remain suitable for real-time, embedded, or privacy-sensitive applications, whereas MLLMs are more appropriate for context-rich and interaction-intensive scenarios that require reasoning and explanation. This comparative perspective highlights that effective ERV deployment depends on aligning application requirements with system capabilities, rather than adopting a single dominant modeling approach.
6.6. Practical Applications Beyond Video-Based Emotion Recognition
Building upon the practical deployment scenarios discussed above, recent application-oriented studies in adjacent affective computing domains further illustrate how emotion and affective state recognition systems are deployed under real-world constraints. While these works do not exclusively focus on video-based emotion recognition, they provide complementary methodological and application-level insights that are directly relevant to the design, evaluation, and deployment of ERV systems.
In particular, applications in driver state and fatigue monitoring have explored efficient affective and cognitive state decoding using information-theoretic feature selection strategies, such as normalized mutual information combined with lightweight classifiers including extreme learning machines. These approaches have demonstrated practical advantages in safety-critical and resource-constrained environments, highlighting how information-driven modeling can support robust affective state recognition under strict latency and computational constraints. More advanced deployment scenarios, such as continuous driver fatigue detection, have adopted graph-based deep learning architectures, including graph attention convolutional neural networks with mutual information-driven connectivity. By explicitly modeling relational dependencies between temporal segments, physiological signals, or behavioral cues, these methods enable adaptive representation learning that improves robustness in dynamic operational settings. Conceptually, these architectures parallel recent ERV approaches that leverage attention mechanisms and structured modeling to capture dependencies across facial regions, temporal windows, or multimodal inputs.
In parallel, EEG-based affective recognition systems have been widely investigated in applied contexts such as workload assessment, fatigue monitoring, and emotion analysis. Recent application-focused studies and surveys emphasize challenges related to inter-subject variability, annotation subjectivity, and generalization across recording conditions. Although EEG-based systems rely on different sensing modalities, many of the identified deployment challenges, such as robustness, domain adaptation, and explainability, are shared with video-based emotion recognition and multimodal ERV systems.
From an application perspective, these adjacent domains reflect a broader convergence toward information-aware, attention-based, and graph-structured learning paradigms for affective state recognition. These paradigms increasingly inform the development of transformer-based ERV models and multimodal large language models, particularly in applications requiring reliability, interpretability, and real-time interaction. Incorporating insights from affective computing applications beyond video therefore supports the development of more robust, generalizable, and deployable ERV systems across safety-critical, assistive, and human–AI interaction scenarios.
7. Conclusions
Emotion recognition in video (ERV) has evolved through a sequence of methodological paradigms. Early work relied on handcrafted features combined with classical machine learning techniques, followed by the adoption of deep learning architectures based on convolutional and recurrent neural networks. These early systems typically focused on facial cues or short temporal segments and achieved reasonable performance under controlled laboratory conditions, but their robustness often deteriorated when applied to unconstrained, real-world recordings.
Subsequent multimodal approaches incorporating audio alongside visual information demonstrated improved performance on several benchmarks; however, their effectiveness remained strongly dependent on the dataset characteristics, recording conditions, and annotation protocols. The introduction of transformer-based architectures marked a further step forward by enabling the modeling of long-range temporal dependencies and more expressive cross-modal representations, including links between visual content and language. These capabilities addressed several limitations of earlier models, particularly with respect to contextual integration and temporal reasoning.
More recently, multimodal large language models (MLLMs) have emerged as a new paradigm for emotion understanding. Using large-scale multimodal pretraining and instruction following, these models can perform emotion-related inference without task-specific retraining and can generate natural-language explanations grounded in visual, auditory, and textual input. That shift has changed how ERV is viewed, moving the field from isolated perceptual classification to multimodal reasoning.
In practice, several challenges remain unresolved. Many existing datasets lack sufficient cultural and demographic diversity, and emotion annotation remains inherently subjective, with substantial inter-annotator variability. In addition, concerns related to fairness, privacy, and computational demands of large-scale models pose significant barriers to real-time and resource-constrained deployment. These limitations motivate ongoing research into more diverse and representative datasets, standardized and comparable evaluation protocols, and lightweight model architectures suitable for edge or on-device inference.
At the same time, growing interest in applications such as assistive technologies, education, mental health support, and interactive systems highlights the increasing demand for emotion-aware AI. This demand underscores a gap between emerging application needs and the current technological maturity of ERV systems, particularly with respect to robustness, interpretability, and ethical deployment.
Emotion recognition in video has progressed from handcrafted pipelines to deep learning architectures and, more recently, to multimodal large language models. While these advances have expanded the modeling capabilities, real-world deployment remains constrained by data quality, robustness, computational cost, and interpretability. This review emphasizes practical engineering trade-offs across modeling paradigms, focusing on multimodal fusion strategies, dataset properties, and evaluation practices relevant to applied systems. Classical and task-specific deep learning models continue to provide efficient and reliable solutions for well-defined scenarios, whereas MLLMs offer added value in context-aware, interactive, and flexible applications. Future ERV systems are therefore likely to adopt hybrid designs. These systems will balance computational efficiency and robustness with higher-level multimodal reasoning to meet real-world operational requirements.