A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models

Mirela-Magdalena Grosu (Marinescu); Octaviana Datcu; Ruxandra Tapu; Bogdan Mocanu

doi:10.3390/app16031289

,

and

¹

Faculty of Electronics, Telecommunications and Information Technology, National University of Science and Technology “Politehnica” Bucharest, Bd. Iuliu Maniu 1-3, 061071 Bucharest, Romania

²

Laboratoire SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, 91000 Paris, France

^*

Author to whom correspondence should be addressed.

Appl. Sci.2026, 16(3), 1289;https://doi.org/10.3390/app16031289

This article belongs to the Section Computing and Artificial Intelligence

Version Notes

Order Reprints

Featured Application

Emotion recognition in video (ERV) enables practical deployment in human–AI interaction, assistive systems, and intelligent monitoring applications. The review highlights engineering trade-offs among accuracy, robustness, and computational cost, showing that lightweight deep models are well suited to resource-constrained platforms, whereas multimodal large language models support context-aware interaction in cloud-assisted systems.

Abstract

Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language models and multimodal large language models (MLLMs). This review surveys this evolution, with an emphasis on engineering considerations relevant to real-world deployment. We analyze multimodal fusion strategies, dataset characteristics, and evaluation protocols, highlighting limitations in robustness, bias, and annotation quality under unconstrained conditions. Emerging MLLM-based approaches are examined in terms of performance, reasoning capability, computational cost, and interaction potential. By comparing task-specific models with foundation model approaches, we clarify their respective strengths for resource-constrained versus context-aware applications. Finally, we outline practical research directions toward building robust, efficient, and deployable ERV systems for applied scenarios such as assistive technologies and human–AI interaction.

Keywords:

emotion recognition in video streams; state-of-the-art survey; traditional machine learning approaches; deep learning methods; transformer-based architectures; multimodal large language models

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.