Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition

Kumar, Himanshu; Aruldoss, Martin; Wynn, Martin

doi:10.3390/mti9120116

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition

by

Himanshu Kumar

¹,

Martin Aruldoss

¹

and

Martin Wynn

^2,*

¹

School of Mathematics and Computer Sciences, Central University of Tamil Nadu, Thiruvarur 610005, India

²

School of Business, Computing and Social Sciences, University of Gloucestershire, Cheltenham GL502RH, UK

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2025, 9(12), 116; https://doi.org/10.3390/mti9120116

Submission received: 11 October 2025 / Revised: 18 November 2025 / Accepted: 21 November 2025 / Published: 24 November 2025

Download Versions Notes

Abstract

Artificial emotional intelligence is a sub-domain of human–computer interaction research that aims to develop deep learning models capable of detecting and interpreting human emotional states through various modalities. A major challenge in this domain is identifying meaningful correlations between heterogeneous modalities—for example, between audio and visual data—due to their distinct temporal and spatial properties. Traditional fusion techniques used in multimodal learning to combine data from different sources often fail to adequately capture meaningful and less computational cross-modal interactions, and struggle to adapt to varying modality reliability. Following a review of the relevant literature, this study adopts an experimental research method to develop and evaluate a mathematical cross-modal fusion model, thereby addressing a gap in the extant research literature. The framework uses the Tucker tensor decomposition to analyse the multi-dimensional array of data into a set of matrices to support the integration of temporal features from audio and spatiotemporal features from visual modalities. A cross-attention mechanism is incorporated to enhance cross-modal interaction, enabling each modality to attend to the relevant information from the other. The efficacy of the model is rigorously evaluated on three publicly available datasets and the results conclusively demonstrate that the proposed fusion technique outperforms conventional fusion methods and several more recent approaches. The findings break new ground in this field of study and will be of interest to researchers and developers in artificial emotional intelligence.

Keywords: artificial emotional intelligence; human–computer interaction; cross-attention mechanism; categorical emotions; spatiotemporal features; Tucker decomposition; cross-modal framework

Share and Cite

MDPI and ACS Style

Kumar, H.; Aruldoss, M.; Wynn, M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technol. Interact. 2025, 9, 116. https://doi.org/10.3390/mti9120116

AMA Style

Kumar H, Aruldoss M, Wynn M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technologies and Interaction. 2025; 9(12):116. https://doi.org/10.3390/mti9120116

Chicago/Turabian Style

Kumar, Himanshu, Martin Aruldoss, and Martin Wynn. 2025. "Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition" Multimodal Technologies and Interaction 9, no. 12: 116. https://doi.org/10.3390/mti9120116

APA Style

Kumar, H., Aruldoss, M., & Wynn, M. (2025). Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technologies and Interaction, 9(12), 116. https://doi.org/10.3390/mti9120116

Article Menu

Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI