Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition
Abstract
1. Introduction
2. Materials and Methods
2.1. Narrative Review
2.2. Experimental Design and Validation
- Temporal alignment between audio and visual modalities is preserved, allowing for one-to-one correspondence across time steps.
- Tensor construction for Tucker decomposition remains consistent, since outer product fusion requires matching time-step alignment, and
- Feature interpretability and spatial integrity are retained, avoiding information loss from dimensional projection.
3. Results
3.1. RQ1. To What Extent Do Affective Computing and Recent Advancements in Deep Learning Improve Cross-Modal Fusion for Audio and Image-Based Emotion Recognition?
3.2. RQ2. Can a New Model Be Developed and Validated to Improve Correlations Between Heterogeneous Modalities (Such as Audio and Visual Data) to Detect and Interpret Human Emotional States?
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| Belongs to OR is an element of | |
| Set of real numbers | |
| Standard multiplication for scalars | |
| Bold uppercase letter for a matrix (e.g., ) | |
| a | Bold lowercase letter for a vector |
| n-mode product (multiplication of a tensor by a matrix along mode n) | |
| N | Batch size (Number of samples in a batch) |
| Raw audio feature sequence | |
| Raw image feature sequence | |
| Feature dimension of audio at each time step | |
| Feature dimension of visual (image) at each frame | |
| Number of time steps/segments in the original audio feature sequence | |
| Number of frames/time steps in the original visual feature sequence | |
| Common temporal length after alignment | |
| Cross-attention-refined image feature sequence | |
| Cross-attention-refined audio feature sequence | |
| X | Uppercase letter, the 4th-order cross-modal input tensor to the Tucker decomposition layer, representing spatiotemporal interactions. |
| G | The uppercase letter, the core tensor, represents each sample’s highly compressed and fused spatiotemporal representation. |
| Reduced rank for the common temporal mode in the core tensor | |
| Reduced rank for the audio feature mode in the core tensor | |
| Reduced rank for the visual feature mode in the core tensor | |
| Factor matrix for the common temporal mode | |
| Factor matrix for the audio feature mode | |
| Factor matrix for the visual feature mode |
Appendix A. Model Architecture and Hyperparameter Configuration
Appendix A.1. Audio Modality Configuration
Appendix A.2. Video Modality Configuration
Appendix A.3. Cross-Modal Attention Mechanism
- Visual attending to Audio: The visual features sequence acts as the query, seeking relevant information from the audio sequence, which serves as the keys and values. This process generates a visual sequence enriched audio context.
- Audio attending to Visual: Conversely, the audio feature sequence acts as the query, attending to the visual feature sequence (key and values). This results in an audio sequence enriched with visual context.
Appendix A.4. The Inputs of the Proposed Architecture
| Audio Modality | Video Modality |
|---|---|
| 2D CNN | Resnet-18 |
| Input size = (Batch size, 128 × 256) | Input size = 224 × 224 × 3 |
| Kernel size = 3 | Kernel (conv layers) = 7 × 7, 3 × 3, 1 × 1 |
| Activation function = ReLU | Activation function = ReLU |
| Max pooling = 3 × 3, stride = 2 | Max pooling = 3 × 3, stride = 2 |
| Batch size =128 | Batch size = 128 |
| Epochs = 50–100 | Epochs = 50–100 |
| Learning rate = 0.0001 to 0.001 | Learning rate = 0.0001 to 0.001 |
| Optimizer = Adam | Optimizer= Adam |
| Loss Function = Categorical Cross-Entropy | Loss Function = Categorical Cross-Entropy |
| Audio Augmentation = Pitch shifting, time stretching, background noise | Video Augmentation Horizontal flip, rotation, random cropping |
| Synchronization = Temporal alignment of audio and video segments | Synchronization = Temporal alignment of audio and video segments |
| Output dimension = 128 | Output dimension = 512 |
| Latent Space dimension = 60 × 256 | Latent Space dimension = 90 × 256 |
| Random 80:20 Split | |
| Number of Attention_head = 4 | |
| Dropout_rate = 0.5 | |
| Weight_decay= 1 × 10−4 | |
Appendix B. The Input Tensor, Tensor Construction, and Tucker Decomposition for Fusion, and Associated Algorithms
| Algorithm A1: Tensor construction from audio and visual features |
| Input: • Attention-weighted audio feature sequence for sample n • Attention-weighted visual feature sequence for sample n |
| Output: • A 4th-order cross-modal tensor |
| Procedure: 1. Initialize Tensor List = [ ] 2. For each sample n = 1 to N: a. Initialize b. For each time step t-=1 to : i. Extract audio feature vector: ii. Extract visual feature vector: iii. Compute the outer product: iv. Store as the tth slice: c. Append to Tensor List |
| 3. Stack all sample tensors along a new first dimension: |
| Algorithm A2: Cross-modal spatiotemporal fusion by Tucker decomposition layer |
| Input: • A 4th-order cross-modal tensor • Desired output ranks for the core tensor: |
| Output: • The fused low-rank core tensor |
| Learnable parameters (End-to-End Trainable): • Factor matrix for common temporal dimension: • Factor matrix for audio feature dimension: • Factor matrix for visual feature dimension: |
| Procedure: 1. Initialization • InitializationInitialize factor matrices |
| 2. Tensor Projection via n-mode products: a. Mode-2 (Temporal) Projection: Where b. Mode-3 (Audio) Projection: Where c. Mode-4 (Visual) Projection: Where G |
| 3. Return G as the fused representation. |
Appendix C. Datasets Used in the Experiments
- IEMOCAP: The IEMOCAP dataset is one of the challenging and complex benchmark datasets for cross-modal emotion recognition. It provides an extensive multimodal resource for recognizing emotions and capturing emotive speech, corresponding facial expressions, motion captures, and body gestures. The dataset is annotated with multiple emotional categories, making it highly suitable for supervised learning tasks. The dataset comprises 302 video utterances (utterances in the video-Audio dictionary: 302) and consists of dyadic sessions where actors perform improvisations or scripted scenarios, captured through audio and video modalities. Each utterance is annotated with nine distinct emotion categories, enabling fine-grained emotional analysis. This dataset is pivotal for developing and evaluating emotion recognition systems, especially in cross-modal contexts.
- RAVDESS: The proposed research extended its experimental validation to include the RAVDESS dataset, which serves as a challenging and benchmark dataset for cross-modal emotion recognition. This dataset consists of video recordings from 12 male and 12 female professionals (24 actors), each contributing 60 unique samples and eight distinct emotion states. Each utterance is in .mp4 format, making the RAVDESS dataset well-suited for evaluating cross-modal fusion techniques.
- REMA-D: This dataset comprises 7442 recordings from 91 different actors. The CREMA-D dataset offers a valuable collection of spontaneous and posed emotional expressions. It covers six core emotions, with each utterance captured in both audio and video (.wav and .flv) formats. The CREMA-D was validated using crowd-sourced human ratings, providing reliable emotion labels. This dataset serves as a strong foundation for training and testing emotion recognition models across diverse speakers, modalities, and expression styles.
| Dataset | Total Utterances | Modality | Emotion Labels |
|---|---|---|---|
| IEMOCAP | 302 | Audio-Video | Neutral, Calm, Happy, Sad, Angry, Fear, Disgust, Excited, Surprised |
| RAVDESS | 2857 | Audio-Video | Neutral, Calm, Happy, Sad, Angry, Fear, Disgust, Surprise |
| CREMA-D | 7442 | Audio-Video | Neutral, Happy, Sad, Angry, Fearful, Disgust |
Appendix D. System Configurations
References
- Cohn, J.F.; Ambadar, Z.; Ekman, P. Observer-based measurement of facial expression with the Facial Action Coding System. In Handbook of Emotion Elicitation and Assessment; Coan, J.A., Allen, J.J.B., Eds.; Oxford University Press: Oxford, UK, 2007; pp. 203–221. [Google Scholar]
- Avital, N.; Egel, I.; Weinstock, I.; Malka, D. Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing. Inventions 2024, 9, 113. [Google Scholar] [CrossRef]
- Kalateh, S.; Estrada-Jimenez, L.A.; Nikghadam-Hojjati, S.; Barata, J. A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
- Hu, P.; Huang, Y.; Mei, J.; Leung, H.; Chen, Z.; Kuang, Z.; You, Z.; Hu, L. Learning from low-rank multimodal representations for predicting disease-drug associations. BMC Med. Inf. Decis. Mak. 2021, 21, 308. [Google Scholar] [CrossRef]
- DeVault, D.; Artstein, R.; Benn, G.; Dey, T.; Fast, E.; Gainer, A.; Georgila, K.; Gratch, J.; Hartholt, A.; Lhommet, M.; et al. SimSensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, Paris, France, 5–9 May 2014; Volume 2, pp. 1061–1068. [Google Scholar]
- Khan, W.A.; Qudous, H.; Farhan, A.A. Speech emotion recognition using feature fusion: A hybrid approach to deep learning. Multimed. Tools Appl. 2024, 83, 75557–75584. [Google Scholar] [CrossRef]
- Zhou, F.; Kong, S.; Fowlkes, C.C.; Chen, T.; Lei, B. Fine-grained facial expression analysis using dimensional emotion model. Neurocomputing 2020, 392, 38–49. [Google Scholar] [CrossRef]
- Kuchibhotla, S.; Yalamanchili, B.S.; Vankayalapati, H.D.; Anne, K.R. Speech Emotion Recognition Using Regularized Discriminant Analysis. Adv. Intell. Syst. Comput. 2014, 247, 363–369. [Google Scholar] [CrossRef]
- Ortega, J.D.S.; Cardinal, P.; Koerich, A. Emotion recognition using fusion of audio and video features. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3847–3852. [Google Scholar] [CrossRef]
- Raj, R.S.; Pratiba, D.; Kumar, R.P. Facial Expression Recognition using Facial Landmarks: A novel approach. Adv. Sci. Technol. Eng. Syst. 2020, 5, 24–28. [Google Scholar] [CrossRef]
- Ristea, N.C.; Dutu, L.C.; Radoi, A. Emotion recognition system from speech and visual information based on convolutional neural networks. In Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara, Romania, 10–12 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Chaudhari, A.; Bhatt, C.; Nguyen, T.T.; Patel, N.; Chavda, K.; Sarda, K. Emotion Recognition System via Facial Expressions and Speech Using Machine Learning and Deep Learning Techniques. SN Comput. Sci. 2023, 4, 363. [Google Scholar] [CrossRef]
- Rangulov, D.; Fahim, M. Emotion Recognition on large video dataset based on Convolutional Feature Extractor and Recurrent Neural Network. In Proceedings of the International Conference on Image Processing, Applications and Systems, IPAS, Genoa, Italy, 9–11 December 2020; pp. 14–20. [Google Scholar]
- Tholusuri, A.; Anumala, M.; Malapolu, B.; Jaya Lakshmi, G. Sentiment analysis using LSTM. Int. J. Eng. Adv. Technol. 2019, 8, 1338–1340. [Google Scholar] [CrossRef]
- Liu, K.; Feng, Y.; Zhang, L.; Wang, R.; Wang, W.; Yuan, X.; Cui, X.; Li, X.; Li, H. An Effective Personality-Based Model for Short Text Sentiment Classification Using BiLSTM and Self-Attention. Electron 2023, 12, 3274. [Google Scholar] [CrossRef]
- Pan, X.; Guo, W.; Guo, X.; Li, W.; Xu, J.; Wu, J. Deep temporal-spatial aggregation for video-based facial expression recognition. Symmetry 2019, 11, 52. [Google Scholar] [CrossRef]
- Sanku, S.R.; Sandhya, B. Multi-Modal Emotion Recognition Feature Extraction and Data Fusion Methods Evaluation. Int. J. Innov. Technol. Explor. Eng. 2024, 3075, 18–27. [Google Scholar] [CrossRef]
- Zhang, K.; Li, Y.; Wang, J.; Wang, Z.; Li, X. Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis. IEEE Signal Process. Lett. 2021, 28, 1898–1902. [Google Scholar] [CrossRef]
- Hazarika, D.; Gorantla, S.; Poria, S.; Zimmermann, R. Self-Attentive Feature-level Fusion for Multimodal Emotion Detection. In Proceedings of the 2018 IEEE Conference on multimedia information processing and retrieval (MIPR), Miami, FL, USA, 10–12 April 2018; pp. 196–201. [Google Scholar] [CrossRef]
- Dixit, C.; Satapathy, S.M. Deep CNN with late fusion for real time multimodal emotion recognition. Expert Syst. Appl. 2024, 240, 122579. [Google Scholar] [CrossRef]
- Kumar, P.; Malik, S.; Raman, B. Interpretable multimodal emotion recognition using hybrid fusion of speech and image data. Multimed Tools Appl. 2024, 83, 28373–28394. [Google Scholar] [CrossRef]
- Gill, J.; Johnson, P. Research Methods for Managers, 3rd ed.; Sage: London, UK, 2002. [Google Scholar]
- Porter, A.L.; Kongthon, A.; Lu, J.C. Research Profiling: Improving the Literature Review. Scientometrics 2002, 53, 351–370. [Google Scholar] [CrossRef]
- Popay, J.; Roberts, H.; Sowden, A.; Petticrew, M.; Arai, L.; Rodgers, M. Guidance on the Conduct of Narrative Synthesis in Systematic Reviews; Lancaster University: Lancaster, UK, 2006. [Google Scholar]
- Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Snyder, H. Literature review as a research methodology: An overview and guidelines. J. Bus. Res. 2019, 104, 333–339. [Google Scholar] [CrossRef]
- Levering, B. Concept Analysis as Empirical Method. Int. J. Qual. Methods 2002, 1, 35–48. [Google Scholar] [CrossRef]
- Jabareen, Y. Building a Conceptual Framework: Philosophy, Definitions, and Procedure. Int. J. Qual. Methods 2009, 8, 49–62. [Google Scholar] [CrossRef]
- Indeed Editorial Team. Experimental Research: Definition, Types and Examples. 2024. Available online: https://www.indeed.com/career-advice/career-development/experimental-research (accessed on 10 October 2025).
- Zadeh, A.; Chen, M.; Cambria, E.; Poria, S.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar] [CrossRef]
- Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
- Ben-younes, H.; Cord, M.; Thome, N. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2612–2620. [Google Scholar] [CrossRef]
- Mai, S.; Hu, H.; Xing, S. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 164–172. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Wang, R.; Zhu, J.; Wang, S.; Wang, T.; Huang, J.; Zhu, X. Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking. Int. J. Multimed. Inf. Retr. 2024, 13, 39. [Google Scholar] [CrossRef]
- Ben-younes, H.; Cadene, R.; Thome, N.; Cord, M. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In Proceedings of the Third AAAI Conference on Artificial Intelligence, Washington, DC, USA, 22–26 August 2019; pp. 8102–8109. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
- Houwei, C.; Cooper, D.; Keutmann, M.; Gur, R.; Nenkova, A.; Ragini, V. CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef]
- Murugappan, M.; Mutawa, A. Facial geometric feature extraction based emotional expression classification using machine learning algorithms. PLoS ONE 2021, 16, e0247131. [Google Scholar] [CrossRef]
- Roopa, N.S.; Prabhakaran, M.; Betty, P. Speech emotion recognition using deep learning. Int. J. Recent Technol. Eng. 2019, 7, 247–250. [Google Scholar] [CrossRef]
- Akhand, M.A.H.; Roy, S.; Siddique, N.; Kamal, M.A.S.; Shimamura, T. Facial emotion recognition using transfer learning in the deep CNN. Electronics 2021, 10, 1036. [Google Scholar] [CrossRef]
- Liu, G.; Cai, S.; Wang, C. Speech emotion recognition based on emotion perception. Eurasip J. Audio Speech Music. Process. 2023, 1, 22. [Google Scholar] [CrossRef]
- Mocanu, B.; Tapu, R.; Zaharia, T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vis. Comput. 2023, 133, 104676. [Google Scholar] [CrossRef]
- Sultana, T.; Jahan, M.; Uddin, K.; Kobayashi, Y.; Smieee, M.H. Multimodal Emotion Recognition through Deep Fusion of Audio-Visual Data. In Proceedings of the 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; IEEE: Bissen, Luxembourg, 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Goncalves, L.; Leem, S.G.; Lin, W.C.; Sisman, B.; Busso, C. Versatile Audio-Visual Learning for Emotion Recognition. IEEE Trans. Affect. Comput. 2024, 16, 306–318. [Google Scholar] [CrossRef]
- John, V.; Kawanishi, Y. Audio and Video-based Emotion Recognition using Multimodal Transformers. In Proceedings of the International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2582–2588. [Google Scholar] [CrossRef]
- Singh, N.; Khan, R.; Shree, R. MFCC and Prosodic Feature Extraction Techniques: A Comparative Study. Int. J. Comput. Appl. 2012, 54, 9–13. [Google Scholar] [CrossRef]
- Aouani, H.; Ayed, Y.B. Speech Emotion Recognition with deep learning. In Proceedings of the 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Speech Emotion Recognition with Deep Learning Systems, Verona, Italy, 16–18 September 2020; Elsevier: Amsterdam, The Netherlands, 2020; pp. 251–260. [Google Scholar]
- Sharma, G.; Umapathy, K.; Krishnan, S. Trends in audio signal feature extraction methods. Appl. Acoust. 2020, 158, 20. [Google Scholar] [CrossRef]
- Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
- Lim, W.; Jang, D.; Lee, T. Speech Emotion Recognition using Convolutional Recurrent Neural Networks and Spectrograms. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–16 December 2016; IEEE: Bissen, Luxembourg, 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Li, C.; Bao, Z.; Li, L.; Zhao, Z. Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf. Process Manag. 2020, 57, 102185. [Google Scholar] [CrossRef]
- Venkateswarlu, S.C.; Jeevakala, S.R.; Kumar, N.U.; Munaswamy, P.; Pendyala, D. Emotion Recognition From Speech and Text using Long Short-Term Memory. Eng. Technol. Appl. Sci. Res. 2023, 13, 11166–11169. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Multimodal fusion for audio-image and video action recognition. Neural. Comput. Appl. 2024, 5, 5499–5513. [Google Scholar] [CrossRef]
- Palash, M.; Bhargava, B. EMERSK-Explainable Multimodal Emotion Recognition with Situational Knowledge. IEEE Trans. Multimed. 2023, 26, 2785–2794. [Google Scholar] [CrossRef]
- Zhang, S.; Tao, X.; Chuang, Y.; Zhao, X. Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 2021, 127, 73–81. [Google Scholar] [CrossRef]
- Lakshmi, K.L.; Muthulakshmi, P.; Nithya, A.A.; Jeyavathana, R.B.; Usharani, R.; Das, N.S.; Devi, G.N.R. Recognition of emotions in speech using deep CNN and RESNET. Soft. Comput. 2023, 1–16. [Google Scholar] [CrossRef]
- Udeh, C.P.; Chen, L.; Du, S.; Li, M.; Wu, M. Multimodal Facial Emotion Recognition Using Improved Convolution Neural Networks Model. J. Adv. Comput. Intell. Intell. Inform. 2023, 27, 710–719. [Google Scholar] [CrossRef]
- Patil, G.; Suja, P. Emotion Recognition from 3D Videos using Optical Flow Method. In Proceedings of the International Conference on Smart Technology for Smart Nation, (SmartTechCon), Bangalore, India, 17–19 August 2017; IEEE: Bissen, Luxembourg, 2017; pp. 825–829. [Google Scholar]
- Kumari, N.; Bhatia, R. Deep learning based efficient emotion recognition technique for facial images. Int. J. Syst. Assur. Eng. Manag. 2023, 14, 1421–1436. [Google Scholar] [CrossRef]
- Alkawaz, M.H.; Mohamad, D.; Basori, A.H.; Saba, T. Blend Shape Interpolation and FACS for Realistic Avatar. 3D Res. 2015, 6, 6. [Google Scholar] [CrossRef]
- Cho, J.; Hwang, H. Spatio-temporal representation of an electoencephalogram for emotion recognition using a three-dimensional convolutional neural network. Sensors 2020, 20, 3491. [Google Scholar] [CrossRef] [PubMed]
- Adegun, I.P.; Vadapalli, H.B. Facial micro-expression recognition: A machine learning approach. Sci. Afr. 2020, 8, 14. [Google Scholar] [CrossRef]
- Mehta, N.; Jadhav, S. Facial Emotion recognition using Log Gabor filter and PCA Ms Neelum Mehta. In Proceedings of the International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 12–13 August 2016; IEEE: Bissen, Luxembourg, 2016; pp. 1–5. [Google Scholar]
- Shi, Y.; Lv, Z.; Bi, N.; Zhang, C. An improved SIFT algorithm for robust emotion recognition under various face poses and illuminations. Neural. Comput. Appl. 2020, 32, 9267–9281. [Google Scholar] [CrossRef]
- Lakshmi, D.; Ponnusamy, R. Facial emotion recognition using modified HOG and LBP features with deep stacked autoencoders. Microprocess. Microsyst. 2021, 82, 103834. [Google Scholar] [CrossRef]
- Schoneveld, L.; Othmani, A.; Abdelkawy, H. Leveraging recent advances in deep learning for audio-Visual emotion recognition. Pattern Recognit. Lett. 2021, 146, 1–7. [Google Scholar] [CrossRef]
- Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In Proceedings of the 15th European Conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11220, pp. 595–610. [Google Scholar] [CrossRef]
- Sahoo, S.; Routray, A. Emotion recognition from audio-visual data using rule based decision level fusion. In Proceedings of the IEEE Students’ Technol Symp TechSym 2016, Kharagpur, India, 2 October 2016; pp. 7–12. [Google Scholar] [CrossRef]
- Shoumy, N.J.; Ang, L.M.; Seng, K.P.; Rahaman, D.M.M.; Zia, T. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl. 2020, 149, 102447. [Google Scholar] [CrossRef]
- Ortega, J.D.S.; Senoussaoui, M.; Granger, E.; Pedersoli, M.; Cardinal, P.; Koerich, A.L. Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition. arXiv 2019, arXiv:190703196. [Google Scholar] [CrossRef]
- Njoku, J.N.; Caliwag, A.C.; Lim, W.; Kim, S.; Hwang, H.J.; Jeong, J.W. Deep Learning Based Data Fusion Methods for Multimodal Emotion Recognition. J. Korean Inst. Commun. Inf. Sci. 2022, 47, 79–87. [Google Scholar] [CrossRef]
- Cimtay, Y.; Ekmekcioglu, E.; Caglar-Ozhan, S. Cross-subject multimodal emotion recognition based on hybrid fusion. IEEE Access 2020, 8, 168865–168878. [Google Scholar] [CrossRef]
- Yoshino, K.; Sakti, S.; Nakamura, S. Hierarchical Tensor Fusion Network for Deception Handling Negotiation Dialog Model. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–10. Available online: https://neurips.cc/virtual/2019/workshop/13200 (accessed on 10 October 2025).
- Krishna, D.N.; Patil, A. Multimodal Emotion Recognition using Cross-Modal Attention and 1D Convolutional Neural Networks. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 4243–4247. [Google Scholar]
- Praveen, R.G.; De Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.L.; Bacon, S.; Cardinal, P.; et al. A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 2485–2494. [Google Scholar] [CrossRef]
- Zhou, S.; Wu, X.; Jiang, F.; Huang, Q.; Huang, C. Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks. Int. J. Environ. Res. Public Health 2023, 20, 1400. [Google Scholar] [CrossRef]
- Lee, Y.; Yoon, S.; Jung, K. Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; ISCA: Geneva, Switzerland, 2020; pp. 2717–2721. [Google Scholar]
- Liu, F.; Fu, Z.; Wang, Y.; Zheng, Q. TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition. CAAI Artif. Intell. Res. 2023, 2, 9150019. [Google Scholar] [CrossRef]
- Fu, Z.; Liu, F.; Wang, H.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv 2021, arXiv:2111.02172. [Google Scholar] [CrossRef]
- Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Appl. Sci. 2022, 12, 327. [Google Scholar] [CrossRef]
- Jin, Z.; Zai, W. Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset. J. Supercomput. 2025, 81, 31. [Google Scholar] [CrossRef]
- Moorthy, S.; Moon, Y.K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. [Google Scholar] [CrossRef]
- Feng, J.; Fan, X. Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition. arXiv 2025, arXiv:2501.15063. [Google Scholar] [CrossRef]
- Hu, D.; Chen, C.; Zhang, P.; Li, J.; Yan, Y.; Zhao, Q. A two-stage attention based modality fusion framework for multi-modal speech emotion recognition. IEICE Trans. Inf. Syst. 2021, E104D, 1391–1394. [Google Scholar] [CrossRef]
- Mengara Mengara, A.G.; Moon, Y.K. CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts. Mathematics 2025, 13, 1907. [Google Scholar] [CrossRef]











| Emotions | Angry | Calm | Disgust | Fearful | Happy | Neutral | Sad | Surprise |
|---|---|---|---|---|---|---|---|---|
| Precision | 92.12 | 92.8 | 91.47 | 93.2 | 93.85 | 91.47 | 91.17 | 93.6 |
| Recall | 92.8 | 93.91 | 91.64 | 91.06 | 93.12 | 93.5 | 91.55 | 91.55 |
| F1-Score | 92.46 | 93.35 | 91.55 | 92.12 | 93.49 | 92.47 | 91.36 | 92.56 |
| Emotions | Anger | Disgust | Fear | Happy | Neutral | Sad |
|---|---|---|---|---|---|---|
| Precision | 86.5 | 88.8 | 87.93 | 87.39 | 85.62 | 85.62 |
| Recall | 85.23 | 88.46 | 87.4 | 87.83 | 85.08 | 88.88 |
| F1-Score | 85.86 | 88.63 | 87.67 | 87.61 | 85.35 | 87.22 |
| Emotions | Anger | Happiness | Sadness | Neutral | Excited | Frustration | Surprise | Fear |
|---|---|---|---|---|---|---|---|---|
| Precision | 83.87 | 84.22 | 82.85 | 83.14 | 81.38 | 85.87 | 84.75 | 82.62 |
| Recall | 85.08 | 82.42 | 84.45 | 84.64 | 82.11 | 82.89 | 82.41 | 83.94 |
| F1-Score | 84.47 | 83.31 | 83.64 | 83.88 | 81.74 | 84.36 | 83.56 | 83.27 |
| Dataset | Weighted Accuracy | Unweighted Accuracy |
|---|---|---|
| RAVDESS | 92.46 | 88.74 |
| CREMA-D | 87.31 | 84.27 |
| IEMOCAP | 83.22 | 79.63 |
| Source | Fusion | Accuracy | Remarks |
|---|---|---|---|
| [82] | Self-attention | 75.76 | The analysis relies solely on one dataset |
| [83] | Late fusion | 86.70 | The analysis relies solely on one dataset |
| [46] | Concatenation | 66.90 | Simple concatenation, high dimensionality |
| [45] | Cross-Attention | 89.25 | Poor sensitivity to micro-expressions |
| [84] | Cross-Attention | 82.42 | Computationally expensive and performed on a single dataset |
| Proposed Model | Cross-modal attention | 92.46 | Reduces dimensionality while employing three established benchmarks |
| Source | Fusion | Accuracy | Remarks |
|---|---|---|---|
| [48] | Multi-branch attention | 72.45 | Speaker-dependent model |
| [45] | Cross Attention | 84.57 | Poor sensitivity to micro-expressions |
| [47] | Conformer encoder | 77.9 | Versatile learning model |
| Proposed Model | Cross-modal attention | 87.31 | Common temporal dynamics along with the GRU layer |
| Source | Fusion | Accuracy | Remarks |
|---|---|---|---|
| [85] | Hybrid Multi-Attention Fusion | 75.39 | Parallel co-attention mechanism |
| Proposed Model | Cross-modal attention | 83.22 | Intra-modal temporal refinement |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumar, H.; Aruldoss, M.; Wynn, M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technol. Interact. 2025, 9, 116. https://doi.org/10.3390/mti9120116
Kumar H, Aruldoss M, Wynn M. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technologies and Interaction. 2025; 9(12):116. https://doi.org/10.3390/mti9120116
Chicago/Turabian StyleKumar, Himanshu, Martin Aruldoss, and Martin Wynn. 2025. "Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition" Multimodal Technologies and Interaction 9, no. 12: 116. https://doi.org/10.3390/mti9120116
APA StyleKumar, H., Aruldoss, M., & Wynn, M. (2025). Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technologies and Interaction, 9(12), 116. https://doi.org/10.3390/mti9120116

