Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing
Abstract
1. Introduction
- We summarize our contributions as follows:
- We introduce a lightweight multimodal framework for Urdu emotion recognition that fuses text, sound and image modes with substantially less computational load.
- We provide a new fusion method that greatly enhances robustness when there is noise or when one or more modes are dropped from the detection via dual-level fusion and regularization of missed modes.
- We conduct our experiments on the UMED corpus introduced by Majeed and Mujtaba [1], which remains the only publicly available multimodal Urdu emotion dataset containing 3850 annotated samples across five emotion classes. Our primary contribution is the lightweight multimodal framework itself, not dataset construction.
- We present extensive experiments comparing the performance of our proposed method against traditional multimodal transformer types and creating competitive results using significantly less parameter space and lower inference cost than these larger methods.
2. Related Works
2.1. Unimodal Emotion Recognition
2.2. Multimodal Emotion Recognition and Fusion Strategies
2.3. Emotion Recognition in Low-Resource Languages and Urdu
3. Methodology
3.1. UMED Corpus Dataset Description
3.2. Design Principles for Efficient Multimodal Fusion
3.3. Self-Attention Mechanism
3.4. Data Processing Pipeline
- Text Modality Processing
- Audio Modality Processing
- Visual Modality Processing
3.5. Model Architectures
- UMEDNet (heavyweight/multimodal upper-bound baseline)
- Proposed Fusion Model (our efficiency-based contribution)
- A set of unimodal baselines (used to perform controlled ablation analyses)
- Text Encoder—BERT-Large (340 M parameters)
- Audio Encoder—Wav2Vec 2.0 (95 M parameters)
- Visual Encoder—ViT-Large (307 M parameters)
3.5.1. Proposed Fusion: Efficient Multimodal Model
- Text Encoder: DistilBERT (66 M parameters)
- Audio Encoder: CNN–BiGRU hybrid (8.2 M parameters)
- Visual Encoder: MobileViT-XXS (5.6 M parameters)
- Reduction in number of parameters provides greater deployment flexibility
- Lower latency during inference will allow for real-time applications
- Alignment of knowledge distillation will help preserve the semantic structure of the teacher
- Improved generalization should occur from a limited capacity
3.5.2. Dual-Level Fusion Architecture
- (1)
- Feature-Level Fusion
- (2)
- Prediction-Level Fusion
3.6. Robustness Enhancements
- (1)
- Modality Dropout
- (2)
- Dynamic Weight Adjustment
- (3)
- Compression Ratio
3.7. Classification Network
3.8. Unimodal Baseline Models
- (1)
- Text-Only Model (using DistilBERT)
- (2)
- Audio-Only Model (CNN–BiGRU)
- (3)
- Visual-Only Model (using MobileViT-XXS)
| Algorithm 1: Multimodal Emotion Recognition Training Pipeline |
| Input: Dataset , Model type Output: Trained model , optimal hyperparameters, evaluation metrics
|
3.9. Training Methodology
- (1)
- Loss Function
- (2)
- Optimization (AdamW)
- (3)
- Cosine Annealing Learning Rate Schedule
- (4)
- Gradient Clipping
4. Results and Discussion
4.1. Holistic Performance and Efficiency Benchmark
- 76.5% parameter reduction
- 4.4× inference speedup (185 ms vs. 620 ms)
- Real-time capability at 5.4 FPS, compared to UMEDNet’s 1.6 FPS
| Model | Accuracy (%) | F1 (%) | Params (M) | Inf. Time (ms) | FPS | Memory (GB) | Eff. Score |
|---|---|---|---|---|---|---|---|
| UMEDNet | 85.27 | 85.29 | 200 | 620 | 1.6 | 8.2 | 1.00 |
| Text Only | 71.34 | 70.98 | 66 | 452 | 2.2 | 2.7 | 8.92 |
| Audio Only | 65.81 | 64.73 | 8.2 | 92 | 10.9 | 1.1 | 15.73 |
| Visual Only | 68.95 | 68.41 | 5.6 | 78 | 12.8 | 0.9 | 18.45 |
| Proposed Fusion | 83.72 | 83.61 | 47 | 185 | 5.4 | 2.1 | 6.84 |
4.2. Statistical Significance and Effect Size Validation
4.3. Component-Wise Contribution and Ablation Analysis
4.4. Computational Footprint Decomposition
4.5. Qualitative Error Analysis and Confusion Patterns
4.6. Inference Speed and Real-Time Capability Analysis
4.7. Cross-Lingual Generalization Potential
4.8. Additional Comparative Analysis
4.9. Modality Degradation and Robustness Analysis
5. Discussion
5.1. Performance–Efficiency Trade Off
- The text (modality) has the highest semantic grounding and adds the most performance (gain) in the ablation analysis.
- Audio modality captures the prosodic and paralinguistic features of the speech signal (e.g., pitch, intensity).
- The visual modality contains direct facial expressions (cues) and provides non-verbal/affective (cues) information.
5.2. Limitations
- (1)
- Modalities are processed sequentially and processing modalities in parallel may allow for an estimated 30–40% reduction in processing latency.
- (2)
- DistilBERT makes up 58.9% of the overall number of model parameters and is therefore the major source of computational bottleneck for the model. Future work could explore attention head pruning techniques [38] or alternative efficient transformer architectures to further reduce this bottleneck while maintaining semantic understanding capabilities.
- (3)
- The five-class emotion taxonomy is constrained by UMED corpus design. The observed Sad/Neutral (28%) and Love/Happy (31%) confusion rates reflect genuinely overlapping emotional characteristics rather than model failure. Extending to six- or seven-class schemes (e.g., Ekman’s basic emotions) would require corpus re-annotation and is planned as future work. Investigating whether finer-grained class separation or hierarchical emotion classification can reduce confusion rates is a promising research direction.
- (4)
- Individual fold-level standard deviations for all ablation variants were not retained in experimental logs, preventing formal confidence interval reporting for each component. Future work will maintain complete per-fold records to enable full statistical validation of each architectural contribution.
- (5)
- UMED Corpus Limitations: The UMED corpus [1] has several inherent limitations that affect result interpretation. First, the 3850 samples were collected from online interviews and publicly available videos, which may not fully represent spontaneous emotional expressions in naturalistic Urdu conversation. Second, the class distribution is imbalanced (Love: 13.7%, Anger: 25.0%), which may bias models toward majority classes; although stratified sampling maintains this distribution across splits, performance on Love remains lower than other classes. Third, all recordings were made in controlled studio conditions (noise < 30 dB, consistent lighting), which do not reflect real-world deployment conditions with ambient noise and variable illumination. These limitations provide context for interpreting our results and motivate future data collection in more naturalistic settings.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Majeed, A.; Mujtaba, H. UMEDNet: A multimodal approach for emotion detection in the Urdu language. PeerJ Comput. Sci. 2025, 11, e2861. [Google Scholar] [CrossRef] [PubMed]
- Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar] [CrossRef]
- Mamieva, D.; Abdusalomov, A.B.; Kutlimuratov, A.; Muminov, B.; Whangbo, T.K. Multimodal emotion detection via attention-based fusion of extracted facial and speech features. Sensors 2023, 23, 5475. [Google Scholar] [CrossRef] [PubMed]
- Mustaqeem; Kwon, S. 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. Comput. Mater. Contin. 2021, 67, 3959–3977. [Google Scholar] [CrossRef]
- Bashir, M.F.; Javed, A.R.; Arshad, M.U.; Gadekallu, T.R.; Shahzad, W.; Beg, M.O. Context-aware emotion detection from low-resource Urdu language using deep neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–30. [Google Scholar] [CrossRef]
- Akhtar, M.Z.; Jahangir, R.; Ain, Q.; Nauman, M.A.; Uddin, M.; Ullah, S.S. UrduSER: A comprehensive dataset for speech emotion recognition in Urdu language. Data Brief 2025, 60, 111627. [Google Scholar] [CrossRef]
- Soleymani, M.; Pantic, M.; Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 2011, 3, 211–223. [Google Scholar] [CrossRef]
- Illendula, A.; Sheth, A. Multimodal emotion classification. In Proceedings of the Companion World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In Proceedings of the NeurIPS Workshop on Energy Efficient Deep Learning, Vancouver, BC, Canada, 8 December 2019. [Google Scholar]
- Aliyu, Y.; Sarlan, A.; Danyaro, K.U.; Rahman, A.S.B.A.; Abdullahi, M. Sentiment analysis in low-resource settings: A comprehensive review of approaches, languages, and data sources. IEEE Access 2024, 12, 66883–66909. [Google Scholar] [CrossRef]
- Zhao, R.; Jiang, X.; Yu, F.R.; Leung, V.C.; Wang, T.; Zhang, S. Leveraging cross-attention transformer and multi-feature fusion for cross-linguistic speech emotion recognition. IEEE Internet Things J. 2025, 12, 50653–50664. [Google Scholar] [CrossRef]
- Zhang, T.; Tan, Z. Survey of deep emotion recognition in dynamic data using facial, speech and textual cues. Multimed. Tools Appl. 2024, 83, 66223–66262. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the NeurIPS, Virtual, 6–12 December 2020. [Google Scholar]
- Bharti, S.K.; Varadhaganapathy, S.; Gupta, R.K.; Shukla, P.K.; Bouye, M.; Hingaa, S.K.; Mahmoud, A. Text-based emotion recognition using deep learning approach. Comput. Intell. Neurosci. 2022, 2022, 2645381. [Google Scholar] [CrossRef]
- Tang, Y.; Hu, Y.; He, L.; Huang, H. A bimodal network based on audio-text-interactional-attention with ArcFace loss for speech emotion recognition. Speech Commun. 2022, 143, 21–32. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Makhmudov, F.; Kultimuratov, A.; Cho, Y.I. Enhancing multimodal emotion recognition through attention mechanisms in BERT and CNN architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
- Boitel, E.; Mohasseb, A.; Haig, E. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis. Expert Syst. Appl. 2025, 270, 126236. [Google Scholar] [CrossRef]
- Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Zong, Y. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy 2023, 25, 1440. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, Virtual, 3–7 May 2021. [Google Scholar]
- Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Zhu, Z.; Mao, K. Knowledge-based BERT word embedding fine-tuning for emotion recognition. Neurocomputing 2023, 552, 126488. [Google Scholar]
- Abdullah, S.M.; Ameen, S.Y.; Sadeeq, M.A.; Zeebaree, S. Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2021, 2, 73–79. [Google Scholar] [CrossRef]
- Middya, A.I.; Nag, B.; Roy, S. Deep learning-based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl. Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
- Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Attention driven fusion for multi-modal emotion recognition. arXiv 2020, arXiv:2009.10991. [Google Scholar] [CrossRef]
- Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-modal attention for speech emotion recognition. arXiv 2020, arXiv:2009.04107. [Google Scholar] [CrossRef]
- Balaji, R.L.; Thiruvenkataswamy, C.S.; Batumalay, M.; Duraimutharasan, N.; Devadas, A.D.T.; Yingthawornsuk, T. A study of unified framework for extremism classification, ideology detection, propaganda analysis, and flagged data detection using transformers. J. Appl. Data Sci. 2025, 6, 1791–1810. [Google Scholar] [CrossRef]
- Zaidi, S.A.M.; Latif, S.; Qadir, J. Enhancing cross-language multimodal emotion recognition with dual attention transformers. IEEE Open J. Comput. Soc. 2024, 5, 684–693. [Google Scholar] [CrossRef]
- Schmitz, M.; Ahmed, R.; Cao, J. Bias and fairness on multimodal emotion detection algorithms. arXiv 2022, arXiv:2205.08383. [Google Scholar] [CrossRef]
- Caschera, M.C.; Grifoni, P.; Ferri, F. Emotion classification from speech and text in videos using a multimodal approach. Multimodal Technol. Interact. 2022, 6, 28. [Google Scholar] [CrossRef]
- Raza, M.A.; Fränti, P. A hierarchical gamma mixture model-based method for classification of high-dimensional data. Entropy 2019, 21, 906. [Google Scholar]
- Teneva, E.V. Emotionalization of the 2021–2022 global energy crisis coverage: Analyzing the rhetorical appeals as manipulation means in the mainstream media. Journal. Media 2025, 6, 14. [Google Scholar] [CrossRef]
- Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A systematic review and experimental evaluation of classical and transformer-based models for Urdu abstractive text summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
- Cevher, D.; Zepf, S.; Klinger, R. Towards multimodal emotion recognition in German speech events in cars using transfer learning. arXiv 2019, arXiv:1909.02764. [Google Scholar] [CrossRef]
- Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. Efficient transformer-based abstractive Urdu text summarization through selective attention pruning. Information 2025, 16, 991. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Padi, S.; Sadjadi, S.O.; Manocha, D.; Sriram, R.D. Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. arXiv 2022, arXiv:2202.08974. [Google Scholar] [CrossRef]
- Cheema, A.S.; Azhar, M.; Arif, F.; ul Haq, Q.M.; Sohail, M.; Iqbal, A. EGPT-SPE: Story point effort estimation using improved GPT-2 by removing inefficient attention heads. Appl. Intell. 2025, 55, 994. [Google Scholar] [CrossRef]









| Parameter | Value | Remarks |
|---|---|---|
| Total Duration | 17 h | Continuous recordings from online interviews |
| Number of Samples | 3850 | Each sample: text + audio + video (synchronized) |
| Emotion Classes | 5 | Anger, Happy, Sad, Neutral, Love |
| Number of Speakers | 142 | Native Urdu speakers, diverse backgrounds |
| Male/Female Ratio | 58%/42% | Gender balance maintained across emotion classes |
| Age Range | 18–65 years | Diverse age representation |
| Recording Environment | Controlled studio | Noise < 30 dB, consistent lighting |
| Annotation Agreement (Cohen’s κ) | 0.80 | High inter-annotator reliability |
| Data Splits (Train/Val/Test) | 70%/15%/15% | Stratified sampling across emotion categories |
| Audio Quality (SNR) | >35 dB | All samples meet quality threshold |
| Face Detection Confidence | >0.90 | MTCNN face detection threshold |
| Temporal Alignment | ±50 ms | Modality synchronization tolerance |
| Emotion | Count | Percentage (%) | Duration (h) |
|---|---|---|---|
| Anger | 2068 | 25.0 | 4.25 |
| Happy | 1771 | 21.4 | 3.64 |
| Sad | 1624 | 19.6 | 3.34 |
| Neutral | 1680 | 20.3 | 3.45 |
| Love | 1135 | 13.7 | 2.33 |
| Total | 8278 | 100.0 | 17.01 |
| Comparison | p-Value | Effect Size (Cohen’s d) | Interpretation |
|---|---|---|---|
| Proposed vs. Text Only | <0.001 | 1.82 | Large effect |
| Proposed vs. Audio Only | <0.001 | 2.15 | Large effect |
| Proposed vs. Visual Only | <0.001 | 1.94 | Large effect |
| Proposed vs. UMEDNet | 0.12 | 0.45 | No significant difference |
| Model Variant | Accuracy (%) | Δ Accuracy | 95% CI | p-Value (vs. Full) |
|---|---|---|---|---|
| Full Proposed Model | 83.72 | — | [82.91, 84.53] | — |
| Text Modality | 75.18 | −8.54 | [74.21, 76.15] | <0.001 |
| Visual Modality | 77.65 | −6.07 | [76.58, 78.72] | <0.001 |
| Audio Modality | 79.41 | −4.31 | [78.33, 80.49] | <0.001 |
| Modality Dropout | 82.95 | −0.77 | [82.02, 83.88] | 0.08 |
| Prediction-Level Fusion | 82.88 | −0.84 | [81.95, 83.81] | 0.06 |
| Concatenation Only | 82.10 | −1.62 | [81.15, 83.05] | 0.04 |
| Component | Params (M) | % of Total | Performance Contribution (%) |
|---|---|---|---|
| DistilBERT (Text Encoder) | 66 | 58.9 | 32.1 |
| CNN-BiGRU (Audio Encoder) | 8.2 | 7.3 | 18.7 |
| MobileViT-XXS (Visual Encoder) | 5.6 | 5.0 | 14.5 |
| Fusion + Classifier | 32.2 | 28.8 | 34.7 |
| Total | 112.0 | 100 | 100 |
| Model | Latency (ms) | FPS | Memory (GB) | Battery (mAh/h) | Deployable? |
|---|---|---|---|---|---|
| UMEDNet | 620 | 1.6 | 8.2 | ~4200 | No |
| Text Only | 452 | 2.2 | 2.7 | ~1800 | Limited |
| Audio Only | 92 | 10.9 | 1.1 | ~650 | Yes |
| Visual Only | 78 | 12.8 | 0.9 | ~520 | Yes |
| Proposed Fusion | 185 | 5.4 | 2.1 | ~1100 | Yes |
| Degradation Condition | Accuracy (%) | Δ from Full Model |
|---|---|---|
| Full Model (all modalities) | 83.72 | — |
| Audio: Gaussian noise (SNR = 10 dB) | 80.15 | −3.57 |
| Audio: Gaussian noise (SNR = 0 dB) | 75.43 | −8.29 |
| Video: Gaussian blur (σ = 2.0) | 79.88 | −3.84 |
| Video: Gaussian blur (σ = 5.0) | 74.21 | −9.51 |
| Text: random word dropout (30%) | 76.94 | −6.78 |
| Text: random word dropout (50%) | 71.23 | −12.49 |
| Missing audio (zero input) | 78.65 | −5.07 |
| Missing video (zero input) | 77.92 | −5.80 |
| Missing text (zero input) | 70.33 | −13.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Azhar, M.; Amjad, A.; Arman, M.; Dewi, D.A. Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information 2026, 17, 458. https://doi.org/10.3390/info17050458
Azhar M, Amjad A, Arman M, Dewi DA. Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information. 2026; 17(5):458. https://doi.org/10.3390/info17050458
Chicago/Turabian StyleAzhar, Muhammad, Adeen Amjad, Muhammad Arman, and Deshinta Arrova Dewi. 2026. "Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing" Information 17, no. 5: 458. https://doi.org/10.3390/info17050458
APA StyleAzhar, M., Amjad, A., Arman, M., & Dewi, D. A. (2026). Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information, 17(5), 458. https://doi.org/10.3390/info17050458

