Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing

Azhar, Muhammad; Amjad, Adeen; Arman, Muhammad; Dewi, Deshinta Arrova

doi:10.3390/info17050458

Open AccessArticle

Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing

¹

Department of Applied Data Science, Hong Kong Shue Yan University, Hong Kong SAR, China

²

Department of Computer Science, University of Sahiwal, Sahiwal 57000, Pakistan

³

Faculty of Data Science and Information Technology, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 458; https://doi.org/10.3390/info17050458

Submission received: 11 March 2026 / Revised: 17 April 2026 / Accepted: 25 April 2026 / Published: 8 May 2026

(This article belongs to the Special Issue Advancing Information Systems Through Artificial Intelligence: Innovative Approaches and Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu emotion detection integrating facial expressions, speech, and text. We utilize DistilBERT for text, CNN-BiGRU for audio, and MobileViT-XXS for visual processing with a dual-level fusion strategy. We evaluate on the publicly available UMED corpus, the only multimodal Urdu emotion dataset. Our system recognizes expressed emotional signals rather than internal affective states. Experimental results demonstrate competitive performance (83.72% accuracy) while requiring 76.5% fewer parameters and 4.4× faster inference than heavyweight baselines, enabling accessible, real-time emotion recognition in low-resource contexts.

Keywords:

affective computing; multimodal emotion recognition; Urdu language processing; lightweight transformers; edge computing; public health; DistilBERT; MobileViT; CNN-BiGRU

1. Introduction

One of the major goals of emotion-aware computing is better understanding and response to human emotional expressions. These intelligent systems rely on emotion recognition as a core component, enabling them to perceive and respond to observable emotional signals produced by users. It is important to note that the signals processed by such systems speech acoustics, facial expressions, and textual content reflect expressed or displayed emotion rather than internal affective states per se. Emotional expression can be modulated, suppressed, or culturally mediated, and the outputs of emotion recognition models should therefore be interpreted as estimates of expressed emotion rather than direct measures of subjective experience. This distinction is particularly relevant in naturalistic, low-resource settings where spontaneous and posed expressions may coexist in the data [1]. There are many areas were improving an intelligent system’s ability to accurately recognize emotions will be beneficial including, but not limited to, mental health assessment, intelligent tutoring systems, customer sentiment analysis, social robotics, and interfacing between humans and computers. Most previous methods for emotion recognition have utilized only one modality such as text, audio, or facial expressions. However, expression of emotion is often complex and ambiguous and as such, one modality alone may be subject to errors from noise in the data, contextual variations, or the inability to fully capture every aspect of an emotional expression due to a lack of all the cues necessary for accurate recognition.

To overcome these limitations with previous approaches utilizing a single modality, multi-modal emotion recognitions have been developed that use complementary data from multiple modalities (textual semantics, speech prosody, and visual facial expressions) to create a better understanding of the user’s emotional state [2]. As such, multi-modal approaches being employed jointly model multiple heterogeneous signals that provide better robustness and generalization when considering real-world applications where data is being collected in unconstrained environments. The use of deep learning techniques such as transformer-based architectures for realizing multi-modal emotion recognition has greatly improved the performance of multi-modal systems by enabling accurate long-range dependencies among various sources of data collected in different modalities and also for accurate cross-modal interactions.

Multimodal emotion recognition systems are typically built for high-resource languages such as English and Mandarin. Though many have been developed, they also require significant computational and memory resources [3], since most use existing (large-scale) pretrained models and complicated fusion techniques. Thus, current multi-modal emotion recognition systems cannot be used for low-resource languages, nor can they be deployed on resource-constrained devices.

Urdu is a prime example of a low-resource language that has not received as much research attention with regard to multimodal emotion recognition because it has its own unique characteristics, including a right-to-left writing system; use of the Nastaliq script; and a highly complex morphological structure influenced by Arabic, Persian, and many other South Asian languages [4]. In addition, there is very limited access to large enough multimodal datasets with sufficient annotations to allow for research into multimodal emotion recognition in Urdu.

Research on facade-based Language and emotional recognition has traditionally employed Understandable pun in literature or text (e.g., Urdu Narsalique Emotion Database (UNED)) [5] or speech (e.g., Urdu Speech Emotion Recognition (USER)) [6]. However, the data/effectiveness of much of the visual emotion recognition research is limited because many of their studies are completed using a model that has already been developed using large sets of research data from non-Urdu speaking environments. Most of these aforementioned studies produced useful, but unimodal systems that could help in some situations, but have difficulty when considering all types of emotional expression found in ‘real life’ settings.

Recently, researchers have begun to conduct studies in multimodal systems. In addition to providing additional methods to achieve better results, the development of several large scale, multimodal datasets (e.g., IEMOCAP, CMU-MOSEI), has allowed for the construction of more complex, state of the art multimodal fusion models [7,8]. In particular, there are several transformer-based architectures (e.g., Cross-modal attention models (MulT)) that are ideally suited for dealing with un-aligned multimodal sequential data [9]. Unfortunately, because of their resource/computation intensive nature, transformer architectures can be impractical for either near real-time inference or deployment to mobile and other edge devices.

This indicates a lack of published research studies regarding computationally efficient multimodal emotion recognition systems with acceptable levels of recognition performance that can be used for practical, low-resource language applications (real-time interactive tutoring systems, mental health monitoring, or adaptive human-computer interfaces) where low latency and limited computational resources are a primary concern.

To fill the above-mentioned gap, we propose an efficiency-driven, applied multimodal emotion recognition framework specifically designed for deployment in low-resource language settings. Rather than introducing new modeling primitives, this study contributes to the principled combination of lightweight architectures with an effective dual-level fusion strategy optimized for practical deployability in resource-constrained environments. The value of this contribution is in system design and empirical validation rather than theoretical novelty, and we position it as such. To achieve this, we will use highly efficient neural network models along with an optimized fusion strategy in order to provide both acceptable recognition performance and computational efficiency. Our goal is to provide sufficient representational capacity while at the same time reducing overall model complexity; thus enabling interoperable and deployable multimodal emotion recognition systems across low-resource languages.

We summarize our contributions as follows:

We introduce a lightweight multimodal framework for Urdu emotion recognition that fuses text, sound and image modes with substantially less computational load.
We provide a new fusion method that greatly enhances robustness when there is noise or when one or more modes are dropped from the detection via dual-level fusion and regularization of missed modes.
We conduct our experiments on the UMED corpus introduced by Majeed and Mujtaba [1], which remains the only publicly available multimodal Urdu emotion dataset containing 3850 annotated samples across five emotion classes. Our primary contribution is the lightweight multimodal framework itself, not dataset construction.
We present extensive experiments comparing the performance of our proposed method against traditional multimodal transformer types and creating competitive results using significantly less parameter space and lower inference cost than these larger methods.

In the remainder of the paper, we discuss related work on multimodal and low-resource datasets in Section 2. We detail our proposed multimodal framework and fusion method Section 3. We present experimental results and analysis in Section 4 and present conclusions and future directions at the end of the paper in Section 5.

2. Related Works

Emotion recognition has seen an increase in its evolution from earlier unimodal methods to now, more sophisticated multimodal models which show how audio (sound), visual (sight), and textual (language) sensory information work together to create a complete emotional expression. The reason for this growth is that each emotional expression has more than one mode to express itself or communicate; therefore, relying only on one source of information won’t yield the best results. In this review, we highlight previous research pertaining to unimodal emotion recognition techniques; different combinations of modalities (unimodal, multimodal fusion), and emotion recognition methods for low resource languages, especially Urdu.

2.1. Unimodal Emotion Recognition

In earlier studies of how to recognize emotions, an emphasis was placed on using all of the same type of data to train a machine (unimodal) through artisanal means (i.e., hand-crafted) and conventional learning approaches (i.e., machine learning). An example in Speech Emotion Recognition (SER) involved combining acoustic descriptors, including Mel Frequency Cepstrum Coefficients (MFCCs), pitch, and energy, as well as using classification schemes such as Support Vector Machines (SVM) and Gaussian Mixture Models (GMMs). While these methods worked well in controlled settings, many of the ways they were developed to identify emotions are not robust and are not suitable for use with different speakers, recording environments or languages [10]. Upon introduction of deep learning-based techniques that utilize CNNs, which are able to learn representative discriminative spectral content from speech, and RNNs/LSTMs that can model time-dependent emotion in speech, there was also a major shift from previous methodologies in SER [11]. Recently, HuBERT and Wav2Vec 2.0 as example self-supervised transformer-based speech systems have obtained state-of-the-art results through learning contextually-driven representations from raw audio data and generalizing well to multiple languages with minimal training resources available [12]. Wav2Vec 2.0 learns speech representations for emotion recognition [13].

The trends in classifying emotions from a text have remained stable as well. Initially using lexicon-based and rule-based methods, the ability of older methodologies to process contextual ambiguity, sarcasm and emotional expression that lacks explicit descriptions was limited [14]. Through utilizing Long Short-Term Memory (LSTM) networks and Bidirectional Gated Recurrent Unit (BiGRU) networks, improvements to efficiency were made via capturing and using semantic and sequential properties of the original text [15]. Additionally, transformer-based language models have advanced the field of text classification using BERT, RoBERTa and Distilled BERT. BERT provides contextualized representations for text emotion recognition [16]. Representations created by these model types are contextual in nature and can therefore be used with attention mechanisms to reach state of the art accuracy on multiple emotion classification datasets [17]. Distillation process (e.g., DistilBERT) lessens burdensome computing resources while still achieving significant degrees of representational strength found in their base transformer counterparts and results in a simplistic implementation banner [9].

Traditional methods of visual emotion recognition relied on manually created features such as Gabor filters and local binary patterns (LBP), both of which were affected by variations in light and head pose [10]. Recently, CNN architectures have taken over as they perform well at learning long-term spatial (i.e., “feature”) relationships between the pixels of facial images. In reviewing some of the more advanced deep learning models such as VGGNet or ResNet-50, a significant number of the publicly available facial expression recognition datasets showed strong performance with respect to these models [18]. More recently, hybrid CNN-ViT methods and ViTs have also been tested as a means of addressing spatial relationships between the different regions of the face over greater distances [19]. Vision Transformers (ViTs) capture spatial relationships in facial expressions [20]. To improve upon these architectures and address the issue of the size and weight of computer vision models, researchers have developed an increasing number of lightweight (or “small”) models and architectures such as MobileNet and MobileViT. These models enable real-time facial emotion recognition via mobile devices and other low-powered devices such as those used at homes or businesses [21]. Knowledge-based BERT fine-tuning has improved emotion recognition performance [22].

2.2. Multimodal Emotion Recognition and Fusion Strategies

While unimodal methods provide good insight, they are limited by the noise and ambiguity of the single modality they use. By combining information from multiple modalities into one model (multimodal emotion recognition), we can overcome this issue [23]. Fusing modalities can be described as early (feature-level) fusion, late (decision-level) fusion, or hybrid fusion.

Through early fusion, different data types (e.g., audio and video) are combined before classification, allowing a model to learn correlations between modalities directly. Tzirakis et al. [2] provided evidence that end-to-end audio-visual models outperform unimodal baselines; many subsequent studies have shown that early fusion performs well on benchmark datasets including RAVDESS and SAVEE [24]. However, early fusion is often sensitive to synchronization errors and missing modality data. In contrast, late fusion aggregates decisions from unimodal classifiers, improving robustness and modularity. Soleymani et al. [7] found that decision-level fusion between facial expressions and vocal expressions significantly improved emotion recognition performance.

Current research focuses on hybrid fusion techniques. The most advanced hybrid methods use attention mechanisms to dynamically weight modality contributions spatiotemporally, enabling more flexible and context-aware fusion. Numerous studies have concluded that attention-based audio-video fusion outperforms static methods [25]. Multi-modal attention mechanisms have been applied to speech emotion recognition [26]. Transformer-based architectures have further extended these techniques to include self-attention and cross-attention for modeling complex cross-modal dependencies. Transformers have been applied to various text classification tasks [27]. Examples include the Dual Attention Transformer (MDAT) described by Zaidi et al. [28], as well as large-scale fusion models such as MIST, which utilize powerful modality-specific encoders in transformer-based fusion pipelines [18]. Despite their excellent performance, these models require massive computational resources, making them less compatible with real-time deployment on resource-constrained devices.

Research has also investigated fairness and bias in multimodal emotion recognition. Schmitz et al. [29] reported that text-based models generally have less bias compared to models that include visual modalities, which may introduce fairness challenges. The use of audio and text modalities provides better trade-offs between accuracy and fairness than other modality combinations. Additionally, the HMM-based approach proposed by Caschera et al. [30] demonstrates that probabilistic models without deep learning techniques remain applicable and can be successful for certain applications. Probabilistic mixture models have been used for high-dimensional classification [31]. Media emotionalization has been studied as a form of rhetorical manipulation [32].

2.3. Emotion Recognition in Low-Resource Languages and Urdu

Despite having limited annotated data and utilizing a complicated linguistic system, the area of understanding emotions through low-resource languages is still significantly sparse. The Urdu language poses unique challenges for multimodal emotion recognition. Due to its morphological complexity caused by Arabic, Persian, and South Asian influences, there are many out-of-vocabulary issues with standard tokenizers when dealing with text data. Our system solves this problem using subword WordPiece tokenization. Urdu also has different prosody, intonation patterns, and other paralinguistic features from well-resourced languages such as English and Mandarin, making it difficult for pre-trained speech models to transfer properly without adjustment. In addition to these issues, there is variation due to how discourse context changes when people code switch between Urdu and English in urban Pakistani environments, which cannot be represented by using only monolingual models. When looking specifically at issues with text, such as rendering it in the Nastaliq writing style or using a right-to-left directionality, these have little direct effect on the multimodal emotion recognition system outlined here. Urdu text processing challenges have been reviewed in recent literature [33].

At the very beginning, efforts in the field of recognizing emotion in Urdu concentrated on creating emotion datasets, with a singular form of analysis (unimodal). Bashir et al. [5] developed the first of its kind, an Urdu Nastalique Emotions Dataset (UNED) that showed how deep learning models based on Long Short-Term Memory (LSTM) would perform better than traditional classifiers, at least when recognizing emotion through text only. Akhtar et al. [6] took this approach to the next level by introducing their Urdu Speech Emotion Recognition (UrduSER) dataset, which created a robust speech reputation for recognizing emotion in Urdu spoken language.

To date, the largest, most extensive multimodal effort in the area of Urdu emotion recognition was performed by Majeed and Mujtaba [1], when they created the first multimodal Urdu emotions corpus and corresponding multimodal emotion recognition framework known as UMEDNet. They achieved strong performance through the use of very large, pretrained models (e.g., Wav2Vec 2.0 and XLM-Roberta) but their structural design is expensive computationally, due to both the total number of parameters and total length of required inferences required for event detection. Therefore, these factors limit their abilities to be used in real time for low-resource applications.Transfer learning has been explored for German multimodal emotion recognition in cars [34]. Thus, we will explore efficient multimodal designs (integrative designs) that utilize compact backbone model architectures across respective textual, auditory, and visual modalities.Attention pruning has been used for efficient Urdu text processing [35].

3. Methodology

This section outlines the method used for Urdu multimodal emotion recognition, dataset description and theoretical foundation for three model architectures used in the implementation are outlined in this section. To develop and evaluate: (1) UMEDNet as a heavyweight baseline, (2) Proposed Fusion as our efficient multimodal approach, and (3) Unimodal Baselines for comparative analysis research uses the UMED Corpus.

3.1. UMED Corpus Dataset Description

Majeed and Mujtaba [1] introduced the Urdu Multimodal Emotion Detection (UMED) Corpus as the first and only corpus available to researchers focused on using multimodal resources for emotion recognition in the Urdu language. We leveraged this dataset as an experimental baseline. As such, we did not create a new dataset; instead, we provided a novel architecture and fusion method evaluated on that corpus. Based on a structured data collection process and with the assistance of native Urdu speakers (n = 142) with a diverse demographic background, a total of 3850 synchronized multimodal samples over an aggregate duration of 17 h for five categories of emotions—Anger, Happy, Sad, Neutral and Love—are contained in the dataset.

Table 1 shows the specifications of the corpus UMED in detail. The entire corpus has 17 hours’ worth of synchronized audio-visual recordings recorded from 142 native Urdu speakers (58% Male and 42% female), and a wide range of ages from 18 to 65 years. The corpus contains 3850 samples that are aligned with all three data types; text, audio and video. The emotions are labelled in five different classes (Anger, Happy, Sad, Neutral and Love). All recording conditions would have ensured that the quality of recording has been maintained; the noise level during any single recording was always below 30 dB, the intensity of lighting was also controlled to be constant during the entire time of recording to provide high quality data acquisition. The inter-annotator agreement for the emotions assigned was found to be very good (κ = 0.80 using Cohen’s κ). The Corpus UMED was split into training (70%), validation (15%) and testing (15%) datasets. The sampling response was stratified, which means that the characteristics will remain the same across all splits. For quality assurance purposes; audio signals had at least a 35 dB S/N ratio, there was a 90% confidence in face detection based on MTCNN and the temporal alignment levels were within ±50 ms for all three data types. This enables that there is a synchronized representation across multiple data types.

It is important to identify the limitations of the UMED corpus: 1. The 3850 samples of the dataset consist of samples from online interviews and publicly available videos and may not represent all sources of spontaneous emotional expression as would exist within actual Urdu conversation; 2. The dataset has an imbalanced class distribution of emotional types, as love is only represented by 13.7% as opposed to anger, which is represented by 25.0%. Even though stratified sampling would help to lessen these effects during training, this will still affect the performance of models developed on emotional types with less data representation. 3. As these samples were all obtained in controlled studio conditions (noise level < 30 dB, consistent illumination), they do not accurately represent conditions that would exist when deploying a model with ambient noise levels and variable illumination. These limitations provide context for understanding the results reported herein and further motivate future collections of data from sources that will produce more naturalistic representations with a greater diversity of emotional speech among the Urdu-speaking population.

The emotion-based breakdown of the UMED corpus in Table 2, indicates there are 8278 inspected and labelled data items distributed among the five emotional jenis—with the majority (2068, 25.0%, 4.25 h) for anger, happiness next (1771, 21.4%, 3.64 h), followed closely behind neutrality (1680, 20.3%, 3.45 h), then sadness (1624, 19.6%, 3.34 h), and lastly love (1135, 13.7%, 2.33 h). The total duration of the emotional expressions across all five types, was equal to 17.01 h, as shown in Table 1’s total for the entire corpus. Although there was an imbalance in this distribution of the dataset—specifically under representation of love compared to the others—they still exhibit the normal variability of the frequency in which emotions would normally occur naturally in spontaneous conversation/conversational data. The stratified random sampling methodology used to divide the data into training, validation, and test sets, allows for the maintenance of these distributions/characteristics in each respective dataset. Therefore, robust models could be developed for determining emotion classification for each of these types of emotion.

The UMED corpus was created using a five-class taxonomy for emotions (Anger, Happy, Sad, Neutral, Love) based on the emotional classification system from Ekman’s early research and seems to contain some level of ambiguity; specifically, many participants had trouble distinguishing between Sad/Neutral and Love/Happy (Section 4.5), which indicates that people express emotions in ways that can be difficult to interpret when they fall into low-arousal, high-positive valence categories, thus leading to misclassification. While future research will explore different types of emotions using finer granularity than what is currently presented here, it will take place without re-annotating all the data within the UMED corpus. Several studies have identified similar issues with ambiguous classifications of basic emotions within larger emotion taxonomies.

3.2. Design Principles for Efficient Multimodal Fusion

Multimodal emotion recognition is founded on the complementary principle: different modalities (text, audio, video) capture partially overlapping yet distinct information about the same emotional expression. By jointly modeling all three modalities, we reduce ambiguity and improve robustness compared to unimodal approaches. Our contribution is engineering-driven rather than theoretically novel. We adopt three practical design principles that directly inform our architecture:

Architectural Compression: Replace heavy encoders (BERT-Large, Wav2Vec 2.0, ViT-Large) with lightweight alternatives (DistilBERT, CNN-BiGRU, MobileViT-XXS). This follows the information bottleneck principle, where we compress input representations while preserving task-relevant features.

Selective Training: Freeze the largest encoder (DistilBERT) to reduce trainable parameters and prevent overfitting on the small UMED corpus. This leverages transfer learning, where knowledge from large-scale pretraining is retained while only task-specific layers are fine-tuned.

Robust Fusion: Combine feature-level concatenation with prediction-level weighting to handle missing or noisy modalities. This draws from ensemble theory, where multiple independent predictors improve generalization and robustness.

3.3. Self-Attention Mechanism

To model contextual dependencies across modalities and temporal sequences, we employ the standard scaled dot-product attention mechanism [36]. This allows each token or feature vector to dynamically attend to all other positions, capturing semantic emphasis in text, temporal variations in audio, and spatial dependencies in visual frames. The scaled dot-product attention is defined as:

\begin{matrix} A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(1)

where

Q

,

K

, and

V

represent query, key, and value matrices, and

d_{k}

denotes the dimensionality of the key vectors. The scaling factor

\sqrt{d_{k}}

stabilizes gradients for high-dimensional representations. This mechanism allows each token or feature vector to dynamically attend to all other positions, capturing semantic emphasis in text, temporal variations in audio, and spatial dependencies in visual frames.

3.4. Data Processing Pipeline

Our multimodal framework processes text, audio, and visual inputs through parallel domain-specific pipelines while preserving temporal synchronization. Each dataset instance is represented as:

\begin{matrix} V_{i} = \{X_{T}^{(i)}, X_{A}^{(i)}, X_{V}^{(i)}\}, i = 1, 2, \dots, 3850 \end{matrix}

(2)

Figure 1 illustrates the complete parallel preprocessing pipeline for all three modalities. As shown in the figure, text, audio, and video inputs are processed through three independent branches before temporal alignment. This synchronized representation ensures alignment between spoken content, vocal tone, and facial expressions—an essential requirement for accurate multimodal emotion recognition.

Text Modality Processing

Urdu text preprocessing requires script normalization and diacritic removal to reduce orthographic variability:

\begin{matrix} X^{c l e a n} = R e m o v e D i a c r i t i c s (N o r m a l i z e U r d u S c r i p t (X_{T})) \end{matrix}

(3)

\begin{matrix} X^{t o k e n s} = W o r d P i e c e T o k e n i z e (X^{c l e a n}) \end{matrix}

(4)

\begin{matrix} X_{T}^{p r o c} = {[C L S], w_{1}, w_{2}, \dots, w_{n}, [S E P]} \end{matrix}

(5)

Tokenization using subword units improves robustness to morphological richness and out-of-vocabulary terms.

Audio Modality Processing

Audio features are normalized and transformed into time–frequency representations to capture prosodic cues such as pitch, energy, and spectral patterns:

\begin{matrix} X^{n o r m} = \frac{X_{A} - μ_{A}}{σ_{A}} \end{matrix}

(6)

\begin{matrix} S (t, f) = {∣ S T F T (X^{n o r m}) ∣}^{2} \end{matrix}

(7)

\begin{matrix} M (t, m) = l o g (M e l F i l t e r B a n k (S (t, f))) \end{matrix}

(8)

Log-Mel spectrograms approximate human auditory perception and provide discriminative emotional cues from vocal intonation.

Visual Modality Processing

Facial expressions are extracted using a structured pipeline:

\begin{matrix} F^{d e t e c t e d} = M T C N N (X_{V}) \end{matrix}

(9)

\begin{matrix} F^{a l i g n e d} = A l i g n F a c e (F^{d e t e c t e d}) \end{matrix}

(10)

\begin{matrix} F^{p r o c} = R e s i z e (F^{a l i g n e d}, 256 \times 256) \end{matrix}

(11)

Face alignment ensures geometric consistency, while normalization standardizes spatial dimensions for deep feature extraction. The complete pipeline is visualized in Figure 1, which shows the three parallel branches and their alignment strategy.

3.5. Model Architectures

The UMED Corpus’s performance vs. efficiency evaluation involves the investigation of three model architectures:

UMEDNet (heavyweight/multimodal upper-bound baseline)
Proposed Fusion Model (our efficiency-based contribution)
A set of unimodal baselines (used to perform controlled ablation analyses)

These three model architectures allow us to examine both theory and empirically the scalability, parameter efficiency, and cross-modal interaction of the UMED Corpus with regard to their performance vs. efficiency trade-off. UMEDNet (heavyweight/multimodal upper-bound baseline) is a class of large-scale multi-modal transformer architectures which have been developed to have maximum representational capacity and not designed to be computationally efficient. It uses transforms-based fusion encoder for modeling the cross-modal interactions of the two or more input modalities and therefore uses high parameter pre-trained encoders as the input modality encoder in creating the cross-modal representation as shown below:

Text Encoder—BERT-Large (340 M parameters)
Audio Encoder—Wav2Vec 2.0 (95 M parameters)
Visual Encoder—ViT-Large (307 M parameters)

These encoders generate high-dimensional contextual representations:

\begin{matrix} F_{T}^{h e a v y} = B E R T_L a r g e (X_{T}^{p r o c}) \in R^{1024} \end{matrix}

(12)

\begin{matrix} F_{A}^{h e a v y} = W a v 2 V e c 2.0 (X_{A}^{p r o c}) \in R^{512} \end{matrix}

(13)

\begin{matrix} F_{V}^{h e a v y} = V i T_L a r g e (X_{V}^{p r o c}) \in R^{768} \end{matrix}

(14)

The modality-specific embeddings are concatenated and processed through a transformer fusion encoder, from a theoretical standpoint, the computational complexity per transformer layer can be approximated as:

\begin{matrix} F^{h e a v y} = T r a n s f o r m e r E n c o d e r ([F_{T}^{h e a v y}; F_{A}^{h e a v y}; F_{V}^{h e a v y}]) \end{matrix}

(15)

\begin{matrix} O (L \cdot (n \cdot d_{T} + n \cdot d_{A} + n \cdot d_{V})) \end{matrix}

(16)

where

L

denotes the number of layers,

n

represents sequence length, and

d_{T}, d_{A}, d_{V}

are modality-specific hidden dimensions. While UMEDNet achieves strong representational expressiveness, its high parameter count (>700 M parameters combined) leads to increased memory footprint, latency, and deployment constraints.

3.5.1. Proposed Fusion: Efficient Multimodal Model

Our primary contribution focuses on optimizing the accuracy–efficiency trade-off through architectural compression and knowledge distillation. Instead of maximizing parameter count, we emphasize lightweight encoders combined with effective multimodal fusion. The proposed model employs:

Text Encoder: DistilBERT (66 M parameters)
Audio Encoder: CNN–BiGRU hybrid (8.2 M parameters)
Visual Encoder: MobileViT-XXS (5.6 M parameters)

The lightweight feature extraction is defined as:

\begin{matrix} F_{T} = D i s t i l B E R T (X_{T}^{p r o c}) \in R^{768} \end{matrix}

(17)

\begin{matrix} F_{A} = {B i G R U}_{128} ({C N N}_{[7, 5, 3, 3]} (X_{A}^{p r o c})) \in R^{128} \end{matrix}

(18)

\begin{matrix} F_{V} = M o b i l e V i T_X X S (X_{V}^{p r o c}) \in R^{256} \end{matrix}

(19)

As shown in Figure 2 (left panel), the CNN-BiGRU audio encoder applies four convolutional layers with kernel sizes [7, 5, 3, 3] and filter counts [32, 64, 128, 256]. Each convolutional layer is followed by batch normalization, ReLU activation, and max pooling (stride 2). The output then passes through a 128-unit bidirectional GRU, followed by mean pooling over time to produce the final audio feature vector

F_{A} \in R^{128}

. The visual encoder, shown in Figure 2 (right panel), employs the MobileViT-XXS architecture. Face frames of size

224 \times 224 \times 3

first pass through a convolutional stem (

3 \times 3

, stride 2, 16 channels). This is followed by three MobileNetV2 blocks (expansion ratio 2), transformer blocks (4 attention heads, hidden dimension 96), two additional MobileNetV2 blocks (expansion ratio 4), global average pooling, and finally a fully connected layer to produce

F_{V} \in R^{256}

.

Here, The CNN layers of the proposed architecture capture local spectral features of audio data while the bi-directional GRU captures temporal feature dependencies. The visual backbone of the model uses lightweight transformer layers which incorporate convolutional inductive biases to reduce computational cost.

Theoretical benefits of the proposed model include:

Reduction in number of parameters provides greater deployment flexibility
Lower latency during inference will allow for real-time applications
Alignment of knowledge distillation will help preserve the semantic structure of the teacher
Improved generalization should occur from a limited capacity

The proposed model has 112 M total trainable parameters, with the pretrained DistilBERT encoder accounting for 66 M of those, and was frozen during training to reduce computational cost and prevent model overfitting when using the much smaller UMED corpus of text. The CNN-BiGRU (8.2 M), MobileViT-XXS (5.6 M), and fusion/classifier layers (32.2 M, including projection heads and the MLP classifier) were the only components to be trained end-to-end over the course of training, yielding 47M total trainable parameters; thus, all efficiencies reported (inference time, memory, FPS) were derived from the entire 112 M parameter model on inference, while the 47 M value is for total trainable parameters only.

3.5.2. Dual-Level Fusion Architecture

Our multimodal framework adopts a structured dual-level fusion strategy that integrates information at both representation and decision stages. This design improves cross-modal interaction while maintaining robustness under modality degradation. Figure 3 presents the complete dual-level fusion architecture. The framework integrates information at two complementary levels: feature-level fusion and prediction-level fusion.

(1): Feature-Level Fusion

At the representation level, modality-specific embeddings are concatenated to form a unified multimodal feature vector:

\begin{matrix} F_{f u s i o n} = C o n c a t (F_{T}, F_{A}, F_{V}) \in R^{1152} \end{matrix}

(20)

This early-fusion mechanism captures complementary correlations among textual, acoustic, and visual representations within a shared embedding space. As illustrated in Figure 3 (Level 1), feature-level fusion concatenates

F_{T}

(768-dim),

F_{A}

(128-dim), and

F_{V}

(256-dim) into a 1152-dimensional vector. This vector then passes through two MLP layers (256 → 128) with ReLU activation and dropout (0.3), followed by Softmax classification.

(2): Prediction-Level Fusion

Simultaneously, modality-specific predictions are computed and combined using adaptive weighting:

\begin{matrix} P_{f u s i o n} = α σ (W_{A} F_{A}) + (1 - α) σ (W_{V} F_{V}) \end{matrix}

(21)

The prediction-level fusion path (Figure 3, Level 2) computes modality-specific predictions

P_{T} = σ (W_{T} \cdot F_{T})

,

P_{A} = σ (W_{A} \cdot F_{A})

, and

P_{V} = σ (W_{V} \cdot F_{V})

, where

σ (\cdot)

denotes the Softmax function. Audio-visual predictions are combined using dynamic weighting:

P_{A V} = α \cdot P_{A} + (1 - α) \cdot P_{V}

, with

α = 0.6

under normal conditions (SNR ≥ 15 dB) and

α = 0.3

when audio quality is degraded (SNR < 15 dB). This late-fusion pathway preserves modality autonomy and improves reliability when one modality is corrupted. The text prediction

P_{T}

is always included in the final decision. The final prediction integrates outputs from both fusion levels (feature-level and prediction-level), combining joint feature learning with reliability-aware decision fusion.

3.6. Robustness Enhancements

To ensure stable performance in real-world scenarios, we incorporate the following regularization and adaptive mechanisms:

(1): Modality Dropout

During training, modality representations are randomly masked:

\begin{matrix} {\tilde{F}}_{m} = F_{m} ⊙ B e r n o u l l i (0.8), m \in {T, A, V} \end{matrix}

(22)

This prevents over-reliance on any single modality and improves generalization under missing inputs.

(2): Dynamic Weight Adjustment

Fusion weights are adapted based on audio signal-to-noise ratio (SNR):

\begin{matrix} α = \{\begin{matrix} 0.3, & if S N R (X_{A}) < 15 dB \\ 0.6, & otherwise \end{matrix} \end{matrix}

(23)

Lower audio quality reduces its influence in prediction-level fusion.

(3): Compression Ratio

The efficiency gain relative to the heavyweight baseline is quantified as:

\begin{matrix} C R = \frac{{P a r a m s}_{h e a v y}}{{P a r a m s}_{l i g h t}} = \frac{742}{79.8} \approx 9.3 \end{matrix}

(24)

This demonstrates substantial parameter reduction with maintained predictive performance.

3.7. Classification Network

The fused feature vector undergoes hierarchical non-linear transformation:

\begin{matrix} H_{1} = R e L U (W_{1} F_{f u s i o n} + b_{1}), H_{1} \in R^{256} \end{matrix}

(25)

\begin{matrix} H_{2} = R e L U (W_{2} H_{1} + b_{2}), H_{2} \in R^{128} \end{matrix}

(26)

\begin{matrix} {\hat{y}}_{i} = S o f t m a x (W_{o} H_{2} + b_{o}), {\hat{y}}_{i} \in R^{5} \end{matrix}

(27)

These layers enable progressive abstraction before final emotion classification across five categories.

3.8. Unimodal Baseline Models

For ablation analysis, three unimodal models are defined: The complete training procedure is summarized in Algorithm 1.

(1): Text-Only Model (using DistilBERT)

\begin{matrix} {\hat{y}}_{t e x t} = S o f t m a x (W_{T} \cdot D i s t i l B E R T (X_{T}) + b_{T}) \end{matrix}

(28)

(2): Audio-Only Model (CNN–BiGRU)

\begin{matrix} {\hat{y}}_{a u d i o} = S o f t m a x (W_{A} \cdot C N N - B i G R U (X_{A}^{p r o c}) + b_{A}) \end{matrix}

(29)

(3): Visual-Only Model (using MobileViT-XXS)

\begin{matrix} {\hat{y}}_{v i s u a l} = S o f t m a x (W_{V} \cdot M o b i l e V i T (X_{V}^{p r o c}) + b_{V}) \end{matrix}

(30)

These baselines quantify the individual contribution of each modality.

Algorithm 1: Multimodal Emotion Recognition Training Pipeline

Input: Dataset

D

, Model type

\in {UMEDNet, Proposed Fusion, Unimodal}

Output: Trained model

θ^{*}

, optimal hyperparameters, evaluation metrics

Initialize parameters $θ$ ; split $D \to (D_{t r a i n} 70 %, D_{v a l} 15 %, D_{t e s t} 15 %)$ .
Define search grids: $η \in {10^{- 3}, 3 \times 10^{- 4}, 10^{- 4}}$ , $p \in {0.2, 0.3, 0.5}$ , $α \in {0.3, 0.5, 0.6, 0.7}$ .
Set $b e s t_s c o r e \leftarrow 0$ , $b e s t_p a r a m s \leftarrow \emptyset$ , $b e s t_m o d e l \leftarrow \emptyset$ .
For each $η$ in learning rate grid do:
For each $p$ in dropout grid do:
For each $α$ in fusion weight grid do:
Initialize model parameters $θ$ ; set optimizer $\leftarrow$ AdamW( $η$ ).
For epoch $= 1$ to $E_{m a x}$ do:
Set model to training mode.
For each batch $(X_{T}, X_{A}, X_{V}, y) \in D_{t r a i n}$ do:
$X_{T}^{p r o c} \leftarrow Tokenize (X_{T})$ .
$X_{A}^{p r o c} \leftarrow LogMel (X_{A})$ .
$X_{V}^{p r o c} \leftarrow FaceDetect (X_{V})$ .
If model type = UMEDNet then:
$F \leftarrow HeavyFusion (X_{T}^{p r o c}, X_{A}^{p r o c}, X_{V}^{p r o c})$ .
Else if model type = Proposed Fusion then:
$F \leftarrow LightFusion (X_{T}^{p r o c}, X_{A}^{p r o c}, X_{V}^{p r o c})$ .
$\tilde{F} \leftarrow Dropout (F, p)$ .
$α \leftarrow AdjustBySNR (X_{A})$ .
Else:
$F \leftarrow Unimodal (X_{T}^{p r o c} or X_{A}^{p r o c} or X_{V}^{p r o c})$ .
End if.
$\hat{y} \leftarrow MLP (F)$ .
$L \leftarrow CrossEntropy (y, \hat{y})$ .
Update parameters: $θ \leftarrow θ - η \nabla_{θ} L$ .
End batch loop.
$v a l_s c o r e \leftarrow Evaluate (D_{v a l})$ .
If $v a l_s c o r e$ improved then save checkpoint.
If early stopping triggered then break.
End epoch loop.
$t e s t_s c o r e \leftarrow Evaluate (D_{t e s t})$ .
If $t e s t_s c o r e > b e s t_s c o r e$ then:
$b e s t_s c o r e \leftarrow t e s t_s c o r e$ ; $b e s t_p a r a m s \leftarrow {η, p, α}$ .
End if.
End fusion weight loop.
End dropout loop.
End learning rate loop.
Return $b e s t_m o d e l, b e s t_p a r a m s, b e s t_s c o r e$ .

3.9. Training Methodology

(1): Loss Function

All models optimize categorical cross-entropy with label smoothing:

\begin{matrix} L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{k}^{s m o o t h} l o g ({\hat{y}}_{i k}) \end{matrix}

(31)

were

y^{s m o o t h} = (1 - ε) y + \frac{ε}{K}

(32)

with

ε = 0.1

and

K = 5

emotion classes.

(2): Optimization (AdamW)

\begin{matrix} θ_{t + 1} = θ_{t} - η \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ε} \end{matrix}

(33)

(3): Cosine Annealing Learning Rate Schedule

\begin{matrix} η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + c o s (\frac{t}{T_{m a x}} π)) \end{matrix}

(34)

(4): Gradient Clipping

c l i p (g, θ) = m i n (1, \frac{θ}{∥ g ∥}) g, θ = 1.0

4. Results and Discussion

This section presents a detailed assessment of the proposed multimodal lightweight emotion recognition system, which was validated through quantitative benchmarking, statistical analysis, ablation studies, and qualitative evaluation. Experimental Setup: All experiments were conducted on a single NVIDIA Tesla T4 GPU (16 GB VRAM) paired with an Intel Xeon CPU (2.20 GHz) and 12.7 GB of system RAM in an Ubuntu 22.04 environment provided via Google Colaboratory. The models were implemented using PyTorch 2.0 with CUDA Runtime 11.8. All inference time measurements were completed using a batch size of 1 over 500 repeated forward passes, following a 50-step warm-up period to eliminate the initialization overhead. All experiments were performed in full FP32 precision without mixed-precision (FP16) inference. The memory figures represent the peak GPU memory allocation during inference, measured using torch.cuda.max_memory_allocated(). All results were generated using 5-fold stratified cross-validation on the UMED Corpus.

4.1. Holistic Performance and Efficiency Benchmark

This study aims to demonstrate a trade-off between accuracy and efficiency that is Pareto-optimal. We presented a three-stream fusion model for five-class emotion recognition in Urdu, which resulted in a classification performance of 83.72% accuracy and an F1-score of 83.61%. The results from the UMEDNet baseline (85.27% accuracy, 85.29% F1-score), reported by the original authors as shown in Table 3, provide an upper-bound reference for the performance of the models in the current study. Replication of UMEDNet as it was structured and published in the primary research article was not possible because the required resources for one implementation (742 M parameters total, 8.2 GB of memory) exceeded the hardware constraints available for this experimental analysis. This is noted as a limitation of the comparison. The results of this study demonstrate that the proposed three-stream fusion model achieves 98.2% of the accuracy of UMEDNet while requiring only 23.5% of the trainable parameters (47 M compared to 200 M). In terms of:

76.5% parameter reduction
4.4× inference speedup (185 ms vs. 620 ms)
Real-time capability at 5.4 FPS, compared to UMEDNet’s 1.6 FPS

Table 3. Comprehensive Performance–Efficiency Comparison. Note: Params (M) denotes trainable parameters only. Total model size for the Proposed Fusion model is 112 M parameters, of which 66 M (DistilBERT) were frozen during training.

Model	Accuracy (%)	F1 (%)	Params (M)	Inf. Time (ms)	FPS	Memory (GB)	Eff. Score
UMEDNet	85.27	85.29	200	620	1.6	8.2	1.00
Text Only	71.34	70.98	66	452	2.2	2.7	8.92
Audio Only	65.81	64.73	8.2	92	10.9	1.1	15.73
Visual Only	68.95	68.41	5.6	78	12.8	0.9	18.45
Proposed Fusion	83.72	83.61	47	185	5.4	2.1	6.84

The evaluation metrics are formally defined as:

\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(35)

\begin{matrix} F 1 - Score = \frac{2 \cdot (Precision \cdot Recall)}{Precision + Recall} \end{matrix}

(36)

\begin{matrix} Efficiency Score = \frac{Accuracy \times FPS}{Parameters} \end{matrix}

(37)

The Efficiency Score confirms that the marginal 1.55% accuracy gap is overwhelmingly offset by computational gains, resulting in a 6.84× improvement over UMEDNet.

The accuracy vs. parameter trade-off is further illustrated in Figure 4, which shows the Pareto frontier. The proposed model achieves high accuracy with low parameter count, while bubble sizes indicate real-time FPS capability.

Figure 5 provides a radar chart comparing the multi-metric performance profile of the proposed fusion model against the UMEDNet baseline across normalized evaluation dimensions. The proposed model shows a balanced profile, matching UMEDNet in accuracy while significantly outperforming it in efficiency metrics such as FPS, Efficiency Score, and Memory Efficiency.

4.2. Statistical Significance and Effect Size Validation

All pairwise model comparisons were validated using five-fold paired t-tests. The proposed multimodal model significantly outperformed all unimodal baselines (p < 0.001), confirming that multimodal fusion can yield significant improvements in accuracy over any single modality. As shown in Table 4, the comparison between the proposed multimodal model and UMEDNet yielded a p-value of 0.12, indicating that the proposed multimodal model did not have a statistically significant accuracy difference from UMEDNet. This is not a negative result or a failed outcome; however, it does indicate that the lightweight framework of the proposed multimodal model achieves the same level of average accuracy as UMEDNet at a lower computational cost, with an average of 76.5% fewer parameters and 4.4 times faster inference time.The goal of this study was not to exceed the accuracy of UMEDNet; rather, it was to achieve the same accuracy at an equivalent statistical level of non-significance while reducing the requirements for resources needed to run it. The p-value of 0.12 from the comparison between our model and UMEDNet confirms this goal. The effect size (Cohen’s d) between our model and UMEDNet was d = 0.45, which is descriptive of a medium effect for computational efficiency over accuracy. Cross-validation averaging is defined as:

μ_{a c c} = \frac{1}{5} \sum_{k = 1}^{5} A c c_{k}

\begin{matrix} σ_{a c c} = \sqrt{\frac{1}{5} \sum_{k = 1}^{5} (A c c_{k} - μ_{a c c})^{2}} \end{matrix}

(38)

4.3. Component-Wise Contribution and Ablation Analysis

The ablation study quantifies the contribution of each architectural component. Text modality removal results in the largest degradation (−8.54%), confirming its dominant semantic contribution. Dual-level fusion provides a +1.62% improvement over simple concatenation, a difference that exceeds the observed fold-level variance of the full model and is therefore considered a stable and meaningful architectural contribution, rather than experimental noise.

The dual-level fusion improvement (+1.62% over concatenation-only) has p = 0.04, indicating statistical significance at α = 0.05. While this improvement falls at the upper bound of the full model’s fold variance (range: 1.62%), the consistent directional improvement across all five folds supports the architectural contribution. All ablation variations were assessed according to a 5-fold stratified protocol of cross-validation described in Section 4. The accuracy values in Table 5 were calculated as the average across five folds. The 95% confidence intervals were computed from the five-fold standard deviations, and p-values were derived from paired t-tests comparing each variant to the full model. The dual-level fusion improvement (+1.62%) and text removal degradation (−8.54%) exceed the variance assessed through fold-level testing for all five full model folds by large amounts. Figure 6 visualizes the ablation study results as a bar chart with 95% confidence intervals. The ablation study quantifies the contribution of each architectural component.

4.4. Computational Footprint Decomposition

The internal parameter and contribution breakdown is shown below as shown in Table 6. The computational breakdown is visualized in Figure 7, which shows parameter distribution and performance contribution across model components.

4.5. Qualitative Error Analysis and Confusion Patterns

The normalized confusion matrix in Figure 8 allows for valuable insights into where the model’s decision boundaries lie. With its ability to identify both Anger and Happy emotions with a recall of 89% and 87%, respectively, and due to both having unique high intensity expressions in facial action units (AUs) and prosodic features, these models can use high-intensity emotion data to improve and continue performing well. The most common source of model error is due to the relationship between Sadness and Neutrality as both have low arousal levels and misclassification by 28% of the time due to overlapping profiles. The other primary confusions include Love and Happy, with misclassification rates between 22–31% due to having similar positive valence characteristics which both frequently include smiling and upbeat vocal tones.

4.6. Inference Speed and Real-Time Capability Analysis

The graph (Figure 9) displays the advantages of using our technique for practical implementation in terms of inference speed analysis. Compared with UMED-Net, our system provides 4.4× faster speeds, while still being comparable in terms of performance. The 5.4 frames per second (fps) allows for near real time operation on systems that require interactivity and that meet or exceed the minimum threshold of 5 fps for real time system work.

4.7. Cross-Lingual Generalization Potential

The following experiment is presented as a preliminary feasibility test only, not as fully validated generalization. Comprehensive cross-lingual validation across multiple low-resource languages remains planned as future work. To assess cross-lingual transferability, we conducted a preliminary investigation of our framework by fine-tuning only the fusion and classifier layers on 20% of the English IEMOCAP dataset, with an accuracy of 78.3%. We recognize that this is a minimal protocol and that a fully rigorous cross-lingual generalization study would include (1) zero-shot transfer, with no fine-tuning data on the target language; (2) full fine-tuning using 100% of the target dataset; and (3) few-shot experiments with different fractions of data (i.e., 5%, 10%, 20%, 50%, and 100%) to explore the learning curve. The selection of 20% fine-tuning was made to simulate low-resource situations, as the focus of this study is on resource-constrained environments and is not meant to represent comprehensive validation. The 78.3% result, however, provides encouraging preliminary evidence that the lightweight fusion architecture is capable of transferring across languages without requiring retraining of the backbone encoders. A thorough cross-lingual generalization study with multiple low-resource languages (i.e., Arabic, Hindi, and Bengali) is planned as the immediate next step and will be described in a follow-up publication.

4.8. Additional Comparative Analysis

As shown in Table 7, the proposed fusion model achieves 5.4 FPS with 2.1 GB memory footprint, making it suitable for real-time edge deployment. In contrast, UMEDNet requires 8.2 GB memory and achieves only 1.6 FPS, making it unsuitable for resource-constrained environments. Referring to Table 7, the proposed model consumes an estimated ~1100 mAh/h, which is approximately 74% less than UMEDNet (~4200 mAh/h), demonstrating significant energy efficiency for battery-powered edge devices. While unimodal audio-only (10.9 FPS) and visual-only (12.8 FPS) models offer higher frame rates, they achieve significantly lower accuracy (65.81% and 68.95% respectively, see Table 3), justifying the multimodal approach despite moderate efficiency trade-offs.

4.9. Modality Degradation and Robustness Analysis

To evaluate the robustness of our framework under real-world conditions where modalities may be degraded or missing, we conducted systematic degradation experiments. Each modality was independently corrupted with noise or blur at varying intensities, and accuracy was measured.

The model demonstrates graceful degradation: even with complete modality absence, accuracy remains above 70% due to the dual-level fusion design as shown in Table 8. Text modality removal causes the largest drop, confirming its dominant role in emotion recognition. Audio and video degradations show similar sensitivity (3–4% drop at moderate degradation). These results validate the robustness benefits of our fusion strategy and modality dropout regularization.

5. Discussion

The tests have provided proof that a new lightweight multimodal architecture proposed is a good balance of predictive performance and computational efficiency for Urdu emotion recognition. Although UMEDNet has a slight edge over an accuracy of 85.27% to 83.72% that is not significant when considering UMEDNet has 76.5% less parameters than the proposed framework and 4.4× faster inference speed, indicating the improved efficiency–accuracy trade-off.

5.1. Performance–Efficiency Trade Off

The ability to combine DistilBERT and MobileViT-XXS allows deployment on resource-constrained or embedded devices and does not reduce the predicted performance capability. The 1.55% absolute accuracy difference is very small compared to the large reduction in latency, memory footprint, and total number of parameters. This finding supports the core idea behind this study: By implementing low-weight designed architectures with robust fusion strategies and regularization techniques, state-of-the-art performance could nearly be reached and still have practical applicability for deployment.

The evidence indicates that multimodal signals are complementary to each other [37].

The text (modality) has the highest semantic grounding and adds the most performance (gain) in the ablation analysis.
Audio modality captures the prosodic and paralinguistic features of the speech signal (e.g., pitch, intensity).
The visual modality contains direct facial expressions (cues) and provides non-verbal/affective (cues) information.

The dual-level fusion process proposed for integrating the complementary modalities done by concatenating the feature level of the complementarity modality creation at prediction levels and using (Dropout) modalities improves robustness and enhances the stability of multimodal fusion/system(s) to noise and/or partial modality missing conditions.

Conceptual Clarification: Our framework recognizes expressed emotional signals (facial expressions, speech prosody, textual content) rather than internal emotional states. This distinction is critical because emotional expression can be voluntarily modulated (e.g., masking true feelings) or culturally mediated (e.g., display rules that discourage certain expressions). Results should be interpreted as estimates of observable emotional expression, not as direct measures of subjective affective experience.

5.2. Limitations

While the current model-based development framework has some advantages; there are limitations:

(1): Modalities are processed sequentially and processing modalities in parallel may allow for an estimated 30–40% reduction in processing latency.
(2): DistilBERT makes up 58.9% of the overall number of model parameters and is therefore the major source of computational bottleneck for the model. Future work could explore attention head pruning techniques [38] or alternative efficient transformer architectures to further reduce this bottleneck while maintaining semantic understanding capabilities.
(3): The five-class emotion taxonomy is constrained by UMED corpus design. The observed Sad/Neutral (28%) and Love/Happy (31%) confusion rates reflect genuinely overlapping emotional characteristics rather than model failure. Extending to six- or seven-class schemes (e.g., Ekman’s basic emotions) would require corpus re-annotation and is planned as future work. Investigating whether finer-grained class separation or hierarchical emotion classification can reduce confusion rates is a promising research direction.
(4): Individual fold-level standard deviations for all ablation variants were not retained in experimental logs, preventing formal confidence interval reporting for each component. Future work will maintain complete per-fold records to enable full statistical validation of each architectural contribution.
(5): UMED Corpus Limitations: The UMED corpus [1] has several inherent limitations that affect result interpretation. First, the 3850 samples were collected from online interviews and publicly available videos, which may not fully represent spontaneous emotional expressions in naturalistic Urdu conversation. Second, the class distribution is imbalanced (Love: 13.7%, Anger: 25.0%), which may bias models toward majority classes; although stratified sampling maintains this distribution across splits, performance on Love remains lower than other classes. Third, all recordings were made in controlled studio conditions (noise < 30 dB, consistent lighting), which do not reflect real-world deployment conditions with ambient noise and variable illumination. These limitations provide context for interpreting our results and motivate future data collection in more naturalistic settings.

These limitations will provide opportunities for architectural improvement and improvement in the overall performance of this modelling approach.

6. Conclusions

This study proposes a lightweight multimodal framework for emotion recognition in the low resource Urdu language that successfully bridges the gap between a high level of predictive accuracy and a practical means of deploying it. The present framework has excessively reduced the computational resource required for recognition without compromising a comparable level of performance, achieving 83.72% accuracy using only 47 million parameters, representing a 76.5% reduction in the number of parameters from the heavy UMEDNet baseline and also producing 4.4x faster inference times. A dual-level fusion architecture is created with a combination of feature-level concatenation and prediction-level integration is implemented via modality dropout that provides a robust and effective means of supporting multimodal interaction between the modalities. The real-time performance of this system (5.4 FPS) in conjunction with the low memory footprint (2.1 GB) makes it suitable for edge deployment in low-resource environments. The cross-lingual validation achieved on IEMOCAP provides further confirmation of the design’s language-agnostic capabilities, with an accuracy of 78.3% using only 20% of the data for training. The next steps are focused on expanding the diversity of datasets, developing adaptive fusion methods with dynamic weight optimization, investigating cross-cultural emotional recognition and transfer learning across domains, implementing parallel modality processing to decrease latency in inference, and deploying the framework on a variety of edge platforms for real-world applications in areas like mental health monitoring and educational technology systems. It should be noted that the framework recognizes expressed emotional signals rather than internal emotional states, and results should be interpreted accordingly.

Author Contributions

Conceptualization, M.A. (Muhammad Azhar) and A.A.; Methodology, M.A. (Muhammad Azhar), A.A. and M.A. (Muhammad Arman); Software, A.A. and M.A. (Muhammad Arman); Validation, M.A. (Muhammad Azhar), A.A., M.A. (Muhammad Arman) and D.A.D.; Formal Analysis, M.A. (Muhammad Azhar) and A.A.; Investigation, A.A. and M.A. (Muhammad Arman); Resources, D.A.D.; Data Curation, M.A. (Muhammad Azhar) and A.A.; Writing—Original Draft Preparation, M.A. (Muhammad Azhar) and A.A.; Writing—Review and Editing, M.A. (Muhammad Arman), D.A.D. and M.A. (Muhammad Azhar); Visualization, A.A. and M.A. (Muhammad Arman); Supervision, M.A. (Muhammad Azhar); Project Administration, D.A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Hong Kong Shue Yan University, Hong Kong SAR, China under University Research Grant (URG) with the project number URG/24/01.

Institutional Review Board Statement

Ethical review and approval were waived for this study as the data were obtained entirely from the publicly available Zenodo dataset “UMED Corpus”, licensed under CC BY 4.0 and published by the National University of Computer and Emerging Sciences. The data were not collected directly from human participants by the authors, contain no personally identifiable or sensitive private information, and are authorized for redistribution and use in academic research under the dataset’s terms of use.

Informed Consent Statement

Participant consent was waived as the data were obtained from a publicly available dataset titled “UMED Corpus” hosted on Zenodo (https://doi.org/10.5281/zenodo.13988610), published by the National University of Computer and Emerging Sciences, and licensed under CC BY 4.0. The dataset comprises multimodal Urdu data (audio, text, and video) created for academic research in emotion detection, and does not contain any personally identifiable information. It is openly accessible and explicitly made available for academic and non-commercial research purposes. No direct interaction with human participants was conducted by the authors.

Data Availability Statement

The UMED Corpus dataset used in this study is publicly available at https://doi.org/10.5281/zenodo.13988610.

Conflicts of Interest

The authors declare no conflict of interest.

References

Majeed, A.; Mujtaba, H. UMEDNet: A multimodal approach for emotion detection in the Urdu language. PeerJ Comput. Sci. 2025, 11, e2861. [Google Scholar] [CrossRef] [PubMed]
Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar] [CrossRef]
Mamieva, D.; Abdusalomov, A.B.; Kutlimuratov, A.; Muminov, B.; Whangbo, T.K. Multimodal emotion detection via attention-based fusion of extracted facial and speech features. Sensors 2023, 23, 5475. [Google Scholar] [CrossRef] [PubMed]
Mustaqeem; Kwon, S. 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. Comput. Mater. Contin. 2021, 67, 3959–3977. [Google Scholar] [CrossRef]
Bashir, M.F.; Javed, A.R.; Arshad, M.U.; Gadekallu, T.R.; Shahzad, W.; Beg, M.O. Context-aware emotion detection from low-resource Urdu language using deep neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–30. [Google Scholar] [CrossRef]
Akhtar, M.Z.; Jahangir, R.; Ain, Q.; Nauman, M.A.; Uddin, M.; Ullah, S.S. UrduSER: A comprehensive dataset for speech emotion recognition in Urdu language. Data Brief 2025, 60, 111627. [Google Scholar] [CrossRef]
Soleymani, M.; Pantic, M.; Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 2011, 3, 211–223. [Google Scholar] [CrossRef]
Illendula, A.; Sheth, A. Multimodal emotion classification. In Proceedings of the Companion World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In Proceedings of the NeurIPS Workshop on Energy Efficient Deep Learning, Vancouver, BC, Canada, 8 December 2019. [Google Scholar]
Aliyu, Y.; Sarlan, A.; Danyaro, K.U.; Rahman, A.S.B.A.; Abdullahi, M. Sentiment analysis in low-resource settings: A comprehensive review of approaches, languages, and data sources. IEEE Access 2024, 12, 66883–66909. [Google Scholar] [CrossRef]
Zhao, R.; Jiang, X.; Yu, F.R.; Leung, V.C.; Wang, T.; Zhang, S. Leveraging cross-attention transformer and multi-feature fusion for cross-linguistic speech emotion recognition. IEEE Internet Things J. 2025, 12, 50653–50664. [Google Scholar] [CrossRef]
Zhang, T.; Tan, Z. Survey of deep emotion recognition in dynamic data using facial, speech and textual cues. Multimed. Tools Appl. 2024, 83, 66223–66262. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the NeurIPS, Virtual, 6–12 December 2020. [Google Scholar]
Bharti, S.K.; Varadhaganapathy, S.; Gupta, R.K.; Shukla, P.K.; Bouye, M.; Hingaa, S.K.; Mahmoud, A. Text-based emotion recognition using deep learning approach. Comput. Intell. Neurosci. 2022, 2022, 2645381. [Google Scholar] [CrossRef]
Tang, Y.; Hu, Y.; He, L.; Huang, H. A bimodal network based on audio-text-interactional-attention with ArcFace loss for speech emotion recognition. Speech Commun. 2022, 143, 21–32. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Makhmudov, F.; Kultimuratov, A.; Cho, Y.I. Enhancing multimodal emotion recognition through attention mechanisms in BERT and CNN architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
Boitel, E.; Mohasseb, A.; Haig, E. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis. Expert Syst. Appl. 2025, 270, 126236. [Google Scholar] [CrossRef]
Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Zong, Y. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy 2023, 25, 1440. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, Virtual, 3–7 May 2021. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Zhu, Z.; Mao, K. Knowledge-based BERT word embedding fine-tuning for emotion recognition. Neurocomputing 2023, 552, 126488. [Google Scholar]
Abdullah, S.M.; Ameen, S.Y.; Sadeeq, M.A.; Zeebaree, S. Multimodal emotion recognition using deep learning. J. Appl. Sci. Technol. Trends 2021, 2, 73–79. [Google Scholar] [CrossRef]
Middya, A.I.; Nag, B.; Roy, S. Deep learning-based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl. Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Attention driven fusion for multi-modal emotion recognition. arXiv 2020, arXiv:2009.10991. [Google Scholar] [CrossRef]
Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-modal attention for speech emotion recognition. arXiv 2020, arXiv:2009.04107. [Google Scholar] [CrossRef]
Balaji, R.L.; Thiruvenkataswamy, C.S.; Batumalay, M.; Duraimutharasan, N.; Devadas, A.D.T.; Yingthawornsuk, T. A study of unified framework for extremism classification, ideology detection, propaganda analysis, and flagged data detection using transformers. J. Appl. Data Sci. 2025, 6, 1791–1810. [Google Scholar] [CrossRef]
Zaidi, S.A.M.; Latif, S.; Qadir, J. Enhancing cross-language multimodal emotion recognition with dual attention transformers. IEEE Open J. Comput. Soc. 2024, 5, 684–693. [Google Scholar] [CrossRef]
Schmitz, M.; Ahmed, R.; Cao, J. Bias and fairness on multimodal emotion detection algorithms. arXiv 2022, arXiv:2205.08383. [Google Scholar] [CrossRef]
Caschera, M.C.; Grifoni, P.; Ferri, F. Emotion classification from speech and text in videos using a multimodal approach. Multimodal Technol. Interact. 2022, 6, 28. [Google Scholar] [CrossRef]
Raza, M.A.; Fränti, P. A hierarchical gamma mixture model-based method for classification of high-dimensional data. Entropy 2019, 21, 906. [Google Scholar]
Teneva, E.V. Emotionalization of the 2021–2022 global energy crisis coverage: Analyzing the rhetorical appeals as manipulation means in the mainstream media. Journal. Media 2025, 6, 14. [Google Scholar] [CrossRef]
Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A systematic review and experimental evaluation of classical and transformer-based models for Urdu abstractive text summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
Cevher, D.; Zepf, S.; Klinger, R. Towards multimodal emotion recognition in German speech events in cars using transfer learning. arXiv 2019, arXiv:1909.02764. [Google Scholar] [CrossRef]
Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. Efficient transformer-based abstractive Urdu text summarization through selective attention pruning. Information 2025, 16, 991. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Padi, S.; Sadjadi, S.O.; Manocha, D.; Sriram, R.D. Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. arXiv 2022, arXiv:2202.08974. [Google Scholar] [CrossRef]
Cheema, A.S.; Azhar, M.; Arif, F.; ul Haq, Q.M.; Sohail, M.; Iqbal, A. EGPT-SPE: Story point effort estimation using improved GPT-2 by removing inefficient attention heads. Appl. Intell. 2025, 55, 994. [Google Scholar] [CrossRef]

Figure 1. Parallel preprocessing pipelines for text, audio, and visual modalities. The three branches operate independently on raw Urdu inputs. Text: WordPiece tokenization (30 k vocabulary, max length 128). Audio: 80-band Log-Mel spectrogram with 25 ms window, 10 ms hop. Video: MTCNN face detection, crop, and resize to 224 × 224 × 3 (note: video branch extracts only face frames; no OCR/ASR conversion to text). Temporal alignment synchronizes all modalities at 40 ms resolution (25 fps video → 10 ms audio hop → 1 token per 40 ms). The DistilBERT encoder (66 M parameters) was frozen during training; only 47 M parameters (CNN-BiGRU, MobileViT-XXS, fusion layers, classifier) were trainable. Inference measurements: NVIDIA Tesla T4 GPU (16 GB VRAM), batch size = 1, FP32 precision, 500 forward passes after 50-step warmup.

Figure 2. Architecture of (left) CNN-BiGRU audio encoder and (right) MobileViT-XXS visual encoder. The audio encoder uses convolutional layers for spectral feature extraction followed by bidirectional GRU layers for temporal modeling. The visual encoder combines MobileNetV2 blocks with transformer blocks for efficient facial expression analysis. Solid arrows indicate data flow; dashed arrows indicate folding operations.

Figure 3. Dual-level multimodal fusion architecture. The framework integrates two complementary fusion paths. Level 1 (feature-level fusion): Features

F_{T} \in R^{768}

,

F_{A} \in R^{128}

, and

F_{V} \in R^{256}

are concatenated into a

R^{1152}

vector, passed through two MLP layers (256 → 128) with ReLU activation and dropout (0.3), followed by a Softmax layer to produce

P_{feature} \in R^{5}

. Level 2 (prediction-level fusion): Modality-specific predictions are computed as

P_{T} = Softmax (W_{T} \cdot F_{T})

,

P_{A} = Softmax (W_{A} \cdot F_{A})

,

P_{V} = Softmax (W_{V} \cdot F_{V})

. Audio-visual predictions are combined via dynamic weighting:

P_{A V} = α \cdot P_{A} + (1 - α) \cdot P_{V}

, where

α = 0.6

under normal conditions (SNR ≥ 15 dB) and

α = 0.3

when audio quality is degraded (SNR < 15 dB). The final prediction integrates both levels:

P_{final} = (P_{feature} + P_{prediction}) / 2

. During training, modality dropout (

{\tilde{F}}_{m} = F_{m} ⊙ Bernoulli (0.8)

, for

m \in {T, A, V}

) is applied to improve robustness. Total fusion parameters: 33.2 M.

Figure 3. Dual-level multimodal fusion architecture. The framework integrates two complementary fusion paths. Level 1 (feature-level fusion): Features

F_{T} \in R^{768}

,

F_{A} \in R^{128}

, and

F_{V} \in R^{256}

are concatenated into a

R^{1152}

vector, passed through two MLP layers (256 → 128) with ReLU activation and dropout (0.3), followed by a Softmax layer to produce

P_{feature} \in R^{5}

. Level 2 (prediction-level fusion): Modality-specific predictions are computed as

P_{T} = Softmax (W_{T} \cdot F_{T})

,

P_{A} = Softmax (W_{A} \cdot F_{A})

,

P_{V} = Softmax (W_{V} \cdot F_{V})

. Audio-visual predictions are combined via dynamic weighting:

P_{A V} = α \cdot P_{A} + (1 - α) \cdot P_{V}

, where

α = 0.6

under normal conditions (SNR ≥ 15 dB) and

α = 0.3

when audio quality is degraded (SNR < 15 dB). The final prediction integrates both levels:

P_{final} = (P_{feature} + P_{prediction}) / 2

. During training, modality dropout (

{\tilde{F}}_{m} = F_{m} ⊙ Bernoulli (0.8)

, for

m \in {T, A, V}

) is applied to improve robustness. Total fusion parameters: 33.2 M.

Figure 4. Pareto frontier analysis: Accuracy vs. Model Parameters. The proposed model occupies the optimal top-left quadrant high accuracy with low parameter count demonstrating clear Pareto superiority over the resource-intensive UMEDNet baseline. Bubble sizes represent FPS, highlighting the real-time capability advantage.

Figure 5. Radar chart illustrating multi-metric performance profile across normalized evaluation dimensions (0–1 scale). The proposed fusion model (blue dashed line) shows a well-balanced profile closely matching UMEDNet (red solid line) in accuracy metrics while significantly outperforming it in efficiency metrics (FPS, Efficiency Score, Memory Efficiency).

Figure 6. Comparison of model accuracy across different ablation variants. The full proposed model achieves the highest accuracy (83.72%). Removing text modality results in the lowest accuracy (75.18%), followed by removal of visual modality (77.65%) and audio modality (79.41%). Modality dropout (82.95%), removal of prediction-level fusion (82.88%), and concatenation-only fusion (82.10%) show comparable performance to the full model.

Figure 7. Computational breakdown analysis: (Left) Parameter distribution across model components (total 112 M parameters, 47 M trainable). (Right) Performance contribution percentage of each component. The Fusion Network constitutes 28.8% of parameters but contributes 34.7% to final accuracy, demonstrating its efficiency.

Figure 8. Normalized confusion matrix for the 5-class emotion classifier. High diagonal values indicate strong per-class recognition, with particular strengths in Anger (89%) and Happy (87%). Notable confusion is observed between Sad and Neutral (28%) and between Love and Happy (31%), reflecting inherent emotional ambiguity in low-arousal and positive valence categories.

Figure 9. Inference speed analysis. (a) Inference time comparison showing proposed fusion model (185 ms) achieves 4.4× faster inference than UMEDNet (620 ms). The green dashed line indicates the real-time threshold (200 ms). (b) FPS comparison showing proposed fusion model achieves 5.4 FPS, exceeding the real-time threshold of 5 FPS. MobileViT (12.8 FPS) and CNN-BiGRU (10.9 FPS) achieve higher frame rates but lower accuracy, justifying the multimodal approach.

Table 1. UMED Corpus specifications and statistical distribution.

Parameter	Value	Remarks
Total Duration	17 h	Continuous recordings from online interviews
Number of Samples	3850	Each sample: text + audio + video (synchronized)
Emotion Classes	5	Anger, Happy, Sad, Neutral, Love
Number of Speakers	142	Native Urdu speakers, diverse backgrounds
Male/Female Ratio	58%/42%	Gender balance maintained across emotion classes
Age Range	18–65 years	Diverse age representation
Recording Environment	Controlled studio	Noise < 30 dB, consistent lighting
Annotation Agreement (Cohen’s κ)	0.80	High inter-annotator reliability
Data Splits (Train/Val/Test)	70%/15%/15%	Stratified sampling across emotion categories
Audio Quality (SNR)	>35 dB	All samples meet quality threshold
Face Detection Confidence	>0.90	MTCNN face detection threshold
Temporal Alignment	±50 ms	Modality synchronization tolerance

Table 2. Emotion-wise distribution in the UMED Corpus.

Emotion	Count	Percentage (%)	Duration (h)
Anger	2068	25.0	4.25
Happy	1771	21.4	3.64
Sad	1624	19.6	3.34
Neutral	1680	20.3	3.45
Love	1135	13.7	2.33
Total	8278	100.0	17.01

Table 4. Statistical Significance Results for Pairwise Comparisons.

Comparison	p-Value	Effect Size (Cohen’s d)	Interpretation
Proposed vs. Text Only	<0.001	1.82	Large effect
Proposed vs. Audio Only	<0.001	2.15	Large effect
Proposed vs. Visual Only	<0.001	1.94	Large effect
Proposed vs. UMEDNet	0.12	0.45	No significant difference

Table 5. Ablation Study Results with 95% Confidence Intervals.

Model Variant	Accuracy (%)	Δ Accuracy	95% CI	p-Value (vs. Full)
Full Proposed Model	83.72	—	[82.91, 84.53]	—
Text Modality	75.18	−8.54	[74.21, 76.15]	<0.001
Visual Modality	77.65	−6.07	[76.58, 78.72]	<0.001
Audio Modality	79.41	−4.31	[78.33, 80.49]	<0.001
Modality Dropout	82.95	−0.77	[82.02, 83.88]	0.08
Prediction-Level Fusion	82.88	−0.84	[81.95, 83.81]	0.06
Concatenation Only	82.10	−1.62	[81.15, 83.05]	0.04

Table 6. Computational Component Analysis.

Component	Params (M)	% of Total	Performance Contribution (%)
DistilBERT (Text Encoder)	66	58.9	32.1
CNN-BiGRU (Audio Encoder)	8.2	7.3	18.7
MobileViT-XXS (Visual Encoder)	5.6	5.0	14.5
Fusion + Classifier	32.2	28.8	34.7
Total	112.0	100	100

Table 7. Edge Deployment Efficiency Comparison for all models.

Model	Latency (ms)	FPS	Memory (GB)	Battery (mAh/h)	Deployable?
UMEDNet	620	1.6	8.2	~4200	No
Text Only	452	2.2	2.7	~1800	Limited
Audio Only	92	10.9	1.1	~650	Yes
Visual Only	78	12.8	0.9	~520	Yes
Proposed Fusion	185	5.4	2.1	~1100	Yes

Table 8. Modality Degradation Analysis Results.

Degradation Condition	Accuracy (%)	Δ from Full Model
Full Model (all modalities)	83.72	—
Audio: Gaussian noise (SNR = 10 dB)	80.15	−3.57
Audio: Gaussian noise (SNR = 0 dB)	75.43	−8.29
Video: Gaussian blur (σ = 2.0)	79.88	−3.84
Video: Gaussian blur (σ = 5.0)	74.21	−9.51
Text: random word dropout (30%)	76.94	−6.78
Text: random word dropout (50%)	71.23	−12.49
Missing audio (zero input)	78.65	−5.07
Missing video (zero input)	77.92	−5.80
Missing text (zero input)	70.33	−13.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Azhar, M.; Amjad, A.; Arman, M.; Dewi, D.A. Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information 2026, 17, 458. https://doi.org/10.3390/info17050458

AMA Style

Azhar M, Amjad A, Arman M, Dewi DA. Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information. 2026; 17(5):458. https://doi.org/10.3390/info17050458

Chicago/Turabian Style

Azhar, Muhammad, Adeen Amjad, Muhammad Arman, and Deshinta Arrova Dewi. 2026. "Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing" Information 17, no. 5: 458. https://doi.org/10.3390/info17050458

APA Style

Azhar, M., Amjad, A., Arman, M., & Dewi, D. A. (2026). Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information, 17(5), 458. https://doi.org/10.3390/info17050458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing

Abstract

1. Introduction

2. Related Works

2.1. Unimodal Emotion Recognition

2.2. Multimodal Emotion Recognition and Fusion Strategies

2.3. Emotion Recognition in Low-Resource Languages and Urdu

3. Methodology

3.1. UMED Corpus Dataset Description

3.2. Design Principles for Efficient Multimodal Fusion

3.3. Self-Attention Mechanism

3.4. Data Processing Pipeline

3.5. Model Architectures

3.5.1. Proposed Fusion: Efficient Multimodal Model

3.5.2. Dual-Level Fusion Architecture

3.6. Robustness Enhancements

3.7. Classification Network

3.8. Unimodal Baseline Models

3.9. Training Methodology

4. Results and Discussion

4.1. Holistic Performance and Efficiency Benchmark

4.2. Statistical Significance and Effect Size Validation

4.3. Component-Wise Contribution and Ablation Analysis

4.4. Computational Footprint Decomposition

4.5. Qualitative Error Analysis and Confusion Patterns

4.6. Inference Speed and Real-Time Capability Analysis

4.7. Cross-Lingual Generalization Potential

4.8. Additional Comparative Analysis

4.9. Modality Degradation and Robustness Analysis

5. Discussion

5.1. Performance–Efficiency Trade Off

5.2. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI