Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

Kim, DeogHwa; Lee, Yu il; Yoon, Da Hyun; Kim, Byeong Jun; Kim, Deok-Hwan

doi:10.3390/app16063103

Open AccessArticle

Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

by

DeogHwa Kim

,

Yu il Lee

,

Da Hyun Yoon

,

Byeong Jun Kim

and

Deok-Hwan Kim

^*

Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 3103; https://doi.org/10.3390/app16063103

Submission received: 27 February 2026 / Revised: 15 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Multimodal Emotion Recognition and Affective Computing)

Download

Browse Figures

Versions Notes

Abstract

Multimodal Emotion Recognition in Conversation (ERC) has attracted attention as a key technology in human–computer interaction, mental healthcare, and intelligent services. However, deploying ERC in real-world settings remains challenging due to reliability gaps across modalities, instability in visual representations, and the high computational cost of large pretrained models. In particular, on resource-constrained edge devices, it is difficult to reduce model size and inference latency while preserving accuracy. To address these challenges, we jointly propose a knowledge-distillation-based multimodal ERC model, called DDVLM, with an edge-optimized Weight-Only Quantization (WOQ) pipeline for efficient edge deployment. DDVLM assigns the textual modality as the teacher and the visual modality as the student, transferring emotion-distribution knowledge to improve non-verbal representations and stabilize multimodal learning. In addition, Exponential Moving Average (EMA)-based self-distillation enhances the consistency and generalization capability of text features. Meanwhile, the proposed WOQ pipeline quantizes linear-layer weights to INT8 while preserving precision-sensitive operations in mixed precision, thereby minimizing accuracy loss and reducing model size, memory usage, and inference latency. Experiments on the MELD dataset demonstrated that the proposed approach achieves state-of-the-art performance while also enabling real-time inference on edge devices such as NVIDIA Jetson. Overall, this work presents a practical ERC framework that jointly considers accuracy and deployability.

Keywords:

multimodal emotion recognition in conversation; knowledge distillation; exponential moving average; vision-language model; weight-only quantization

1. Introduction

With the advancement of AI technologies, the importance of natural human–computer interaction has become increasingly prominent, and emotion recognition has been widely adopted across various application domains, including conversational AI systems, healthcare, and chatbots [1,2,3,4,5,6,7,8,9,10]. In particular, Emotion Recognition in Conversation (ERC) goes beyond utterance-level emotion classification by inferring emotions while considering conversational continuity and inter-speaker interactions, thereby addressing the limitations of conventional emotion recognition approaches. This enables the effective modeling of emotion dynamics and context-dependent expressions in real-world conversations, making ERC a crucial component in a wide range of intelligent interactive applications. Early studies on ERC primarily employed text-based unimodal approaches, which effectively captured semantic context and emotion-related lexical cues within conversations. However, such approaches fail to fully reflect a speaker’s true emotional state due to the absence of non-verbal cues, and their performance often degrades in utterances with strong linguistic ambiguity, such as sarcasm or irony [1,2,3,4,5].

To overcome these limitations, multimodal approaches have recently been actively explored. Multimodal ERC integrates linguistic information from text with complementary non-verbal cues extracted from audio and visual modalities, enabling more accurate inference of a speaker’s true emotional intent even in situations where textual information alone is ambiguous.

However, existing multimodal ERC approaches still suffer from several limitations. First, fusion learning is challenged by representational gaps across modalities and the unstable quality of non-verbal representations. In particular, the visual modality is highly susceptible to environmental noise, and non-verbal expressions of the same emotion often vary significantly, resulting in lower information reliability compared to text, as illustrated in Figure 1. Consequently, naive fusion of textual and visual modalities may cause visual features to act as noise, ultimately degrading overall performance. To alleviate these issues, this paper proposes a Knowledge Distillation (KD) framework that leverages the textual modality, which is characterized by relatively dense emotion representations and higher classification accuracy, as the teacher model, while treating the visual modality, which exhibits comparatively lower accuracy, as the student model. By transferring the teacher’s emotion distribution knowledge to the student, the proposed approach enhances the representational quality of non-verbal modalities and effectively improves the overall stability of multimodal learning.

Furthermore, we identify a structural limitation in which the textual modality, despite serving as the teacher, does not receive feedback from other modalities or external models. If the representations learned by the textual teacher are unstable or structurally inconsistent, the student model may be guided by noisy supervisory signals, significantly reducing the effectiveness of knowledge transfer. Therefore, prior to distilling knowledge into the student model, we applied Exponential Moving Average (EMA)-based self-distillation to the textual modality to maximize its generalization capability and construct a consistent and robust teacher model. EMA-based self-distillation jointly exploits hard-label supervision and soft-label distribution alignment to improve both training stability and representational expressiveness. Hard-label learning provides a strong decision signal toward a single class, particularly for utterances with ambiguous emotional boundaries; however, it fails to sufficiently capture continuous inter-emotion relationships and may lead to overly sharp decision boundaries. In contrast, soft-label preserves the probabilistic structure among classes, enabling the model to learn relative distances and correlations between similar emotions. This encourages the formation of more consistent cluster structures in the emotion embedding space, allowing the textual modality to represent subtle inter-emotion relationships more effectively.

Second, most existing multimodal ERC studies rely on purely vision-based encoders, such as Convolutional Neural Network (CNN)- or Transformer-based architectures [2,5,6,7,8,9,10], for visual feature extraction. However, in ERC, the textual modality, which is grounded in the semantic context of conversations and speaker utterances, plays a central role in emotion inference. As a result, when visual representations are not sufficiently aligned with textual representations, the effectiveness of multimodal fusion can be substantially limited.

To mitigate this cross-modal representational mismatch, this paper employed Qwen2.5-VL-7B-Instruct [11], a Vision-Language Model (VLM) trained to interpret and align visual information within a language model’s representational space, as the visual feature extractor. Furthermore, the visual descriptions generated by the VLM are embedded using Sentence-BERT [12], ensuring that visual features are represented within a semantically aligned embedding space shared with the textual modality. This design enhances semantic consistency across modalities and improves both the stability and efficiency of multimodal fusion.

Third, recent multimodal ERC models are often built upon high-capacity pretrained backbones such as RoBERTa [13] and Large Language Models (LLMs), which entail substantial computational complexity and memory requirements [2,5,6,7,9]. While these models can be deployed in server-grade GPU environments, their excessive computational cost and power consumption impose significant limitations on real-time applications and resource-constrained edge devices. In particular, embedded edge platforms such as NVIDIA Jetson, despite supporting low-power on-device inference, face practical challenges in directly deploying existing multimodal ERC models due to restricted computational and memory resources. Considering these constraints, model compression is essential for the practical deployment of multimodal ERC systems. However, most existing ERC studies primarily focus on multimodal fusion strategies or sophisticated architectural designs, and only a limited number of works have experimentally investigated actual deployability in edge environments.

To address this gap, this paper trained a multimodal ERC model in a server-grade GPU environment, evaluated both classification performance and inference efficiency based on a full-precision (FP32) multimodal ERC model, and subsequently deployed the model on edge devices such as NVIDIA Jetson to explore lightweight and optimization strategies. Based on this, we propose an edge-optimized Weight-Only Quantization (WOQ) pipeline for enabling the efficient deployment of multimodal ERC models in resource-constrained edge environments. The proposed pipeline quantizes the weights of linear layers, which dominate both computational cost and memory usage in the model, to INT8, while retaining precision-sensitive operations in FP16 or FP32 through a mixed-precision design. Furthermore, the text encoder, VLM encoder, and fusion module were decoupled and built as independent components, enabling module-level optimization and flexible deployment.

Through this design, the proposed pipeline substantially reduces model size and memory consumption while minimizing inference latency and improving energy efficiency. Moreover, compared to the full-precision multimodal ERC model, it effectively preserves classification accuracy, thereby achieving a favorable trade-off between model compression and performance retention.

The main contributions of this paper are summarized as follows:

A KD framework alleviates modality imbalance by employing an EMA-stabilized and consistent textual modality as the teacher and a relatively less accurate visual modality as the student.
A VLM with Sentence-BERT embedding mitigates cross-modal representational mismatch by aligning visual features with the textual semantic space.
An edge-optimized WOQ pipeline enables efficient multimodal ERC deployment on resource-constrained edge devices while preserving performance.
The proposed approach achieved state-of-the-art performance on the MELD benchmark.

The remainder of this paper is organized as follows. Section 2 reviews related work on ERC and model compression techniques. Section 3 describes the proposed multimodal ERC framework, including the edge-optimized WOQ method and its core components. Section 4 presents experimental results conducted in a server-grade GPU environment, followed by deployment and inference evaluations on edge devices, comparing full-precision and quantized models. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Emotion Recognition in Conversation (ERC)

ERC can be broadly classified into three methodological paradigms based on modeling strategies.

First, graph-based methods model conversational structures by representing utterances, speaker interactions, and temporal dependencies as nodes and edges in a graph. Graph Neural Networks (GNNs) are then employed to jointly learn these relational representations. This paradigm effectively captures inter-utterance dependencies, speaker-level emotional transition patterns, and interactions among multimodal features, making it widely adopted in ERC research. Representative graph-based ERC models include GA2MIF [1], MMPGCN [7], DER-GCN [9], CBERL [10],

D^{2}

GNN [14], GCCL [15], SUNET [16], ConxGNN [17], Causal-DAG [18], and MKFM [19].

MMPGCN [7] embeds heterogeneous multimodal information into a relation graph and enhances semantic integration and generalization by applying weighted aggregation based on node homogeneity and neighbor weight sharing.

D^{2}

GNN [14] decouples emotion-related and emotion-irrelevant features, employs speaker-aware message passing, and learns discriminative emotion representations through multimodal distillation. SUNET [16] constructs a speaker-utterance heterogeneous graph to model speaker-utterance interactions and dynamically updates nodes in speaker order to jointly capture global speaker characteristics and contextual emotional transitions.

Second, sequence-based methods leverage the sequential nature of conversations in ERC, where utterances are organized in temporal order. These approaches employ sequence modeling architectures such as Recurrent Neural Networks (RNNs) and Transformers to learn temporal dependencies and contextual dynamics within a dialogue.

By modeling both short- and long-term inter-utterance dependencies, contextual emotion evolution patterns, and semantic transitions across consecutive utterances, sequence-based methods have been widely adopted in ERC. Representative sequence-based ERC models include TelME [2], UniMSE [3], EASUM [5], SACCMA [6], MRSLN [8], HU-Dialogue [20], PCDS [21], ML-ERC [22], and PeTracker [23].

SACCMA [6] integrates a speaker-aware network with cross-modal attention to capture the complementary relationship between speaker information and multimodal features. MRSLN [8] employs a Speaker-LSTM structure to explicitly model inter- and intra-speaker dependencies and utilizes residual connections to alleviate over-smoothing in LSTM while reinforcing contextual information across utterances. HU-Dialogue [20] incorporates hierarchical uncertainty estimation into the emotion recognition process by modeling contextual uncertainty via source-adaptive noise-based attention normalization and modality uncertainty through CapsNet [24] and Bayesian inference.

Finally, generation-based methods perform emotion recognition by leveraging the reasoning capabilities and latent representations of large-scale generative models to generate or enrich conversational context, speaker states, and emotional cues. These approaches exploit commonsense and conversational knowledge learned by LLMs to infer implicit emotional signals or reconstruct inter-utterance relations, thereby enabling more expressive and semantically enriched emotion representations. Representative generation-based ERC models include COSMIC [25] and CoTCL [26].

COSMIC [25] leverages the commonsense reasoning ability of COMET [27] to infer high-level knowledge such as speaker intent, emotional reactions, and action outcomes, and integrates these in utterance embeddings to better capture contextual and causal emotional cues. CoTCL [26] employs a triple contrastive learning framework consisting of instance-level, sequential, and graph-based components to refine utterance representations, enhancing discriminability, preserving continuous emotion relations, and modeling dialogue emotion transitions with COMET-based commonsense graphs. A comparative overview of recent ERC models is summarized in Table 1.

2.2. Quantization

In recent ERC research, feature extraction based on large-scale pretrained models such as LLMs, RoBERTa, and BERT has become the dominant paradigm. However, these models suffer from inherent limitations when deployed in real-time on edge devices with constrained memory bandwidth and computational resources. In particular, Transformer-based architectures, which involve repeated large-scale matrix operations, incur severe inference latency and high power consumption under low memory bandwidth conditions. Consequently, to ensure real-time performance in edge environments, it is essential to structurally reduce both the computational complexity and memory usage through model compression techniques. Representative model compression techniques include quantization, pruning, knowledge distillation, and low-precision approaches that reduce numerical precision from FP32 to FP16 or lower. Among these, quantization is one of the most widely adopted approaches for edge deployment, as it effectively reduces both computational cost and memory footprint by lowering the representational precision of the model.

Quantization refers to the process of converting neural network weights and activations from high-precision representations, such as 32-bit floating-point (FP32), to low-precision formats, such as 8-bit integers (INT8). By reducing the bit width of arithmetic operations, data are mapped onto discrete integer grids, substantially decreasing model size and memory consumption while enabling more efficient execution on hardware accelerators. Nevertheless, as lower precision inevitably introduces information loss, minimizing such degradation remains one of the central challenges in quantization. Representative quantization techniques include post-training quantization (PTQ) and quantization-aware training (QAT).

PTQ estimates the dynamic range of each operator using a calibration dataset without additional training and determines scaling factors accordingly to map FP32/FP16 values to low-precision representations such as INT8. Representative PTQ-based techniques include Activation-aware Weight Quantization (AWQ) [28], SmoothQuant [29], and TensorRT INT8 quantization.

AWQ [28] analyzes the scale distribution of activations to selectively protect weight channels with high importance and assigns scaling factors to each channel to minimize quantization error. In particular, by adopting activation-free quantization that preserves activations while quantizing only weights, AWQ enables stable INT8/INT4 inference for LLMs. SmoothQuant [29] mitigates activation outliers during INT8 quantization by rescaling activations and weights, shifting most of the quantization burden to weights and enabling efficient per-channel INT8 inference. TensorRT INT8 combines PTQ scaling with kernel fusion and platform-specific optimizations, making it one of the most practical quantization solutions for embedded GPUs such as NVIDIA Jetson.

QAT simulates quantization effects during training through fake quantization of weights and activations, enabling the model to adapt to low-precision representations such as INT8 or INT4. By applying forward operations that incorporate quantization noise throughout training, the model learns to become robust to errors induced by reduced numerical precision. Representative QAT-based quantization techniques include Learned Step Size Quantization (LSQ) [30], LSQ+ [31], and QAT-hybrid approaches that extend low-bit variants of methods such as GPTQ [32] and AWQ.

LSQ [30] treats the quantization step size as a learnable parameter and directly optimizes it via backpropagation to obtain optimal scaling values. This enables effective compensation for nonlinear quantization errors and has been shown to achieve superior accuracy compared to PTQ across both CNN and Transformer-based models. LSQ+ [31] extends LSQ by improving the stability of weight and activation scaling during training. By enhancing scale initialization and training stability, LSQ+ achieves robust performance even under ultra-low-bit settings such as INT4 and INT3. QAT-hybrid approaches mitigate residual quantization errors that PTQ methods (e.g., AWQ, SmoothQuant, GPTQ) cannot fully correct by fine-tuning only selected layers with QAT. This significantly lowers the cost of full QAT while enabling stable low-bit deployment for large LLMs and Transformer models, making it a practical QAT strategy.

Overall, PTQ- and QAT-based quantization techniques jointly reduce memory consumption, computational complexity (FLOPs), and memory bandwidth requirements by decreasing bit width, thereby enabling the efficient inference of large-scale models on resource-constrained edge devices.

3. Proposed Methods

3.1. Problem Definition

The objective of multimodal ERC is to predict the emotion label

y_{k}

corresponding to the

k

-th utterance (

u_{k}

) within a dialogue. Given a set of speakers

S

, utterances

U

, and emotion labels

Y

, a dialogue consisting of

k

utterances can be represented as follows:

[(s_{i}, u_{1}, y_{1}), (s_{j}, u_{2}, y_{2}), \dots, (s_{i}, u_{k}, y_{k})]

(1)

where

s_{i}

and

s_{j} \in S

denote the speakers participating in the dialogue, and if

i = j

, then

s_{i}

and

s_{j}

refer to the same speaker. The variable

y_{k} \in Y

represents the emotion of the

k

-th utterance in the dialogue and belongs to one of the predefined emotion categories. Similarly,

u_{k} \in U

denotes the

k

-th utterance, which consists of multimodal features from text, audio, and visual modalities.

3.2. Model Overview

The proposed multimodal ERC model consists of four key components: feature extraction, EMA-based self-distillation, cross-modal KD, and residual fusion. The overall model architecture is illustrated in Figure 2, and the following sections provide detailed descriptions of the design and operation of each component. In addition, although not included in Figure 2, the edge-optimized WOQ pipeline for deployment in resource-constrained edge environments (Figure 5) is also described in detail in the subsequent sections.

3.3. Feature Extraction

Text modality: For the text modality, emotion features were extracted at the utterance level using the RoBERTa model. All utterances in a dialogue were concatenated in chronological order, and speaker-specific tokens such as <s1> and <s2> were inserted to distinguish between speakers. In addition, to explicitly guide the model to predict the current speaker’s emotion, an emotion prompt in the form of “Now <si> feels <mask>” was appended, allowing the model to incorporate both the conversational context and emotional cues. The input sentence was tokenized and limited to a maximum of 511 tokens. If the number of tokens exceeded this limit, the earlier part of the sequence was truncated. After appending a [mask] token to the end of each sequence, all sentences were padded and accompanied by attention masks, and were processed to form mini-batches. Contextual representations were then extracted through the RoBERTa encoder, and the embedding of the last token in the final hidden state was used as the emotional representation of the entire utterance, yielding a 768-dimensional embedding vector, denoted as

h_{t e x t} \in R^{768}

. The embedding extraction process for the text modality is illustrated in Figure 3.

Visual modality: For the visual modality, we utilized the Qwen2.5-VL-7B-Instruct model to extract objective, sentence-level descriptive features of observable facial and bodily states from short video segments corresponding to each utterance. A detailed instructional prompt beginning with “You are an annotator who concisely describes observable actions and facial states” was provided to the model, guiding it to generate present-progressive sentences that describe only visually observable cues without emotional inference. The prompt explicitly required the inclusion of descriptions of facial regions such as the eyes, eyebrows, lips, jaw, head, and gaze, along with at least one bodily behavior involving hand movements, posture, or interaction.

Additionally, emotion-related words such as happy, angry, and sad were strictly prohibited to ensure that the generated content reflected purely visual and factual observations. The resulting sentences were refined by removing redundant or irrelevant outputs, yielding 2–3 concise descriptions that captured facial expressions, gaze behavior, gestures, posture, and interpersonal interactions for each utterance. These sentence-level descriptions were subsequently transformed into embeddings and used as visual emotion features.

The generated visual descriptions were embedded using Sentence-BERT (all-mpnet-base-v2), where all sentences corresponding to each utterance were concatenated into a single textual sequence and encoded into a 768-dimensional embedding vector, denoted as

h_{v i s u a l} \in R^{768}

. The VLM-based visual description generation process is illustrated in Figure 4.

3.4. EMA-Based Self-Distillation

Self-distillation applies temperature scaling to the probability distributions produced by the teacher and student models, both derived from the same network, to smooth (soften) the output probabilities and use them as soft targets. This approach alleviates the issue of overconfidence in a single class and enables more stable learning of relative inter-class relationships. In other words, it establishes a learning paradigm that improves the model internally without relying on an external teacher network.

The Qwen-based visual modality receives both expressive knowledge and soft-label from the text modality through cross-modal KD. However, the text modality itself does not receive any feedback from other modalities. Therefore, to further stabilize text representations and enhance the reliability of knowledge transfer, we applied an EMA-based self-distillation strategy exclusively to the text modality. EMA was chosen as the teacher model because it can generate stable and consistent supervision by temporally averaging the student model parameters during training, without the need for an external fixed teacher. Furthermore, the EMA teacher reduces the student’s reliance on hard-label at a single time step and promotes the learning of averaged soft targets over multiple updates, thereby maintaining structural consistency in the representation space. This allows the model to learn stable emotional representations that reflect long-term trends while mitigating the effects of short-term fluctuations or noise during training.

In the proposed approach, the EMA teacher is initialized at Step 0 by copying the parameters of the student model. Beginning from Step 1, the EMA teacher is updated at every training step according to the EMA update rule, and the accumulated updates progressively refine the teacher model over time. As training proceeds, the previous EMA parameters are retained with a weighting factor of

λ

, while the current student parameters are incorporated with a proportion of

(1 - λ)

. As updates accumulate, the influence of previous parameters gradually decreases in the form of

λ

raised to successive powers. Consequently, the contribution of older parameters decays exponentially, allowing recent updates to be reflected more prominently. In other words, the EMA functions as a temporal smoothing filter based on exponential weighting, continuously averaging model parameters over time. This mechanism prevents past information from being abruptly discarded, instead allowing it to fade smoothly over time, thereby enabling the teacher model to preserve long-term learning knowledge while flexibly adapting to the latest trends in the student model’s evolution. This update approach mitigates short-term fluctuations, captures consistent long-term patterns, and prevents unstable or noisy gradients from being directly propagated to the teacher predictions, particularly during the early stages of training.

The stabilized text representations obtained through this process serve as reliable guidance signals for the visual modality during the subsequent cross-modal knowledge distillation phase. These representations improve convergence stability in the visually variant modality, enhance the alignment quality of the cross-modal representation space, and ultimately maximize the complementarity among modalities, improving both knowledge transfer efficiency and overall performance.

Finally, during training, the student model is optimized to minimize the cross-entropy loss using the ground-truth labels and the Kullback–Leibler (KL) divergence loss computed from the soft targets generated by the EMA teacher. However, because the EMA teacher is not sufficiently stable during the first epoch, only the cross-entropy loss is applied in Epoch 1 to ensure stable learning. Starting from Epoch 2, the KL divergence loss based on the EMA teacher’s soft targets is additionally incorporated, enabling effective knowledge distillation and guiding the student to learn more stable and fine-grained emotional representations. The EMA update for the teacher model is defined as follows:

θ_{E M A}^{(t)} = λ \cdot θ_{E M A}^{(t - 1)} + (1 - λ) \cdot θ_{s t u d e n t}^{(t)}

(2)

where

θ_{E M A}^{(t)}

and

θ_{s t u d e n t}^{(t)}

denote the parameters of the EMA teacher and student models, respectively, at the current training step

t

, and

λ

denotes the EMA decay coefficient for updating the teacher model. The relationship between the output distribution of the EMA teacher model and the predictions of the student model, as well as the self-distillation loss, are defined as follows:

p_{t}^{(τ)} = s o f t m a x (\frac{z_{t}}{τ}), p_{s}^{(τ)} = s o f t m a x (\frac{z_{s}}{τ})

(3)

where

z_{t}

and

z_{s}

denote the logits of the EMA teacher and the student model, respectively, and

τ

denotes the temperature parameter.

L_{K D} = τ^{2} \cdot K L (p_{t}^{(τ)} | | p_{s}^{(τ)})

(4)

where

L_{K D}

represents the KL divergence between the temperature-scaled softmax probability distributions of the EMA-based model

p_{t}^{(τ)}

, and the student model

p_{s}^{(τ)}

, guiding the student to acquire KD from the output distribution of the teacher model.

L_{s t u d e n t}^{s e l f} = α \cdot L_{c l s} (z_{s}, y) + (1 - α) \cdot L_{K D}

(5)

where

L_{c l s} (z_{s}, y)

denotes the classification loss computed as the cross-entropy between the student logit

z_{s}

and ground-truth label

y

.

3.5. Cross-Modal Knowledge Distillation

We propose two strategies for distilling emotion-related knowledge from the text modality. These strategies are designed to transfer the rich linguistic and contextual information inherent in text to visual modality, thereby compensating for the relatively lower representational accuracy of the visual modality in emotion recognition.

Response-based Distillation: This approach, introduced in TelME [2], guides the learning process through the softmax probability distributions of the teacher and student models. It simultaneously considers both inter-class and intra-class relationships to effectively capture the structural characteristics of these distributions.

The inter-class relation is computed based on the Pearson correlation coefficient between the softmax probability distributions of the two models. By comparing the class-wise probability patterns of each utterance, the student model was aligned with the teacher’s global emotional structure—specifically, the semantic similarities and boundary relationships among emotion classes.

The intra-class relation is designed to ensure that within each batch, samples belonging to the same emotion class share consistent probability distributions. This is achieved by transposing the probability matrix to compute sample-wise correlations within each class, thereby enhancing the cohesion and consistency of the intra-class representations. As a result, the student model learns to maintain a stable local structure within the emotion representation space.

Furthermore, instead of the commonly used KL divergence, this work adopted the Pearson correlation coefficient for response-based distillation. Unlike KL divergence, which is sensitive to absolute distributional differences and may lead to unstable learning, the Pearson correlation coefficient measures the linear dependency between distributions, making it less susceptible to modality-specific variations and enabling more stable and robust knowledge transfer. Through this approach, the student model learns both the global structure and the fine-grained representational characteristics embedded in the teacher model’s output distribution. The response-based loss is defined as follows:

y_{t}^{(τ)} = s o f t m a x (\frac{z_{t}}{τ}), y_{s}^{(τ)} = s o f t m a x (\frac{z_{s}}{τ})

(6)

L_{i n t e r} = 1 - \frac{1}{N} \sum_{i = 1}^{N} c o r r (y_{t}^{(i, τ)}, y_{s}^{(i, τ)})

(7)

L_{i n t r a} = 1 - \frac{1}{C} \sum_{j = 1}^{C} c o r r (y_{t}^{(: j, τ)}, y_{s}^{(: j, τ)})

(8)

L_{r e s p o n s e} = τ^{2} (L_{i n t e r} + L_{i n t r a})

(9)

where

y_{s}

and

y_{t}

denote the temperature-scaled softmax probability distributions of the student and teacher models, respectively. In addition,

N

and

C

indicate the number of samples in a batch and the number of emotion classes, respectively. Furthermore,

i

and

j

represent the indices over the sample and class dimensions, and corr(

\cdot

) denotes the Pearson correlation coefficient between two vectors.

Feature-Based Distillation: To address the limitation that response-based distillation alone is insufficient for visual modalities to learn effective emotional representations, this work additionally applies feature-based distillation. In this approach, both the teacher and student embeddings are

L 2

-normalized, and pairwise similarity matrices are constructed by computing the dot product between all sample pairs within a batch. For example, with a batch size of 4, each modality produces a 4 × 4 similarity matrix, where each element represents the relational similarity between two samples.

The teacher model’s similarity matrix is transformed into a teacher–teacher relational distribution through a temperature-scaled softmax function, while the student embeddings undergo the same process to produce a teacher–student relational distribution. The student model is then trained to minimize the discrepancy between these two relational distributions via KL divergence-based alignment, allowing it to learn the relational structure captured by the teacher model.

Furthermore, to further enhance training stability, we introduce an auxiliary term,

λ_{J S D}

. While KL divergence provides a strong alignment signal, it is highly sensitive to small differences between distributions and can cause instability such as gradient explosion in the early stages of training. In contrast, the Jensen–Shannon divergence (JSD) is symmetric and bounded, providing a more stable and reliable learning signal even when large distributional differences exist. The feature-based distillation integrated with this auxiliary JSD term enables more stable relational alignment between the teacher and student models, guiding the student to construct a representation space that captures both the global structure and the fine-grained relational patterns of the teacher model. The feature-based loss is defined as follows:

{\tilde{H}}_{t} = n o r m a l i z e (H_{t}), {\tilde{H}}_{s} = n o r m a l i z e (H_{s})

(10)

p = s o f t m a x (\frac{{\tilde{H}}_{t} {\tilde{H}}_{t}^{Τ}}{τ}), q = s o f t m a x (\frac{{\tilde{H}}_{t} {\tilde{H}}_{s}^{Τ}}{τ})

(11)

L_{f e a t u r e} = K L (q | | p) + λ_{J S D} J S D (p, q)

(12)

where

{\tilde{H}}_{t}

and

{\tilde{H}}_{s}

denote the

L 2

-normalized embedding matrices of the teacher and student models, respectively. In addition,

p

and

q

denote the teacher–teacher and teacher–student relational (similarity) distributions obtained using the temperature-scaled softmax function, respectively.

Finally, the loss function for the visual student model combines the classification loss, response loss, and feature loss, enabling the student to learn both response-level and feature-level knowledge from the text-based teacher. The visual student loss is defined as follows:

L_{s t u d e n t}^{v i s u a l} = L_{c l s} + α L_{r e s p o n s e} + β L_{f e a t u r e}

(13)

Through this process, the visual modality acquires a refined emotional representation distilled from the text modality, reducing noise and enabling more stable and effective multimodal fusion. As a result, the proposed cross-modal KD enhances alignment and knowledge transfer efficiency, leading to significant improvements in overall emotion recognition performance.

3.6. Residual Fusion

We propose residual fusion, a text-centered fusion module designed to refine visual information through gated residual correction rather than simple feature concatenation. This module adaptively integrates visual cues into the text embedding space while preserving the semantic stability of the textual representation.

First, the visual embedding extracted from the VLM is smoothly projected into the text embedding space through a projection network. The projected visual feature and the text feature are combined using element-wise operations such as concatenation, difference, and product. These interactions are fed into a gating network that determines a sample-specific correction weight. To prevent unstable learning in the early training stage, the gate value is constrained within a fixed range of 0.05–0.5.

The correction strength is dynamically modulated not only by the gate

g

but also by the cosine similarity between the text and visual embeddings and the confidence of the visual logits. Cosine similarity quantitatively measures the semantic alignment between the two modalities, allowing the model to apply stronger corrections when the textual and visual representations are semantically consistent and share an emotional context. The confidence of the visual logits reflects the reliability of the visual modality itself, enabling the model to emphasize information when the prediction is stable and well-defined. By jointly considering these two factors, the model adaptively strengthens corrections for well-aligned multimodal pairs and suppresses those from noisy or uncertain visual cues, thereby achieving more reliable and context-aware multimodal interaction.

In addition to the gated confidence weighting, we introduce a residual correction mechanism to more precisely refine textual representations using visual cues. This method is designed to progressively incorporate subtle emotional signals from the visual modality into the text representation. The residual correction term (

δ

) is computed as the normalized difference between the fused and text embeddings, capturing the refined visual contribution modulated by the gate

g

. This residual feature is further stabilized through Layer Normalization and Dropout before being transformed into correction logits via a lightweight linear head. The resulting

δ

is then scaled by gated confidence weights, which combine the learnable scalar

α

, the gate value, the cosine similarity between modalities, and the confidence of the visual logits. Finally, the scaled residual correction term is added to the original text logits, allowing the model to selectively inject reliable and semantically aligned visual cues into the textual representation, thereby enhancing emotional richness while preserving the semantic integrity of the text modality.

Through this gated residual design, the model preserves the dominant textual semantics while effectively integrating fine-grained emotional nuances from the visual modality, resulting in stable and text-oriented multimodal fusion. The fusion mechanism is defined as follows:

{\tilde{h}}_{v} = W_{2} \cdot R e L U (W_{1} \cdot L N (h_{v}))

(14)

where

h_{v}

and

h_{t}

denote the unimodal representations of the visual and text modalities, respectively. The projected visual embedding

{\tilde{h}}_{v}

represents the feature mapped into the text feature space through Layer Normalization and two-stage linear projection.

W_{1}

and

W_{2}

are learnable projection matrices that transform the visual representation into a semantically aligned space with the text modality.

g = σ (W_{g} [h_{t} | | {\tilde{h}}_{v} | | (h_{t} - {\tilde{h}}_{v}) | | h_{t} ⊙ {\tilde{h}}_{v}])

(15)

where

σ

represents the sigmoid activation function. The adaptive gating weight

g

modulates the fusion process by controlling the contribution of the visual modality based on sample-specific interactions between text and visual features.

δ = W_{c} \cdot D r o p o u t (L N (h_{f u s e d} - h_{t}))

(16)

where the residual correction logits

δ

are obtained by applying Layer Normalization and Dropout to the residual difference between the fused and text embeddings, and then transforming the stabilized residual feature through a lightweight linear head

W_{c}

. This process ensures stable optimization and prevents overfitting during training.

z = z_{t} + (α \cdot g \cdot s_{c o s} \cdot s_{c o n f}) \cdot δ

(17)

where the parameter

α

is a learnable scalar that regulates the overall correction strength, and

s_{\cos}

and

s_{conf}

represent the cosine similarity-based and confidence-based scaling factors, respectively. The scaled residual correction term is then added to the original text logits to yield the final output

z

, allowing the model to selectively inject reliable and semantically aligned visual cues into the textual representation through residual fusion, thereby enhancing emotional expressiveness while preserving the semantic integrity and dominance of the text modality.

3.7. Edge-Optimized WOQ Pipeline

The proposed multimodal ERC model is built upon large-scale pretrained models, consisting of RoBERTa-based textual representations, VLM-based visual representations, and a fusion network that integrates features from both modalities, resulting in a parameter size of several hundred megabytes and extremely high computational complexity. While such architectures operate efficiently on server-grade GPUs, they face significant challenges on edge devices such as NVIDIA Jetson due to limitations in memory bandwidth, inference latency, power consumption, and thermal constraints, making real-time on-device inference difficult to achieve. In addition, because textual and visual modalities exhibit substantially different activation distributions, even minor numerical perturbations introduced during quantization can critically affect the stability and accuracy of multimodal fusion. To address these challenges and enable real-time multimodal emotion recognition on edge devices, this paper proposes an edge-optimized WOQ pipeline that compresses the original full-precision multimodal ERC model without requiring any architectural modifications. The proposed pipeline converts the weights of major linear layers to INT8 while maintaining non-quantized operations, such as activations, bias, and LayerNorm, in FP16 or FP32 precision to preserve numerical stability during multimodal fusion. Taking into account that the NVIDIA Jetson AGX Orin used in this study supports FP16 Tensor Core acceleration, the pipeline adopts a mixed-precision strategy that reduces memory usage by quantizing weights to INT8 while retaining activations in FP16 to preserve model accuracy.

Based on this mixed-precision design, the proposed edge-optimized WOQ pipeline is organized into three sequential stages. In the first stage, a trained full-precision checkpoint is loaded, and the major linear layers are replaced with lightweight INT8-based linear modules (Int8Linear) based on an AWQ-lite structure [28]. In the second stage, FP32 precision is retained for stability-critical operations such as LayerNorm, whereas FP16 is used for compute-intensive operations such as activations and bias, enabling Tensor Core acceleration on Jetson devices. In the final stage, the text encoder, VLM encoder, and fusion module are built independently and stored as mixed-precision models that use INT8 weights. During inference, these modules are combined under an FP16 execution setting, enabling real-time multimodal emotion recognition on resource-constrained edge devices.

The quantitative formulation of the proposed edge-optimized WOQ pipeline is applied to the linear layers in Transformer architectures. For a weight matrix

W \in R^{o \times i}

, a per-channel scale is computed for each output channel

j

, where the scale factor

s_{j}

is defined as follows:

s_{j} = \frac{m a x | W_{:, j} |}{127}

(18)

Using the computed scale, the INT8 quantization of the weights is defined as follows:

W_{int 8, :, j} = r o u n d (W_{:, j} / s_{j})

(19)

During inference, the following dequantization operation is applied to recover the weights in the floating-point domain:

{\hat{W}}_{:, j} = s_{j} \cdot W_{int 8, :, j}

(20)

The activation

x

is retained in FP16 precision, or FP32 for certain operations, and the final linear transformation is performed as:

y = \hat{W} x + b

(21)

Since this weight-only quantization scheme does not involve activation quantization, it provides high numerical stability. This property is particularly beneficial for multimodal architectures in which heterogeneous modalities, such as text, visual, and fusion representations, are combined, as it minimizes information loss during multimodal integration. Moreover, because the proposed approach does not require any modification to the model architecture, trained full-precision checkpoints can be directly reused, enabling consistent quantization across the text encoder, VLM encoder, and fusion module. Specifically, the overall process of weight-only quantization is shown as follows: the original full-precision weights are first converted into INT8 values through scale-based rounding, and then approximately reconstructed through dequantization using the same scaling factor. A comparison between the original and dequantized weights shows that the reconstructed values remain very close to their full-precision counterparts, indicating that the proposed method can effectively reduce memory usage while preserving the essential numerical characteristics of the model, as illustrated in Figure 5.

4. Experimental Results

4.1. Dataset and Evaluation Metrics

Dataset: The MELD (Multimodal EmotionLines Dataset) [33] consists of over 1400 multi-party conversations and more than 13,000 utterances extracted from the American sitcom Friends. Each utterance is annotated with one of seven emotion labels: neutral, surprise, fear, sadness, joy, disgust, or anger, and is provided in an aligned multimodal format that includes text, audio, and visual modalities. The MELD captures emotional transitions and contextual dependencies among speakers in real conversational settings, enabling the development of models that can learn complex and realistic emotion patterns.

Evaluation Metrics: In this paper, accuracy and weighted average F1-score (WA-F1) were used as evaluation metrics to measure the overall classification performance of the model. Accuracy represents the proportion of correctly classified predictions among all predictions and is defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(22)

where TP: true positives, TN: true negatives, FP: false positives, and FN: false negatives.

However, as illustrated in Figure 6, the MELD exhibited a clear class imbalance. While the neutral class accounted for approximately half of all utterances, the fear and disgust classes each accounted for less than 3% of the total. Due to this severe imbalance between majority and minority classes, a simple accuracy metric may overestimate the model’s performance by favoring the majority classes.

To address this issue, the WA-F1 was adopted as the primary evaluation metric. WA-F1 calculates the weighted average of the F1-scores for each class, using the class-wise sample ratio

w_{i}

as the weighting factor, and is defined as follows:

W A - F 1 = \sum_{i = 1}^{K} w_{i} \cdot {F 1}_{i}, w_{i} = \frac{N_{i}}{\sum_{j = 1}^{K} N_{j}}

(23)

where

{F 1}_{i}

denotes the F1-score for the

i

-th class,

N_{i}

is the number of samples in that class, and

k

represents the total number of classes.

This metric accounts for the class imbalance by considering performance degradation in minority classes and prevents performance overestimation caused by majority classes. Furthermore, it effectively incorporates both precision and recall for each class, providing a more balanced, fair, and reliable evaluation of the model’s overall classification capability.

4.2. Experimental Settings

In a server-grade GPU environment, the ERC classification experiments were conducted on Ubuntu 22.04 equipped with an NVIDIA RTX A6000 GPU. The full-precision multimodal ERC model was trained and evaluated in this server-grade GPU environment and subsequently used as the reference model for performance comparison with existing ERC models in the emotion classification experiments presented in Section 4.3. AdamW was adopted as the optimizer, decoupling weight decay from the momentum update to mitigate overfitting and improve generalization. In addition, the get_linear_schedule_with_warmup scheduler was employed to control the learning rate, which gradually increases the learning rate during the initial training phase and linearly decays it thereafter to alleviate early-stage instability and promote stable convergence. The training hyperparameters are summarized in Table 2.

Experiments comparing the deployment of the server-trained full-precision multimodal ERC model and its lightweight quantized variants were conducted on an NVIDIA Jetson AGX Orin to assess their practical deployability in resource-constrained edge environments. The edge environment was configured with JetPack 6.3, Ubuntu 22.04, and CUDA 12.6. Under this setting, inference-only evaluations were performed to measure inference efficiency and performance, focusing on real-time feasibility and computational efficiency under edge constraints. Figure 7 illustrates the NVIDIA Jetson AGX Orin device used in the experiments.

4.3. Emotion Classification Results on a Server-Grade GPU

Table 3 summarizes the performance comparison between the proposed model and representative existing ERC models on the MELD. As shown in Table 3, the proposed model achieved the highest classification performance in terms of both overall accuracy and WA-F1. It is important to note that several previous studies, including GA2MIF, MRSLN,

D^{2}

GNN, GCCL, and PCDS, evaluated their models using only five emotion classes, excluding the fear and disgust categories due to the limited number of training samples.

In contrast, the proposed model was evaluated under a more challenging seven-class setting that includes all emotion categories in the MELD. Despite this more difficult evaluation scenario, the proposed model still achieved superior overall performance. Furthermore, when examining the recognition performance for minority emotion classes, the proposed model demonstrated significantly improved classification capability. In particular, the recognition accuracy for minority classes such as fear and disgust showed improvements ranging from approximately 5% to 20% compared with existing models. These results indicate that the proposed framework is not only effective in improving overall ERC performance, but also more robust in handling minority emotion categories that are often excluded or poorly recognized in prior work.

Figure 8 illustrates the confusion matrix obtained on the MELD. The analysis reveals that emotion classes with a limited number of samples, such as fear and disgust, exhibit lower classification accuracy and a higher rate of confusion with other emotions.

This confusion pattern can be attributed to both the expressive similarity between emotions and inherent data imbalance, which remains one of the fundamental challenges to be addressed in the ERC field.

4.4. Edge Deployment and Inference Evaluation of the Quantized Model on Jetson

The overall multimodal ERC model in this study was structured as a teacher–student architecture consisting of a text-based teacher model, a VLM-based student model, and a fusion model. Within this pipeline, the fusion module is responsible for integrating high-dimensional representations extracted from the text and visual modalities into the final prediction space. Despite having a relatively small number of parameters, the fusion module directly influences the final classification performance through its gating values and fusion logic. Due to these characteristics, the fusion module constitutes the most sensitive component to performance variations induced by quantization, making it a suitable target for analyzing the effectiveness of model compression techniques. Moreover, applying quantization uniformly across the entire pipeline makes it difficult to clearly attribute performance degradation to specific components. Therefore, prior to full-pipeline quantization, we performed an isolated quantization of the fusion module to conduct a sensitivity analysis. The expected advantages of this fusion-only quantization experiment are as follows:

Independently analyze the impact of weight-only quantization applied to the fusion stage on the WA-F1.
Verify whether model size reduction and inference speed improvement are achieved in practice.
Identify quantization techniques that can be applied without quality degradation prior to full-pipeline compression.
Provide a baseline reference for subsequent quantization of the text and VLM models.

The quantization models used for comparison with the proposed edge-optimized WOQ pipeline are described as follows:

Dynamic Quantization: Dynamic quantization is a quantization scheme in which the model weights are pre-quantized and stored in INT8 precision, while the activation scales are dynamically computed at inference time. This approach does not require a separate calibration process and is simple to apply, making it widely used in environments where lightweight model deployment is required. First, the weight matrix

W \in R^{o \times i}

of a linear layer is quantized to INT8 precision as follows:

W_{int 8} = Q (W), s_{w} = \frac{m a x ∣ W ∣}{127}

(24)

where

Q (\cdot)

denotes the quantization function that maps real-valued inputs to the INT8 range, and

s_{w}

represents the static (scale-fixed) weight scale. This process is performed only once during the model build stage. For the input activation

x

, the scale is dynamically computed at each inference step. Specifically, based on the distribution of the current input batch, the activation scale is defined as:

s_{x} = \frac{m a x ∣ x ∣}{127}, x_{int 8} = Q (x)

(25)

The activation scale

s_{x}

is recomputed at every inference step, allowing the quantization process to flexibly adapt to variations in the activation distribution. Before the actual computation, the quantized weights and activations are dequantized back to the floating-point domain:

\hat{W} = s_{w} \cdot W_{int 8}, \hat{x} = s_{x} \cdot x_{int 8}

(26)

Finally, the linear operation is performed in the floating-point domain as:

y = \hat{W} \hat{x} + b

(27)

SmoothQuant: SmoothQuant is a co-quantization method that applies channel-wise scaling to mitigate the dynamic range mismatch between activations and weights. In Transformer-based models, activation distributions are often wide and vary significantly across layers, whereas weight distributions tend to be relatively stable. As a result, quantizing activations can cause quantization errors to concentrate on specific channels. SmoothQuant addresses this issue by applying complementary scaling to both activations and weights. For a linear layer, the scaling factor

α_{c}

for channel

c

, which considers the distributions of the activation and weight, is defined as:

α_{c} = {(\frac{∥ x_{c} ∥_{\infty}}{∥ W_{c} ∥_{\infty}})}^{ρ}

(28)

where

ρ \in [0, 1]

is a hyperparameter that controls the proportion of scale allocation between activations and weights. When

ρ = 0

, SmoothQuant behaves similarly to weight-only quantization, whereas when

ρ = 1

, activation-dominant scaling is applied. Using the computed scaling factor, the weight and activation are transformed as follows:

W_{c}^{'} = \frac{W_{c}}{α_{c}}, x_{c}^{'} = α_{c} x_{c}

(29)

INT8 quantization is then applied to the scaled weight and activation:

W_{int 8} = Q (W_{c}^{'}), x_{int 8} = Q (x_{c}^{'})

(30)

This scaling process does not affect the actual computation during inference, as the following relationship holds:

W_{c} x_{c} = W_{c}^{'} x_{c}^{'}

(31)

Through this mechanism, SmoothQuant balances the dynamic ranges of activations and weights while preserving the original computation results. Although SmoothQuant has been shown to be effective for attention-based architectures with highly non-uniform activation distributions, it involves activation quantization and may therefore introduce numerical instability in multimodal fusion stages.

TensorRT INT8 Quantization: TensorRT INT8 Quantization is a static INT8 quantization approach optimized for NVIDIA GPU architectures, designed to enable high-performance INT8 General Matrix Multiplication (GEMM) for convolutional and linear operations. This method determines fixed scaling factors based on activation distribution ranges measured using pre-collected calibration data. During the calibration stage, the activation scale

s_{x}

is defined as:

s_{x} = \frac{{m a x}_{x \in D_{cal}} | x |}{127}

(32)

Alternatively, when percentile-based calibration is used, the activation scale is computed as:

s_{x} = \frac{{Percentile}_{p} (∣ x ∣)}{127}

(33)

The weights are also quantized to INT8 using a static scale:

W_{int 8} = Q (W), s_{w} = \frac{m a x ∣ W ∣}{127}

(34)

INT8 inference is executed using TensorRT’s internally optimized INT8 GEMM operations, and the final output is computed as:

y = INT 8 GEMM (W_{int 8}, x_{int 8}) + b

(35)

In experiments on deploying quantized emotion recognition models in edge environments, not only emotion classification performance (WA-F1) but also practical efficiency metrics, such as model memory footprint and inference latency, and inference throughput are jointly evaluated.

In the fusion module quantization experiments, the original model, in which the text encoder, the VLM encoder, and the fusion module were all maintained in FP32 precision, was first established as the baseline. Subsequently, four quantization configurations were applied exclusively to the fusion module, and their end-to-end emotion classification performance was compared. In terms of emotion classification accuracy measured by WA-F1 (%), the original FP32 model achieved the highest performance at 67.80%. The TensorRT + SmoothQuant configuration obtained a WA-F1 score of 65.82%, corresponding to a performance drop of 1.98% compared to FP32. The TensorRT INT8 approach exhibited the largest performance drop, recording 65.31%, which corresponded to a decrease of 2.49%. This performance drop can be attributed to TensorRT’s use of static, fixed-scale quantization at the tensor or channel level, which fails to sufficiently preserve subtle distributional variations after LayerNorm.

In contrast, both the proposed edge-optimized WOQ (fusion-only) method and the Dynamic INT8 method achieved a WA-F1 score of 66.05%, resulting in a smaller performance drop of 1.75% relative to FP32. Among the evaluated quantization methods, these two approaches exhibited the smallest performance drop, indicating that they effectively control the numerical sensitivity of the fusion stage and maintain stable prediction performance even under aggressive INT8 quantization.

The impact of quantization was also clearly observed in terms of model size. The FP32 fusion module occupied approximately 13.5 MB. In contrast, the Dynamic INT8 method and the proposed edge-optimized WOQ achieved the highest compression ratios, with model sizes of 3.4 MB and 3.6 MB, respectively. In comparison, the TensorRT + SmoothQuant configuration and the TensorRT INT8 approach resulted in larger sizes of 7.1 MB and 7.4 MB, respectively, corresponding to roughly a 50% reduction relative to FP32 but remaining noticeably larger than the Dynamic INT8 method and proposed approach. These results indicate that since the fusion module primarily consists of linear weights, weight-only INT8 quantization provides the greatest benefit in terms of model size reduction.

Inference latency improvements were also observed across all INT8 configurations. The FP32 model exhibited an average latency of approximately 200.35 ms per sample. The TensorRT + SmoothQuant configuration and the TensorRT INT8 approach achieved the fastest inference speeds, recording latencies of 139.01 ms and 138.23 ms, respectively, corresponding to an improvement of approximately 31%. The Dynamic INT8 method achieved a latency of 148.10 ms, while the proposed edge-optimized WOQ recorded 165.57 ms, yielding speed improvements of approximately 26% and 17%, respectively.

The improvements in inference latency were also reflected in inference throughput. The FP32 model achieved a throughput of 4.99 samples/s. The TensorRT + SmoothQuant configuration and the TensorRT INT8 approach achieved higher throughputs of 7.19 samples/s and 7.23 samples/s, respectively. The Dynamic INT8 method recorded 6.75 samples/s, while the proposed edge-optimized WOQ achieved 6.04 samples/s. The results of the fusion-only quantization experiments are illustrated in Figure 9.

Overall, the Dynamic INT8 method and the proposed edge-optimized WOQ pipeline maintained accuracy levels comparable to FP32 while providing consistent improvements in both model size and inference latency. In contrast, although the TensorRT INT8 approach delivered superior speed, it incurred relatively larger accuracy losses, and the TensorRT + SmoothQuant configuration showed less stable performance.

This fusion-only quantization experiment serves as an important baseline, demonstrating that the fusion module can be safely converted to INT8 in future full-pipeline quantization. Ultimately, these results support the effectiveness of the proposed edge-optimized WOQ pipeline in achieving both efficiency and stability for deploying large-scale multimodal models in edge environments.

In the full-pipeline quantization experiment, quantization was applied to all submodules, including the text encoder, the VLM encoder, and the fusion module, and the overall system performance was evaluated from an end-to-end perspective. Accordingly, the full-pipeline experiments compared four quantization configurations, including the TensorRT INT8 approach, the Dynamic INT8 method, and the proposed edge-optimized WOQ pipeline.

The proposed edge-optimized WOQ pipeline is distinguished from conventional weight-only quantization methods in that it consistently applies a mixed-precision strategy across the text encoder, VLM encoder, and fusion module. Specifically, linear weights are converted to INT8, while activations and non-quantized operations are retained in FP16/FP32. Accordingly, the objective of this experiment is not merely to compare individual INT8 conversion techniques, but to verify whether a large-scale multimodal pipeline can be constructed for real-time inference by applying a consistent, edge-friendly precision and scaling policy to all components.

For evaluation, the original model, which maintained the text encoder, VLM encoder, and fusion module entirely in FP32, was first established as the baseline, and four quantization configurations were compared in an end-to-end manner. The FP32 baseline achieved the highest accuracy, with a WA-F1 of 67.80%. However, the text encoder alone occupied approximately 1.32 GB of memory, and the end-to-end inference latency was measured at around 200.35 ms per sample, rendering real-time inference impractical on the NVIDIA Jetson AGX Orin platform. This baseline analysis clearly indicates the necessity of full-pipeline INT8 compression.

Experiment 1 applied FP16 TorchScript conversion only to the text encoder, while the VLM encoder and fusion module were converted to TensorRT INT8. In this configuration, the text model underwent only FP32-to-FP16 reduction, whereas the vision-related submodules leveraged TensorRT’s static (symmetric) INT8 scale-based optimization. This configuration achieved a WA-F1 score of 65.81%, indicating a relatively large performance drop compared to the baseline. The model sizes of the text encoder, VLM encoder, and fusion module were reduced to 713.3 MB, 0.88 MB, and 7.4 MB, respectively. Although this configuration achieved a relatively fast inference latency of approximately 43.6 ms per sample, attempts to convert the text encoder into a fully TensorRT INT8 engine resulted in integration issues. Specifically, the sequence length, attention mask, and hidden-state shapes could not be statically fixed, leading to mismatches in fusion input dimensions. Due to these structural constraints, the text encoder could not be converted to INT8, and this configuration exhibited both accuracy degradation and pipeline integration limitations.

Experiment 2 applied ONNX Runtime-based Dynamic INT8 quantization to the text encoder, VLM encoder, and fusion module. This approach is optimized for CPU-based GEMM operations and requires dynamic scale recomputation even on GPU, resulting in substantial tensor-level scaling overhead. The experimental results clearly revealed these limitations. Although the WA-F1 score remained acceptable at 66.11%, memory reduction was limited, with the text model and fusion module occupying 512.9 MB and 3.4 MB, respectively. More critically, the inference latency reached 616.21 ms per sample, making this configuration impractical for real-time deployment. These results demonstrate that while Dynamic INT8 preserves accuracy, it is not a viable option in GPU-based edge environments such as NVIDIA Jetson AGX Orin due to excessive latency.

Experiment 3 applied the proposed edge-optimized WOQ pipeline consistently across the entire pipeline. In this configuration, all linear layers in the text encoder, VLM encoder, and fusion module were replaced with AWQ-lite-based INT8 linear modules, while activations, LayerNorm, and bias terms were retained in FP16/FP32. This approach represents not a simple weight-only conversion, but an edge-friendly quantization pipeline explicitly designed to ensure numerical stability and real-time performance. The pipeline was applied stably across the text encoder’s self-attention blocks, hidden-state flows, and residual connections. As a result, the model sizes of the text encoder, VLM encoder, and fusion module were reduced to 409.2 MB, 0.61 MB, and 3.6 MB, respectively. The end-to-end WA-F1 score reached 66.27%, which is very close to the FP32 baseline (67.80%). In addition, the inference latency was reduced to approximately 42.51 ms per sample, and the inference throughput reached 23.52 samples/s, which was slightly faster than the TensorRT-based configuration. Notably, the successful conversion of a large-scale self-attention-based text encoder without scale information loss confirms that the proposed edge-optimized WOQ pipeline is a stable and consistent strategy for compressing Transformer-based multimodal models.

Experiment 4 adopted a hybrid configuration informed by the fusion-only compression results. In this setup, the proposed edge-optimized WOQ pipeline was applied to the text encoder and VLM encoder to ensure stable INT8 conversion, while the structurally simple fusion module was quantized using the Dynamic INT8 method. Motivated by the observation that Dynamic INT8 yielded faster fusion-only inference, this configuration employed GPU execution for the text and VLM modules and CPU execution for the fusion module, aiming to preserve accuracy and stability in the text-VLM components while minimizing overhead in the fusion stage. The resulting model sizes for the text encoder, VLM encoder, and fusion module were 409.2 MB, 0.61 MB, and 3.4 MB, respectively. The WA-F1 score also matched that of the proposed pipeline at 66.27%. The inference latency was approximately 44.26 ms per sample, only 1–2 ms slower than the TensorRT-based configuration and still well within real-time requirements for embedded environments. However, the slight increase in latency compared to Experiment 3 reflects the inherent limitations of CPU-based inference. The results of the full-pipeline quantization experiments are illustrated in Figure 10.

In summary, while the FP32 baseline model provided the highest accuracy, its memory footprint and inference latency make it unsuitable for real-time deployment on edge devices. The TensorRT-based configuration in Experiment 1 achieved excellent speed but suffered from accuracy degradation and structural constraints that prevented full INT8 conversion of the text encoder. The Dynamic INT8-based configuration in Experiment 2 maintained reasonable accuracy but incurred prohibitive latency. In contrast, the proposed edge-optimized WOQ pipeline in Experiment 3 preserved accuracy close to FP32 while consistently reducing both the model size and inference latency across the entire pipeline. The hybrid configuration in Experiment 4 demonstrated comparable performance but exhibited slightly inferior inference speed relative to the proposed pipeline. Overall, these results experimentally validate that the proposed edge-optimized WOQ pipeline is an optimal compression strategy for Transformer- and linear-based multimodal pipelines, achieving minimal information loss while enabling real-time deployment in edge environments such as NVIDIA Jetson AGX Orin.

4.5. Ablation Study on a Server-Grade GPU

To investigate the contribution of each component of the proposed model to the overall performance, an ablation study was conducted in a server-grade GPU environment. Through this analysis, we quantitatively evaluated the extent to which the key mechanisms of the proposed DDVLM model contribute to emotion recognition performance. Table 4 summarizes the WA-F1 results of the ablation study. The experimental results show that performance degradation is observed when either self-distillation or cross-modal knowledge distillation is removed, indicating that both mechanisms play a crucial role in improving overall emotion recognition performance. In particular, excluding cross-modal knowledge distillation results in reduced stability in visual representations, leading to a noticeable performance drop. Similarly, removing self-distillation weakens the stability of textual representations, which in turn causes a decline in classification accuracy. These findings confirm that the proposed mechanisms complement each other and jointly contribute to performance enhancement. Removing the residual fusion module results in a decrease in classification performance, demonstrating the importance of the proposed fusion strategy in integrating multimodal representations.

Furthermore, the effectiveness of the proposed techniques is clearly reflected in the unimodal performance analysis. The text modality with self-distillation achieves higher classification accuracy than its counterpart without self-distillation, demonstrating that EMA-based self-distillation effectively enhances the consistency and generalization of textual representations. Likewise, the VLM-based visual modality attains improved performance when cross-modal knowledge distillation is applied compared to standalone training, indicating that emotion distribution knowledge transferred from the text modality enhances the discriminative capability of visual representations. Overall, these results demonstrate that each core component of the proposed DDVLM model provides meaningful performance gains, and that their combination leads to the highest emotion recognition performance.

5. Conclusions

In this paper, we analyzed the limitations of existing conversation-based multimodal ERC systems, focusing on the representational imbalance across modalities, the instability of visual representations, and the high computational cost and limited deployability associated with large-scale pretrained backbones. To address these challenges, we jointly proposed a knowledge-distillation-based multimodal ERC model, termed DDVLM, together with an edge-optimized WOQ pipeline that enables efficient deployment in resource-constrained edge environments.

To alleviate the problem that naive multimodal fusion may even degrade performance due to the reliability gap between textual and visual modalities, we adopted a knowledge distillation strategy in which the textual modality, characterized by higher emotional density and classification accuracy, serves as the teacher, while the visual modality, which is comparatively less reliable, is treated as the student. By transferring the emotion distribution learned by the text modality to the visual modality, the proposed framework improves the representational quality of non-verbal cues and enhances the overall stability of multimodal learning. Furthermore, to strengthen the representational stability of the teacher itself, we applied EMA-based self-distillation to the textual modality. By jointly leveraging hard-label supervision and soft-label alignment, this approach improves the consistency and generalization of text representations, thereby constructing a more reliable teacher and maximizing the effectiveness of cross-modal knowledge transfer. From the perspective of visual feature extraction, we employed a VLM instead of purely vision-based encoders, and projected the generated visual descriptions into a Sentence-BERT embedding space. This design enforces semantic alignment between textual and visual modalities within a shared representation space, simultaneously improving both the stability and efficiency of multimodal fusion.

In addition, we systematically examined the deployability of multimodal ERC models in real edge environments by analyzing computational and resource constraints when deploying a full-precision model trained in a server-grade GPU environment onto edge devices. Our experiments demonstrated that applying a consistent mixed-precision strategy across the text encoder, VLM encoder, and fusion module significantly reduces the model size and memory consumption while effectively lowering inference latency and maintaining performance close to that of the full-precision model. Notably, real-time inference was successfully achieved on resource-constrained edge devices such as NVIDIA Jetson.

Overall, the proposed DDVLM framework mitigates modality imbalance and enhances multimodal learning stability through knowledge distillation and EMA-based self-distillation, while VLM-Sentence-BERT alignment effectively strengthens semantic consistency between text and visual modalities, leading to superior performance on the MELD compared with existing methods. Meanwhile, the edge-optimized WOQ pipeline applies edge-friendly mixed-precision quantization to large-scale multimodal ERC architectures, minimizing accuracy degradation while substantially reducing model size, memory footprint, and inference latency. Taken together, this work presents a practical multimodal ERC approach that balances recognition performance with deployment efficiency.

Nevertheless, several limitations remain. Although the proposed edge-optimized quantization strategy preserves accuracy close to the full-precision model and effectively minimizes performance loss after quantization, a certain level of degradation still persists compared with server-grade full-precision training. Model compression inevitably introduces a trade-off between computational efficiency and classification performance, and stronger compression may lead to further accuracy degradation. Future work will therefore focus on precision-control mechanisms and module-wise adaptive quantization strategies that can better balance model compression and recognition performance while maintaining lightweight deployment in edge environments.

In addition, the proposed method converts visual inputs into textual descriptions through a VLM before feature embedding. Although this design enables semantic alignment between textual and visual modalities, it may introduce potential limitations in preserving fine-grained visual information. In particular, subtle visual cues such as micro-expressions, facial dynamics, and nuanced gestures may not be fully captured during the visual-to-text conversion process. Future research will therefore investigate hybrid approaches that combine language-aligned visual representations with direct visual feature extraction in order to better preserve detailed visual information while maintaining cross-modal semantic alignment.

Author Contributions

Conceptualization, D.K., Y.i.L., D.H.Y. and B.J.K.; Methodology, D.K. and Y.i.L.; Software, D.K. and Y.i.L.; Validation, D.K., Y.i.L., D.H.Y. and B.J.K.; Formal analysis, D.K.; Investigation, D.K. and Y.i.L.; Data curation, D.K. and Y.i.L.; Writing—original draft preparation, D.K.; Writing—review and editing, D.K., Y.i.L., D.H.Y., B.J.K. and D.-H.K.; Visualization, D.K. and Y.i.L.; Supervision, D.-H.K.; Project administration, D.-H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2024-00336286) and in part by an Inha University Research Grant.

Data Availability Statement

Publicly available data were analyzed in this study. The MELD dataset is publicly accessible at https://affective-meld.github.io/ and is described in arXiv:1810.02508.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Li, J.; Wang, X.; Lv, G.; Zeng, Z. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. IEEE Trans. Affect. Comput. 2024, 15, 130–143. [Google Scholar] [CrossRef]
Yun, T.; Lim, H.; Lee, J.; Song, M. TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 82–95. [Google Scholar]
Hu, G.; Lin, T.-E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Mexico City, Mexico, 2022; pp. 7837–7851. [Google Scholar]
Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. IEEE Trans. Multimed. 2023, 26, 77–89. [Google Scholar] [CrossRef]
Hwang, Y.; Kim, J.-H. EASUM: Enhancing Affective State Understanding through Joint Sentiment and Emotion Modeling for Multimodal Tasks. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Waikoloa, HI, USA, 2024; pp. 5668–5678. [Google Scholar]
Guo, L.; Song, Y.; Ding, S. Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation. Knowl.-Based Syst. 2024, 296, 111969. [Google Scholar] [CrossRef]
Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
Lu, N.; Tan, Z.; Qian, J. MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation. Neurocomputing 2024, 580, 127467. [Google Scholar] [CrossRef]
Ai, W.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DERGCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4908–4921. [Google Scholar] [CrossRef]
Meng, T.; Shou, Y.; Ai, W.; Yin, N.; Li, K. Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations. IEEE Trans. Artif. Intell. 2024, 5, 6472–6487. [Google Scholar] [CrossRef]
Qwen Team; Alibaba Group. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.-Y. MPNet: Masked and Permuted Pre-training for Language Understanding. arXiv 2020, arXiv:2004.09297. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Dai, Y.; Li, Y.; Chen, D.; Li, J.; Lu, G. Multimodal Decoupled Distillation Graph Neural Network for Emotion Recognition in Conversation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9910–9924. [Google Scholar] [CrossRef]
Dai, Y.; Li, J.; Li, Y.; Lu, G. Multimodal graph context extraction and consensus-aware learning for emotion recognition in conversation. Knowl.-Based Syst. 2024, 298, 111954. [Google Scholar] [CrossRef]
Song, R.; Giunchiglia, F.; Shi, L.; Shen, Q.; Xu, H. SUNET: Speaker-utterance interaction Graph Neural Network for Emotion Recognition in Conversations. Eng. Appl. Artif. Intell. 2023, 123, 106315. [Google Scholar] [CrossRef]
Van, C.T.; Tran, T.V.T.; Nguyen, V.; Hy, T.S. Effective Context Modeling Framework for Emotion Recognition in Conversations. In ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Waikoloa, HI, USA, 2025. [Google Scholar]
Su, Y.; Wei, Y.; Nie, W.; Zhao, S.; Liu, A. Dynamic Causal Disentanglement Model for Dialogue Emotion Detection. IEEE Trans. Affect. Comput. 2024, 16, 1–14. [Google Scholar] [CrossRef]
Tu, G.; Liang, B.; Qin, B.; Wong, K.-F.; Xu, R. An Empirical Study on Multiple Knowledge from ChatGPT for Emotion Recognition in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Mexico City, Mexico, 2023; pp. 12160–12173. [Google Scholar]
Chen, F.; Shao, J.; Zhu, A.; Ouyang, D.; Liu, X.; Shen, H.T. Modeling Hierarchical Uncertainty for Multimodal Emotion Recognition in Conversation. IEEE Trans. Cybern. 2024, 54, 187–198. [Google Scholar] [CrossRef]
Shen, S.; Liu, F.; Wang, H.; Zhou, A. Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision. IEEE Trans. Affect. Comput. 2025, 16, 2261–2273. [Google Scholar] [CrossRef]
Kang, Y.; Cho, Y.-S. Beyond Single Emotion: Multilabel Approach to Conversational Emotion Recognition. Proc. AAAI Conf. Artif. Intell. 2025, 39, 24321–24329. [Google Scholar] [CrossRef]
Cao, Y.; Huang, L.; Tang, Y. PeTracker: Poincaré-based Dual-Strategy Emotion Tracker for Emotion Recognition in Conversation. IEEE Trans. Affect. Comput. 2025, 16, 2020–2032. [Google Scholar] [CrossRef]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ghosal, D.; Mahumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: CommonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Mexico City, Mexico, 2020; pp. 2470–2481. [Google Scholar]
Liang, J.; Li, W.; Zhong, Q.; Huang, J.; Jiang, D.; Cambria, E. Learning chain for clause awareness: Triplex-contrastive learning for emotion recognition in conversations. Inf. Sci. 2025, 705, 121969. [Google Scholar] [CrossRef]
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Mexico City, Mexico, 2019; pp. 4762–4779. [Google Scholar]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv 2024, arXiv:2306.00978. [Google Scholar] [CrossRef]
Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv 2024, arXiv:2211.10438. [Google Scholar]
Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. arXiv 2020, arXiv:1902.08153. [Google Scholar] [CrossRef]
Bhalgat, Y.; Lee, J.; Nagel, M.; Blankevoort, T.; Kwak, N. LSQ+: Improving low-bit quantization through learnable offsets and better initialization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Waikoloa, HI, USA, 2020. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. arXiv 2019, arXiv:1810.02508. [Google Scholar] [CrossRef]

Figure 1. Unimodal weight average F1-score (WA-F1) on the MELD dataset.

Figure 2. Proposed method architecture.

Figure 3. Example of text modality feature extraction process.

Figure 4. Example of visual modality feature extraction process.

Figure 5. Architecture of the proposed edge-optimized WOQ pipeline.

Figure 6. Emotion class distribution and imbalance in MELD.

Figure 7. NVIDIA Jetson AGX Orin 64 GB developer kit.

Figure 8. Confusion matrices on MELD.

Figure 9. Fusion-only quantization experiment.

Figure 10. Full-pipeline quantization experiment.

Table 1. Comparison of recent ERC models in terms of core mechanisms and modalities.

Model	Core Mechanism	Modality
GA2MIF [1]	Multi-Head Directed Graph Attention Networks, Multi-Head Pairwise Cross-Modal Attention Networks	Text, Visual, Audio
TelME [2]	Knowledge Distillation, Attention-Based Shifting Fusion	Text, Visual, Audio
UniMSE [3]	Label Formalization, Pre-Trained Modality Fusion, Inter-Modality Contrastive Learning	Text, Visual, Audio
GraphCFC [4]	Graph-Based Cross-Modal Feature Complementation, Pairwise Cross-Modal Complementary, Multi-Subspace Mapping	Text, Visual, Audio
EASUM [5]	Domain General Model, Domain Specific Model, Pseudo Label Learning	Text, Visual, Audio
SACCMA [6]	Speaker-Aware Cognitive Network, Cross-Modal Attention Fusion	Text, Visual, Audio
MMPGCN [7]	Heterogeneous Graph Construction, Multivariate Message Passing Graph Convolutional Network	Text, Visual, Audio
MRSLN [8]	Residual Speaker-LSTM Network, Inter-Speaker Dependency, Intra-Speaker Context	Text, Visual, Audio
DER-GCN [9]	Masked Graph Representation Learning, Multi-Relational Information Aggregation	Text, Visual, Audio
CBERL [10]	Data Augmentation, Intermodal Feature Fusion, Graph Interaction Network	Text, Visual, Audio
$D^{2}$ GNN [14]	Decoupled Representation Learning, Supervised Prototype Contrastive Learning	Text, Visual, Audio
GCCL [15]	Contrastive Learning, Graph Context Extraction, Consensus-Aware Learning	Text, Visual, Audio
SUNET [16]	Speaker-Utterance Heterogeneous Graph Construction, Directed Conversation Graph Modeling	Text
ConxGNN [17]	Inception Graph Module, Hypergraph Module	Text, Visual, Audio
Causal-DAG [18]	Causal Directed Acyclic Graph, Hidden Variable Disentanglement	Text
MKFM [19]	Auxiliary Contextual Knowledge, Auxiliary Label Knowledge, Supervised Contrastive Learning	Text
HU-Dialogue [20]	Context-Level Uncertainty, Modality-Level Uncertainty, Capsule Network	Text, Visual, Audio
PCDS [21]	Progressive Contrastive Deep Supervision, Speaker Contrast and Clustering, Contrastive Learning	Text, Audio
ML-ERC [22]	Pseudo Multi-Label Generation, Multi-Label Weighted Supervised Contrastive Loss, Soft Multi-Labeling	Text
PeTracker [23]	Hyperbolic Space Representation, Geometry Curriculum Learning, Stratification Contrastive Learning	Text
COSMIC [25]	Commonsense Knowledge, COMET	Text
CoTCL [26]	Triplex Contrastive Learning, Pleasure-Arousal-Dominance Space	Text
DDVLM (Ours)	EMA-Based Self-Distillation, Knowledge Distillation, Vision-Language Model, Residual Fusion	Text, Visual

Table 2. Hyperparameter settings.

Hyperparameter	Value
Batch size	4
lr	1 × 10⁻⁵
Dropout	0.2
Epoch	10
EMA decay	0.99
Self-distillation for $L_{s t u d e n t}^{s e l f} α$	0.7
Temperature for $L_{s t u d e n t}^{s e l f}$	2
Temperature for $L_{r e s p o n s e}$	4
Temperature for $L_{f e a t u r e}$	1
$λ_{J S D}$ $for L_{f e a t u r e}$	0.2
Cross-modal KD for $L_{s t u d e n t}^{v i s u a l} α$	1
Cross-modal KD for $L_{s t u d e n t}^{v i s u a l} β$	1
$Residual fusion α$	0.3

Table 3. Performance comparison of accuracy and WA-F1 on MELD.

Models	Year	MELD (7-Way)
Models	Year	Neutral	Surprise	Fear	Sadness	Joy	Disgust	Anger	Accuracy	WA-F1
HU-Dialogue [20]	2024	-	-	-	-	-	-	-	61.38	58.56
GA2MIF [1]	2023	76.92	49.08	-	27.18	51.87	-	48.52	61.65	58.94
SACCMA [6]	2024	-	-	-	-	-	-	-	62.30	59.30
MMPGCN [7]	2024	78.60	53.80	3.20	25.20	53.30	2.60	45.00	60.70	59.30
MRSLN [8]	2024	77.13	50.36	-	25.08	55.47	-	48.31	62.11	59.41
$D^{2}$ GNN [14]	2024	76.38	49.91	-	32.18	56.86	-	47.60	61.72	59.74
GCCL [15]	2024	76.93	50.70	-	31.49	57.14	-	49.05	62.82	60.28
PCDS [21]	2025	79.03	56.79	-	32.66	57.07	-	48.67	64.33	62.61
ML-ERC [22]	2025	-	-	-	-	-	-	-	-	63.01
SUNET [16]	2023	-	-	-	-	-	-	-	-	64.03
UniMSE [3]	2022	-	-	-	-	-	-	-	65.09	65.51
MKFM [19]	2023	-	-	-	-	-	-	-	-	65.66
ConxGNN [17]	2025	-	-	-	-	-	-	-	66.28	65.69
EASUM [5]	2024	-	-	-	-	-	-	-	66.70	65.93
DER-GCN [9]	2025	80.60	51.00	10.40	41.50	64.30	10.30	57.40	66.80	66.10
PeTracker [23]	2025	-	-	-	-	-	-	-	-	66.49
CoTCL [26]	2025	-	-	-	-	-	-	-	68.00	66.53
CBERL [10]	2024	82.03	57.91	22.23	41.36	65.67	24.65	55.31	67.78	66.89
TelME [2]	2024	80.22	60.33	26.97	43.45	65.67	26.42	56.70	-	67.37
Causal-DAG [18]	2024	82.60	67.40	10.80	38.60	65.00	16.20	50.90	-	67.50
DDVLM (Ours)	2025	80.66	60.97	31.25	45.83	64.67	30.91	56.00	68.28	67.80

Note: The highest accuracy and WA-F1 are shown in bold, while the second-best results are underlined.

Table 4. WA-F1 results of the ablation study.

Model/Modality	MELD
DDVLM w/o Self-Distillation w/o Cross-Modal KD w/o Residual Fusion	67.80 66.88 (0.92↓) 67.42 (0.38↓) 67.24 (0.56↓)
Self-Distillation Only Text	67.13
w/o Self-Distillation Only Text	66.46 (0.67↓)
Only Visual	33.85
w/o Cross-Modal Only Visual	31.66 (2.19↓)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, D.; Lee, Y.i.; Yoon, D.H.; Kim, B.J.; Kim, D.-H. Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Appl. Sci. 2026, 16, 3103. https://doi.org/10.3390/app16063103

AMA Style

Kim D, Lee Yi, Yoon DH, Kim BJ, Kim D-H. Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Applied Sciences. 2026; 16(6):3103. https://doi.org/10.3390/app16063103

Chicago/Turabian Style

Kim, DeogHwa, Yu il Lee, Da Hyun Yoon, Byeong Jun Kim, and Deok-Hwan Kim. 2026. "Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment" Applied Sciences 16, no. 6: 3103. https://doi.org/10.3390/app16063103

APA Style

Kim, D., Lee, Y. i., Yoon, D. H., Kim, B. J., & Kim, D.-H. (2026). Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Applied Sciences, 16(6), 3103. https://doi.org/10.3390/app16063103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

Abstract

1. Introduction

2. Related Work

2.1. Emotion Recognition in Conversation (ERC)

2.2. Quantization

3. Proposed Methods

3.1. Problem Definition

3.2. Model Overview

3.3. Feature Extraction

3.4. EMA-Based Self-Distillation

3.5. Cross-Modal Knowledge Distillation

3.6. Residual Fusion

3.7. Edge-Optimized WOQ Pipeline

4. Experimental Results

4.1. Dataset and Evaluation Metrics

4.2. Experimental Settings

4.3. Emotion Classification Results on a Server-Grade GPU

4.4. Edge Deployment and Inference Evaluation of the Quantized Model on Jetson

4.5. Ablation Study on a Server-Grade GPU

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI