Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment
Abstract
1. Introduction
- A KD framework alleviates modality imbalance by employing an EMA-stabilized and consistent textual modality as the teacher and a relatively less accurate visual modality as the student.
- A VLM with Sentence-BERT embedding mitigates cross-modal representational mismatch by aligning visual features with the textual semantic space.
- An edge-optimized WOQ pipeline enables efficient multimodal ERC deployment on resource-constrained edge devices while preserving performance.
- The proposed approach achieved state-of-the-art performance on the MELD benchmark.
2. Related Work
2.1. Emotion Recognition in Conversation (ERC)
2.2. Quantization
3. Proposed Methods
3.1. Problem Definition
3.2. Model Overview
3.3. Feature Extraction
3.4. EMA-Based Self-Distillation
3.5. Cross-Modal Knowledge Distillation
3.6. Residual Fusion
3.7. Edge-Optimized WOQ Pipeline
4. Experimental Results
4.1. Dataset and Evaluation Metrics
4.2. Experimental Settings
4.3. Emotion Classification Results on a Server-Grade GPU
4.4. Edge Deployment and Inference Evaluation of the Quantized Model on Jetson
- Independently analyze the impact of weight-only quantization applied to the fusion stage on the WA-F1.
- Verify whether model size reduction and inference speed improvement are achieved in practice.
- Identify quantization techniques that can be applied without quality degradation prior to full-pipeline compression.
- Provide a baseline reference for subsequent quantization of the text and VLM models.
4.5. Ablation Study on a Server-Grade GPU
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, J.; Wang, X.; Lv, G.; Zeng, Z. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection. IEEE Trans. Affect. Comput. 2024, 15, 130–143. [Google Scholar] [CrossRef]
- Yun, T.; Lim, H.; Lee, J.; Song, M. TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 82–95. [Google Scholar]
- Hu, G.; Lin, T.-E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Mexico City, Mexico, 2022; pp. 7837–7851. [Google Scholar]
- Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. IEEE Trans. Multimed. 2023, 26, 77–89. [Google Scholar] [CrossRef]
- Hwang, Y.; Kim, J.-H. EASUM: Enhancing Affective State Understanding through Joint Sentiment and Emotion Modeling for Multimodal Tasks. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Waikoloa, HI, USA, 2024; pp. 5668–5678. [Google Scholar]
- Guo, L.; Song, Y.; Ding, S. Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation. Knowl.-Based Syst. 2024, 296, 111969. [Google Scholar] [CrossRef]
- Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
- Lu, N.; Tan, Z.; Qian, J. MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation. Neurocomputing 2024, 580, 127467. [Google Scholar] [CrossRef]
- Ai, W.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DERGCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4908–4921. [Google Scholar] [CrossRef]
- Meng, T.; Shou, Y.; Ai, W.; Yin, N.; Li, K. Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations. IEEE Trans. Artif. Intell. 2024, 5, 6472–6487. [Google Scholar] [CrossRef]
- Qwen Team; Alibaba Group. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.-Y. MPNet: Masked and Permuted Pre-training for Language Understanding. arXiv 2020, arXiv:2004.09297. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Dai, Y.; Li, Y.; Chen, D.; Li, J.; Lu, G. Multimodal Decoupled Distillation Graph Neural Network for Emotion Recognition in Conversation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9910–9924. [Google Scholar] [CrossRef]
- Dai, Y.; Li, J.; Li, Y.; Lu, G. Multimodal graph context extraction and consensus-aware learning for emotion recognition in conversation. Knowl.-Based Syst. 2024, 298, 111954. [Google Scholar] [CrossRef]
- Song, R.; Giunchiglia, F.; Shi, L.; Shen, Q.; Xu, H. SUNET: Speaker-utterance interaction Graph Neural Network for Emotion Recognition in Conversations. Eng. Appl. Artif. Intell. 2023, 123, 106315. [Google Scholar] [CrossRef]
- Van, C.T.; Tran, T.V.T.; Nguyen, V.; Hy, T.S. Effective Context Modeling Framework for Emotion Recognition in Conversations. In ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Waikoloa, HI, USA, 2025. [Google Scholar]
- Su, Y.; Wei, Y.; Nie, W.; Zhao, S.; Liu, A. Dynamic Causal Disentanglement Model for Dialogue Emotion Detection. IEEE Trans. Affect. Comput. 2024, 16, 1–14. [Google Scholar] [CrossRef]
- Tu, G.; Liang, B.; Qin, B.; Wong, K.-F.; Xu, R. An Empirical Study on Multiple Knowledge from ChatGPT for Emotion Recognition in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Mexico City, Mexico, 2023; pp. 12160–12173. [Google Scholar]
- Chen, F.; Shao, J.; Zhu, A.; Ouyang, D.; Liu, X.; Shen, H.T. Modeling Hierarchical Uncertainty for Multimodal Emotion Recognition in Conversation. IEEE Trans. Cybern. 2024, 54, 187–198. [Google Scholar] [CrossRef]
- Shen, S.; Liu, F.; Wang, H.; Zhou, A. Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision. IEEE Trans. Affect. Comput. 2025, 16, 2261–2273. [Google Scholar] [CrossRef]
- Kang, Y.; Cho, Y.-S. Beyond Single Emotion: Multilabel Approach to Conversational Emotion Recognition. Proc. AAAI Conf. Artif. Intell. 2025, 39, 24321–24329. [Google Scholar] [CrossRef]
- Cao, Y.; Huang, L.; Tang, Y. PeTracker: Poincaré-based Dual-Strategy Emotion Tracker for Emotion Recognition in Conversation. IEEE Trans. Affect. Comput. 2025, 16, 2020–2032. [Google Scholar] [CrossRef]
- Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Ghosal, D.; Mahumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: CommonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Mexico City, Mexico, 2020; pp. 2470–2481. [Google Scholar]
- Liang, J.; Li, W.; Zhong, Q.; Huang, J.; Jiang, D.; Cambria, E. Learning chain for clause awareness: Triplex-contrastive learning for emotion recognition in conversations. Inf. Sci. 2025, 705, 121969. [Google Scholar] [CrossRef]
- Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Mexico City, Mexico, 2019; pp. 4762–4779. [Google Scholar]
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv 2024, arXiv:2306.00978. [Google Scholar] [CrossRef]
- Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv 2024, arXiv:2211.10438. [Google Scholar]
- Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. arXiv 2020, arXiv:1902.08153. [Google Scholar] [CrossRef]
- Bhalgat, Y.; Lee, J.; Nagel, M.; Blankevoort, T.; Kwak, N. LSQ+: Improving low-bit quantization through learnable offsets and better initialization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Waikoloa, HI, USA, 2020. [Google Scholar]
- Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. arXiv 2019, arXiv:1810.02508. [Google Scholar] [CrossRef]










| Model | Core Mechanism | Modality |
|---|---|---|
| GA2MIF [1] | Multi-Head Directed Graph Attention Networks, Multi-Head Pairwise Cross-Modal Attention Networks | Text, Visual, Audio |
| TelME [2] | Knowledge Distillation, Attention-Based Shifting Fusion | Text, Visual, Audio |
| UniMSE [3] | Label Formalization, Pre-Trained Modality Fusion, Inter-Modality Contrastive Learning | Text, Visual, Audio |
| GraphCFC [4] | Graph-Based Cross-Modal Feature Complementation, Pairwise Cross-Modal Complementary, Multi-Subspace Mapping | Text, Visual, Audio |
| EASUM [5] | Domain General Model, Domain Specific Model, Pseudo Label Learning | Text, Visual, Audio |
| SACCMA [6] | Speaker-Aware Cognitive Network, Cross-Modal Attention Fusion | Text, Visual, Audio |
| MMPGCN [7] | Heterogeneous Graph Construction, Multivariate Message Passing Graph Convolutional Network | Text, Visual, Audio |
| MRSLN [8] | Residual Speaker-LSTM Network, Inter-Speaker Dependency, Intra-Speaker Context | Text, Visual, Audio |
| DER-GCN [9] | Masked Graph Representation Learning, Multi-Relational Information Aggregation | Text, Visual, Audio |
| CBERL [10] | Data Augmentation, Intermodal Feature Fusion, Graph Interaction Network | Text, Visual, Audio |
| GNN [14] | Decoupled Representation Learning, Supervised Prototype Contrastive Learning | Text, Visual, Audio |
| GCCL [15] | Contrastive Learning, Graph Context Extraction, Consensus-Aware Learning | Text, Visual, Audio |
| SUNET [16] | Speaker-Utterance Heterogeneous Graph Construction, Directed Conversation Graph Modeling | Text |
| ConxGNN [17] | Inception Graph Module, Hypergraph Module | Text, Visual, Audio |
| Causal-DAG [18] | Causal Directed Acyclic Graph, Hidden Variable Disentanglement | Text |
| MKFM [19] | Auxiliary Contextual Knowledge, Auxiliary Label Knowledge, Supervised Contrastive Learning | Text |
| HU-Dialogue [20] | Context-Level Uncertainty, Modality-Level Uncertainty, Capsule Network | Text, Visual, Audio |
| PCDS [21] | Progressive Contrastive Deep Supervision, Speaker Contrast and Clustering, Contrastive Learning | Text, Audio |
| ML-ERC [22] | Pseudo Multi-Label Generation, Multi-Label Weighted Supervised Contrastive Loss, Soft Multi-Labeling | Text |
| PeTracker [23] | Hyperbolic Space Representation, Geometry Curriculum Learning, Stratification Contrastive Learning | Text |
| COSMIC [25] | Commonsense Knowledge, COMET | Text |
| CoTCL [26] | Triplex Contrastive Learning, Pleasure-Arousal-Dominance Space | Text |
| DDVLM (Ours) | EMA-Based Self-Distillation, Knowledge Distillation, Vision-Language Model, Residual Fusion | Text, Visual |
| Hyperparameter | Value |
|---|---|
| Batch size | 4 |
| lr | 1 × 10−5 |
| Dropout | 0.2 |
| Epoch | 10 |
| EMA decay | 0.99 |
| Self-distillation for | 0.7 |
| Temperature for | 2 |
| Temperature for | 4 |
| Temperature for | 1 |
| 0.2 | |
| Cross-modal KD for | 1 |
| Cross-modal KD for | 1 |
| 0.3 |
| Models | Year | MELD (7-Way) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Neutral | Surprise | Fear | Sadness | Joy | Disgust | Anger | Accuracy | WA-F1 | ||
| HU-Dialogue [20] | 2024 | - | - | - | - | - | - | - | 61.38 | 58.56 |
| GA2MIF [1] | 2023 | 76.92 | 49.08 | - | 27.18 | 51.87 | - | 48.52 | 61.65 | 58.94 |
| SACCMA [6] | 2024 | - | - | - | - | - | - | - | 62.30 | 59.30 |
| MMPGCN [7] | 2024 | 78.60 | 53.80 | 3.20 | 25.20 | 53.30 | 2.60 | 45.00 | 60.70 | 59.30 |
| MRSLN [8] | 2024 | 77.13 | 50.36 | - | 25.08 | 55.47 | - | 48.31 | 62.11 | 59.41 |
| GNN [14] | 2024 | 76.38 | 49.91 | - | 32.18 | 56.86 | - | 47.60 | 61.72 | 59.74 |
| GCCL [15] | 2024 | 76.93 | 50.70 | - | 31.49 | 57.14 | - | 49.05 | 62.82 | 60.28 |
| PCDS [21] | 2025 | 79.03 | 56.79 | - | 32.66 | 57.07 | - | 48.67 | 64.33 | 62.61 |
| ML-ERC [22] | 2025 | - | - | - | - | - | - | - | - | 63.01 |
| SUNET [16] | 2023 | - | - | - | - | - | - | - | - | 64.03 |
| UniMSE [3] | 2022 | - | - | - | - | - | - | - | 65.09 | 65.51 |
| MKFM [19] | 2023 | - | - | - | - | - | - | - | - | 65.66 |
| ConxGNN [17] | 2025 | - | - | - | - | - | - | - | 66.28 | 65.69 |
| EASUM [5] | 2024 | - | - | - | - | - | - | - | 66.70 | 65.93 |
| DER-GCN [9] | 2025 | 80.60 | 51.00 | 10.40 | 41.50 | 64.30 | 10.30 | 57.40 | 66.80 | 66.10 |
| PeTracker [23] | 2025 | - | - | - | - | - | - | - | - | 66.49 |
| CoTCL [26] | 2025 | - | - | - | - | - | - | - | 68.00 | 66.53 |
| CBERL [10] | 2024 | 82.03 | 57.91 | 22.23 | 41.36 | 65.67 | 24.65 | 55.31 | 67.78 | 66.89 |
| TelME [2] | 2024 | 80.22 | 60.33 | 26.97 | 43.45 | 65.67 | 26.42 | 56.70 | - | 67.37 |
| Causal-DAG [18] | 2024 | 82.60 | 67.40 | 10.80 | 38.60 | 65.00 | 16.20 | 50.90 | - | 67.50 |
| DDVLM (Ours) | 2025 | 80.66 | 60.97 | 31.25 | 45.83 | 64.67 | 30.91 | 56.00 | 68.28 | 67.80 |
| Model/Modality | MELD |
|---|---|
| DDVLM w/o Self-Distillation w/o Cross-Modal KD w/o Residual Fusion | 67.80 66.88 (0.92↓) 67.42 (0.38↓) 67.24 (0.56↓) |
| Self-Distillation Only Text | 67.13 |
| w/o Self-Distillation Only Text | 66.46 (0.67↓) |
| Only Visual | 33.85 |
| w/o Cross-Modal Only Visual | 31.66 (2.19↓) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kim, D.; Lee, Y.i.; Yoon, D.H.; Kim, B.J.; Kim, D.-H. Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Appl. Sci. 2026, 16, 3103. https://doi.org/10.3390/app16063103
Kim D, Lee Yi, Yoon DH, Kim BJ, Kim D-H. Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Applied Sciences. 2026; 16(6):3103. https://doi.org/10.3390/app16063103
Chicago/Turabian StyleKim, DeogHwa, Yu il Lee, Da Hyun Yoon, Byeong Jun Kim, and Deok-Hwan Kim. 2026. "Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment" Applied Sciences 16, no. 6: 3103. https://doi.org/10.3390/app16063103
APA StyleKim, D., Lee, Y. i., Yoon, D. H., Kim, B. J., & Kim, D.-H. (2026). Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment. Applied Sciences, 16(6), 3103. https://doi.org/10.3390/app16063103

