A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis
Abstract
1. Introduction
2. Related Work
2.1. Emotion Recognition in Natural Language Processing
2.2. Multimodal Emotion Analysis
2.3. Emotional Speech Synthesis
3. Materials and Methods
3.1. Architecture Design
- Service Support Layer
- 2.
- Intelligent Interaction Layer
- 3.
- Speech Processing Layer
- 4.
- External Service Layer
- 5.
- Storage Layer
- 6.
- Application Presentation Layer
3.2. Lightweight Emotion Recognition Model Based on BERT
3.2.1. Model Lightweight Strategy
3.2.2. Model Fine-Tuning Strategy Based on LoRA
3.3. Six-Dimensional Emotion Space Mapping Method
3.4. Emotion-Controllable Speech Synthesis Based on VITS
3.4.1. VITS Base Model
- Conditional Generation Process
- 2.
- Variational Inference Objective (ELBO)
- 3.
- Key Technical Innovation
3.4.2. VITS Model Fine-Tuning Strategy
3.5. Dynamic Model Collaboration Mechanism
3.5.1. Dynamic Model Scheduling Algorithm
3.5.2. Load Balancing Strategy Design
3.6. Exception Handling Mechanism
3.6.1. Multi-Level Fallback Architecture
3.6.2. Exception Classification and Response Strategies
3.6.3. Service Degradation Strategy
4. Experiments
4.1. Experiment Details
4.2. Datasets
4.3. Results and Analysis
4.3.1. Evaluation of Speech Synthesis Quality
4.3.2. Emotion Recognition Results
- Confusion Matrix Analysis
- 2.
- Radar Chart Analysis
- 3.
- Specific Examples of Emotion Recognition Instances
4.3.3. System Response Time Optimization Results
4.3.4. Dynamic Model Scheduling Optimization
4.3.5. Performance Comparison Analysis of the Model Across Different Languages
4.3.6. Advantages Across Diverse Application Scenarios
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Picard, R.W.; Vyzas, E.; Healey, J. Toward Machine Emotional Intelligence: Analysis of Affective Physiological State. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1175–1191. [Google Scholar] [CrossRef]
- Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000; ISBN 0-262-66115-2. [Google Scholar]
- Hazmoune, S.; Bougamouza, F. Using Transformers for Multimodal Emotion Recognition: Taxonomies and State of the Art Review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
- Yang, Q.; Ye, M.; Du, B. EmoLLM: Multimodal Emotional Understanding Meets Large Language Models. arXiv 2024, arXiv:2406.16442. [Google Scholar] [CrossRef]
- Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep Learning-Based Multimodal Emotion Recognition from Audio, Visual, and Text Modalities: A Systematic Review of Recent Advancements and Future Prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
- Rahman, A.; Mahir, S.H.; Tashrif, M.T.A.; Aishi, A.A.; Karim, M.A.; Kundu, D.; Debnath, T.; Moududi, M.A.A.; Eidmum, M.D. Comparative Analysis Based on Deepseek, Chatgpt, and Google Gemini: Features, Techniques, Performance, Future Prospects. arXiv 2025, arXiv:2503.04783. [Google Scholar]
- Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]
- Zhang, Y.; Yang, X.; Xu, X.; Gao, Z.; Huang, Y.; Mu, S.; Feng, S.; Wang, D.; Zhang, Y.; Song, K. Affective Computing in the Era of Large Language Models: A Survey from the Nlp Perspective. arXiv 2024, arXiv:2408.04638. [Google Scholar]
- Liu, W.; Zhang, S.; Zhang, T.; Gu, Q.; Han, W.; Zhu, Y. The AI Empathy Effect: A Mechanism of Emotional Contagion. J. Hosp. Mark. Manag. 2024, 33, 703–734. [Google Scholar] [CrossRef]
- Seyitoğlu, F.; Ivanov, S. Robots and Emotional Intelligence: A Thematic Analysis. Technol. Soc. 2024, 77, 102512. [Google Scholar] [CrossRef]
- Duan, S.; Wang, Z.; Wang, S.; Chen, M.; Zhang, R. Emotion-Aware Interaction Design in Intelligent User Interface Using Multi-Modal Deep Learning. In Proceedings of the 2024 5th International Symposium on Computer Engineering and Intelligent Communications, Wuhan, China, 8–10 November 2024; IEEE: New York, NY, USA; pp. 110–114. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the International Conference on Machine Learning, Graz, Austria, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
- Lian, Z.; Sun, L.; Sun, H.; Chen, K.; Wen, Z.; Gu, H.; Liu, B.; Tao, J. Gpt-4v with Emotion: A Zero-Shot Benchmark for Generalized Emotion Recognition. Inf. Fusion 2024, 108, 102367. [Google Scholar] [CrossRef]
- Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT Applications in Natural Language Processing: A Review. Artif. Intell. Rev. 2025, 58, 1–49. [Google Scholar] [CrossRef]
- Zhu, Z.; Mao, K. Knowledge-Based BERT Word Embedding Fine-Tuning for Emotion Recognition. Neurocomputing 2023, 552, 126488. [Google Scholar] [CrossRef]
- Wan, B.; Wu, P.; Yeo, C.K.; Li, G. Emotion-Cognitive Reasoning Integrated BERT for Sentiment Analysis of Online Public Opinions on Emergencies. Inf. Process. Manag. 2024, 61, 103609. [Google Scholar] [CrossRef]
- Hou, Z.; Du, Y.; Li, Q.; Li, X.; Chen, X.; Gao, H. A False Emotion Opinion Target Extraction Model with Two Stage BERT and Background Information Fusion. Expert Syst. Appl. 2024, 250, 123735. [Google Scholar] [CrossRef]
- Das, R.; Singh, T.D. Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Pandey, A.; Vishwakarma, D.K. Progress, Achievements, and Challenges in Multimodal Sentiment Analysis Using Deep Learning: A Survey. Appl. Soft Comput. 2024, 152, 111206. [Google Scholar] [CrossRef]
- Makhmudov, F.; Kultimuratov, A.; Cho, Y.-I. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
- Liu, Z.; Zhou, B.; Chu, D.; Sun, Y.; Meng, L. Modality Translation-Based Multimodal Sentiment Analysis under Uncertain Missing Modalities. Inf. Fusion 2024, 101, 101973. [Google Scholar] [CrossRef]
- Wang, P.; Zhou, Q.; Wu, Y.; Chen, T.; Hu, J. DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 21180–21188. [Google Scholar]
- Huang, J.; Zhou, J.; Tang, Z.; Lin, J.; Chen, C.Y.-C. TMBL: Transformer-Based Multimodal Binding Learning Model for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2024, 285, 111346. [Google Scholar] [CrossRef]
- Tang, H.; Zhang, X.; Cheng, N.; Xiao, J.; Wang, J. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA; pp. 12146–12150. [Google Scholar]
- Li, Y.A.; Han, C.; Mesgarani, N. Styletts: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. IEEE J. Sel. Top. Signal Process. 2025, 19, 283–296. [Google Scholar] [CrossRef]
- Luo, R.; Lin, T.-E.; Zhang, H.; Wu, Y.; Liu, X.; Yang, M.; Li, Y.; Chen, L.; Li, J.; Zhang, L. OpenOmni: Large Language Models Pivot Zero-Shot Omnimodal Alignment across Language with Real-Time Self-Aware Emotional Speech Synthesis. arXiv 2025, arXiv:2501.04561. [Google Scholar]
- Mao, Y.; Ge, Y.; Fan, Y.; Xu, W.; Mi, Y.; Hu, Z.; Gao, Y. A Survey on Lora of Large Language Models. Front. Comput. Sci. 2025, 19, 197605. [Google Scholar] [CrossRef]
- Durkan, C.; Bekasov, A.; Murray, I.; Papamakarios, G. Neural Spline Flows. Adv. Neural Inf. Process. Syst. 2019, 32, 7511–7522. [Google Scholar]
- Plachtaa VITS-Fast-Fine-Tuning. 2023. Available online: https://github.com/Plachtaa/VITS-fast-fine-tuning (accessed on 15 May 2025).
- Ito, K.; Johnson, L. The LJ Speech Dataset 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 15 May 2025).
- Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A Multi-Speaker Mandarin Tts Corpus and the Baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
- Adigwe, A.; Tits, N.; Haddad, K.E.; Ostadabbas, S.; Dutoit, T. The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems. arXiv 2018, arXiv:1806.09514. [Google Scholar] [CrossRef]
- Streijl, R.C.; Winkler, S.; Hands, D.S. Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives. Multimed. Syst. 2016, 22, 213–227. [Google Scholar] [CrossRef]
- Elias, I.; Zen, H.; Shen, J.; Zhang, Y.; Jia, Y.; Skerry-Ryan, R.J.; Wu, Y. Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. arXiv 2021, arXiv:2103.14574. [Google Scholar] [CrossRef]
- Hu, G.; Ruan, Z.; Guo, W.; Quan, Y. A Multi-Task Learning Speech Synthesis Optimization Method Based on CWT: A Case Study of Tacotron2. EURASIP J. Adv. Signal Process. 2024, 2024, 4. [Google Scholar] [CrossRef]
- Diatlova, D.; Shutov, V. Emospeech: Guiding fastspeecH2 towards Emotional Text to Speech. arXiv 2023, arXiv:2307.00024. [Google Scholar]
Exception Type | Triggering Condition | Response Strategy |
---|---|---|
Speech Silence | No valid input detected for 5 s | Reset the ASR engine |
Semantic Conflict | BERT-PERT confidence score below 0.6 | Context rollback using a sliding window (size = 5) |
Emotional Conflict | Emotion intensity variance exceeds 0.4 | Emotion parameter interpolation (window size = 3) |
Model | Naturalness | Emotional Expressiveness | Intelligibility | Overall Score |
---|---|---|---|---|
Tacotron2 | 3.88 ± 0.42 | 3.25 ± 0.45 | 4.19 ± 0.38 | 3.77 ± 0.42 |
FastSpeech2 | 4.15 ± 0.39 | 3.66 ± 0.42 | 4.35 ± 0.35 | 4.05 ± 0.38 |
Ours | 4.21 ± 0.36 | 4.43 ± 0.34 | 4.42 ± 0.33 | 4.35 ± 0.34 |
Category | Precision | Recall | F1-Score |
---|---|---|---|
Happiness | 0.946 | 0.937 | 0.941 |
Sadness | 0.849 | 0.925 | 0.886 |
Anger | 0.971 | 0.890 | 0.929 |
Curiosity | 0.816 | 0.845 | 0.830 |
Playfulness | 0.831 | 0.848 | 0.839 |
Calmness | 0.905 | 0.858 | 0.881 |
Sample | Input Text | Ground Truth | Predicted Vector | Predicted Label |
---|---|---|---|---|
1 | First time buying books on Dangdang, delivery was fast, ordered the previous night and arrived by noon the next day, the delivery staff was very polite, really satisfied. | Happiness | [0.52, 0.05, 0.03, 0.13, 0.22, 0.05] | Happiness |
2 | The hotel environment is great, the business center is very enthusiastic, the free airport shuttle service is excellent, felt comfortable and happy staying there. | Happiness | [0.50, 0.05, 0.03, 0.15, 0.20, 0.07] | Happiness |
3 | The last few pages of the book were double-printed, the text was blurred and unreadable, the quality was so poor it was disheartening. | Sadness | [0.05, 0.70, 0.05, 0.05, 0.05, 0.10] | Sadness |
4 | The hotel was too old, the room had a musty smell, barely stayed one night, checked out the next morning, very disappointed. | Sadness | [0.05, 0.68, 0.06, 0.05, 0.05, 0.11] | Sadness |
5 | The lousy hotel had terribly slow internet, the breakfast was awful, and the staff was eating while cooking eggs, it made me furious. | Anger | [0.05, 0.10, 0.67, 0.05, 0.03, 0.10] | Anger |
6 | No hotel could be worse than this, the staff asked me to change shoes for breakfast, who’s the boss here, the manager should resign. | Anger | [0.05, 0.10, 0.66, 0.06, 0.03, 0.10] | Anger |
7 | It seems like a label was torn off the back of the machine, with residue still there, strange, what’s going on? | Curiosity | [0.10, 0.05, 0.05, 0.65, 0.10, 0.05] | Curiosity |
8 | Why does this book have two versions, can Dangdang release the sixth volume separately, really want to know what’s going on? | Curiosity | [0.10, 0.05, 0.05, 0.67, 0.08, 0.05] | Curiosity |
9 | The language is light and humorous, reading it lifts my mood, the content is practical, haha, feels like chatting. | Playfulness | [0.20, 0.05, 0.03, 0.15, 0.52, 0.05] | Playfulness |
10 | The room was fairly clean and tidy, breakfast had limited variety but tasted okay, check-out was fast, overall okay. | Calmness | [0.15, 0.10, 0.05, 0.10, 0.05, 0.55] | Calmness |
Task Type | Language | Inference Time (ms) | GPU Memory (MB) | FLOPs (×109) |
---|---|---|---|---|
Emotion Recognition | Chinese | 35 | 530 | 1.76 |
English | 33 | 500 | 1.72 | |
Speech Synthesis | Chinese | 250 | 1380 | 11.8 |
English | 230 | 1305 | 10.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yuan, Y.; Duo, S.; Tong, X.; Wang, Y. A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis. Algorithms 2025, 18, 513. https://doi.org/10.3390/a18080513
Yuan Y, Duo S, Tong X, Wang Y. A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis. Algorithms. 2025; 18(8):513. https://doi.org/10.3390/a18080513
Chicago/Turabian StyleYuan, Yanhong, Shuangsheng Duo, Xuming Tong, and Yapeng Wang. 2025. "A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis" Algorithms 18, no. 8: 513. https://doi.org/10.3390/a18080513
APA StyleYuan, Y., Duo, S., Tong, X., & Wang, Y. (2025). A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis. Algorithms, 18(8), 513. https://doi.org/10.3390/a18080513