A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis

Yanhong Yuan; Shuangsheng Duo; Xuming Tong; Yapeng Wang

doi:10.3390/a18080513

,

and

¹

Academic Affairs Office, Hebei North University, Zhangjiakou 075000, China

²

School of Information Science and Engineering, Hebei North University, Zhangjiakou 075000, China

³

Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR, China

^*

Author to whom correspondence should be addressed.

Algorithms2025, 18(8), 513;https://doi.org/10.3390/a18080513

This article belongs to the Section Algorithms for Multidisciplinary Applications

Version Notes

Order Reprints

Abstract

Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems.

Keywords:

affective intelligence; multimodal interaction; emotion modeling; speech synthesis; human–computer interaction

1. Introduction

As a pivotal subfield of artificial intelligence, affective computing was first introduced by Professor Rosalind Picard of the MIT Media Lab [1,2], and has since evolved into a key technological direction for enhancing the intelligence and human-likeness of human–computer interaction. Its fundamental objective is to enable computational systems to perceive, interpret, and express human emotions by intelligently analyzing multimodal data such as speech, text, and facial expressions—thus endowing machines with basic “emotional intelligence”.

As human–computer interaction (HCI) scenarios evolve from traditional task-oriented command execution to more complex needs such as emotional companionship, psychological counseling, and virtual character engagement, affective computing is transitioning from theoretical exploration to practical deployment. Against this backdrop, multimodal emotion modeling has emerged as a prominent research focus. Key tasks include emotion recognition from speech and text, cross-modal emotional alignment, and the generation of emotionally driven personalized responses [3,4,5]. In recent years, the advent of large-scale language models such as ChatGPT and DeepSeek [6,7] has demonstrated strong capabilities in contextual understanding and emotional reasoning, providing crucial technical support for building more natural and emotionally expressive interactive systems [8].

Despite recent progress, current affective interaction systems still face several critical challenges [9,10,11]. First, the dimensionality of emotional representation remains limited—most models can only recognize basic categories such as joy, anger, sadness, and fear, while struggling to capture more nuanced states like confusion or sarcasm. Second, temporal asynchrony between modalities such as speech and text leads to inefficient feature alignment, undermining the coherence of interaction responses. Third, traditional text-to-speech (TTS) models exhibit shortcomings in rhythm control and emotional prosody, resulting in synthesized speech that lacks naturalness and expressiveness, with average Mean Opinion Scores (MOS) rarely exceeding 3.5—significantly constraining the practical performance of such systems in real-world interactions.

To address the challenges outlined above, this paper proposes a multimodal emotional interaction framework that integrates BERT-based semantic understanding [12] with VITS-based emotional speech synthesis [13]. In this architecture, BERT serves as the core for emotion comprehension, with a lightweight recognition module extracting high-quality emotional semantic vectors. These vectors are then used to drive a VITS-based end-to-end speech synthesis system. The framework incorporates a six-dimensional emotional space modeling strategy and a gated attention mechanism to achieve dynamic fusion of semantic and acoustic modalities. In addition, a dynamic model scheduling mechanism and multi-layer heterogeneous optimization strategy are introduced, allowing the system to adaptively select language models and generation paths based on the intensity of user emotions and current system load—thus achieving a balance between expressive emotional output, low response latency, and efficient resource utilization.

The main contributions of this study are summarized as follows:

(1) We propose a modular and layered affective intelligence architecture that integrates BERT for semantic and emotional understanding and VITS for emotional speech synthesis. This joint modeling of textual and acoustic emotional features significantly enhances the accuracy, naturalness, and immersive quality of multimodal interactions.

(2) We introduce a six-dimensional continuous emotion modeling approach that overcomes the limitations of traditional discrete emotion classification. Combined with a gated attention mechanism, this enables dynamic fusion of semantic and acoustic features, thereby improving the system’s capacity to express complex and blended emotional states.

(3) We develop a multi-model collaboration mechanism that dynamically switches between large language models based on emotion intensity, response latency, and system load. This strategy not only ensures fine-grained emotional expression but also improves system stability and resource efficiency under high-concurrency conditions—effectively addressing the scalability and latency limitations of conventional TTS frameworks.

2. Related Work

Emotion recognition has garnered significant attention across natural language processing, multimodal learning, and speech synthesis. Particularly in the context of social media and complex communication scenarios, the diversity of emotional expression has increased substantially, exposing the limitations of traditional models in uncovering latent affective signals. To address these challenges and enhance model robustness, discriminative power, and interpretability, researchers have proposed a range of innovative approaches—spanning language model optimization, multimodal fusion, external knowledge integration, and task-specific architectural design.

2.1. Emotion Recognition in Natural Language Processing

The advancement of large language models has significantly propelled research in emotion recognition. The GPT series developed by OpenAI demonstrates strong capabilities in semantic abstraction and contextual modeling, offering new avenues for emotional reasoning and affective text generation [14]. BERT (Bidirectional Encoder Representations from Transformers), a widely used pre-trained language representation model based on the Transformer architecture, utilizes bidirectional encoding to capture deep semantic features in text [12]. It has been extensively applied to tasks such as emotion classification, emotion intensity estimation, and dialogue state recognition [15].

To improve the effectiveness of BERT in emotion modeling, Zhu and Mao (2023) [16] incorporated affective lexical knowledge to fine-tune word embeddings, thereby enhancing intra-class similarity and inter-class separability. These enriched embeddings were then fused with BERT’s contextual representations, resulting in significantly improved emotion recognition performance. Wan et al. (2024) [17] integrated the Ortony–Clore–Collins (OCC) emotion model with deep neural networks and proposed ECR-BERT, which employs emotion-cognitive reasoning to generate auxiliary knowledge that is adaptively integrated into the model—enhancing both interpretability and accuracy. Addressing the challenge of “false emotional expressions” prevalent on social media, Hou et al. (2024) [18] developed a two-stage BERT+CRF model that jointly models user comments and their background context through dual-channel encoding, achieving improved sequence labeling accuracy through feature fusion.

These studies collectively illustrate the ongoing evolution of BERT in the field of emotion recognition and validate the effectiveness of incorporating knowledge-based guidance, cognitive reasoning frameworks, and structural enhancements to improve emotional understanding in NLP systems.

2.2. Multimodal Emotion Analysis

Multimodal emotion analysis aims to integrate heterogeneous modalities—such as language, speech, and vision—to achieve more comprehensive and precise emotion understanding. Recent research has progressed from simple modality concatenation to attention-based deep fusion frameworks that enable the more effective alignment of multimodal emotional features [19,20].

Makhmudov et al. (2024) [21] proposed a bimodal speech–text architecture that combines CNNs and BERT. In this framework, Mel spectrogram features are extracted from speech using CNNs, while semantic representations are obtained from text using BERT. The two modalities are then fused through an attention-based weighting mechanism. To address the issue of missing modalities, Liu et al. (2024) [22] introduced the MTMSA model, which translates visual and acoustic modalities into textual representations to construct Missing Joint Features (MJFs), enabling robust modeling via a Transformer encoder. Wang et al. (2025) [23] developed the DLF framework, which enhances performance by applying feature disentanglement and a language-guided cross-attention mechanism to emphasize the dominant role of the linguistic modality. Huang et al. (2024) [24] proposed the TMBL model, which incorporates a modality binding mechanism and similarity loss. Their design augments the Transformer with convolutional components and semantic consistency constraints to improve high-level cross-modal interaction and alignment.

2.3. Emotional Speech Synthesis

Emotional speech synthesis (ESS), as an important extension of text-to-speech (TTS) technology, focuses on generating natural speech that conveys emotion, prosody, and speaker personality. Although modern neural TTS systems have achieved substantial improvements in speech naturalness, challenges remain in controlling emotional diversity, achieving fine-grained modeling, and ensuring real-time synthesis.

Tang et al. (2024) [25] proposed the ED-TTS model, which integrates emotion recognition (SER) and emotion diarization (SED) techniques to extract multi-scale emotional features. These features guide the generation process of a diffusion-based model (DDPM), enabling both local and global emotion control. Building on this, Li et al. (2025) [26] introduced a monotonic alignment mechanism and a self-supervised learning strategy within a parallel TTS framework, allowing the system to learn style and emotion representations from reference speech without requiring explicit labels—achieving zero-shot speaker transfer and highly natural emotional speech. Going further, Luo et al. (2025) [27] proposed the OpenOmni framework from a large-model and cross-modal generation perspective. It achieves the unified modeling of text, image, and speech through a two-stage training process to learn omnimodal representations and employs a non-autoregressive decoder for low-latency, high-fidelity emotional speech synthesis.

These works collectively illustrate the evolutionary trajectory of emotional speech synthesis, progressing from multi-scale emotion modeling and style transfer to cross-modal generative integration. This line of research advances the field toward more expressive and real-time emotional speech generation.

In summary, research on emotion modeling is evolving from traditional text-based recognition toward deeper multimodal integration, encompassing semantic understanding, knowledge fusion, modality interaction, and emotional generation. Despite significant progress, current methods still face limitations in emotion modeling precision, multimodal coordination stability, and real-time generation efficiency. This paper proposes a novel multimodal emotional interaction framework that integrates BERT and VITS. By incorporating six-dimensional emotion modeling, emotional speech synthesis, and dynamic scheduling mechanisms, the proposed approach effectively enhances the naturalness and responsiveness of emotional speech, offering a new technical pathway for advancing affective computing.

3. Materials and Methods

3.1. Architecture Design

The proposed system adopts a modular, decoupled, and layered design architecture, which is composed of the following six core functional layers: the service support layer, intelligent interaction layer, speech processing layer, storage layer, external service layer, and application presentation layer. These components communicate through standardized interfaces, ensuring high maintainability, scalability, and cross-platform deployment capability, as shown in Figure 1. The system is designed to achieve efficient coordination among emotion recognition, semantic understanding, and speech synthesis modules, thereby enabling real-time intelligent responses in multimodal emotional interaction scenarios.

Figure 1. Modular layered architecture diagram. Modular layered architecture of the proposed system. The system consists of six functional layers: the service support layer, intelligent interaction layer, speech processing layer, storage layer, external service layer, and application presentation layer. It integrates emotion analysis, dialogue generation, and speech services with LLMs such as ChatGPT-4 and ChatGLM-6B, enabling real-time multimodal emotional interaction through standardized inter-layer communication.

Service Support Layer

The service support layer is responsible for system management, resource monitoring, and log recording to ensure stable and reliable system operation. The service manager orchestrates core services and allocates computing resources efficiently. The resource monitor tracks real-time usage of CPU, memory, and GPU to maintain optimal performance. The system checker supervises network connectivity and hardware status while providing fault recovery mechanisms. The log manager records operational logs to support failure tracing and debugging. The configuration manager handles the storage and dynamic adjustment of system parameters.

To enhance availability and scalability, a distributed service framework is employed. The framework adopts containerized elastic deployment using Docker Compose to build a microservices cluster, with replica sets enabling automatic failover. The emotion computing and speech recognition services are horizontally scalable to up to 16 instances, supporting a maximum of 320 concurrent speech input streams. A resource isolation strategy guarantees that critical modules reserve at least 30% of computational resources, preventing inference bottlenecks. Furthermore, a dynamic weight allocation algorithm based on exponential smoothing prediction is implemented. By monitoring key indicators such as connection count, memory load, and response latency, the system intelligently optimizes service routing. As a result, the standard deviation of resource utilization is reduced from 23.4% to 9.1%, and the service rejection rate remains below 0.5% even under peak load conditions.

2.: Intelligent Interaction Layer

The intelligent interaction layer serves as the semantic core of the system, handling key tasks such as natural language understanding, emotion recognition, dialogue generation, and policy scheduling. For the speech modality, 3 × 3 depth wise separable convolutions are used to extract spectrogram features, enhancing time-frequency representation while reducing computational complexity. In the textual modality, BERT provides context-aware emotional embeddings. To capture temporal dynamics, a bidirectional LSTM is applied to model emotional state sequences, improving the coherence and accuracy of emotion recognition. The dialogue generation module integrates a large-scale language model, which produces multi-turn responses that are semantically natural and emotionally aligned by leveraging both user inputs and emotional context.

Additionally, a reinforcement learning–based policy selection mechanism is introduced. This mechanism jointly optimizes emotional intensity, response latency, and computational cost, enabling the dynamic selection of optimal language models and speech synthesis paths. This design ensures a balance between emotional expression fidelity and real-time interaction performance.

3.: Speech Processing Layer

The speech processing layer is primarily responsible for automatic speech recognition (ASR) and text-to-speech synthesis (TTS). The ASR module performs denoising, framing, and feature extraction on user speech. It employs a non-autoregressive recognition engine based on the Paraformer architecture, combined with a latency-controlled decoding algorithm to efficiently process real-time audio streams. The front-end preprocessing pipeline includes stereo channel separation, noise suppression, and adaptive resampling, enhancing robustness in complex acoustic environments.

The speech synthesis module utilizes an end-to-end VITS-based generative model, integrated with an emotion control mechanism that deeply fuses semantic-emotional features extracted by BERT. A dynamic emotion parameter injection mechanism is designed, leveraging a gated attention network to dynamically fuse emotional text embeddings with acoustic features, enabling semantic-driven emotional expression control. This approach significantly improves the naturalness of prosody, rhythm, and intonation in synthesized speech, while enhancing the accuracy and diversity of emotional expression.

4.: External Service Layer

The external service layer manages the intelligent scheduling of large language models and their seamless integration with external services, ensuring efficient responses to user needs across diverse scenarios. By comprehensively evaluating emotional intensity, response latency, and computational resource usage, the system dynamically selects the optimal model for inference. In high emotional intensity scenarios, it prioritizes high-performance heavyweight models to guarantee logical coherence, semantic accuracy, and emotional alignment in responses. For routine conversations or low resource demand scenarios, the system switches to lightweight models, effectively reducing GPU usage and energy consumption while optimizing resource efficiency.

5.: Storage Layer

The storage layer is responsible for data management and persistence services throughout system operation, encompassing model parameters, configuration files, runtime logs, and temporary caches. Model files include weights for ASR, TTS, emotion recognition, and large language models, supporting hot updates and version control to facilitate system upgrades and maintenance. The logging system records critical operational statuses and exceptions, providing essential data for system monitoring and performance optimization. The configuration management module enables online adjustment of runtime parameters, allowing the system to adapt dynamically to varying deployment environments. An intermediate caching mechanism temporarily stores speech recognition results, emotion vectors, and semantic responses, significantly improving overall system response efficiency.

6.: Application Presentation Layer

The application presentation layer functions as the user interface, intuitively showcasing the system’s core capabilities while supporting multi-platform access and multimodal interaction. This layer consists of a web client and a 3D visualization client built on Unreal Engine 5.1 (UE5.1), designed, respectively, to meet the needs of lightweight access and immersive experiences. The web client provides basic functionalities such as text input, voice interaction, and authentication, making it suitable for low-cost access scenarios on mobile devices and web browsers. The UE5.1 client targets high-fidelity interaction demands, leveraging Unreal Engine’s real-time rendering and physics simulation capabilities to create immersive 3D emotional environments. For communication, the system utilizes a WebSocket-based low-latency full-duplex channel to ensure real-time transmission of voice and text data.

The architecture employs a modular and decoupled design, enabling rapid integration of novel AI models, speech-to-text conversion, and emotion computing modules via APIs, with interfaces that can be updated flexibly as needed. This design significantly reduces resource consumption compared to traditional approaches, while enhancing system scalability and operational efficiency, ensuring high-quality emotional interaction.

3.2. Lightweight Emotion Recognition Model Based on BERT

3.2.1. Model Lightweight Strategy

To achieve efficient and accurate emotion recognition in multimodal emotional interaction systems, this paper develops a lightweight emotion recognition model based on BERT. While retaining the model’s deep semantic modeling capabilities, structural compression and precision control strategies are employed to significantly enhance inference efficiency and resource utilization, enabling the synergistic optimization of semantic modeling and affective computation.

At the input stage, a time-frequency fusion mechanism is introduced to perform cross-modal alignment between the textual word embeddings

E_{text} \in R^{L \times 768}

and the speech spectrogram features

F_{spec} \in R^{T \times 256}

. A cross-modal attention module is constructed using scaled dot product attention, which is formulated as follows:

\begin{matrix} C r o s s A t t n (Q, K, V) = s o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \end{matrix}

(1)

\begin{matrix} E_{joint} = L a y e r N o r m (E_{text} + CrossAttn (E_{text}, F_{spec})) \end{matrix}

(2)

In the encoding layer, the model adopts a dual-stream Transformer architecture. The text stream retains the standard multi-head self-attention mechanism to preserve contextual modeling capabilities. The voice channel employs one-dimensional convolution to extract local dynamic features in the temporal sequence. The representations from both streams are integrated via a gated fusion mechanism.

G = σ (W_{g} [E_{text}; E_{audio}])

(3)

E_{fusion} = G ⊙ E_{text} + (1 - G) ⊙ E_{audio}

(4)

Here, G denotes the gating weights that control the fusion ratio between textual and acoustic features, and ⊙ represents element-wise multiplication.

To enhance the inference efficiency and deployment flexibility of the model, this study employs Knowledge Distillation and Structured Pruning techniques to compress the original BERT model. Using BERT as the teacher model, a lightweight emotion recognition model is constructed through knowledge distillation. Subsequently, the student model undergoes channel pruning and 8-bit quantization to reduce computational complexity.

3.2.2. Model Fine-Tuning Strategy Based on LoRA

To further reduce the parameter overhead of the BERT model while maintaining high emotional recognition accuracy, this study introduces a parameter-efficient fine-tuning strategy based on Low-Rank Adaptation (LoRA) [28], fine-tuning the model on optimized emotional corpora to adapt to target tasks. LoRA achieves rapid adaptation to specific emotional tasks by injecting trainable low-rank matrices into the Transformer’s attention mechanism while keeping most pre-trained parameters frozen. This approach significantly reduces the number of parameters requiring fine-tuning and minimizes GPU memory usage. Despite the reduced parameter count, LoRA achieves performance comparable to traditional full-parameter fine-tuning methods, demonstrating high efficiency and adaptability. The basic principle is illustrated in Equation (5).

(W + ∆ W) x = W x + A B x

(5)

Specifically, for a pre-trained weight matrix

W \in R^{d \times k}

, the weight update matrix is decomposed into two smaller matrices using low-rank decomposition, just as ΔW =

{W_{A}}^{*}

W_{B}

, where

W_{A}

is a d × r matrix and

W_{B}

is an r × k matrix. During training, the original weight matrix W remains unchanged, and only the newly introduced matrices

W_{A}

and

W_{B}

are updated. In this model, LoRA is integrated into the Query and Value projection layers of the self-attention mechanism in BERT, enabling the model to learn feature distribution shifts in the emotional domain without altering the backbone structure.

3.3. Six-Dimensional Emotion Space Mapping Method

To overcome the limitations of traditional emotion recognition methods that rely on discrete and coarse-grained category representations, this paper proposes a continuous six-dimensional emotional space mapping approach. Instead of assigning fixed emotion labels, this method represents emotional states as learnable vector distributions, enabling more nuanced expression and higher-dimensional semantic perception of complex emotions.

Inspired by dimensional theories in psychology, the proposed model expands emotion modeling to the following six fundamental dimensions: Happiness, Sadness, Anger, Curiosity, Playfulness, and Calmness. These are represented as a set of six emotion basis vectors

E = [e_{1}, e_{2}, \dots, e_{6}]

, where each

e_{i} \in R^{d}

denotes an embedding of a basic emotional dimension. The model learns an emotion probability distribution

p \in R^{6}

, which is used to compute a weighted combination over the emotional basis space, generating the final continuous emotion vector. This allows for richer, more flexible emotion representations that capture subtle affective variations beyond conventional categorical labeling.

E_{emo} = \sum_{i = 1}^{6} p_{i} \cdot e_{i}

(6)

The emotion probability distribution p is jointly computed by a cross-modal fusion network that integrates both textual semantics and speech spectrogram features. This approach preserves the perceptual advantages of multimodal inputs and allows the system to continuously model complex and blended emotional states.

To achieve high-quality multimodal fusion, a confidence-aware gating mechanism is introduced during the fusion stage. This mechanism dynamically allocates modality-specific weights based on the contextual effectiveness of speech and text features. Specifically, the model computes the energy entropy of the speech modality

H_{a u d i o}

and the semantic confidence of the text modality

C_{t e x t}

, and uses these values to adjust the gated fusion weights—speech weight α and text weight β—such that they satisfy the following constraint:

α + β = 1, α = f (H_{audio}), β = f (C_{text})

(7)

This dynamic weighting scheme enables the model to prioritize more reliable modalities under varying contexts, thereby improving the robustness and expressiveness of emotional representation across modalities.

3.4. Emotion-Controllable Speech Synthesis Based on VITS

3.4.1. VITS Base Model

VITS is an end-to-end speech synthesis framework that integrates a Conditional Variational Autoencoder (CVAE) with a Generative Adversarial Network (GAN). Compared to traditional two-stage TTS methods, VITS unifies acoustic modeling and vocoder modules within an end-to-end framework, achieving joint modeling from text to speech.

Building upon the VITS architecture, this study constructs an emotion-aware speech synthesis module by incorporating a semantically guided variational inference mechanism and a gated attention network. Specifically, the six-dimensional emotion vector and emotion intensity parameters extracted by BERT are used as control signals to modulate spectrogram features and prosodic patterns. This enables fine-grained control over emotional expression and allows for flexible, controllable speech generation.

VITS models the conditional speech generation process with a full variational autoencoding framework under text conditions, and its probabilistic modeling includes the following three key components.

Conditional Generation Process

p_{θ} (x, z| c) = p_{θ} (x| z, c) p_{θ} (z| c)

(8)

Here, the condition

c = c_{t e x t}, A

is composed of the phoneme sequence

c_{t e x t}

and the phoneme-to-frame alignment matrix

A

both generated by the text encoder. The alignment matrix

A

is computed using the Monotonic Alignment Search (MAS) algorithm based on dynamic programming. This alignment guides the model in synchronizing phonetic and acoustic information during speech synthesis, enabling precise prosody and emotional control.

2.: Variational Inference Objective (ELBO)

L = E_{q_{ϕ} (z| x, c)} \log p_{θ} (x| z, c) - β D_{K L} (q_{ϕ} (z| x, c) ∥ p_{θ} (z| c))

(9)

3.: Key Technical Innovation

The normalized flow-enhanced prior transforms a simple Gaussian prior into a more complex distribution through the stacking of invertible 1 × 1 convolutional layers.

p_{θ} (z| c) = N (f_{θ} (z); μ_{θ} (c), σ_{θ}^{2} (c)) |d e t \frac{𝜕 f_{θ} (z)}{𝜕 z}|

(10)

The phoneme duration distribution is modeled using Neural Spline Flow (NSF) [29], with a stochastic duration predictor introduced to capture variability. The modeling of phoneme duration distribution is formulated as follows:

\log p_{θ} (d| c_{text}) \geq E_{q_{ϕ} (u, ν| d, c)} \log \frac{p_{θ} (d| u, ν, c)}{q_{ϕ} (u, ν| d, c)}

(11)

Monotonic Alignment Search (MAS) employs a dynamic programming algorithm to optimize phoneme-to-speech alignment, as formulated below.

\hat{A} = \arg \max_{A} \log N (f_{θ} (z); μ_{θ} (c_{text}, A), σ_{θ}^{2} (c_{text}, A))

(12)

The improved CVAE architecture adopts a dual-channel feature extraction mechanism as follows: a text encoder based on BERT captures emotion-aware contextual representations, while an acoustic encoder utilizes a dilated convolutional network to extract spectral features. These two types of features are integrated via an emotion intensity-driven gating mechanism, enabling joint modeling of semantic and acoustic information.

For prosody control, the system introduces a semantic-guided stochastic duration prediction algorithm. This algorithm, based on an enhanced neural spline flow model, utilizes extracted emotional vectors as conditional variables for the neural spline flow. Combined with real phoneme durations as training targets, it optimizes the maximum likelihood function.

3.4.2. VITS Model Fine-Tuning Strategy

To enhance the system’s customization capabilities and fine-tuning efficiency, this paper introduces the VITS Fast Fine-tuning (VITS-FFT) [30] technique from an open-source project, built upon the VITS framework. VITS-FFT aims to achieve high-quality voice style transfer and personalized emotional fine-tuning with minimal resource overhead. By caching phoneme alignment information, it eliminates the need for retraining the aligner, enabling rapid and low-resource adaptation for emotional voice transfer. VITS-FFT preserves the backbone structure of the original model, fine-tuning only emotion-related modules, and supports precise control over the emotional expression of synthesized speech by extracting style vectors from reference audio. This strategy significantly reduces training costs while improving the model’s generalization and deployment flexibility across diverse scenarios and emotional conditions, offering an efficient and practical solution for personalized speech synthesis and virtual character customization.

To further optimize inference efficiency in real-time interactive scenarios, the system employs a CUDA-accelerated Mel spectrogram generation strategy, leveraging parallel kernel computation to extract spectrogram features at millisecond-level speeds. Combined with GPU memory reuse and stream parallelism, the system effectively minimizes end-to-end inference latency, ensuring robust responsiveness in high-concurrency speech synthesis tasks.

3.5. Dynamic Model Collaboration Mechanism

To enhance response efficiency and resource utilization in multimodal emotional interaction scenarios, this study proposes a multi-model collaboration framework comprising a dynamic model scheduling algorithm and a hierarchical load balancing strategy. The mechanism adaptively selects the most appropriate language model by jointly considering emotional intensity, system load, and response latency, thereby enabling intelligent switching and efficient coordination across models.

3.5.1. Dynamic Model Scheduling Algorithm

This algorithm serves as the core component of the framework. It autonomously selects suitable large language models based on the complexity of the interaction scenario and the current status of system resources. Built upon a multidimensional perception framework, the algorithm jointly evaluates key indicators such as user emotional intensity, response latency, and computational resource usage to support dynamic scheduling. Specifically, when the system detects that the user is in a highly activated emotional state, such as anger or joy, it prioritizes the invocation of high-capacity models with strong emotional modeling capabilities to ensure semantic coherence and affective consistency. In contrast, during emotionally neutral interactions, the system switches to lightweight models in order to reduce computational overhead. A sliding window mechanism is employed to track the history of dialogue interactions. Additionally, a reinforcement learning strategy with reward-based feedback is incorporated to continuously refine the model selection policy, thereby enabling policy migration across multi-turn dialogues.

3.5.2. Load Balancing Strategy Design

To ensure stability under high-concurrency conditions, a hierarchical load balancing mechanism is implemented, encompassing both the resource scheduling layer and the service distribution layer. At the resource scheduling layer, the system adopts a container-based elastic scaling mechanism using Docker, which allows for automatic adjustment of service nodes in response to real-time request volume. When the number of concurrent users exceeds 50, a horizontal scaling strategy is triggered, enabling the deployment and activation of additional service nodes within three seconds, thus ensuring uninterrupted service. At the service distribution layer, an enhanced weighted round-robin algorithm is utilized. This algorithm dynamically allocates requests by considering multiple factors, including GPU utilization, request queue length, and average response time for each model instance. This approach significantly mitigates load imbalance across service nodes and improves overall system throughput and response stability.

3.6. Exception Handling Mechanism

To ensure the robustness of the multimodal human–computer interaction system, this study introduces a series of fault-tolerance strategies, incorporating dynamic input switching and service degradation mechanisms. These designs aim to maintain system availability and response stability under abnormal operating conditions.

3.6.1. Multi-Level Fallback Architecture

A three-tier fallback mechanism is implemented to handle input failures.

Voice Input Retry (Level 1): When voice input fails three consecutive times in the Voice Activity Detection (VAD) module with the detection threshold set to −40 dBFS, the system automatically initiates a voice input retry mechanism to address transient recognition interruptions.

Modality Switching (Level 2): In the event of prolonged voice input unavailability, the system automatically switches to a text-based input channel established via WebSocket, while maintaining stable communication through an HTTP long-polling mechanism (polling interval set to 500 ms).

Device Status Monitoring (Level 3): The system proactively verifies the status of the local microphone by broadcasting a UDP command (on port 5060). If the device is not detected, a front-end alert window is triggered to prompt the user to check the device connection.

3.6.2. Exception Classification and Response Strategies

To further enhance fault tolerance during the interaction process, the system categorizes exceptions into three types and designs corresponding response mechanisms, as shown in Table 1.

Table 1. Exception categories and response strategies.

3.6.3. Service Degradation Strategy

When the system detects that GPU utilization exceeds 90%, it automatically activates a lightweight PERT-base model and reduces the concurrency level to 4 in order to alleviate computational pressure. At the same time, an OPUS-based speech compression strategy is adopted, with the bitrate set to 16 kbps, ensuring that the speech synthesis MOS score remains above 4.0 while maintaining acceptable compression quality. To suppress interaction inconsistencies caused by emotional recognition fluctuations, the system freezes the current emotion vector

E_{e m o}

if the abnormal state persists for more than 10 s, thereby preventing sudden emotional shifts from affecting subsequent outputs. Figure 2 illustrates the timing logic and module invocation path of the system’s exception handling process.

Figure 2. Timing diagram of the fault tolerance process for exception handling.

4. Experiments

4.1. Experiment Details

The proposed system is deployed within a hybrid cloud architecture, with the hardware and software configurations carefully selected to balance the demands of real-time interaction and the computational efficiency required for deep model inference.

On the hardware side, the system operates on a GPU server cluster powered by NVIDIA T4 accelerators. The cluster consists of eight nodes, each equipped with a 32-core Intel Xeon Silver processor and 64 GB of RAM. High-efficiency multi-GPU coordination is enabled via NVLink interconnects. Backend data storage relies on a distributed Ceph cluster, which provides a sustained read/write throughput of up to 2.4 GB/s, effectively supporting high-speed I/O operations for large-scale speech data.

On the software side, the system is deployed using a containerized architecture and runs on Ubuntu 22.04 LTS. The inference engine is built on PyTorch 1.12 and ONNX Runtime 1.15, allowing for flexible model loading and accelerated inference. The speech processing module integrates the Librosa 0.10.0 audio processing library and employs a streaming ASR model based on Paraformer-large, which offers both robustness and real-time performance. For emotion recognition, the system utilizes the HuggingFace Transformers 4.28 framework to load the pre-trained bert–base–chinese model, enabling high-accuracy sentiment classification from text. RESTful APIs are implemented using FastAPI, and bidirectional low-latency communication is achieved via the WebSocket protocol, ensuring immediate and stable user interaction.

4.2. Datasets

To ensure the reliability of experimental results and enhance the generalization capability of the proposed model, multiple publicly available speech datasets are utilized for model training and evaluation. These datasets cover a wide range of languages, speaker characteristics, and emotional dimensions.

1. English Speech Dataset

LJSpeech-1.1 [31] is a high-quality single-speaker English speech dataset that is widely used for training and evaluating speech synthesis models. The dataset consists of 13,100 audio clips, each accompanied by precise text transcriptions. The duration of each clip ranges from 1 to 10 s, with a total recording time of approximately 24 h. The recordings feature clear pronunciation and consistent audio quality, making the dataset well-suited for baseline speech synthesis modeling tasks.

2. Chinese Speech Datasets

To support Chinese speech modeling and emotion recognition, two representative datasets are adopted in this study, as follows:

(1) AISHELL-3 [32] is one of the largest open-source Mandarin speech datasets currently available. It comprises 88,035 high-quality recordings collected from 218 speakers across various dialect regions, with a total duration exceeding 85 h. The dataset is well-suited for training and evaluating multi-speaker Mandarin speech synthesis systems.

(2) ChnSentiCorp (https://github.com/hidadeng/cnsenti (accessed on 10 May 2025)) is a Chinese sentiment analysis corpus containing a large number of sentences annotated with emotion labels, such as positive and negative. This dataset provides accurate emotional feature vectors for the speech synthesis module and supports emotion-driven speech generation. To adapt to the six-dimensional emotional space proposed in this paper, the original binary sentiment labels (positive and negative) were converted into six-dimensional emotional labels through semantic analysis and manual annotation. Positive comments were mapped to happiness, playfulness, and calmness, while negative comments were mapped to sadness and anger. Curiosity was identified through interrogative sentences and exploratory vocabulary. The annotation was performed by three professionals, with annotation consistency evaluated using Cohen’s Kappa coefficient, achieving a score of 0.82, indicating high inter-annotator agreement.

3. Emotional Speech Dataset

EmoV-DB [33] is a high-quality dataset designed for emotional speech synthesis tasks. It includes recordings annotated with the following five basic emotion categories: happiness, sadness, anger, surprise, and calm. This dataset is utilized to train the emotion control module, thereby enhancing the system’s ability to accurately convey emotional content from textual input through synthesized speech.

4.3. Results and Analysis

4.3.1. Evaluation of Speech Synthesis Quality

To comprehensively evaluate the performance of the proposed system in speech synthesis tasks, we adopt the internationally standardized MOS (Mean Opinion Score) [34] as the subjective metric for perceptual evaluation. The MOS test assesses the following three key dimensions: Naturalness, Emotional Expressiveness, and Intelligibility. To ensure the reliability and robustness of the evaluation results, 21 participants were recruited for the experiment. Each emotional dimension was assessed using a 5-point rating scale, and the arithmetic mean was calculated. The standard deviations of the scores for all emotional dimensions were below 0.5, indicating a high level of consistency among the participants’ ratings. The evaluation results are summarized in Table 2.

Table 2. MOS comparison of speech synthesis quality.

As shown in Table 2, the proposed system outperforms the baseline models Tacotron2 [35,36] and FastSpeech2 [37] across all three evaluation dimensions, as well as in the overall MOS score, demonstrating superior performance in speech synthesis quality.

In terms of naturalness and intelligibility, our proposed system (Ours) achieved scores of 4.21 and 4.42, respectively, outperforming both Tacotron2 and FastSpeech2. Specifically, the naturalness score showed an improvement of 0.06 over FastSpeech2 and 0.33 over Tacotron2, indicating that the proposed model is capable of generating more realistic and fluent synthetic speech. However, the relatively small gap with FastSpeech2 suggests that there is room for further improvement in the stability of prosody modeling, particularly under specific scenarios.

Regarding emotional expressiveness, our system achieved a MOS of 4.43, significantly surpassing Tacotron2 (3.25) and FastSpeech2 (3.66) by 1.18 and 0.77 points, respectively. This result strongly validates the effectiveness of deep semantic representations from BERT in capturing emotional cues. Furthermore, the emotion-driven gated attention mechanism and the fine-grained control of prosodic parameters in our model enhance the expressiveness and emotional impact of the synthesized speech.

For intelligibility, all models exceeded the “easy to understand” threshold score of 4.0, with our system achieving the highest score of 4.42 ± 0.33. This represents relative improvements of 5.49% and 1.61% over Tacotron2 and FastSpeech2, respectively. These gains can be attributed to the VITS vocoder, which effectively reduces pronunciation ambiguity via adversarial training, and the optimized phoneme duration prediction module, which minimizes the occurrence of skipped phonemes, thereby maintaining high speech clarity.

Additionally, standard deviation analysis revealed consistently low score variability across all three models. Notably, our system exhibited the lowest standard deviations in emotional expressiveness (0.34) and intelligibility (0.33), indicating that listeners’ subjective perceptions of the synthesized speech were more consistent and stable.

4.3.2. Emotion Recognition Results

Confusion Matrix Analysis

A confusion matrix was constructed to conduct quantitative analysis, providing both a systematic assessment of the model’s classification accuracy across individual emotion categories and insight into the potential confusability between them, as shown in Figure 3. The experimental results indicate that the proposed system demonstrates a notable advantage in fine-grained emotion recognition tasks, achieving an overall classification accuracy of 91.6%. Specifically, the model shows outstanding performance in recognizing Happiness (94.6%) and Anger (97.1%), which is attributed to the distinctive acoustic features associated with these two emotions.

Figure 3. Confusion matrix analysis of emotion recognition results.

However, the analysis reveals a 12.0% confusion rate between Curiosity and Playfulness, mainly due to their similarities in both semantic content and acoustic expression. It is worth noting that the model achieves high precision and recall across most emotion categories, indicating that it maintains recognition accuracy while ensuring comprehensive emotional coverage, as shown in Table 3. These findings validate the robustness and generalizability of the proposed method in multi-emotion scenarios and provide an effective solution for complex emotion recognition tasks.

Table 3. Precision, recall, and F1-score for each emotion category.

2.: Radar Chart Analysis

To further visualize the model’s performance across different emotional categories, a radar chart of F1-scores for the six emotions was plotted, as shown in Figure 4. This visualization reveals a clear performance distribution: the categories Happiness and Anger exhibit strong performance with prominent outer boundaries, while Curiosity and Playfulness show relatively contracted regions, indicating room for further optimization in discriminating between these semantically similar emotions.

Figure 4. F1-score by emotion category.

From the perspective of F1-score, the model performs best on Happiness (0.941) and Anger (0.929), suggesting that it is highly effective at capturing the distinguishing features of these emotions, achieving both high precision and strong recall. The categories Calmness (0.881) and Sadness (0.886) also demonstrate high stability and predictive reliability, reflecting the model’s balanced performance.

Although Playfulness (0.839) and Curiosity (0.830) score slightly lower, their performance remains at a relatively high level. The confusion between these two categories may stem from their overlapping semantic and acoustic characteristics, indicating that more contextual or multimodal information could be beneficial for improving their separability.

In summary, the proposed model demonstrates strong overall performance, accurate recognition of major emotional categories, and solid fine-grained discriminative capability across multiple emotion classes.

3.: Specific Examples of Emotion Recognition Instances

To further illustrate the effectiveness of the proposed method in the emotion recognition task, Table 4 presents the recognition results for 10 representative samples spanning a six-dimensional emotion space as follows: Happiness, Sadness, Anger, Curiosity, Playfulness, and Calmness. Emotional states are represented as probability distribution vectors, with each dimension corresponding, in order, to one of the aforementioned emotion categories. For each sample, the table reports the input text, the ground-truth label, the predicted emotion vector, and the final predicted label. This analysis serves to assess the model’s capability in fine-grained emotion recognition and to examine its robustness and limitations across diverse emotional expression scenarios.

Table 4. Emotion recognition examples.

(1) Happiness (Samples 1 and 2): Sample 1 describes a positive experience with prompt delivery and excellent service, with the Happiness dimension achieving the highest score (0.52) in the predicted vector. The Playfulness dimension (0.22) shows a slight increase, reflecting the lighthearted emotions conveyed by phrases such as “extremely satisfied”. Similarly, Sample 2 exhibits a dominant Happiness score (0.50), followed by Playfulness (0.20), indicating the model’s ability to capture the joyful and relaxed emotions inherent in positive feedback.

(2) Sadness (Samples 3 and 4): Sample 3 addresses issues with book printing quality, with the Sadness dimension scoring 0.70, accurately capturing negative emotions expressed through terms like “disheartening”. Sample 4 describes a poor hotel environment, with a Sadness score of 0.68 and a minor contribution from Calmness (0.11), potentially reflecting the resigned sentiment in phrases such as “barely stayed one night”. The model effectively distinguishes the intensity and semantic context of sadness.

(3) Anger (Samples 5 and 6): Both Samples 5 and 6 express strong dissatisfaction with poor service quality, with Anger scores of 0.67 and 0.66, respectively, demonstrating the model’s sensitivity to intense expressions such as “infuriating” and “the manager should be sacked”. Minor elevations in Sadness (0.10) and Calmness (0.10) may reflect underlying disappointment within the anger, showcasing the model’s ability to detect nuanced emotional blends.

(4) Curiosity (Samples 7 and 8): Samples 7 and 8, both interrogative sentences, yield Curiosity scores of 0.65 and 0.67, respectively, indicating the model’s capability to identify exploratory emotions through phrases like “what’s going on” and “is it possible”. Slight contributions from Happiness (0.10) and Playfulness (0.10) may stem from the mildly positive tone often present in questioning sentences.

(5) Playfulness (Sample 9): Sample 9 describes a lighthearted and humorous reading experience, with the Playfulness dimension scoring 0.52, followed by Happiness (0.20). This accurately captures the lively emotions conveyed by terms such as “haha” and “like chatting”, demonstrating the model’s sensitivity to playful linguistic cues.

(6) Calmness (Sample 10): Sample 10, a neutral review, yields a Calmness score of 0.55, with minor contributions from Happiness (0.15) and Sadness (0.10), reflecting the balanced emotional state expressed in phrases like “generally okay”. This highlights the model’s ability to detect neutral and tempered emotional tones.

The predicted vectors, generated through mapping in a six-dimensional continuous emotional space, achieve accurate classification of emotional categories while providing fine-grained distributions of emotional intensity. For example, Sample 5 exhibits a high Anger score (0.67), clearly reflecting the intense emotions in the text, whereas Sample 10, with a dominant Calmness score (0.55), demonstrates precise modeling of neutral emotions. However, the slightly elevated secondary scores for Curiosity and Playfulness (both 0.10) in Samples 7 and 8 suggest minor confusion between semantically similar emotions, such as curiosity and playfulness. This aligns with the 12.0% cross-error rate reported in the confusion matrix, likely due to overlapping semantic and acoustic features in interrogative sentences and lighthearted tones. The model performs particularly well in recognizing high-intensity emotions such as happiness, sadness, and anger, though there remains room for improvement in distinguishing semantically close emotions like curiosity and playfulness. These results further validate the effectiveness of the proposed approach, highlighting the advantages of six-dimensional emotional space modeling in accurately identifying diverse emotional expressions.

4.3.3. System Response Time Optimization Results

To support real-time interaction, a three-tier optimization strategy was implemented, successfully increasing the system’s queries per second (QPS) from 15 to 50, while keeping the end-to-end response latency consistently below 1.2 s.

We first applied model lightweighting by employing knowledge distillation, which reduced the size of the BERT model by 75%. Combined with 8-bit quantization, this optimization shortened single-pass inference time from 58 ms to 32 ms (a 44.8% reduction) and decreased memory usage by 62%. Second, distributed task scheduling was achieved using a Redis-based asynchronous message queue, improving task allocation efficiency during peak periods by 3.8 times. Third, GPU resource pooling was introduced to facilitate parallel inference across multiple models, increasing GPU memory utilization from 45% to 82%. Additionally, a frame-level batch processing algorithm for speech reduced CPU idle time by 35%, and a dynamic feature caching mechanism improved the response speed for repeated requests by 47%.

Stress testing results (Figure 5) demonstrate that the system maintains a processing capacity of 42 QPS under extreme conditions with 200 concurrent users, with 95% of response latencies fluctuating within ±0.3 s at a 95% confidence level. The dynamic resource allocation algorithm ensures service availability by automatically activating degradation modes—such as disabling non-essential emotion visualization—during traffic surges. In terms of resource efficiency, the optimized system reduces hardware consumption by 56% without compromising interaction quality. Specifically, the average daily power consumption of the NVIDIA T4 GPU decreased from 4.2 kWh to 1.8 kWh (p < 0.05). A 120 h continuous stability test revealed no memory leaks or service crashes, confirming the robustness of the improved architecture.

Figure 5. Comparison of response time before and after optimization.

4.3.4. Dynamic Model Scheduling Optimization

The framework employs a dynamic model scheduling algorithm that jointly considers emotional intensity, response latency, and computational resource usage to select the optimal language model at runtime. When the user’s emotional intensity exceeds a predefined threshold (e.g., in highly expressive states such as happiness or anger), the system dynamically invokes a high-capacity model, ChatGPT-4, to enhance response coherence and emotional alignment. For routine conversations, the system switches to a lightweight model, ChatGLM-6B, thereby reducing GPU load and energy consumption, as shown in Figure 6.

Figure 6. Accuracy performance evaluation of dynamic and static scheduling strategies.

The algorithm leverages a sliding window mechanism to continuously monitor historical interactions and integrates a reinforcement learning-based policy to adaptively refine scheduling decisions. In a test involving 1000 consecutive dialogues, the dynamic scheduling strategy achieved a model selection accuracy of 94.2%, significantly outperforming static allocation baselines and demonstrating strong adaptability and resource efficiency.

4.3.5. Performance Comparison Analysis of the Model Across Different Languages

To thoroughly evaluate the operational efficiency and model adaptability of the proposed system across diverse linguistic environments, comparative experiments were conducted using Chinese and English corpora, focusing on the computational complexity and performance differences in emotion recognition and speech synthesis tasks. All experiments were performed under a unified hardware and software configuration to ensure comparability. Table 5 presents the comparative results for Chinese and English emotion tasks across the following three dimensions: inference latency, GPU memory usage, and computational complexity (FLOPs).

Table 5. Computational complexity comparison for emotion recognition and speech synthesis across languages.

The results indicate that English tasks exhibit slightly lower inference latency, GPU memory usage, and computational complexity (FLOPs) compared to Chinese tasks, with statistically significant differences (p < 0.05). Specifically, English emotion recognition achieves an inference time of 33 ms, 5.7% faster than Chinese (35 ms), while speech synthesis inference time is 230 ms, 8.0% faster than Chinese (250 ms). For speech synthesis FLOPs, English (10.9 × 10⁹) is 7.6% lower than Chinese (11.8 × 10⁹). These differences are primarily attributed to the clearer phoneme boundaries and stable phoneme-to-speech mapping rules in English, which simplify the model’s generation pathway. In contrast, the Chinese VITS model requires additional modeling of tonal variations, rhythm, and lexical stress, resulting in more complex computational pathways and marginally higher inference latency and FLOPs.

Regarding GPU memory consumption, English emotion recognition (500 MB) and speech synthesis (1305 MB) require 5.7% and 5.4% less memory than their Chinese counterparts (530 MB and 1380 MB, respectively), with differences constrained within 5%. This suggests that the system’s memory optimization strategies are highly generalizable across languages. In terms of emotion recognition performance, English achieves a slightly higher accuracy (92.4%) than Chinese (90.8%), likely due to the more distinguishable prosodic features in English.

Despite inherent differences in linguistic structures and emotional expression between Chinese and English, the proposed multimodal architecture demonstrates efficient inference performance and robust emotion recognition capabilities in both languages. English exhibits a marginal advantage in synthesis efficiency and recognition accuracy, while the computational complexity of the Chinese model can be further reduced through optimized speech preprocessing and prosody modeling. These findings validate the cross-linguistic adaptability and engineering scalability of the proposed system.

4.3.6. Advantages Across Diverse Application Scenarios

The proposed emotion recognition and speech synthesis model demonstrates significant advantages over traditional systems (e.g., Tacotron2, FastSpeech2, and other BERT-based frameworks) in multiple real-world application scenarios. These advantages are detailed as follows.

(1) Psychological Companionship: The system’s ability to handle mixed emotional states (e.g., concurrent sadness and curiosity) and achieve low-latency responses (below 1.2 s) makes it well-suited for real-time therapeutic interactions, addressing the limitations of traditional systems in terms of emotional granularity and response latency. Leveraging six-dimensional emotion modeling and dynamic speech synthesis, the system generates natural and emotionally rich dialogues, particularly effective for emotional companionship in populations such as children, the elderly, and individuals with autism. By delivering empathetic and natural responses, the system fosters stronger emotional connections with users. For instance, it can detect emotional fluctuations in children and mitigate psychological stress through emotionally responsive actions and speech from virtual characters, significantly enhancing the companionship experience. By integrating BERT’s deep semantic understanding with VITS’s real-time synthesis capabilities, the system provides highly humanized emotional support.

(2) Learning Assistant: Integrated with a 3D visualization client based on Unreal Engine 5.1, the system creates immersive learning environments where emotionally expressive virtual avatars enhance engagement in language learning and social skills training. Through BERT-based emotion recognition and VITS-driven speech synthesis, the system delivers personalized learning support. As a virtual learning assistant, it employs a lively and encouraging style to stimulate student interest, dynamically detects emotional states (e.g., curiosity), and provides tailored guidance via emotionally responsive avatar actions and speech. Applicable scenarios include knowledge query resolution, scientific exploration, language learning support, and collaborative team learning. Compared to traditional script-based voice assistants, the proposed system, with its dynamic emotion fusion mechanism and six-dimensional emotion modeling, significantly improves emotional adaptability and personalization of learning outcomes.

(3) Intelligent Customer Service: The system’s dynamic model scheduling and load-balancing mechanisms ensure scalability in high-concurrency scenarios, mitigating the latency spikes commonly encountered by traditional TTS systems during peak usage. By leveraging real-time emotion recognition and speech synthesis, the system provides efficient and emotionally responsive customer support. As a virtual customer service agent, it employs a friendly and engaging persona to detect user emotions (e.g., anger, curiosity, or happiness) and delivers rapid, accurate responses under high-concurrency conditions through dynamic scheduling, significantly enhancing user satisfaction. Application scenarios include gamer support, e-commerce customer service, telecommunications technical support, financial consultation, and travel booking assistance. For example, during in-game interactions, the system can detect a player’s frustration, promptly switch to a consoling tone, and offer solutions. Compared to Tacotron2’s limited emotional output, the proposed system excels in emotional response speed and user satisfaction. Furthermore, dynamic load balancing ensures system stability during peak periods, further optimizing customer experience.

These advantages stem from the system’s comprehensive optimizations in natural language expression, high-precision emotion recognition, vivid virtual character representations, and superior real-time performance, providing efficient and adaptable solutions across diverse application scenarios.

5. Conclusions

This study proposes a multimodal intelligent interaction framework that integrates BERT-based semantic understanding with VITS-driven speech synthesis. Addressing key challenges in human–computer emotional interaction, such as limited representational dimensions, modality coordination, and real-time responsiveness, the proposed system achieves high naturalness, robust emotional expressiveness, and low-latency performance. At the system design level, a modular layered architecture and multi-model coordination mechanism enable efficient integration and flexible scheduling of multimodal information. At the model level, the introduction of a six-dimensional continuous emotional space, emotion-driven gated attention mechanisms, and semantic-guided variational inference strategies significantly enhances the emotional granularity of speech synthesis and the adaptability of interactions.

Experimental validation on diverse datasets demonstrates that the proposed approach outperforms mainstream baseline models in key metrics, including speech synthesis naturalness (MOS), emotion recognition accuracy, and system response latency. Furthermore, the incorporation of dynamic model scheduling and load-balancing mechanisms ensures exceptional stability and resource efficiency in high-concurrency scenarios, with a notable reduction in GPU power consumption.

The system’s multimodal emotion modeling and dynamic scheduling mechanisms confer significant advantages in application scenarios such as psychological companionship, learning assistance, and intelligent customer service. In contexts requiring high emotional expressiveness and real-time interaction, the six-dimensional emotional space and gated attention mechanism enable a more natural and immersive interaction experience. Additionally, a vivid 3D visualization client developed using Unreal Engine 5.1 further enhances the system’s performance across diverse scenarios, surpassing traditional systems based on discrete emotion classification.

This study provides an effective technical solution for multimodal emotion modeling, controllable speech synthesis, and intelligent interaction system integration, laying a solid foundation for the development of next-generation emotion-aware artificial intelligence systems. Future research may focus on fine-grained emotion classification, cross-lingual and cross-cultural adaptability, and the interpretability of generative models to further expand the system’s potential in applications such as human–computer dialogue, virtual human interactions, and assistive healthcare.

Author Contributions

Conceptualization, Y.Y. and S.D.; methodology, Y.Y.; software, Y.Y.; validation, Y.Y., S.D. and X.T.; formal analysis, Y.Y.; investigation, Y.Y.; resources, S.D.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, X.T.; visualization, Y.Y.; supervision, X.T.; project administration, X.T.; funding acquisition, Y.W., Y.Y. and S.D. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2025 Hebei Province Higher Education Scientific Research Project [Grant numbers: QN2025367, QN2025754], the Zhangjiakou City 2022 Municipal Science and Technology Plan Self-raised Fund Project [Grant number: 221105D], and the 2024 Hebei Province Education Science “14th Five-Year Plan” Project [Grant number: 2404224].

Data Availability Statement

The data used to support the findings of this study are included within the paper.

Conflicts of Interest

The authors declare no competing financial interests.

References

Picard, R.W.; Vyzas, E.; Healey, J. Toward Machine Emotional Intelligence: Analysis of Affective Physiological State. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1175–1191. [Google Scholar] [CrossRef]
Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000; ISBN 0-262-66115-2. [Google Scholar]
Hazmoune, S.; Bougamouza, F. Using Transformers for Multimodal Emotion Recognition: Taxonomies and State of the Art Review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
Yang, Q.; Ye, M.; Du, B. EmoLLM: Multimodal Emotional Understanding Meets Large Language Models. arXiv 2024, arXiv:2406.16442. [Google Scholar] [CrossRef]
Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep Learning-Based Multimodal Emotion Recognition from Audio, Visual, and Text Modalities: A Systematic Review of Recent Advancements and Future Prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
Rahman, A.; Mahir, S.H.; Tashrif, M.T.A.; Aishi, A.A.; Karim, M.A.; Kundu, D.; Debnath, T.; Moududi, M.A.A.; Eidmum, M.D. Comparative Analysis Based on Deepseek, Chatgpt, and Google Gemini: Features, Techniques, Performance, Future Prospects. arXiv 2025, arXiv:2503.04783. [Google Scholar]
Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.; Xu, X.; Gao, Z.; Huang, Y.; Mu, S.; Feng, S.; Wang, D.; Zhang, Y.; Song, K. Affective Computing in the Era of Large Language Models: A Survey from the Nlp Perspective. arXiv 2024, arXiv:2408.04638. [Google Scholar]
Liu, W.; Zhang, S.; Zhang, T.; Gu, Q.; Han, W.; Zhu, Y. The AI Empathy Effect: A Mechanism of Emotional Contagion. J. Hosp. Mark. Manag. 2024, 33, 703–734. [Google Scholar] [CrossRef]
Seyitoğlu, F.; Ivanov, S. Robots and Emotional Intelligence: A Thematic Analysis. Technol. Soc. 2024, 77, 102512. [Google Scholar] [CrossRef]
Duan, S.; Wang, Z.; Wang, S.; Chen, M.; Zhang, R. Emotion-Aware Interaction Design in Intelligent User Interface Using Multi-Modal Deep Learning. In Proceedings of the 2024 5th International Symposium on Computer Engineering and Intelligent Communications, Wuhan, China, 8–10 November 2024; IEEE: New York, NY, USA; pp. 110–114. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the International Conference on Machine Learning, Graz, Austria, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
Lian, Z.; Sun, L.; Sun, H.; Chen, K.; Wen, Z.; Gu, H.; Liu, B.; Tao, J. Gpt-4v with Emotion: A Zero-Shot Benchmark for Generalized Emotion Recognition. Inf. Fusion 2024, 108, 102367. [Google Scholar] [CrossRef]
Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT Applications in Natural Language Processing: A Review. Artif. Intell. Rev. 2025, 58, 1–49. [Google Scholar] [CrossRef]
Zhu, Z.; Mao, K. Knowledge-Based BERT Word Embedding Fine-Tuning for Emotion Recognition. Neurocomputing 2023, 552, 126488. [Google Scholar] [CrossRef]
Wan, B.; Wu, P.; Yeo, C.K.; Li, G. Emotion-Cognitive Reasoning Integrated BERT for Sentiment Analysis of Online Public Opinions on Emergencies. Inf. Process. Manag. 2024, 61, 103609. [Google Scholar] [CrossRef]
Hou, Z.; Du, Y.; Li, Q.; Li, X.; Chen, X.; Gao, H. A False Emotion Opinion Target Extraction Model with Two Stage BERT and Background Information Fusion. Expert Syst. Appl. 2024, 250, 123735. [Google Scholar] [CrossRef]
Das, R.; Singh, T.D. Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Pandey, A.; Vishwakarma, D.K. Progress, Achievements, and Challenges in Multimodal Sentiment Analysis Using Deep Learning: A Survey. Appl. Soft Comput. 2024, 152, 111206. [Google Scholar] [CrossRef]
Makhmudov, F.; Kultimuratov, A.; Cho, Y.-I. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, B.; Chu, D.; Sun, Y.; Meng, L. Modality Translation-Based Multimodal Sentiment Analysis under Uncertain Missing Modalities. Inf. Fusion 2024, 101, 101973. [Google Scholar] [CrossRef]
Wang, P.; Zhou, Q.; Wu, Y.; Chen, T.; Hu, J. DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 21180–21188. [Google Scholar]
Huang, J.; Zhou, J.; Tang, Z.; Lin, J.; Chen, C.Y.-C. TMBL: Transformer-Based Multimodal Binding Learning Model for Multimodal Sentiment Analysis. Knowl.-Based Syst. 2024, 285, 111346. [Google Scholar] [CrossRef]
Tang, H.; Zhang, X.; Cheng, N.; Xiao, J.; Wang, J. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA; pp. 12146–12150. [Google Scholar]
Li, Y.A.; Han, C.; Mesgarani, N. Styletts: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. IEEE J. Sel. Top. Signal Process. 2025, 19, 283–296. [Google Scholar] [CrossRef]
Luo, R.; Lin, T.-E.; Zhang, H.; Wu, Y.; Liu, X.; Yang, M.; Li, Y.; Chen, L.; Li, J.; Zhang, L. OpenOmni: Large Language Models Pivot Zero-Shot Omnimodal Alignment across Language with Real-Time Self-Aware Emotional Speech Synthesis. arXiv 2025, arXiv:2501.04561. [Google Scholar]
Mao, Y.; Ge, Y.; Fan, Y.; Xu, W.; Mi, Y.; Hu, Z.; Gao, Y. A Survey on Lora of Large Language Models. Front. Comput. Sci. 2025, 19, 197605. [Google Scholar] [CrossRef]
Durkan, C.; Bekasov, A.; Murray, I.; Papamakarios, G. Neural Spline Flows. Adv. Neural Inf. Process. Syst. 2019, 32, 7511–7522. [Google Scholar]
Plachtaa VITS-Fast-Fine-Tuning. 2023. Available online: https://github.com/Plachtaa/VITS-fast-fine-tuning (accessed on 15 May 2025).
Ito, K.; Johnson, L. The LJ Speech Dataset 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 15 May 2025).
Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A Multi-Speaker Mandarin Tts Corpus and the Baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
Adigwe, A.; Tits, N.; Haddad, K.E.; Ostadabbas, S.; Dutoit, T. The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems. arXiv 2018, arXiv:1806.09514. [Google Scholar] [CrossRef]
Streijl, R.C.; Winkler, S.; Hands, D.S. Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives. Multimed. Syst. 2016, 22, 213–227. [Google Scholar] [CrossRef]
Elias, I.; Zen, H.; Shen, J.; Zhang, Y.; Jia, Y.; Skerry-Ryan, R.J.; Wu, Y. Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. arXiv 2021, arXiv:2103.14574. [Google Scholar] [CrossRef]
Hu, G.; Ruan, Z.; Guo, W.; Quan, Y. A Multi-Task Learning Speech Synthesis Optimization Method Based on CWT: A Case Study of Tacotron2. EURASIP J. Adv. Signal Process. 2024, 2024, 4. [Google Scholar] [CrossRef]
Diatlova, D.; Shutov, V. Emospeech: Guiding fastspeecH2 towards Emotional Text to Speech. arXiv 2023, arXiv:2307.00024. [Google Scholar]

Figure 1. Modular layered architecture diagram. Modular layered architecture of the proposed system. The system consists of six functional layers: the service support layer, intelligent interaction layer, speech processing layer, storage layer, external service layer, and application presentation layer. It integrates emotion analysis, dialogue generation, and speech services with LLMs such as ChatGPT-4 and ChatGLM-6B, enabling real-time multimodal emotional interaction through standardized inter-layer communication.

Figure 2. Timing diagram of the fault tolerance process for exception handling.

Figure 3. Confusion matrix analysis of emotion recognition results.

Figure 4. F1-score by emotion category.

Figure 5. Comparison of response time before and after optimization.

Figure 6. Accuracy performance evaluation of dynamic and static scheduling strategies.

Table 1. Exception categories and response strategies.

Exception Type	Triggering Condition	Response Strategy
Speech Silence	No valid input detected for 5 s	Reset the ASR engine
Semantic Conflict	BERT-PERT confidence score below 0.6	Context rollback using a sliding window (size = 5)
Emotional Conflict	Emotion intensity variance exceeds 0.4	Emotion parameter interpolation (window size = 3)

Table 2. MOS comparison of speech synthesis quality.

Model	Naturalness	Emotional Expressiveness	Intelligibility	Overall Score
Tacotron2	3.88 ± 0.42	3.25 ± 0.45	4.19 ± 0.38	3.77 ± 0.42
FastSpeech2	4.15 ± 0.39	3.66 ± 0.42	4.35 ± 0.35	4.05 ± 0.38
Ours	4.21 ± 0.36	4.43 ± 0.34	4.42 ± 0.33	4.35 ± 0.34

Table 3. Precision, recall, and F1-score for each emotion category.

Category	Precision	Recall	F1-Score
Happiness	0.946	0.937	0.941
Sadness	0.849	0.925	0.886
Anger	0.971	0.890	0.929
Curiosity	0.816	0.845	0.830
Playfulness	0.831	0.848	0.839
Calmness	0.905	0.858	0.881

Table 4. Emotion recognition examples.

Sample	Input Text	Ground Truth	Predicted Vector	Predicted Label
1	First time buying books on Dangdang, delivery was fast, ordered the previous night and arrived by noon the next day, the delivery staff was very polite, really satisfied.	Happiness	[0.52, 0.05, 0.03, 0.13, 0.22, 0.05]	Happiness
2	The hotel environment is great, the business center is very enthusiastic, the free airport shuttle service is excellent, felt comfortable and happy staying there.	Happiness	[0.50, 0.05, 0.03, 0.15, 0.20, 0.07]	Happiness
3	The last few pages of the book were double-printed, the text was blurred and unreadable, the quality was so poor it was disheartening.	Sadness	[0.05, 0.70, 0.05, 0.05, 0.05, 0.10]	Sadness
4	The hotel was too old, the room had a musty smell, barely stayed one night, checked out the next morning, very disappointed.	Sadness	[0.05, 0.68, 0.06, 0.05, 0.05, 0.11]	Sadness
5	The lousy hotel had terribly slow internet, the breakfast was awful, and the staff was eating while cooking eggs, it made me furious.	Anger	[0.05, 0.10, 0.67, 0.05, 0.03, 0.10]	Anger
6	No hotel could be worse than this, the staff asked me to change shoes for breakfast, who’s the boss here, the manager should resign.	Anger	[0.05, 0.10, 0.66, 0.06, 0.03, 0.10]	Anger
7	It seems like a label was torn off the back of the machine, with residue still there, strange, what’s going on?	Curiosity	[0.10, 0.05, 0.05, 0.65, 0.10, 0.05]	Curiosity
8	Why does this book have two versions, can Dangdang release the sixth volume separately, really want to know what’s going on?	Curiosity	[0.10, 0.05, 0.05, 0.67, 0.08, 0.05]	Curiosity
9	The language is light and humorous, reading it lifts my mood, the content is practical, haha, feels like chatting.	Playfulness	[0.20, 0.05, 0.03, 0.15, 0.52, 0.05]	Playfulness
10	The room was fairly clean and tidy, breakfast had limited variety but tasted okay, check-out was fast, overall okay.	Calmness	[0.15, 0.10, 0.05, 0.10, 0.05, 0.55]	Calmness

Table 5. Computational complexity comparison for emotion recognition and speech synthesis across languages.

Task Type	Language	Inference Time (ms)	GPU Memory (MB)	FLOPs (×10⁹)
Emotion Recognition	Chinese	35	530	1.76
Emotion Recognition	English	33	500	1.72
Speech Synthesis	Chinese	250	1380	11.8
Speech Synthesis	English	230	1305	10.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis

Abstract

1. Introduction

2. Related Work

2.1. Emotion Recognition in Natural Language Processing

2.2. Multimodal Emotion Analysis

2.3. Emotional Speech Synthesis

3. Materials and Methods

3.1. Architecture Design

3.2. Lightweight Emotion Recognition Model Based on BERT

3.2.1. Model Lightweight Strategy

3.2.2. Model Fine-Tuning Strategy Based on LoRA

3.3. Six-Dimensional Emotion Space Mapping Method

3.4. Emotion-Controllable Speech Synthesis Based on VITS

3.4.1. VITS Base Model

3.4.2. VITS Model Fine-Tuning Strategy

3.5. Dynamic Model Collaboration Mechanism

3.5.1. Dynamic Model Scheduling Algorithm

3.5.2. Load Balancing Strategy Design

3.6. Exception Handling Mechanism

3.6.1. Multi-Level Fallback Architecture

3.6.2. Exception Classification and Response Strategies

3.6.3. Service Degradation Strategy

4. Experiments

4.1. Experiment Details

4.2. Datasets

4.3. Results and Analysis

4.3.1. Evaluation of Speech Synthesis Quality

4.3.2. Emotion Recognition Results

4.3.3. System Response Time Optimization Results

4.3.4. Dynamic Model Scheduling Optimization

4.3.5. Performance Comparison Analysis of the Model Across Different Languages

4.3.6. Advantages Across Diverse Application Scenarios

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics