1. Introduction
Over the past decade, advances in artificial intelligence, speech processing, and natural language processing have transformed the capabilities of spoken dialogue systems (SDS), enabling more natural and human-like interactions. A typical SDS architecture follows a cascaded pipeline structure consisting of three components: an automatic speech recognition (ASR) module, a large language model (LLM), and a text-to-speech synthesis (TTS) module. The ASR module transcribes spoken language into text, which is then processed by the LLM to understand contextual meaning and generate an appropriate response. Finally, the TTS module converts the generated text back into speech. In this framework, each component is often pre-trained on large datasets and subsequently fine-tuned to enhance performance for specific tasks or domains.
Recent advancements have introduced end-to-end SDSs by leveraging self-supervised learning (SSL) speech representations, called SSL features, that effectively capture linguistic information such as lexical and semantic cues [
1,
2,
3]. These representations are quantized into discrete tokens, serving as inputs to speech-language multimodal LLMs, which align speech features with the LLM’s content feature space, enabling direct speech-to-text conversion without requiring a dedicated ASR module or additional adaptors. This approach improves efficiency while maintaining high recognition accuracy.
However, discrete, token-based LLMs primarily emphasize lexical content while overlooking critical paralinguistic cues in human speech. These cues, including intonation, pitch, loudness, speech rate, and rhythm, along with literal semantics, are essential for conveying both emotions and nuanced meanings. While linguistic context contributes to emotion recognition, prosodic and acoustic features play a dominant role. Failure to incorporate emotional cues can lead to misinterpretations, causing the LLM to respond with inappropriate emotions and context, such as interpreting sarcasm as a sincere statement or failing to recognize frustration, ultimately reducing the naturalness of and engagement in the dialogue [
4]. As shown in
Figure 1, the same textual content can convey different meanings depending on the emotional prosody of the utterance.
Recent studies have explored the development of emotionally adaptive SDSs capable of better recognizing speaker emotions and generating responses accordingly [
5,
6,
7,
8]. These systems aim to produce contextually appropriate and emotionally expressive responses by extracting emotional embeddings from speech using a pretrained external emotion encoder [
7,
8,
9]. However, effectively incorporating emotional context in speech-driven scenarios remains challenging. Most existing approaches rely primarily on speech-based emotional embeddings, which may result in incomplete emotion modeling due to the limited incorporation of contextual meaning. Furthermore, these methods, despite enabling emotional adaptation with minimal architectural modifications, introduce additional computational overhead due to their reliance on external emotion encoders. Consequently, they often struggle to fully integrate the rich paralinguistic features that constitute emotional expression with linguistic context. Addressing these limitations requires models that can more effectively incorporate contextual information to enhance emotional understanding and response generation.
Another challenge in the development of emotionally adaptive spoken dialogue systems is the scarcity of high-quality datasets. StyleTalk [
8] and DailyTalk [
10] are the most widely used publicly available speech conversation datasets annotated with emotion labels. However, these datasets often exhibit limitations such as insufficient emotional expressiveness in speech, a limited number of speakers or samples, and severe data imbalance across emotion categories, as shown in Figure 6 and Table 3. Empirically, we found that these issues hinder the LLM’s ability to effectively capture emotional cues from speech, leading to biased response generation and reducing the overall robustness of emotion-aware dialogue modeling.
In this paper, we propose EmoSDS, an Emotional Spoken Dialogue System that leverages self-supervised speech representations to unify speech and emotion recognition within a single framework. Unlike conventional approaches that rely on cascaded ASR and external emotion encoders, EmoSDS directly processes speech features within the LLM, enabling the simultaneous capture of both linguistic content and emotional cues.
To achieve this, we leverage the structured representation properties of SSL features [
1]. Specifically, we apply k-means clustering to SSL features to obtain discrete linguistic tokens, while utilizing the residuals as continuous paralinguistic embeddings to preserve emotional and prosodic attributes. This approach enables a speech-language multimodal architecture that effectively integrates paralinguistic features and linguistic tokens eliminating the need for external emotion encoders. By leveraging the strong representational capabilities of the LLM, EmoSDS captures fine-grained emotional expressions, enhancing the naturalness of generated responses. Furthermore, we adopt a multi-stage training strategy to enable the LLM to learn progressively, starting with fundamental tasks, such as ASR and speech emotion recognition (SER), and moving onto more complex objectives, including response emotion conditioning and text generation. Specifically, we introduce a three-stage training pipeline: in Stage 1, the model is trained on the ASR task; in Stage 2, it learns both ASR and SER; and in Stage 3, it incorporates response emotion conditioning and text generation. This structured training approach allows the LLM to gradually learn linguistic and paralinguistic features, resulting in more emotionally aligned and contextually rich responses.
Finally, to mitigate the scarcity of high-quality emotional speech conversation datasets, we utilize the emotional speech database (ESD) [
11] with GPT-4o-mini [
12] to generate diverse emotional conversations, constructing an expanded emotional speech conversation dataset, EmoSC. This dataset not only improves emotional expressiveness but also increases the number of samples across different emotion categories, ensuring a more balanced and comprehensive dataset for training. As a result, the system can generate more emotionally nuanced and contextually appropriate responses.
The experimental results demonstrate that EmoSDS effectively recognizes both linguistic and emotional features from speech inputs and outperforms baseline systems in response emotion and text generation quality. Furthermore, comparative analyses show that our newly introduced dataset improves the diversity and balance of generated response emotions, enabling the system to engage in more emotionally expressive and contextually sensitive conversations.
Our contribution can be summarized as follows:
We propose EmoSDS, an Emotional Spoken Dialogue System that unifies speech and emotion recognition within a single LLM, enabling an end-to-end speech understanding while effectively capturing both linguistic content and emotional cues by incorporating continuous paralinguistic features.
We introduce a three-stage training pipeline that progressively improves the LLM’s learning of content and emotional representations for more expressive and contextually aligned responses.
We propose EmoSC, a balanced emotional speech dataset that enhances expressiveness and supports diverse, emotionally rich dialogue generation.
The remainder of this paper is organized as follows.
Section 2 reviews related work, while
Section 3 presents the proposed method in detail.
Section 4 describes the experimental setup and
Section 5 provides the experimental results. Finally,
Section 6 concludes this paper.
3. Methods
In the following sections, we describe (1) the architecture leveraged in EmoSDS, (2) our three-stage training pipeline, and (3) the dataset generation process for EmoSC.
3.1. EmoSDS
We propose EmoSDS, which integrates speech emotion recognition (SER) within a speech-language multimodal architecture, enabling end-to-end speech understanding. Our approach directly incorporated continuous paralinguistic features into the LLM, eliminating the need for an intermediate external emotion encoder. We extracted both linguistic and paralinguistic features from an SSL encoder using k-means quantization and a subtraction operation. The linguistic features were obtained through k-means quantization, while the paralinguistic features were derived from the quantization residuals obtained by subtracting the quantized representations from the original SSL features. These extracted features were then passed directly to the LLM, which was fine-tuned for ASR and SER tasks. This enabled the model to learn both content and emotional nuances, producing expressive and contextually appropriate responses. As a result, EmoSDS enhanced the naturalness of generated dialogues by capturing fine-grained emotional expressions.
The overall model architecture is illustrated in
Figure 2. Our framework consisted of three primary components: a WavLM speech encoder, a K-means quantization module, and an LLM backbone. The speech encoder processed speech inputs to produce SSL representations, from which we extracted content features through a quantization and downsampling procedure. Residual features were subsequently derived by subtracting these content features from the original SSL representations.
The LLM backbone internally comprised four subcomponents: a tokenizer, an embedding layer, transformer decoder blocks, and an LM head. The tokenizer converted textual inputs (such as text prompts and dialogue history) and speech cluster ID sequences into token sequences. These tokens were then mapped to embeddings via the embedding layer and concatenated with the residual embeddings. The transformer decoder blocks processed the combined embeddings and, finally, the LM head generated the most probable next token. For the LLM backbone, we used the LlaMA 3.2 3B Instruct model (
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct, accessed on 8 January 2025), which has been fine-tuned on various instruction-following tasks, and for the speech encoder, we used a pretrained WavLM-large model [
18]. In the following subsections, we explain the details of each part of our architecture.
3.1.1. Speech Input Processing and Residual Extraction
The speech input was processed using the WavLM model to extract feature representations from its 6th transformer layer, denoted as
, where
t represents the number of frames and
d denotes the speech feature dimension. The 6th layer of WavLM encoded both linguistic and paralinguistic features, incorporating rich acoustic information [
1,
20]. However, as these features were inherently continuous, while LLMs operate in a discrete feature space, a discretization process was necessary to bridge this gap. To achieve this, we applied k-means clustering, pretrained on the 6th layer of WavLM with 1000 clusters, to quantize the extracted speech representations. This quantization process captured essential linguistic content information, leveraging the structural properties of SSL features while transforming them into a form compatible with the LLM.
This process produced two outputs: a sequence of cluster embeddings , representing the quantized feature representations in the continuous space, and a corresponding sequence of cluster IDs, where each ID encoded the discrete cluster index assigned to each frame based on the clustering process.
To eliminate redundancy, consecutive duplicate cluster IDs were removed, resulting in a downsampled discrete unit sequence
, formally defined as
where
N denotes the total number of clusters and
represents the number of downsampled frames.
To extract emotional features, we computed a residual embedding sequence
by performing element-wise subtraction between the quantized embedding sequence
U and pre-quantized embedding sequence
S:
3.1.2. Text Input Processing
For text inputs, such as task prompts and dialogue history, tokenization was performed to enable the LLM to process textual information. Tokenization converts raw text into discrete numerical representations that models can understand and manipulate. Typically, the LLM utilized a subword tokenization method, referred to as byte-pair encoding (BPE), which segmented words into smaller subword units based on frequency statistics gathered during the tokenizer training phase. For example, the sentence “Transcribe the speech into written text” might be represented with BPE as follows: [“Trans”, “cribe”, “the”, “speech”, “into”, “writ”, “ten”, “text”]. This approach ensured the effective handling of rare or unseen words by reducing vocabulary size, maintaining flexibility, and mitigating the issue of out-of-vocabulary tokens.
3.1.3. Vocabulary Expansion and Tokenization
Text prompts were converted into tokens from the original vocabulary of the LLM. However, to enable the LLM to handle speech cluster IDs as additional tokens, it was necessary to expand the vocabulary. Alongside cluster IDs, emotion class tokens were introduced during training, serving as ground-truth labels for emotion prediction and enabling the LLM to better comprehend emotional context. Additionally, speaker ID tokens were introduced to distinguish between the model itself (“EmoSDS”) and the conversational partner (“user”) during dialogue generation, facilitating speaker-aware response generation. The details of this training process are provided in
Section 3.2. Let
V denote the original LLM vocabulary of size
and let
M and
K represent the number of emotion class tokens and speaker ID tokens, respectively. The vocabulary is expanded by introducing a new set of speech unit tokens, emotion class tokens, and speaker ID tokens, forming an updated vocabulary
, where
The final expanded vocabulary
is then defined as
3.1.4. Embedding Concatenation
To construct the final input for the transformer decoder blocks, tokens obtained from text and speech prompts were first concatenated along the temporal axis. Both speech cluster tokens (C) and textual tokens were then transformed into corresponding embeddings via the embedding layer. Subsequently, the residual embeddings (R) were appended to the end of this embedding sequence. To match the dimension of R with the LLM’s hidden dimension, we replicated R three times along the feature dimension axis, forming . During training, this input segment was referred to as the prefix, and labels corresponding to each training stage were concatenated to the prefix before being fed into the LLM.
3.1.5. Training Objective
The LLM was trained using a next-token prediction objective, aiming to minimize the negative log-likelihood of predicting each subsequent token based on the preceding tokens. Consider an input sequence
composed of a task prompt
, a cluster ID sequence
, and residual embeddings
concatenated into a single sequence. The LLM input sequence can thus be formulated as
where
represents the total prefix length and
denotes the total number of tokens in the sample.
Since the goal of training was to predict the label portion of the sequence based on the prefix portion, the loss function was calculated on the label segment. The final LM head, comprising a linear layer followed by a softmax function, output the conditional probability distribution for the next token, given the previous tokens. The model was then optimized by minimizing the negative log-likelihood loss:
where
d is the number of training samples in dataset
D,
represents the LLM parameters, and
denotes the
i-th token in sample
.
3.2. Three-Stage Training Pipeline
To enable the LLM to effectively learn both linguistic and emotional features from speech, a progressive training strategy was employed. This training process consisted of three stages: the first focused on ASR, the second integrated both ASR and SER, and the final stage incorporated ASR, SER, and emotion-aware response generation. In the initial stage, the model was trained to recognize linguistic content from quantized speech representations. In the subsequent stage, paralinguistic features, including emotional cues, were introduced, enabling the model to associate prosodic and acoustic variations with their corresponding text. In the final stage, the model generated emotionally adaptive responses by conditioning the predicted input emotion and linguistic content. Through this structured training pipeline, the model incrementally acquired both content and emotional representations, enhancing its ability to generate expressive responses.
In the following subsections, we provide a detailed training strategy of each stage.
3.2.1. Stage 1: ASR Task
In the first stage, the model was trained to transcribe speech content using only the quantized input C. The loss was computed on the label portion, corresponding to the transcription of the input speech. To ensure that the model effectively captured linguistic content from C, only the lower six layers of the LLM were fine-tuned, which included the embedding layer, the LM head, and the first six transformer decoder layers.
3.2.2. Stage 2: ASR + SER Task
In the second stage, the input consisted of both token embeddings of C and residual embeddings , allowing the model to incorporate emotional attributes from speech. The training objective remained the same, following the next-token prediction loss function, with the label format modified to include emotion class tokens:
By incorporating emotion tokens alongside linguistic content, the model learned to associate speech characteristics with emotional states. To facilitate this learning, five emotion class tokens were introduced into the vocabulary, the same classes utilized in the ESD dataset:
As in the first stage, only the lower six layers of the LLM were fine-tuned in the second stage.
Additionally, prompt augmentation was applied in both Stage 1 and Stage 2 using GPT-4o, generating variations in training prompts to enhance diversity across ASR and SER tasks. Details of this augmentation process are provided in
Section 3.3.
3.2.3. Stage 3: ASR + SER + Response Emotion and Text Generation
In the final stage, the model was trained to generate emotionally expressive responses. The label format was extended to incorporate both input emotion, response emotion, and generated text, following the structure
At this stage, the entire LLM was fine-tuned, allowing the model to jointly optimize linguistic understanding, emotion recognition, and response generation. To enhance its ability to distinguish between conversational participants, special speaker ID tokens were introduced:
Additionally, dialogue history was incorporated by introducing history tokens
into the prefix, modifying the total prefix length for the
j-th sample as
By incorporating dialogue history, the model generated responses that were both contextually coherent and emotionally aligned, leveraging learned representations of both linguistic and paralinguistic cues.
3.3. EmoSC
The publicly available speech conversation datasets, such as StyleTalk [
8] and DailyTalk [
10], often exhibit limitations in emotional expressiveness or suffer from an insufficient number of samples per emotion category. We empirically found that these limitations hinder the LLM’s ability to effectively capture emotional cues from speech, leading to biased emotional responses (see
Section 5.2 for details).
To overcome these shortcomings, a new dataset, EmoSC, was introduced. EmoSC was constructed based on three key principles: (1) a balanced distribution of emotion categories across samples, (2) real-world speech samples with clearly expressed emotions, and (3) multiple samples containing different emotions but identical text content. Rather than collecting speech data from scratch, the dataset was built upon the Emotional Speech Database (ESD) [
11], which consists of 350 parallel utterances spoken by 10 English and 10 Chinese speakers, totaling over 29 h of speech data. Since the dataset was intended to be used for English-language applications, only the data from the 10 English speakers in ESD were used. The parallel nature of ESD, where each text is spoken in five different emotional tones, was leveraged to augment the dataset with GPT-generated dialogue history corresponding to each emotion–text pair. This augmentation enabled the construction of dialogue samples where the same utterance was expressed in five distinct emotions.
For each emotion–text pair in ESD, the GPT-4o-mini model was used to generate conversation history. The prompting strategy was adapted from [
8] to construct system and user prompts, with the full prompt structure provided in
Figure 3.
In the system prompt, explicit rules were defined to guide the model’s behavior. Specifically, the model was instructed to consider only five predefined emotions and to adhere to common-sense reasoning when generating dialogue histories. Each dialogue turn was structured in the following format: <speaker>: <emotion> <text>. Additionally, it was requested that the model output its responses in dictionary format to facilitate straightforward parsing.
The user prompt explicitly instructed the model to accomplish three tasks: (1) predict an appropriate emotion and corresponding response text based on the provided dialogue history, as well as the current emotion and text; (2) suggest an alternative emotion for the current text that conveyed a different nuance while retaining the original wording; and (3) based on this alternative emotion, generate a new appropriate emotion and associated response text.
An example of a generated dialogue is illustrated in
Figure 4. Since the GPT-generated dialogues existed only in text format, each generated emotion–text pair was mapped to its corresponding speech sample from ESD, forming the final speech conversation dataset.
6. Conclusions
This paper presented EmoSDS, an Emotional Spoken Dialogue System that unifies speech and emotion recognition within a single LLM-based framework. Unlike conventional SDS architectures that rely on cascaded ASR or external emotion encoders, EmoSDS directly feeds self-supervised speech representations into the LLM, enabling end-to-end speech understanding while effectively capturing both linguistic content and emotional cues.
Our approach leverages structured SSL feature representations, where discrete linguistic tokens are extracted through k-means clustering, and continuous residual embeddings preserve paralinguistic attributes. This enables the seamless integration of emotional and prosodic information, eliminating the need for external emotion encoders. To further enhance performance, we introduced a three-stage training pipeline, progressively training the model on ASR, SER, and response generation, allowing for the better learning of linguistic and paralinguistic features.
Through extensive evaluations, we demonstrated that EmoSDS achieves state-of-the-art performance in both emotion recognition and response generation. Our results show that residual input is essential for capturing paralinguistic cues, significantly improving emotion recognition accuracy, particularly in datasets where the same content is spoken with different emotional states. Additionally, stage-wise training enhances both transcription accuracy and emotional expressiveness, resulting in more coherent and emotionally adaptive responses. Furthermore, our findings indicate that LLMs can effectively process continuous paralinguistic features, leveraging prosodic variations to improve emotional understanding and response generation.
To address the scarcity of high-quality emotional speech conversation datasets, we introduced EmoSC, a balanced, emotional speech dataset constructed using GPT-4o-mini and the ESD corpus. EmoSC significantly improves emotion recognition robustness, reduces data imbalance, and enhances the LLM’s ability to generate diverse and emotionally nuanced responses.
Our findings confirm that emotionally expressive datasets and structured training pipelines are critical for advancing spoken dialogue systems. By integrating continuous paralinguistic features and multi-stage learning, EmoSDS bridges the gap between linguistic understanding and emotional expressiveness, offering a more natural and engaging conversational experience.
Future research can explore a fully end-to-end approach that eliminates not only the ASR front-end but also the LLM-TTS pipeline, enabling direct speech-to-speech interaction. Additionally, leveraging LLM methods capable of processing continuous paralinguistic features could extend the model’s applicability to more diverse and generalizable speech-related tasks, such as speaker recognition, speaker diarization, pitch estimation, and energy prediction, which are challenging for discrete, token-based LLMs. This could lead to a highly generalized speech-language multimodal LLM capable of capturing richer acoustic and linguistic information.
Furthermore, our current framework is constrained to five discrete emotion categories. Expanding to a broader set of emotions or adopting continuous emotional representation could enhance the expressiveness and adaptability of the model, further improving its real-world applicability in emotion-aware dialogue systems.