Emotion and Intention Detection in a Large Language Model

Castro, Emmanuel; Calvo, Hiram; Kolesnikova, Olga

doi:10.3390/math13233768

Open AccessArticle

Emotion and Intention Detection in a Large Language Model

by

Emmanuel Castro

,

Hiram Calvo

^*

and

Olga Kolesnikova

Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City 07738, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3768; https://doi.org/10.3390/math13233768

Submission received: 25 September 2025 / Revised: 5 November 2025 / Accepted: 17 November 2025 / Published: 24 November 2025

(This article belongs to the Special Issue Mathematical Foundations in NLP: Applications and Challenges)

Download

Browse Figure

Versions Notes

Abstract

Large language models (LLMs) have recently shown remarkable capabilities in natural language processing. In this work, we investigate whether an advanced LLM can recognize user emotions and intentions from text, focusing on the open-source model DeepSeek. We evaluate zero-shot emotion classification and dialog act (intention) classification using two benchmark conversational datasets (IEMOCAP and MELD). We test the model under various prompting conditions, including those with and without conversational context, as well as with auxiliary information (dialog act labels or emotion labels). Our results show that DeepSeek achieves an accuracy of up to 63% in emotion recognition on MELD, utilizing context and dialog-act information. In the case of intention recognition, the model improved from 45% to 61% with the aid of context, but no further improvement was observed with the provision of emotional cues. Supporting the hypothesis that providing conversational context aids emotion and intention detection. However, conversely, adding emotion cues did not enhance intent classification, suggesting an asymmetric relationship. These findings highlight both the potential and limitations of current LLMs in understanding affective and intentional aspects of dialogue. For comparison, we also ran the same emotion and intention detection tasks on GPT-4 and Gemini-2.5. DeepSeek-r1 performed as well as Gemini-2.5 and better than GPT-4, confirming its place as a strong, competitive model in the field.

Keywords:

large language models; emotion recognition; intent detection; dialogue acts; conversational AI; zero-shot learning

MSC:

68T50

1. Introduction

Chatbot technology has advanced rapidly in the past decade, driven by improvements in artificial intelligence (AI) techniques such as natural language processing (NLP) and machine learning (ML) [1]. Modern conversational agents benefit from text-to-speech and speech-to-text capabilities, making interactions more natural [2] and increasing user acceptance worldwide [3]. Intelligent personal assistants (Apple Siri, Microsoft Cortana, Amazon Alexa, Google Assistant) now demonstrate a better understanding of user input than earlier chatbots [4], thanks to these AI advances. They can even mimic human voices and produce coherent, well-structured sentences.

Chatbots are employed across diverse domains—entertainment, healthcare, customer service, education, finance, travel [5]—and even as companions [6]. These successes result from years of research aimed at improving conversational AI functionality, performance, and language understanding accuracy [7], along with new strategies for efficient implementation [8] and dialogue quality evaluation [9]. Nevertheless, current chatbots often still fall short of user expectations [1], leading to frustration and dissatisfaction [10]. This shortfall has led researchers to emphasize the importance of endowing such systems with affective recognition capabilities, on the premise that acknowledging and responding to user emotions improves user experience [11].

Due to issues like the above, users may perceive chatbot interactions as unnatural or impersonal. To address this, Wolk [12] argued that conversational agents should provide personalized experiences, foster lasting relationships, and receive positive user feedback. He suggested this could be achieved if such systems were capable of processing users’ intents (in addition to understanding content).

In summary, many researchers advocate for conversational agents and other AI applications to be able to recognize both the user’s intentions and emotions during interactions. This dual capability could make dialogues more natural and satisfying.

Recently, large language models (LLMs) have had a profound impact on the NLP community with their remarkable zero-shot performance on a wide range of language tasks [13]. As LLMs are increasingly integrated into daily life applications, it is vital to analyze how well these models can recognize and classify user intentions and emotions. Developments like supervised fine-tuning have further improved LLMs’ understanding of user instructions and intentions [14].

The debut of ChatGPT in late 2022 [15] revolutionized AI conversations with its ability to engage in human-like dialog. More recently, in early 2025, a new open-source large language model (LLM) called DeepSeek-r1 (DS-r1) [16] was introduced as a promising model with purportedly advanced dialogue capabilities. DS-r1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks, and outstanding results on benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond, showing its competitiveness [17]. DS-r1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to handle non-exam-oriented queries intelligently [17].

However, to the best of our knowledge, this is the first systematic study evaluating DeepSeek, a recently released open-source large language model, on the combined tasks of emotion and intent detection in dialogue, and whether providing the additional context of the conversation (in the case of emotion recognition, the previous annotation of intent, and vice-versa) influences performance on these tasks. While prior research has assessed ChatGPT, GPT-4, or fine-tuned transformer models for these tasks, DeepSeek has not been examined in this context despite its reported strengths in reasoning and general dialogue. Our work, therefore, provides novel insights into the affective and pragmatic capacities of a state-of-the-art open-source LLM, highlighting an asymmetric relationship between emotions and intentions that has not been previously documented.

Objectives: In this study, we evaluate the performance of DeepSeek (V3-0324 and R1-0528) as a representative LLM on two key tasks: emotion recognition and intent recognition from text. We focus on zero-shot prompting scenarios—without any fine-tuning—on established conversational datasets. Specifically, we assess (1) how accurately DeepSeek-v3 (DS-v3) and DeepSeek-r1 (DS-r1) can classify the emotion expressed in a user utterance, and (2) how accurately they can classify the user’s intent (dialogue act). We also investigate whether providing additional context (such as the surrounding conversation or knowledge of the other aspect, emotion versus intent) affects performance on these tasks.

The main contributions of this work are:

a demonstration that DeepSeek can make a reasonable classification of intent and the emotional state in zero-shot conditions.
a demonstration that DeepSeek can improve its comprehension of intents and the emotional state of a conversation, providing it with the conversation context.
reaffirms the suggestion that providing the intention of the utterance to an LLM can improve an emotional state classification.

The remainder of this paper is organized as follows: Section 2 provides background on emotion and intention detection, as well as their relationship. Section 4 describes the experimental methodology, including the prompt design and datasets. Section 5 presents the results of the emotion and intent classification experiments. Section 6 presents the comparative results of ChatGPT, DeepSeek, and Gemini. Section 7 discusses the findings, and Section 8 concludes the paper and outlines future work.

2. Background and Problem Formulation

2.1. Emotion Detection: Definition and Importance

Emotion detection (ED), also called emotion recognition, is a branch of sentiment analysis focused on identifying and interpreting human emotions from various data sources. Its primary goal is to recognize and interpret emotional states by analyzing inputs such as facial expressions, vocal tone, body language, or textual content [18]. In text-based ED, the aim is to infer the writer’s or speaker’s emotional state from language cues.

A foundational psychological framework in ED is Ekman’s model of basic emotions. Ekman identified six basic emotions (happiness, sadness, anger, disgust, surprise, fear) that are universally recognized across cultures [19]. These categories often serve as a basis for labeling and classifying emotions in research.

Emotion detection is essential because it allows artificial agents to gather affective information from users and adapt accordingly. By understanding a user’s emotional state, a system can tailor its responses and create a more engaging, long-term interaction [20]. Indeed, systems capable of detecting or expressing emotions have been shown to improve user experience [11]. For example, an emotion-aware system might detect frustration and adjust its dialogue strategy to be more supportive.

Beyond improving general human–computer interaction, emotion recognition has diverse applications. In security and law enforcement, verbal and written cues to emotion can assist in lie detection or threat assessment [21,22]. In customer experience management, businesses use emotion detection to gauge customer satisfaction or dissatisfaction from feedback, enabling proactive service improvements [18]. In healthcare, emotion recognition from text (such as therapy transcripts or patient journals) can help identify signs of mental health issues like depression or anxiety [23]. In education, real-time emotion monitoring could alert teachers to student confusion or frustration [24]. Across these domains, adding emotional context enables AI systems to provide more personalized and effective responses.

2.2. Intention Detection: Definition and Importance

Intention detection is a subfield of NLP and AI that identifies the underlying goal or purpose behind a user’s utterance. For example, a user query might intend to ask a question, issue a command, make a request, or express an opinion. Recognizing intent is critical in task-oriented dialogue systems (such as virtual assistants), which need to map natural language to specific actions or replies. In dialog systems like Apple’s Siri, Amazon Alexa, or Google Assistant, intent detection is the process that interprets a user’s natural-language command and triggers the appropriate function or answer [25]. This interpretation’s process relies on pragmatics, where Austin said that language is used to accomplish actions [26], and according to Searle [27], speech acts are classified into five illocutionary acts (representatives, directives, commissives, expressives, and declarations), from which are obtained the categorized intents in terms of dialog acts (DAs) such as Question, Command, Statement, Request, etc.

Intention detection greatly enhances human–computer interaction by enabling systems to respond to what the user means, not just what they literally say. This leads to more efficient and meaningful interactions. Users experience less frustration when the system correctly interprets their requests [12] and provides relevant answers quickly [28]. Moreover, accurate intent recognition allows conversational agents to personalize responses to a user’s needs and context, creating a more engaging experience [28].

In customer service automation, intent classification allows chatbots to resolve standard queries without human intervention, thus reducing response times and operational costs [28]. In decision support, identifying user intent from feedback or questions can help businesses uncover what customers aim to do (e.g., complain, seek information, make a purchase) and adapt strategies accordingly. Intention detection also enables proactive assistance: a system that understands a user’s goal can sometimes anticipate needs and offer help before the user explicitly asks [28]. In accessibility contexts, intent-aware interfaces can make communication more seamless and reduce ambiguity [29].

Overall, intention detection is a key component in modern conversational AI, driving improvements in user experience, automation, personalization, and the overall effectiveness of dialogue systems.

2.3. The Relationship Between Emotions and Intentions

Emotions and intentions in communication are closely interconnected. Emotional states often influence intentions: for instance, fear may lead to the intention to escape a situation, whereas anger might lead to the intention to confront [30]. Positive emotions, such as excitement, can motivate intentions to pursue a goal, while gratitude can increase a consumer’s intention to repeat a purchase [31,32]. Thus, emotion can bias decision-making and drive particular intentions.

Conversely, intentions (or outcomes related to one’s intentions) can influence subsequent emotions. If a person achieves their intended goal, they may feel satisfaction or contentment; if their intention is thwarted, they may feel disappointment or sadness [33]. In dialog, a speaker’s success or failure in fulfilling their communicative intention (e.g., making a request or joke) can alter their emotional state.

Emotions and intentions often exist in a feedback loop: emotions give rise to intentions, and executing or thwarting those intentions generates new emotional responses. In conversation, this interplay can be subtle. For example, consider an utterance like “Yeah, sure.” Its literal meaning is agreement, but if delivered sarcastically (with an emotional nuance), the true intention might be disagreement or dismissal. The speaker’s emotion (frustration) is key to correctly inferring the communicative intent in such cases [34].

Understanding the relationship between emotion and intention is especially important for AI systems performing emotion or intent detection. Emotions can provide context to disambiguate a user’s intent when the language alone is unclear [35]. Likewise, knowing a speaker’s intent (such as their dialog act) could help interpret ambiguous emotional expressions. We hypothesize that incorporating knowledge of one aspect (emotion or intent) could improve the detection of the other. In other words, an AI that knows the user’s emotional tone might better judge their intent, and vice versa. This forms the central hypothesis of our study.

2.4. Problem Statement and Hypothesis

Despite progress in specialized emotion or intent classifiers, it remains unclear whether large pre-trained language models can achieve a similar level of understanding in a zero-shot setting. Preliminary studies have shown that LLMs like ChatGPT can perform emotion classification, although not at state-of-the-art levels of accuracy (e.g., around 58% on one benchmark [36]). In another study [37], various models of LLMs (GPT-3.5-turbo, GPT-4 and flan-alpaca) were explored in a zero-shot condition to recognize emotions. Their evaluation showed that these models did not perform as well as the top-rated state-of-the-art models (BERT and RoBERTa). Few studies have examined LLMs for intent (dialog act) classification, especially in the context of emotion.

The problem we address is how to improve the performance of an LLM in recognizing emotions and intentions from dialog text as reliably as task-specific models, and whether providing information about one will help the other. Specifically, we explore whether adding conversational context and cross-task information (emotion annotations or intent annotations) can improve zero-shot classification performance.

We formulate the hypothesis that emotions provide helpful cues for intent detection and vice versa. For example, knowing an utterance is spoken in anger might hint that it is intended as a criticism rather than a neutral statement. Likewise, knowing that an utterance is a question (intent) might constrain the likely emotions expressed. We test this hypothesis by comparing model performance with and without access to such supplemental information.

3. Literature Review

3.1. Emotion Recognition in Dialogue

Researchers have approached emotion recognition in text from various perspectives. One early milestone was the work of Zhou et al. [38], who incorporated an explicit “emotion feature” into a chatbot model to generate emotionally relevant responses. Their encoder–decoder model with gated recurrent units (GRUs) introduced emotion category embeddings and an internal/external memory for emotional context, enabling it to respond with contextually appropriate emotion.

Building on the importance of affect in conversation, Asghar et al. [39] improved upon Zhou’s work by using a sequence-to-sequence model with a Long Short-Term Memory (LSTM) network. This model generated responses that more accurately reflected the user’s emotions. With the advent of transformer architectures, Majumder et al. [40] proposed an encoder–decoder transformer chatbot that could mimic the user’s emotions to produce empathetic responses. Their model injected stochastic variation into emotional responses and mirrored the user’s emotional tone to appear more empathetic. Luo et al. [41] propose a fine-tuned pre-trained RoBERTa model with a CNN-LSTM network for textual emotion recognition in a conversation, taking into consideration the long-term emotion-relevant context information. Their results outperformed the state-of-the-art models on the MELD dataset in most cases.

The emergence of powerful LLMs led to new approaches. For example, the release of ChatGPT in 2022 opened the possibility of using a pre-trained conversational model for emotion recognition. Some researchers fine-tuned GPT-style models to create end-to-end empathetic chatbots [42]. Others evaluated ChatGPT directly as an emotion classifier. Banimelhem and Amayreh [36] tested ChatGPT on a standard emotion classification dataset (dair-ai/emotion) and reported an accuracy of about 58%, indicating some competence but leaving room for improvement. Similarly, Mullangi et al. [43] explored sentiment and emotion modeling with ChatGPT, highlighting both its potential and its pitfalls (e.g., inconsistent label assignment without fine-tuning). Wake et al. [44] investigated biases in ChatGPT’s emotion recognition, pointing out that while performance was generally good, certain emotions were systematically more complicated for the model to identify correctly. Mohammad et al. [37] explored various LLMs in a zero-shot setting for emotion recognition; their evaluation showed that these models did not perform as well as the top-rated state-of-the-art models.

In summary, emotion recognition in text has progressed from specialized neural architectures to the current exploration of prompting large pre-trained models. Fine-tuned models show that incorporating emotional embeddings or memories can enhance dialogue generation. Meanwhile, recent studies suggest that large language models can recognize emotions to some extent. Still, their zero-shot performance may not match that of dedicated models without further adaptation or context.

3.2. Intention (Dialog Act) Recognition

Detecting a speaker’s intent—often operationalized as dialog act (DA) classification—has also seen extensive research. Early work by Ortega and Vu [45] used recurrent neural networks (RNNs) with attention mechanisms for DA classification, highlighting the value of context representations in improving performance. Raheja and Tetreault [46] introduced a context-aware self-attention model, which improved on prior RNN approaches on the Switchboard dialogue corpus by capturing long-range dependencies in conversation.

Other studies combined statistical models with neural networks. Saha et al. [47] experimented with Conditional Random Fields (CRFs) along with neural encoders for DA tagging, and Li et al. [48] developed a dual-attention hierarchical RNN with a CRF output layer to label utterances with DAs, using both utterance-level and context-level attention. Shang et al. [49] further showed that incorporating information about speaker turns (who is speaking when) can improve DA classification; they modified the CRF layer to account for speaker change information and found that this yielded more accurate results.

Most of the above methods involve supervised learning on labeled dialogue datasets. There is less published work on using LLMs for DA classification in a zero-shot or prompt-based way. However, the intent detection task in virtual assistants is related and has been addressed with modern techniques. For example, in a recent dissertation, Ye [28] studied user intent modeling in conversational systems, though primarily focusing on task-specific modeling.

In general, intent recognition benefits from understanding dialogue context and structure. The current research gap lies in assessing whether LLMs, which have implicit knowledge of language and dialogue, can infer intents without explicit training on DA labels.

3.3. Emotion and Intent Recognition Together

Very few studies have directly examined the interaction of emotion and dialog act recognition in conversation analysis. A couple of notable exceptions include Bosma and André [35], who attempted to disambiguate speech acts by incorporating the user’s emotional state. In cases where an utterance was ambiguous or difficult to classify by intent alone, knowing the emotion led to more accurate speech-act classification. Their results were encouraging, suggesting a complementary relationship between affect and intent.

Novielli and Strapparava [50] explored affective analysis in dialogue act identification. They proposed that lexical features associated with emotion could inform DA classification. Their findings provided positive evidence that affective lexical cues correlated with particular intentions in dialogue, reinforcing the idea that emotions and intentions are linked.

These studies support our hypothesis that joint modeling or cross-conditioning of emotion and intention could be beneficial. However, they used relatively small models or specific algorithms. It remains to be seen how a large language model might implicitly capture these relationships.

In summary, integrating emotion and intention recognition is a nascent research area. Prior work indicates that emotion cues can help resolve intent ambiguities and that affective lexicons align with communicative intent. This motivates our experimental approach: we will test an LLM on both tasks and observe how providing it with additional emotional or intentional context influences its performance.

4. Materials and Methods

To classify every utterance from each conversation, those were sent to DeepSeek (DS) through a module specifically designed for this purpose. This classification was made within a predefined set of categories. To each utterance, a prompt specifying the desired task was added. Then the classification produced by DS was retrieved, compared, and evaluated against the dataset’s original annotations. Finally, we obtained adequate metrics to understand the model’s performance.

Formally, the method mathematically can be described as follows:

Given a conversation C,

C = u_{1} + u_{2} + \dots + u_{n}, \forall i \in {1, 2, \dots, n}

(1)

where

u_{i}

represent the i-th utterance.

The function that maps every utterance to its prediction can be defined as,

f_{1} : {p_{1} + u_{i}} \to y_{i}^{'}, \forall i \in {1, 2, \dots, n}, a n d y_{i}^{'} \in E

(2)

where

p_{1}

represents the added prompt for each utterance,

y_{i}^{'}

the prediction made by the model for the specific utterance

u_{i}

, and E the set of predefined labels.

Then, let us say that,

L^{'} = y_{1}^{'} + y_{2}^{'} + \dots + y_{i}^{'}, \forall i \in {1, 2, \dots, n}, a n d y_{i}^{'} \in E

(3)

where

L^{'}

represents the ordered list of all the predictions made by the model in a set of predefined labels.

Now, if we want to consider the entire context of the conversation, we have the following function,

f_{2} : {p_{2} + {u_{1} + u_{2} + \dots + u_{i}}} \to {y_{1}^{'} + y_{2}^{'} + \dots + y_{i}^{'}}, \forall i \in {1, 2, \dots, n}

(4)

where

p_{2}

represents the prompt that asks for consideration of the conversational context of the conversation,

y_{i}^{'}

is an element of

L^{'}

, the ordered list of all the predictions made by the model.

Now, let L be an ordered list with the original annotations of the conversation C mapped from each utterance,

L = y_{1} + y_{2} + \dots + y_{i}, \forall i \in {1, 2, \dots, n}

(5)

In the last case, when we want to consider the context of the conversation and provide the model with the additional information of the pre-annotated label class, we have the following function,

f_{3} : {p_{3} + {u_{1} y_{1} + u_{2} y_{2} + \dots + u_{i} y_{i}}} \to {y_{1}^{'} + y_{2}^{'} + \dots + y_{i}^{'}}, \forall i \in {1, 2, \dots, n}

(6)

where

p_{3}

represents the prompt that asks for consideration of the conversational context of the conversation with the additional information, and

y_{i}

represents the original labels of the conversation C belonging to L, mapped from each utterance

u_{i}

.

Finally, the next function returns the confusion matrix that will help us obtain the necessary metrics to evaluate the model,

f_{4} (y_{i}, y_{i}^{'}) \to M_{i j}, \forall i a n d \forall j \in {1, 2, \dots, n}

(7)

The matrix

M_{i j}

enables us to obtain the metrics for evaluation (see Section 4.3 for more details on this function and other algorithms).

The following diagram helps us to visualize the same method described before (Figure 1):

4.1. Prompt Design for the LLM

Although at the beginning, we designed a series of text prompts to query DeepSeek for emotion and intent classification, and because we operate in a zero-shot setting (carefully phrased prompts are essential for eliciting the desired behavior from the model), we experimented with multiple prompt formulations to determine which yielded the most accurate and consistent responses. At the end, we issued two separate prompts per input: one requesting the dominant emotion, and another asking the inferred communicative intention. Categories for emotion followed the Ekman taxonomy for the MELD dataset (see Section 4.2 below), and for the IEMOCAP dataset, the same Ekman taxonomy plus excited and frustrated. While intentions were defined using the corpus authors’ annotation schema. All prompts instructed the model to output a single-word label indicating either an emotion or an intent.

For emotion recognition, without considering the context of the conversation, the prompt used was:

“In one word, choose between anger, excited, fear, frustrated, happy, neutral, sad, or surprised. Not a summary. What emotion is shown in the next text?: ‘…’”

We found that constraining the output to a single word and providing an explicit list of emotion labels often improved the consistency of the model’s answers. Adding “Not a summary” to the prompt is essential to retrieve the model’s response and avoid irrelevant information, which can make it harder to extract the desired word.

In some variants, we provided conversation context and asked the model to label the emotion of each utterance, evaluating its ability to perform in-context learning across multiple turns.

For intent (dialog act) recognition, we used analogous prompts but with dialog act labels. For example, when we wanted the model to classify the intent of an utterance, we might prompt:

“In one word, choose between Greeting, Question,…, and Others. Not a summary. What dialogical act is shown in the next text?: ‘…’.”

In the case of classifying the emotion of each sentence, considering the context of the conversation, we used the following prompt:

“According to the conversation context, choose between anger, excitement, fear, frustration, happiness, neutral, sadness, or surprise. Answer each sentence in one word with a list and a corresponding number. Not a summary. What emotion is shown in each sentence?: ‘…’.”

In the prompt shown above, the sentences “According to the conversation context” and “Answer each sentence in one word with a list and a corresponding number” can be suppressed, but what is essential is the structure in which the conversation is provided to classify, each utterance of it must have a sequential number (e.g., “1.—Joey—But then who? The waitress I went out with last month? 2.—Rachel—You know? Forget it! 3.—Joey—No-no-no-no, no!…), to obtain the corresponding list as expected.

However, in this study, we primarily focus on intent classification in conjunction with emotion (see Section 4.4 below).

For experiments where we provided cross-task information, the prompts were extended. For instance, to see if emotion context aids intent classification, we gave the model an utterance along with a known emotion label and asked for the intent. Conversely, to test if knowing the intent aids emotion recognition, we provided the dialog act label in the prompt and asked for the emotion. An example of the latter:

“According to the context of the conversation and its dialog act classification (given in parentheses), choose the emotion (fear, surprise, sadness, anger, joy, disgust, or neutral) of each sentence. Each sentence’s dialog act is provided. Not a summary. What emotion is shown in each sentence of the conversation? ‘…’.”

Here, the conversation lines were annotated with dialog act tags in parentheses, and the model was asked to output an emotion for each line.

In summary, our prompt engineering strategy was to specify the task, restrict the output format (for reliability), and supply additional context or options when needed. We did not perform exhaustive prompt tuning; rather, we aimed for reasonably straightforward prompts, under the assumption that an effective LLM should handle such direct instructions (this reflects a typical end-user approach).

4.2. Datasets

We evaluated the model on the EMOTyDA (Emotion aware Dialogue Act) dataset (Saha et al. [34]), which was constructed by combining and reannotating two public dialogue datasets: IEMOCAP (Interactive Emotional Dyadic Motion Capture Database) and MELD (Multimodal Emotion Lines Dataset). Both are multimodal dialogue corpora with emotion labels; for our purposes, we used only the text transcripts and associated labels.

IEMOCAP [51] contains scripted and improvised two-person conversations performed by actors. It has about 152 dialogues and over 10,000 utterances (turns), each annotated by multiple annotators for emotion, in the following emotion categories: happy, sad, fear, disgust, neutral, angry, excited, frustrated, and surprise.

MELD [52] is derived from TV show transcripts (dialogues from the Friends series). It includes over 1400 dialogues and 13,000 utterances, labeled with seven emotion categories: joy, sadness, neutral, anger, fear, disgust, and surprise. To compare the results of the classifications of DS once they were made, we used the same emotion categories as the labels provided.

From these, EMOTyDA was formed by taking the text of each conversation along with its emotion labels (Saha et al. [34]). Additionally, for a subset of the data, each utterance was annotated with one of 12 dialog act (intent) categories: greeting, question, answer, statement-opinion, statement-non-opinion, apology, command, agreement, disagreement, acknowledge, backchannel, and other. Although the SWBD-DAMSL tag-set consists of 42 dialogue acts (DA) developed by [53], and it has been widely used for the task of classification, Saha et al. [34] employed the SWBD-DAMSL tag-set as the base for conceiving a tag-set for the EMOTyDA, since both these datasets contain task-independent conversations. Of those 42 tags, the 12 most common were used to annotate utterances in the EMOTyDA dataset. This is because EMOTyDA is smaller than the SWBD corpus. And many of the tags of the SWBD-DAMSL tag-set will never appear in the EMOTyDA dataset. These are the same categories used in the present study.

These dialogue act annotations were available or adapted from the original datasets, and we used those for consistency.

This unified dataset enabled us to test the model for emotion and intent classification using both corpora.

For evaluation, we considered each utterance in isolation or with its conversation context, depending on the experiment (see Section 4.4). The model’s predicted label was compared with the ground truth label to assess accuracy (see the Section 4.3).

4.3. Metrics and Algorithms

We evaluated DeepSeek performance using various metrics. Some algorithms were implemented as well to obtain the values of the variables used in the formulas, such as Algorithm 1, which was used to compute each confusion matrix, or Algorithm 2, to obtain the values of the True Positives, or Algorithm 3, to receive the values of the True Negatives, or Algorithm 4, to obtain the values of the False Positives, or Algorithm 5, to receive the values of the False Negatives [54].

Algorithm 1 Confusion Matrix

Input: List of Classes (listClass), Original Labels (y), Prediction of the model (y’)

Output: confusionMatrix (Confusion Matrix)

1:: $l e n C l a s s = l e n (l i s t C l a s s)$
2:: $c o n f u s i o n M a t r i x \leftarrow [[0] * l e n C l a s s] * l e n C l a s s$
3:: for $k = 0 t o l e n (y)$ do
4:: $i = l i s t C l a s s . i n d e x (y [k])$
5:: $j = l i s t C l a s s . i n d e x (y^{'} [k])$
6:: $c o n f u s i o n M a t r i x [i] [j] + = 1$
7:: end for
8:: return $c o n f u s i o n M a t r i x$

Algorithm 2 True Positives (TP)

Input: confusionMatrix, listClass, class

Output: truePositives

1:: $k = l i s t C l a s s . i n d e x (c l a s s)$
2:: $t r u e P o s i t i v e = c o n f u s i o n M a t r i x [k] [k]$
3:: return $t r u e P o s i t i v e$

Algorithm 3 True Negatives (TN)

Input: confusionMatrix, listClass, class

Output: trueNegatives

1:: $t r u e N e g a t i v e s \leftarrow 0$
2:: $k = l i s t C l a s s . i n d e x (c l a s s)$
3:: for $i = 0 t o l e n (c o n f u s i o n M a t r i x [0])$ do
4:: if $k \neq i$ then
5:: $t r u e N e g a t i v e s + = c o n f u s i o n M a t r i x [i] [i]$
6:: else
7:: do nothing
8:: end if
9:: end for
10:: return $t r u e N e g a t i v e s$

Algorithm 4 False Positives (FP)

Input: confusionMatrix, listClass, class

Output: falsePositives

1:: $f a l s e P o s i t i v e s \leftarrow 0$
2:: $k = l i s t C l a s s . i n d e x (c l a s s)$
3:: for $i = 0 t o l e n (c o n f u s i o n M a t r i x [0])$ do
4:: if $k \neq i$ then
5:: $f a l s e P o s i t i v e s + = c o n f u s i o n M a t r i x [i] [k]$
6:: else
7:: do nothing
8:: end if
9:: end for
10:: return $f a l s e P o s i t i v e s$

Algorithm 5 False Negatives (FN)

Input: confusionMatrix, listClass, class

Output: falseNegatives

1:: $f a l s e N e g a t i v e s \leftarrow 0$
2:: $k = l i s t C l a s s . i n d e x (c l a s s)$
3:: for $i = 0 t o l e n (c o n f u c t i o n M a t r i x [0])$ do
4:: if $k \neq i$ then
5:: $f a l s e N e g a t i v e s + = c o n f u c t i o n M a t r i x [k] [i]$
6:: else
7:: do nothing
8:: end if
9:: end for
10:: return $f a l s e N e g a t i v e s$

Next, we used specific formulas to obtain every metric:

Accuracy. Accuracy refers to the percentage of correct predictions [54].

A_{c c} = \frac{T P + T N}{T P + T N + F P + F N}

(8)

Precision. Precision is understood as the fraction of values that belong to a positive class out of all of the values that are predicted to belong to the same class [54].

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

Recall. Recall is equal to the number of correct predictions out of all the values that truly belong to the positive class [54].

R e c a l l = \frac{T P}{T P + F N}

(10)

F1 score. The F1 score is the harmonic mean of precision and recall, with a value of 1 indicating perfect performance and 0 indicating no performance [54].

F 1 = \frac{2 * T P}{2 * T P + F P + F N}

(11)

Macro-F1. Macro-F1 averages the F1 score across all classes equally [55].

M a c r o_{F 1} = F 1_{c l a s s 1} + F 1_{c l a s s 2} + \dots + F 1_{c l a s s N}

(12)

Weighted F1 scores. Weighted (W) F1 accounts for class imbalance by giving more weight to frequent classes [55].

W e i g h t e d_{F 1} = F 1_{c l a s s 1} * W_{1} + F 1_{c l a s s 2} * W_{2} + \dots + F 1_{c l a s s N} * W_{N}

(13)

4.4. Experimental Settings

We implemented a Python script to interface with DeepSeek models (DeepSeek-R1-0528 and DeepSeek-V3-0324, accessed in June 2025) via their API (simulating user queries to the model). All evaluations were done in a zero-shot manner; no few-shot examples or in-context demonstrations were used—no model fine-tuning was performed, only prompt-based queries. We used the EMOTyDA dataset, comprising 9420 instances from the IEMOCAP dataset (151 conversations) and 9988 cases from the MELD dataset (943 conversations). All prompts were submitted to DS with its standard temperature set to 1 and the default max output of 32k tokens. Each input was queried once. For future replication, full prompt logs are available: https://github.com/Emmanuel-Castro-M/EmotionAndIntentionRecognition (accessed on 24 September 2025).

We conducted three main experiments for emotion recognition and three for intention recognition, for each corpus (IEMOCAP and MELD):

Classification to utterance-level (no conversational context): The model was given each utterance independently, with a prompt asking for the classification (emotion or intention, independently). This setting mimics classifying each sentence in isolation, without conversational context.
Classification with conversational context: Utterances were presented to the model within their dialogue, and the model was asked to output an emotion or intention for each utterance. This tests whether providing context (preceding utterances) improves per-utterance classification recognition.
Classification with conversational context and known the contra-par classification of dialog acts or emotions: Here, we provided the model with each utterance’s dialog act label, in the case of emotion classification (from the human annotation), in addition to the conversation context, or we provided the model with each utterance’s emotion label, in the case of dialog acts classification. Then we asked it for the classification. This condition assesses whether providing an additional hint about the intent can enhance the prediction.

We calculated overall accuracy as the primary metric since the class distributions were moderately balanced. (In cases of class imbalance, we planned to consider F1-scores per class, but for simplicity and given our focus on overall trends, we mainly report accuracy.)

We used the gold labels provided by human annotators as ground truth. The model predictions were compared with these labels using accuracy and F1 scores. No additional human annotation was performed.

5. Results

5.1. Emotion Recognition Performance

We first evaluate the DeepSeek (DeepSeek-V3-0324 and DeepSeek-R1-0528) emotion classification on both datasets, and then assess its intention classification. Extracting, at each time, the single-label response generated by the model, which was parsed and matched against the predefined class labels. Initially, the IEMOCAP dataset was obtained from EMOTyDA and comprised 9420 utterances, but 2354 were incorrectly labeled (classified as xxx in the original dataset). As a result, these sentences were removed from some tables in the final results. Besides that, only two sentences were annotated as hldisgust, so they were removed from the evaluation to avoid affecting the metric values.

The model’s performance on emotion classification is summarized in Table 1 and Table 2. The intention classification is summarized in Table 3 and Table 4. To see more details of the performance of these experiments, you can review Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, and Table A12 that show the values of the metrics for these experiments, and Table A13, Table A14, Table A15, Table A16, Table A17, Table A18, Table A19, Table A20, Table A21, Table A22, Table A23, Table A24, Table A25, Table A26, Table A27, Table A28, Table A29, Table A30, Table A31, Table A32, Table A33, Table A34, and Table A35 show their confusion matrices.

5.1.1. IEMOCAP on Emotion Recognition

At the utterance level (baseline), without context and in zero-shot conditions, DeepSeek attempted to classify seven emotions, plus the label neutral, achieving very low performance. In the baseline condition of Table 1 it is shown that DeepSeek-v3 (DS-v3) obtained only 36% accuracy on IEMOCAP utterances. Meanwhile, DeepSeek-r1 (DS-r1) showed an accuracy of 37% in the same task. The confusion matrices, Table A13 and Table A14, indicate that the model had a bias toward classifying many utterances as frustrated even when the true label was different. For example, mild anger in text was often labeled as frustrated.

In a context-level setting (providing the additional context of each conversation), DS-v3 achieved an overall accuracy of approximately 47%, and DS-r1 showed a slightly better performance, achieving 51%. In the confusion matrices, Table A15 and Table A16, Its shown the model performed best on clearly expressed emotions like neutral, anger and sad, but struggled to distinguish excited vs. happy.

When the model is provided with the whole dialogue context (the preceding and following utterances of a target line) plus the label of DAs, and asked it to classify each line in context, the accuracy improved modestly from 47% to 49%, but in the same conditions DS-r1 has a step back in performance from 51% to 48%. This suggests that conversation context plus DAs helped the model infer emotions in some ambiguous cases (for example, understanding whether a short utterance “Okay.” is meant positively or negatively by seeing the previous turn) in the case of DS-v3, but were not helpful for DS-r1 on this occasion (further analysis must be done to find an acceptable explanation).

5.1.2. MELD on Emotion Recognition

Zero-shot performance was low on MELD without context, including neutral emotion and more fine-grained negative emotions. Table 2 shows that DS-v3 obtained only 44% accuracy on MELD utterances under these conditions, meanwhile, DS-r1, obtained only 54%. A likely factor is the presence of the seven emotions, including neutral, which the model frequently over-predicted. The confusion matrices, Table A19 and Table A20, describe similar behavior, indicating that the model was biased toward classifying many utterances as anger or joy even when the true label was different. For example, the emotion disgust in text was often labeled anger. Including conversational context significantly improved MELD emotion recognition. With a complete dialogue context, the accuracy increased to 57%. The model benefited from context to correctly identify emotions such as sadness, surprise, and sarcasm-related anger, which require understanding of the situation across multiple utterances. The confusion matrices for the context condition, Table A21 and Table A22, show fewer confusions: for instance, with context, the model distinguished surprise from anger more reliably. On this task, DS-r1 obtained 62% of accuracy, 8% better than its previous performance.

When we also provided the dialog act label for each utterance, DS-v3, the model’s accuracy further increased to 62% as shown in Table 2. This supports our hypothesis that understanding the intent can aid in emotion classification. In particular, we observed that when an utterance was labeled as a question or sarcasm (disagreement) as shown in Table A23, the model was more likely to assign the correct emotional tone (e.g., distinguishing a neutral factual question from one asked in an angry rhetorical manner). The improvement from 57% to 62% is slight but consistent across multiple emotion categories. Similarly, in the case of DS-r1 rose to 63%, surpassing its previous performance. The confusion matrix, Table A24, indicates that the model had a bias towards classifying many utterances as disgust even when the true label is anger, or joy when it is neutral.

Table 2 summarizes the results of the MELD emotions for the three conditions for each of the models, DS-v3 and DS-r1. The trend indicates that conversation context and dialog act cues can enhance the performance of emotion detection of an LLM.

5.2. Intent (Dialog Act) Recognition Performance

We evaluated DeepSeek’s (DS-v3 and DS-r1) ability to classify user intents (dialogue acts) using EMOTyDA (reclassified by [47] with dialogue acts beside the annotations of emotions). Table 3 presents the results on classifying intents under different conditions: baseline, context, and context plus emotions. Note that EMOTyDA’s dialog act label set is quite detailed (12 categories- Greeting, Question, Answer, Statement-Opinion (Stat. Op.), Statement-Non-Opinion (Stat. No Op.), Apology, Command, Agreement, Disagreement (Disagreem.), Acknowledge (Acknow.), Backchannel (Backch.), and Other).

5.2.1. IEMOCAP on Intent Recognition

In the utterance-only setting in a baseline conditions (without context), the model DS-v3 achieved about 44% accuracy on intent classification, Table 3, meanwhile, DS-r1 achieved 45%. Both models (DS-v3 and DS-r1) often confuse specific pairs of DAs, e.g., they struggle to detect correctly Agreement, Backchannel, or Greeting with Acknowledge (Table A25 and Table A26). These results indicate that the LLM required more information to achieve better performance. Without considering context and speaker roles, it becomes difficult for the LLM to accurately recognize the user’s intent due to its intrinsic limitations and the complexity of human communication.

Adding context to the conversation increased the classification results. The accuracy under these conditions was 56%, an improvement over the previous record. The model improved significantly at identifying Answer, since a question often receives an answer in context. Another class in which the model performed better was Greeting (Table A27). The performance of DS-r1 was better than in previous experiments, obtaining an accuracy of 61%. Ds-r1 improved at identifying Apology and Question than before, since a question often receives an answer in context and vice versa. The model performed better at classifying the Command class (Table A28) than in previous experiments.

In the third experiment, in addition to the context we provided, the emotions of each sentence were also used to make a better classification, but, contrary to what was expected, the performance was even poorer than the previous experiment, obtaining an accuracy of 53% instead of 56% with DS-v3 as it is shown in Table 3, in the confusion matrix in Table A29, we can appreciate an increased amount of confusion between Answer vs. Stat. Non Op. and as well as between Others vs. Backchannel. The performance of DS-r1 under these conditions maintained similar values, with an accuracy of 61%. Ds-r1 confused principally Greeting with Acknowledge, Table A30, but at the same time its performance in identifying Question was higher than in previous performances.

5.2.2. MELD on Intent Recognition

In the utterance-only setting (no context), the model DS-v3 achieved about 35% accuracy on intent classification (Table 4) and DS-r1 got about 45%. Many utterances in MELD are short (e.g., “Yeah.” which could be Acknowledgment, Backchannel, or Agreement), making it inherently challenging even for humans without context. The model tended to confuse specific pairs: for example, it often misclassified polite requests as statements, or it struggled to detect rhetorical questions as questions (confusion matrices in Table A31 and Table A32). The baseline indicates that the LLM without additional information performs worse on intent than it did on emotions in isolation.

Adding conversational context (the preceding turn or the entire dialog) improved the intent classification to around 50% accuracy, for DS-v3 and 55% for DS-r1. This improvement is more modest than what we saw with emotions. The model did better at identifying Questions (since a question often receives an answer in context) and Answers, and at distinguishing Backchannels (like “uh-huh”) when it sees surrounding speech. However, some intents, such as Statement-opinion vs. Statement-non-opinion remained difficult, as context did not always clarify whether a statement was intended as factual or opinionated.

Finally, when we provided the model with the ground-truth emotion label for each utterance (context + emotion condition), we expected an improvement if our hypothesis held. Interestingly, the performance of the model DS-v3 is slightly decreased to 48% in this condition (a drop of 2 points, which is within a margin of error). Essentially, giving the emotion did not significantly help or hurt—it appears to have introduced a slight inconsistency in model responses. For instance, knowing an utterance was labeled with an emotion like anger sometimes led the model to incorrectly assume the dialog act was a Complaint or Disagreement, even if it was actually just an angry question. In other cases, emotional info was irrelevant. The slight drop to 48% suggests that, at least for this LLM, emotion cues did not assist intent recognition and occasionally misled it. DS-r1 remained almost the same than the previous experiment (Table A35) with an accuracy of 55%.

In summary, DeepSeek-v3’s intent recognition hovered around 50% accuracy in context conditions, and 55% for DeepSeek-r1. Conversation context provided a considerable benefit, but giving emotion information did not improve intent detection accuracy. This asymmetrical result is notable: it implies that while context and intent cues benefit emotion classification, emotion cues alone did not meaningfully benefit intent classification for the model. We discuss possible reasons for this in Section 7.

Qualitatively, we observed the model sometimes making interpretable errors in intent classification. For example:

Dialogue snippet: A: “Could you BE any later?” B: “Sorry, traffic was terrible.”
Model output for A’s line: Question (expected label: Disagreement/Sarcasm).

The model treated the utterance as a literal question. This highlights the limitation that detecting non-literal intents (such as sarcasm or rhetorical questions) is difficult without specialized handling, despite emotion cues (speaker A is likely annoyed, which the model might recognize, but it still labels the form as a question).

Another example:

Utterance: “Yes, that’s exactly what I meant.” Model output: Agreement (expected label: Agreement).

This straightforward case was handled correctly. Given these results, we proceed to analyze and discuss their implications.

6. Comparing the Results with ChatGPT-4 and Gemini-2.5

To contextualize DeepSeek’s performance, we additionally tested GPT-4 and Gemini-2.5 on the same zero-shot tasks using identical prompts. Results (Table 5) show that DeepSeek-r1 achieves accuracy comparable to Gemini and superior to GPT-4 on emotion detection, confirming its competitive standing among current LLMs.

We used similar conditions to evaluate GPT-4 and Gemini-2.5 for DA classification. Results (Table 6) show that Gemini achieves an accuracy slightly higher than DeepSeek-r1 (DS-r1) and that the performance of DS-r1 was comparable to GPT-4 on DAs classification.

7. Discussion

Our experiments highlight an asymmetric relationship between emotion and intent detection in large language models. DeepSeek demonstrated moderate zero-shot performance, but its accuracy increased substantially when conversational context was available, and further improved when dialog act labels were supplied. This confirms that contextual and intentional cues are strong predictors of emotional tone. In contrast, providing emotion labels did not enhance intent recognition and sometimes misled the model, suggesting that emotions are not equally reliable indicators of communicative function.

This asymmetry is theoretically significant: while dialog acts often imply emotional tendencies (e.g., apologies correlate with sadness, disagreements with anger), emotions alone do not comparably constrain intent categories. Such findings suggest a structural imbalance in the interaction between affect and pragmatics, a phenomenon that is not widely reported in the current literature. They also highlight a limitation of zero-shot LLMs: although they capture affective nuance, they are less consistent in mapping emotions to communicative goals without explicit training.

Beyond accuracy scores, this study emphasizes the importance of designing evaluation frameworks that test both dimensions of dialogue simultaneously. By openly releasing our prompts and methodology, we provide a reproducible benchmark for exploring the affect–intent interplay in other models.

Emotion Detection: DeepSeek demonstrated a reasonable zero-shot ability to classify emotions from text, especially when provided with conversation context. The improvement in DS-v3 from 44% to 57% accuracy due to context, and then to 62% with dialog act cues, and in DS-r1 from 54% to 62%, then to 63%, underscores the importance of contextual understanding. This suggests that LLMs like DeepSeek can effectively leverage additional information. When the conversation’s flow or the nature of the utterance is known, the model can disambiguate emotions that might be neutral or ambiguous out of context. For instance, the model often failed to detect surprise or sarcasm without context, but with preceding lines, it could infer those emotions from an unexpected turn in the dialogue.

The positive impact of including dialog act labels on emotion classification (a five percentage point gain on MELD, from 57% to 62%) provides empirical support for our hypothesis in one direction: knowing what someone is doing (question, apologizing, etc.) helps the model figure out how they feel. Why might this be the case? Specific dialog acts carry implicit emotional connotations—an apology often correlates with regret or sadness; a disagreement may correlate with anger or frustration; a question might be asked in a curious (neutral/happy) tone or a challenging (angry) tone depending on context. By giving the model the dialog act, we essentially narrowed down the plausible emotions. The model could then map, say, a Disagreement act to a likely negative emotion such as anger or disgust, rather than considering all emotion categories. These findings may aid the development of empathy agents for mental health support, education, or customer interaction in low-resource settings, potentially benefiting the community.

Intent Detection: In contrast, providing emotion information did not aid intent classification. One interpretation is that emotion is a less reliable predictor of dialog act in our data. A user could be angry when asking a question or happy when making a statement—emotion does not deterministically indicate the functional intent of an utterance. The slight performance drop suggests that the model might have over-relied on emotion cues when they were present, leading to misclassifications (e.g., it might assume an angry utterance is a disagreement when in fact it was an angry question). This points to a limitation: the model does not inherently know when to separate style (emotion) from speech act (intent), and additional information can sometimes confuse style with content.

Another observation is that DeepSeek’s overall intent classification accuracy (61%) is modest. This is perhaps not surprising—dialog act classification can be quite nuanced, and our label set was large. Additionally, the model was not fine-tuned on any dialog act data; it relied solely on its pre-training. Its performance is roughly comparable to random guessing among a handful of dominant classes (since many utterances are statements or questions, which the model did get right). This indicates that while LLMs have implicit knowledge of language patterns, translating that into explicit dialog act labels may require either few-shot examples or fine-tuning. Indeed, prior work on ChatGPT (a similar LLM) has noted it can follow conversation flows but might not explicitly categorize them without guidance.

Logical Consistency and Errors: There were a few logical inconsistencies in model outputs. For example, in one scenario, the model labeled consecutive utterances from the same speaker as conflicting emotions (likely because it treated each utterance in isolation, failing to enforce the temporal consistency of that speaker’s emotional state). This highlights that the LLM does not maintain a persistent “persona state” of emotion unless explicitly modeled. In future work, incorporating a constraint that a speaker’s emotion should not wildly oscillate within a short exchange might improve realistic output.

We also noticed that ambiguous expressions like “Fine.” could be either neutral or negative; the model’s guess would sometimes flip depending on prompt phrasing. This suggests some instability typical of prompt-based LLM responses. More advanced prompt techniques or calibration might be needed for critical applications.

Implications for Affective Dialogue Systems: Our findings indicate that LLMs are promising as all-in-one classifiers for affect and intent, but their raw zero-shot capability might not yet match specialized models. For building empathetic chatbots, one could imagine using an LLM like DeepSeek to detect user emotion in real time and craft responses. The advantage is that the LLM doesn’t require separate training for the detection task. However, as we saw, it hits around 60% accuracy, whereas task-specific models can exceed 80% on some emotion benchmarks. There is, therefore, a trade-off between convenience (using one model for everything) and accuracy.

For intent detection in conversational AI, relying on an LLM’s internal knowledge might be risky if high precision is needed (for example, correctly interpreting a user’s request vs. a question can be critical). Fine-tuning an LLM on annotated intents or providing few-shot exemplars in the prompt could likely boost performance.

Error Analysis: The hypothesis that “emotions help detect intentions and vice versa” was only half-supported by our results. Emotions (especially when combined with context) did help detect intentions in some manual observations—for instance, when the model knew the user was laughing (happy), it correctly recognized a statement as a joke (which is a kind of intentional act). But these were specific cases. Overall, the automation did not show an aggregated improvement.

This asymmetry might stem from the mapping from intent to likely emotion being more direct than from emotion to intent. Intent categories are numerous and orthogonal to emotion in many cases, whereas emotions typically fall into a smaller set and can align with broad intent tones (e.g., anger often accompanies disagreement, sadness often accompanies apologetic statements, etc.). Our LLM could perhaps leverage intent cues to narrow down emotion options, but when given emotion, it still had to choose among many possible intents.

Some of the most common errors we observed are that when the context of the conversation is not provided to the LLM, it misclassifies different emotions in both datasets, as anger instead of frustration, or excited instead of happy, or disgust instead of anger. For example, consider the next sentence from IEMOCAP dataset: “I’ve been to the back of the line five times.”, annotated as anger, but predicted for the model as frustration. Once the context information was provided to the model, it made a correct prediction, because now, it was crystal clear that knowing the previous sentence “There’s nothing I can do for you. Do you understand that? Nothing.”, annotated as anger, the next sentence must be anger too, as a consequence of an reaction directed to whom said the previous sentence, perceived as a threat, instead of in the case of frustration that can be more generalized or self-directed.

In another example, the sentence: “Yes. There’s a big envelope, it says, you’re in. I know.”, annotated as excited, but predicted for the model in baseline condition as happy. Once the context information was provided, the model made the correct prediction, since the previous sentence was “Did you get the letter?”, the model now is ready to understand that the emotional state is derived of an intense temporary emotion linked to anticipation or enthusiasm for a specific event, so excitement, is the correct answer.

Now, let’s see the sentence “Oh my god, what were you thinking?” from MELD dataset, annotated as disgust, and initially predicted by the model as anger. Without the conversation context, the model struggled to make an accurate prediction; however, with the context provided, it made a better guess and changed its classification to disgust, which was the correct classification. Here is the reason why this happened, the two sentences previous were: the first one was “Monica—Joey, this is sick, it’s disgusting, it’s, it’s—not really true, is it?” which clearly can be classified as disgust, and the second one was “Joey—Well, who’s to say what’s true? I mean…”, and the third sentence was presented before where Monica replied to Joey, and by the flow of the conversation, we can infer that there is no reason to think that the state of emotion of Monica changed from disgust to anger. That is how context aids the model in making better predictions.

Limitations: While our study sheds light on several aspects of emotion and intention recognition using DeepSeek, certain limitations should be acknowledged. MELD and IEMOCAP datasets showed inconsistencies in the labeling criteria. Although they share similar labels, they also differ in the labeling criteria, eliciting different emotions when humans annotated them. This could have generated discrepancies in the results. Also, the results for emotion and intent detection based on the MELD style (TV show dialogues) might differ in more formal conversations or other contexts.

Another limitation of our study is that DeepSeek is a single model, and its behavior may not generalize to all LLMs. Newer or larger models might have different capabilities. By conducting new studies with other LLMs, we could find a reasonable explanation for why, in some cases, providing additional information (the emotion classification label for each sentence) to the model does not improve DA prediction performance.

Also, our prompt designs, while reasonable, could potentially be optimized. It’s possible that different wording (or using a few-shot examples) could significantly improve performance on either task. We did not exhaustively tune the prompts due to scope constraints.

Finally, we also note that the “Genesis flash 2.5” and additional model comparisons we initially intended (as hinted in the manuscript) were not completed; thus, our study focuses only on DeepSeek. Future work should compare multiple models (e.g., ChatGPT and others) on the same tasks to see whether they behave similarly or whether some handle the affect-intent interplay better.

8. Conclusions and Future Work

In this paper, we present a study on using large language models (DeepSeek-v3 and DeepSeek-r1) for emotion and intent detection in dialogues. Our evaluation, conducted without any task-specific fine-tuning, yielded several insights:

DeepSeek can recognize basic emotions from text with moderate accuracy in a zero-shot setting. Its performance improves substantially when given conversational context, highlighting the model’s strength in understanding dialogue flow.

Providing the model with information about the conversational intent (dialog act) of an utterance further enhanced emotion recognition, suggesting a synergistic effect where knowing “what” the utterance is doing helps determine “how it is said” emotionally.

For intent recognition, the model’s zero-shot performance was weaker (around chance level across a broad set of dialog act classes). Unlike the emotion case, providing the model with the speaker’s emotion did not assist and occasionally confused the intent classification.

The relationship between emotions and intentions is asymmetric in the context of this LLM: context and intent cues aid emotion detection, but emotion cues do not significantly aid intent detection.

The DeepSeek model, while powerful, sometimes struggled with non-literal language (e.g., sarcasm) and maintaining consistency, indicating areas for further refinement if used in practical systems.

In terms of academic contribution, our work demonstrates the feasibility of leveraging a single LLM for multiple dialogue understanding tasks simultaneously. This opens the door to developing more unified conversational AI systems. Rather than having separate pipelines for intent detection and sentiment analysis, a single model could potentially handle both, simplifying the architecture.

Future Work: There are several directions for future exploration: 1. Few-shot Prompting: We will experiment with providing a few examples of labeled emotions and intents in the prompt (in-context learning) to see if DeepSeek’s performance improves. Preliminary research on models like GPT-3/4 suggests that few-shot examples can dramatically boost accuracy. 2. Model Fine-tuning: Fine-tuning DeepSeek on a small portion of our dataset for each task might yield significant gains. It would be interesting to quantify how much fine-tuning data is required to reach parity with dedicated models. 3. Multi-task Learning: An extension of fine-tuning is to train a model on both emotion and intent labels jointly (multi-task learning). This could encourage the model to learn representations that capture both aspects internally. We hypothesize that multi-task training might enforce the kind of beneficial relationship between emotion and intent that we partially observed. 4. Applying to Real-world Conversations: We intend to test the model on more spontaneous, real user conversations (such as dialogues from customer support chats or social media threads). These often use noisier language, which would test the model’s robustness. 5. Incorporating External Knowledge: Emotions and intents might be better inferred if a model had access to external knowledge about typical scenarios. For instance, recognizing that “I’m fine!” with a particular punctuation is likely anger or sarcasm might be improved by knowledge distillation or rules. Hybrid systems combining LLMs with rule-based disambiguation for specific, tricky cases could be fruitful. 6. Improving Intent Granularity: Our results showed difficulty in fine-grained intent categories. Collapsing intents into broader classes (e.g., question, statement, command) might yield higher reliability. Future work could focus on whether LLMs are better suited to broad categorization and on refining them for detailed subclasses. 7. User State Tracking: One promising area is to have the LLM maintain a running estimate of a user’s emotional state throughout a conversation (rather than independent per-utterance classification). This could potentially smooth out moment-to-moment classification noise and provide a more stable assessment. Techniques from state tracking in dialogue could be applied here.

In conclusion, this work examined the capacity of DeepSeek to recognize emotions and intentions in dialogues under zero-shot conditions. The results show that the model can achieve reasonable emotion recognition when provided with context and dialog act cues. In contrast, intent recognition remains weaker and does not consistently benefit from emotional information.

The central contribution is the identification of an asymmetric relationship: intent knowledge helps disambiguate emotions, but emotion knowledge does not aid intent classification to the same extent. This finding advances our understanding of the limits of current LLMs and opens new perspectives for building empathetic and context-aware conversational systems.

Future research should test whether few-shot prompting, fine-tuning, or multi-task learning can enforce a more balanced integration of affect and pragmatics. Applying these methods to real-world conversational data, beyond scripted corpora, will also be crucial for validating the robustness of the proposed approach.

We acknowledge that while accuracy improved from 57% to 62% on MELD classification of emotions and other relationships, no statistical significance test (e.g., McNemar’s or bootstrap test) was conducted. Future work could explore this further.

We also recognize that using different experimental approaches is crucial for thoroughly validating our work. Future studies will include comparisons with transformer-based and classifier architectures better to demonstrate the strength and generalizability of our results.

9. Ethical Considerations

The confidence relayed in a dialogue system that can automatically analyze the intents and emotions in a conversation with the user and then make a decision related to this, or generate emotionally appropriate utterances presents many challenges, including risks of misinterpretation, dependency, and privacy concerns.

These challenges highlight the need to mitigate risks and ensure the proper use of systems, with clear responsibilities and appropriate safeguards. Also, these kinds of systems must ensure greater benefits than potential harms and maintain a balance among their features. One way we can try to reduce potential harm is by providing human supervision to the extent possible.

As some users perceive dialogue systems, also known as chatbots, as more human-like and conscious, and more often engage in relationships with them due to increased anthropomorphism, this can lead to stronger emotional attachments [56]. However, the complexity of this relationship will determine its benefits, taking into account the user’s pre-existing social needs; this must be carefully considered.

Author Contributions

Conceptualization, E.C., H.C. and O.K.; methodology, E.C.; formal analysis, E.C., H.C. and O.K.; investigation, E.C.; writing—original draft preparation, E.C. and H.C.; writing—review and editing, E.C. and H.C.; visualization, O.K.; supervision, O.K.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the Instituto Politécnico Nacional (COFAA, SIP-IPN, Grants SIP 20250015 and 20253468) and the Mexican Government (SECIHTI, SNII).

Data Availability Statement

The original data presented in the study are openly available in [GitHub] at [https://github.com/sahatulika15/EMOTyDA] (accessed on 15 May 2025).

Acknowledgments

The authors wish to thank the support of the Instituto Politécnico Nacional.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large language models
IEMOCAP	Emotional Dyadic Motion Capture Database
MELD	Multimodal Emotion Lines Dataset
AI	Artificial Intelligence
DS	DeepSeek
ED	Emotion detection
DA	Dialog Acts
EMOTyDA	Emotion-aware Dialogue Act
Stat. Op.	Statement-Opinion
Stat. No Op.	Statement-Non-Opinion
Disagreem.	Disagreement
Acknow.	Acknowledge
Backch.	Backchannel

Appendix A

Table A1. Classification reports emotions DS-v3 and DS-r1 (baseline) IEMOCAP.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.50	0.61	0.31	0.24	0.38	0.35	1034
excited	0.68	0.52	0.13	0.23	0.22	0.32	999
fear	0.07	0.04	0.34	0.37	0.11	0.07	38
frustrated	0.37	0.37	0.58	0.60	0.45	0.46	1714
happy	0.21	0.26	0.47	0.21	0.29	0.23	563
neutral	0.44	0.40	0.27	0.05	0.33	0.09	1613
sad	0.43	0.50	0.38	0.25	0.40	0.33	1011
surprise	0.16	0.11	0.28	0.44	0.20	0.17	94
accuracy					0.36	0.37	7066
macro avg	0.36	0.35	0.34	0.30	0.30	0.25
weighted	0.44	0.44	0.36	0.29	0.36	0.30

Table A2. Classification reports emotions DS-v3 and DS-r1 (context) IEMOCAP.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.54	0.71	0.55	0.42	0.55	0.53	1034
excited	0.75	0.70	0.27	0.33	0.39	0.45	999
fear	0.17	0.09	0.42	0.58	0.24	0.15	38
frustrated	0.53	0.50	0.31	0.46	0.39	0.48	1714
happy	0.25	0.43	0.62	0.37	0.36	0.40	562
neutral	0.48	0.47	0.59	0.75	0.53	0.58	1613
sad	0.57	0.65	0.59	0.54	0.58	0.59	1011
surprise	0.27	0.21	0.46	0.54	0.34	0.30	95
accuracy					0.47	0.51	7066
macro avg	0.45	0.47	0.48	0.50	0.45	0.43
weighted avg	0.53	0.56	0.47	0.51	0.47	0.51

Table A3. Classification reports emotions DS-v3 and DS-r1 (context + DAs) IEMOCAP.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.72	0.72	0.37	0.39	0.49	0.51	1034
excited	0.74	0.75	0.27	0.28	0.40	0.41	999
fear	0.18	0.10	0.47	0.55	0.26	0.16	38
frustrated	0.48	0.47	0.44	0.44	0.46	0.45	1714
happy	0.33	0.38	0.57	0.42	0.42	0.40	563
neutral	0.46	0.44	0.69	0.73	0.55	0.55	1613
sad	0.57	0.63	0.53	0.49	0.55	0.55	1011
surprise	0.25	0.25	0.47	0.56	0.33	0.34	94
accuracy					0.49	0.48	7066
macro avg	0.47	0.47	0.48	0.48	0.43	0.42
weighted	0.54	0.55	0.49	0.48	0.48	0.48

Table A4. Classification reports emotions DS-v3 and DS-r1 (baseline) MELD.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.28	0.47	0.68	0.46	0.39	0.46	1109
disgust	0.19	0.17	0.27	0.48	0.22	0.26	271
fear	0.28	0.21	0.28	0.52	0.28	0.30	268
joy	0.39	0.50	0.71	0.66	0.51	0.57	1743
neutral	0.90	0.86	0.31	0.52	0.46	0.65	4709
sadness	0.34	0.37	0.41	0.44	0.37	0.40	683
surprise	0.50	0.49	0.42	0.62	0.46	0.55	1205
accuracy					0.44	0.54	9988
macro avg	0.41	0.44	0.44	0.53	0.38	0.46
weighted	0.62	0.64	0.44	0.54	0.44	0.56

Table A5. Classification reports emotions DS-v3 and DS-r1 (context) MELD.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.40	0.56	0.64	0.56	0.50	0.56	1109
disgust	0.25	0.27	0.31	0.46	0.28	0.34	271
fear	0.29	0.22	0.37	0.56	0.33	0.31	268
joy	0.51	0.56	0.70	0.64	0.59	0.60	1743
neutral	0.86	0.79	0.53	0.68	0.65	0.73	4709
sadness	0.35	0.50	0.57	0.41	0.43	0.45	683
surprise	0.59	0.65	0.56	0.61	0.57	0.63	1205
accuracy					0.57	0.62	9988
macro avg	0.46	0.51	0.53	0.56	0.48	0.52
weighted	0.65	0.66	0.57	0.62	0.58	0.63

Table A6. Classification reports emotions DS-v3 and DS-r1 (context+DAs) MELD.

Emotion	Precision		Recall		F1-Score		Support
Emotion	v3	r1	v3	r1	v3	r1	Support
anger	0.47	0.57	0.58	0.51	0.52	0.54	1109
disgust	0.34	0.28	0.30	0.44	0.32	0.35	271
fear	0.35	0.24	0.35	0.55	0.35	0.33	268
joy	0.56	0.58	0.65	0.61	0.60	0.59	1743
neutral	0.78	0.75	0.69	0.72	0.73	0.73	4709
sadness	0.40	0.53	0.55	0.43	0.46	0.47	683
surprise	0.66	0.67	0.55	0.58	0.60	0.62	1205
accuracy					0.62	0.63	9988
macro avg	0.51	0.52	0.52	0.55	0.51	0.52
weighted	0.64	0.65	0.62	0.63	0.63	0.63

Table A7. Classification reports DAs DS-v3 and DS-r1 (baseline) IEMOCAP.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.04	0.04	0.59	0.43	0.08	0.07	56
Agreement	0.29	0.31	0.22	0.31	0.25	0.31	507
Answer	0.58	0.72	0.05	0.04	0.09	0.08	1434
Apology	0.42	0.41	0.96	0.95	0.59	0.57	75
Backchannel	0.36	0.36	0.34	0.41	0.35	0.38	305
Command	0.31	0.34	0.86	0.83	0.46	0.49	350
Disagreement	0.17	0.15	0.60	0.55	0.26	0.24	373
Greeting	0.33	0.35	0.27	0.22	0.29	0.27	60
Others	0.72	0.18	0.49	0.51	0.58	0.27	70
Question	0.85	0.88	0.76	0.77	0.80	0.82	1945
Stat. Non Op.	0.54	0.52	0.27	0.34	0.36	0.41	2172
Stat. Opinion	0.47	0.48	0.55	0.49	0.51	0.48	2073
accuracy					0.44	0.45	9420
macro avg	0.42	0.39	0.50	0.49	0.39	0.37
weighted	0.55	0.57	0.44	0.45	0.44	0.45

Table A8. Classification reports DAs DS-v3 and DS-r1 (context) IEMOCAP.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.05	0.07	0.39	0.38	0.08	0.12	56
Agreement	0.58	0.56	0.28	0.46	0.38	0.50	507
Answer	0.86	0.90	0.20	0.32	0.32	0.47	1434
Apology	0.65	0.74	0.96	0.83	0.78	0.78	75
Backchannel	0.58	0.58	0.49	0.39	0.53	0.47	305
Command	0.46	0.40	0.71	0.75	0.56	0.52	350
Disagreement	0.45	0.56	0.49	0.58	0.47	0.57	373
Greeting	0.39	0.45	0.27	0.25	0.32	0.32	60
Others	0.54	0.36	0.76	0.71	0.63	0.48	70
Question	0.91	0.87	0.80	0.89	0.85	0.88	1945
Stat. Non Op.	0.52	0.58	0.43	0.50	0.47	0.54	2172
Stat. Opinion	0.46	0.52	0.77	0.71	0.58	0.60	2073
accuracy					0.56	0.61	9420
macro avg	0.54	0.55	0.54	0.56	0.50	0.52
weighted	0.64	0.66	0.56	0.61	0.55	0.61

Table A9. Classification reports DAs DS-v3 and DS-r1 (context + emotions) IEMOCAP.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.05	0.08	0.43	0.36	0.10	0.13	56
Agreement	0.60	0.55	0.18	0.49	0.27	0.52	507
Answer	0.89	0.90	0.10	0.33	0.17	0.49	1434
Apology	0.66	0.75	0.87	0.79	0.75	0.77	75
Backchannel	0.54	0.59	0.50	0.45	0.52	0.51	305
Command	0.46	0.40	0.68	0.79	0.55	0.53	350
Disagreement	0.49	0.55	0.38	0.54	0.43	0.55	373
Greeting	0.35	0.42	0.28	0.23	0.31	0.30	60
Others	0.54	0.29	0.51	0.64	0.53	0.40	70
Question	0.91	0.86	0.83	0.88	0.87	0.87	1945
Stat. Non Op.	0.45	0.50	0.58	0.51	0.47	0.54	2172
Stat. Opinion	0.45	0.53	0.69	0.70	0.54	0.60	2073
accuracy					0.53	0.61	9420
macro avg	0.53	0.54	0.50	0.56	0.46	0.52
weighted	0.62	0.66	0.53	0.61	0.51	0.61

Table A10. Classification reports DAs DS-v3 and DS-r1 (baseline) MELD.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.05	0.07	0.52	0.48	0.09	0.13	103
Agreement	0.28	0.31	0.32	0.35	0.30	0.33	446
Answer	0.29	0.51	0.03	0.04	0.05	0.07	1284
Apology	0.35	0.63	0.84	0.79	0.49	0.70	178
Backchannel	0.21	0.21	0.27	0.66	0.24	0.31	172
Command	0.22	0.22	0.76	0.92	0.34	0.35	286
Disagreement	0.14	0.22	0.65	0.65	0.23	0.33	289
Greeting	0.67	0.72	0.76	0.75	0.71	0.73	486
Others	0.17	0.43	0.03	0.19	0.05	0.26	615
Question	0.86	0.86	0.56	0.78	0.68	0.82	2042
St. Non Op.	0.54	0.63	0.19	0.29	0.29	0.39	2993
St. Opinion	0.25	0.32	0.53	0.55	0.34	0.40	1093
accuracy					0.35	0.45	9988
macro avg	0.34	0.43	0.45	0.54	0.32	0.40
weighted	0.48	0.57	0.35	0.45	0.35	0.44

Table A11. Classification reports DAs DS-v3 and DS-r1 (context) MELD.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.07	0.09	0.54	0.34	0.12	0.14	103
Agreement	0.36	0.43	0.35	0.49	0.35	0.46	446
Answer	0.78	0.80	0.20	0.35	0.32	0.49	1284
Apology	0.65	0.76	0.80	0.76	0.72	0.76	178
Backchannel	0.31	0.40	0.38	0.43	0.34	0.41	172
Command	0.29	0.27	0.85	0.86	0.44	0.42	286
Disagreement	0.35	0.38	0.56	0.55	0.43	0.45	289
Greeting	0.73	0.78	0.73	0.68	0.73	0.73	486
Others	0.62	0.46	0.07	0.22	0.12	0.30	615
Question	0.89	0.87	0.78	0.85	0.83	0.86	2042
Stat. Non Op.	0.58	0.62	0.41	0.44	0.48	0.52	2994
Stat. Opinion	0.30	0.33	0.60	0.60	0.40	0.42	1093
accuracy					0.50	0.55	9988
macro avg	0.49	0.52	0.52	0.55	0.44	0.50
weighted	0.61	0.63	0.50	0.55	0.50	0.56

Table A12. Classification reports DAs DS-v3 and DS-r1 (context + emotions) MELD.

Dialogue Act	Precision		Recall		F1-Score		Support
Dialogue Act	v3	r1	v3	r1	v3	r1	Support
Acknowledge	0.07	0.10	0.46	0.37	0.12	0.15	113
Agreement	0.35	0.42	0.31	0.44	0.33	0.43	446
Answer	0.81	0.83	0.15	0.33	0.25	0.47	1284
Apology	0.71	0.76	0.76	0.74	0.73	0.75	178
Backchannel	0.29	0.34	0.38	0.37	0.33	0.35	172
Command	0.32	0.27	0.81	0.85	0.46	0.41	286
Disagreement	0.35	0.38	0.46	0.45	0.40	0.41	289
Greeting	0.71	0.80	0.73	0.70	0.72	0.75	486
Others	0.40	0.52	0.04	0.22	0.07	0.30	615
Question	0.89	0.86	0.78	0.84	0.84	0.85	2042
Stat. Non Op.	0.52	0.60	0.43	0.48	0.47	0.54	2994
Stat. Opinion	0.28	0.33	0.58	0.58	0.37	0.42	1093
accuracy					0.48	0.55	9988
macro avg	0.47	0.52	0.49	0.53	0.42	0.49
weighted	0.58	0.63	0.48	0.55	0.48	0.56

Appendix B

Table A13. Confusion matrix emotions DS-v3 (baseline) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.31	0	0.02	0.44	0.07	0.06	0.08	0.02
excited	0.03	0.13	0.05	0.23	0.36	0.10	0.05	0.05
fear	0.11	0	0.34	0.26	0	0.13	0.08	0.08
frustrated	0.09	0.01	0.03	0.58	0.09	0.11	0.09	0.01
happy	0.04	0.03	0.02	0.21	0.47	0.12	0.10	0.01
neutral	0.04	0.01	0.02	0.37	0.17	0.27	0.10	0.02
sad	0.03	0.01	0.04	0.25	0.15	0.12	0.38	0.01
surprise	0.15	0	0.03	0.31	0.06	0.12	0.05	0.28

Table A14. Confusion matrix emotions DS-r1 (baseline) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.24	0.02	0.03	0.49	0.03	0.11	0.04	0.04
excited	0.02	0.23	0.06	0.22	0.16	0.17	0.02	0.11
fear	0	0	0.37	0.16	0	0.26	0.05	0.16
frustrated	0.05	0.02	0.04	0.6	0.03	0.19	0.04	0.03
happy	0.02	0.13	0.03	0.21	0.26	0.24	0.05	0.07
neutral	0.01	0.04	0.05	0.34	0.05	0.42	0.05	0.04
sad	0.02	0.03	0.06	0.29	0.09	0.23	0.25	0.03
surprise	0.04	0.02	0.01	0.31	0.02	0.15	0.01	0.44

Table A15. Confusion matrix emotions DS-v3 (context) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.55	0	0.01	0.18	0.06	0.11	0.07	0.02
excited	0.03	0.27	0.01	0.04	0.47	0.12	0.03	0.03
fear	0.05	0	0.42	0.03	0.34	0.08	0.05	0.03
frustrated	0.2	0.01	0.01	0.31	0.05	0.27	0.14	0.01
happy	0.04	0.07	0	0.06	0.62	0.17	0.04	0.01
neutral	0.04	0.02	0.01	0.08	0.19	0.59	0.06	0.02
sad	0.02	0.01	0.02	0.08	0.06	0.21	0.59	0.01
surprise	0.11	0.04	0.04	0.09	0.11	0.13	0.02	0.46

Table A16. Confusion matrix emotions DS-r1 (context) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.42	0.01	0.03	0.34	0	0.10	0.06	0.03
excited	0.01	0.33	0.03	0.08	0.21	0.28	0.02	0.05
fear	0	0	0.58	0.03	0	0.24	0.05	0.11
frustrated	0.08	0.01	0.05	0.46	0	0.29	0.09	0.02
happy	0.01	0.14	0	0.07	0.37	0.34	0.04	0.02
neutral	0.01	0.02	0.02	0.12	0.02	0.75	0.03	0.03
sad	0	0.01	0.05	0.12	0.01	0.24	0.54	0.01
surprise	0.03	0.03	0.01	0.10	0.05	0.22	0.01	0.54

Table A17. Confusion matrix emotions DS-v3 (context + DAs) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.37	0	0.01	0.37	0	0.14	0.07	0.03
excited	0	0.27	0.01	0.08	0.38	0.19	0.02	0.04
fear	0.03	0	0.47	0.11	0	0.32	0.08	0
frustrated	0.07	0.01	0.01	0.44	0.02	0.32	0.12	0.01
happy	0.01	0.08	0	0.06	0.57	0.23	0.04	0.02
neutral	0.01	0.02	0	0.12	0.09	0.69	0.05	0.02
sad	0	0.01	0.02	0.11	0.07	0.25	0.53	0.01
surprise	0.03	0.05	0.05	0.14	0.02	0.22	0.01	0.47

Table A18. Confusion matrix emotions DS-r1 (context + DAs) IEMOCAP.

	anger	excited	fear	frustrated	happy	neutral	sad	surprised
anger	0.39	0	0.03	0.38	0	0.12	0.05	0.03
excited	0	0.28	0.02	0.07	0.28	0.28	0.02	0.04
fear	0	0	0.55	0.08	0	0.26	0.03	0.08
frustrated	0.08	0.01	0.04	0.44	0	0.33	0.09	0.02
happy	0.01	0.09	0	0.08	0.42	0.36	0.03	0.01
neutral	0.01	0.01	0.02	0.14	0.04	0.73	0.03	0.02
sad	0.01	0.01	0.06	0.12	0.02	0.29	0.49	0.01
surprise	0.02	0.02	0.01	0.07	0.03	0.27	0.01	0.56

Table A19. Confusion matrix emotions DS-v3 (baseline) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.68	0.02	0.02	0.17	0.03	0.04	0.04
disgust	0.55	0.27	0.01	0.05	0.02	0.05	0.05
fear	0.28	0.03	0.28	0.15	0.06	0.12	0.09
joy	0.15	0.02	0.01	0.71	0.02	0.03	0.05
neutral	0.20	0.04	0.02	0.28	0.31	0.08	0.07
sadness	0.28	0.05	0.04	0.11	0.04	0.41	0.07
surprise	0.28	0.02	0.01	0.2	0.03	0.03	0.42

Table A20. Confusion matrix emotions DS-r1 (baseline) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.46	0.10	0.07	0.15	0.07	0.06	0.10
disgust	0.27	0.48	0.03	0.04	0.04	0.07	0.07
fear	0.06	0.06	0.52	0.09	0.08	0.09	0.11
joy	0.05	0.05	0.03	0.66	0.07	0.04	0.10
neutral	0.06	0.06	0.06	0.16	0.52	0.07	0.08
sadness	0.08	0.08	0.09	0.08	0.11	0.44	0.12
surprise	0.06	0.06	0.03	0.12	0.08	0.02	0.62

Table A21. Confusion matrix emotions DS-v3 (context) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.64	0.04	0.03	0.09	0.07	0.09	0.04
disgust	0.43	0.31	0.02	0.02	0.08	0.09	0.06
fear	0.15	0.02	0.37	0.08	0.07	0.22	0.09
joy	0.10	0.01	0.01	0.70	0.09	0.05	0.04
neutral	0.09	0.02	0.02	0.19	0.53	0.09	0.06
sadness	0.17	0.04	0.05	0.06	0.08	0.57	0.04
surprise	0.16	0.03	0.03	0.12	0.07	0.03	0.56

Table A22. Confusion matrix emotions DS-r1 (context) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.56	0.07	0.08	0.08	0.13	0.04	0.06
disgust	0.21	0.46	0.02	0.04	0.15	0.07	0.05
fear	0.08	0.01	0.56	0.05	0.18	0.05	0.07
joy	0.05	0.02	0.03	0.64	0.18	0.02	0.05
neutral	0.04	0.03	0.06	0.13	0.68	0.03	0.04
sadness	0.09	0.06	0.13	0.05	0.20	0.41	0.06
surprise	0.06	0.03	0.04	0.09	0.15	0.02	0.61

Table A23. Confusion matrix emotions DS-v3 (context + DAs) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.58	0.03	0.03	0.08	0.16	0.09	0.04
disgust	0.38	0.30	0.01	0.01	0.17	0.09	0.04
fear	0.12	0.02	0.35	0.04	0.24	0.19	0.04
joy	0.07	0.01	0.01	0.65	0.19	0.03	0.04
neutral	0.05	0.01	0.01	0.13	0.69	0.06	0.04
sadness	0.14	0.02	0.04	0.05	0.15	0.55	0.04
surprise	0.11	0.01	0.03	0.11	0.16	0.03	0.55

Table A24. Confusion matrix emotions DS-r1 (context + DAs) MELD.

	anger	disgust	fear	joy	neutral	sadness	surprise
anger	0.51	0.07	0.06	0.07	0.19	0.04	0.06
disgust	0.23	0.44	0.03	0.01	0.20	0.04	0.04
fear	0.06	0.03	0.55	0.03	0.21	0.06	0.05
joy	0.04	0.02	0.02	0.61	0.24	0.02	0.04
neutral	0.03	0.02	0.05	0.12	0.72	0.03	0.03
sadness	0.09	0.05	0.12	0.04	0.22	0.43	0.04
surprise	0.05	0.03	0.06	0.09	0.19	0.02	0.58

Table A25. Matrix confusion DAs DS-v3 (baseline) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.59	0.14	0	0	0.04	0	0.02	0	0	0.02	0.02	0.18
Agreement	0.45	0.22	0.01	0.01	0.06	0.02	0.10	0	0	0	0.01	0.12
Answer	0.09	0.08	0.05	0.01	0.03	0.06	0.21	0	0	0.03	0.19	0.25
Apology	0	0	0	0.96	0	0.01	0.03	0	0	0	0	0
Backchannel	0.41	0.02	0	0.02	0.34	0.03	0.04	0.03	0.01	0.04	0.05	0.01
Command	0.01	0	0	0.01	0	0.86	0.06	0	0	0.01	0.01	0.04
Disagreement	0.02	0.10	0.01	0	0	0.06	0.60	0	0	0	0.08	0.11
Greeting	0.7	0	0.02	0	0	0	0	0.27	0	0	0.02	0
Others	0	0	0	0	0.44	0.01	0	0	0.49	0	0.06	0
Question	0.01	0.01	0	0.01	0.02	0.04	0.07	0	0	0.76	0.02	0.06
Stat. Non Op.	0.05	0.02	0.01	0.02	0.01	0.12	0.12	0.01	0	0.05	0.27	0.32
Stat. Opinion	0.04	0.03	0	0.01	0.01	0.09	0.17	0	0	0.03	0.06	0.55

Table A26. Matrix confusion DAs DS-r1 (baseline) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.43	0.23	0	0	0.09	0	0.04	0	0.02	0	0.02	0.18
Agreement	0.29	0.31	0	0.01	0.13	0.02	0.07	0	0.02	0.01	0.03	0.12
Answer	0.07	0.10	0.04	0.02	0.04	0.06	0.17	0	0.01	0.03	0.24	0.22
Apology	0.03	0	0	0.95	0	0.01	0.01	0	0	0	0	0
Backchannel	0.25	0.08	0	0	0.41	0.02	0.05	0	0.13	0.03	0.02	0
Command	0.02	0	0	0.01	0	0.83	0.04	0	0.01	0.01	0.05	0.04
Disagreement	0.01	0.1	0	0.01	0	0.05	0.55	0	0	0.01	0.15	0.12
Greeting	0.72	0.02	0	0.02	0.02	0	0	0.22	0	0.02	0	0
Others	0.01	0	0	0	0.44	0.01	0	0	0.51	0	0	0.01
Question	0.01	0.01	0	0	0.01	0.03	0.09	0.01	0	0.77	0.02	0.05
Stat. Non Op.	0.06	0.02	0.01	0.02	0.02	0.09	0.12	0	0.03	0.04	0.34	0.26
Stat. Opinion	0.03	0.04	0	0.01	0.01	0.09	0.20	0	0.01	0.03	0.10	0.49

Table A27. Matrix confusion DAs DS-v3 (context) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.39	0.07	0	0.02	0.11	0	0.02	0	0	0.02	0.09	0.29
Agreement	0.34	0.28	0.01	0.01	0.09	0.01	0.01	0	0	0	0.06	0.20
Answer	0.03	0.03	0.20	0.01	0.01	0.03	0.02	0	0	0.01	0.34	0.33
Apology	0.01	0	0	0.96	0	0.01	0	0	0	0	0.01	0
Backchannel	0.30	0	0.01	0	0.49	0.01	0.01	0	0.08	0.04	0.03	0.03
Command	0.03	0	0	0	0	0.71	0.04	0.01	0	0.01	0.05	0.16
Disagreement	0.03	0.04	0.02	0	0.01	0.02	0.49	0	0	0	0.14	0.26
Greeting	0.63	0	0	0	0	0	0	0.27	0.05	0	0.02	0.03
Others	0.04	0	0	0	0.17	0.01	0	0	0.76	0	0.01	0
Question	0	0	0	0	0.01	0.02	0.01	0	0	0.8	0.04	0.1
Stat. Non Op.	0.02	0	0.01	0.01	0.01	0.05	0.02	0.01	0.01	0.03	0.43	0.4
Stat. Opinion	0.02	0.01	0	0	0	0.04	0.04	0	0	0.03	0.09	0.77

Table A28. Matrix confusion DAs DS-r1 (context) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.38	0.11	0	0	0.05	0.02	0	0	0	0.04	0.04	0.38
Agreement	0.2	0.46	0.02	0	0.06	0.02	0.01	0	0	0	0.09	0.14
Answer	0.01	0.05	0.32	0.01	0.01	0.04	0.03	0	0	0.03	0.26	0.25
Apology	0.01	0	0	0.83	0	0.05	0	0	0	0.01	0.03	0.07
Backchannel	0.20	0.08	0.01	0	0.39	0.01	0.01	0	0.16	0.04	0.03	0.06
Command	0.01	0	0	0	0.01	0.75	0.01	0	0.01	0.01	0.05	0.14
Disagreement	0.01	0.03	0.02	0	0.01	0.03	0.58	0	0.01	0.01	0.14	0.18
Greeting	0.6	0.02	0	0	0	0	0.02	0.25	0.07	0.02	0.03	0
Others	0.06	0	0.01	0	0.2	0.01	0	0	0.71	0	0	0
Question	0	0	0	0	0	0.02	0.01	0.01	0	0.89	0.02	0.06
Stat. Non Op.	0.02	0.01	0.01	0	0.01	0.06	0.02	0	0.01	0.05	0.50	0.31
Stat. Opinion	0.01	0.02	0	0	0	0.07	0.03	0	0	0.04	0.12	0.71

Table A29. Matrix confusion DAs DS-v3 (context + emotions) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.43	0.07	0	0.02	0.07	0	0.02	0.02	0	0.02	0.16	0.20
Agreement	0.33	0.18	0.01	0.01	0.10	0.01	0.01	0	0	0	0.11	0.26
Answer	0.04	0.02	0.10	0.01	0	0.02	0.02	0	0	0.01	0.46	0.32
Apology	0.04	0	0	0.87	0	0.03	0	0	0	0	0.07	0
Backchannel	0.26	0.01	0	0	0.5	0.01	0.01	0.02	0.06	0.03	0.07	0.02
Command	0.01	0	0	0	0	0.68	0.01	0	0	0.01	0.09	0.19
Disagreement	0.01	0.03	0.01	0	0.01	0.04	0.38	0	0	0.01	0.16	0.35
Greeting	0.57	0	0	0	0	0	0	0.28	0.02	0.02	0.07	0.05
Others	0.06	0	0	0	0.40	0	0	0	0.51	0	0.01	0.01
Question	0	0	0	0	0.01	0.02	0.01	0.01	0	0.83	0.04	0.08
Stat. Non Op.	0.02	0	0	0	0.01	0.05	0.02	0.01	0	0.03	0.50	0.37
Stat. Opinion	0.01	0	0	0	0	0.04	0.03	0	0	0.03	0.19	0.69

Table A30. Matrix confusion DAs DS-r1 (context + emotions) IEMOCAP.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.36	0.11	0	0	0.07	0	0	0	0.02	0.02	0.02	0.41
Agreement	0.16	0.49	0.02	0	0.07	0.03	0.02	0	0.01	0.01	0.07	0.12
Answer	0.01	0.06	0.33	0.01	0	0.03	0.02	0	0	0.03	0.26	0.24
Apology	0.04	0.01	0.01	0.79	0	0.05	0.01	0	0	0	0.08	0
Backchannel	0.16	0.07	0.01	0	0.45	0.01	0.01	0	0.18	0.04	0.02	0.04
Command	0.01	0	0	0	0.01	0.79	0.02	0	0.01	0.01	0.05	0.10
Disagreement	0.01	0.03	0.01	0	0	0.03	0.54	0	0	0.01	0.14	0.23
Greeting	0.48	0.02	0	0	0	0	0	0.23	0.12	0.02	0.07	0.07
Others	0.01	0	0.01	0	0.29	0.01	0	0	0.64	0	0.01	0.01
Question	0	0	0	0	0	0.02	0.01	0.01	0	0.88	0.02	0.05
Stat. Non Op.	0.01	0.02	0.01	0	0.01	0.07	0.02	0	0.01	0.05	0.51	0.29
Stat. Opinion	0.01	0.02	0	0	0	0.07	0.03	0	0	0.04	0.12	0.7

Table A31. Matrix confusion DAs DS-v3 (baseline) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.52	0.10	0.01	0.04	0.02	0.01	0.03	0	0.01	0.01	0.05	0.2
Agreement	0.41	0.32	0.02	0	0.05	0.03	0.06	0	0.01	0.01	0.02	0.08
Answer	0.12	0.09	0.03	0.03	0.03	0.06	0.18	0.01	0.03	0.04	0.17	0.23
Apology	0.04	0.06	0	0.84	0	0	0	0	0	0	0	0.06
Backchannel	0.37	0.06	0	0.03	0.27	0.02	0.06	0	0.02	0.05	0.02	0.11
Command	0.04	0.01	0	0.02	0.01	0.76	0.03	0	0	0.03	0.04	0.04
Disagreement	0.02	0.01	0	0.06	0.01	0.06	0.65	0	0.03	0.01	0.04	0.10
Greeting	0.14	0.01	0	0	0	0.02	0.01	0.76	0	0.01	0.05	0.01
Others	0.17	0.04	0.01	0.06	0.05	0.14	0.10	0.03	0.03	0.04	0.16	0.18
Question	0.02	0.02	0.01	0.02	0.02	0.04	0.14	0.04	0	0.56	0.03	0.10
Stat. Non Op.	0.10	0.03	0.01	0.03	0.01	0.13	0.11	0.02	0.01	0.02	0.19	0.33
Stat. Opinion	0.05	0.05	0.01	0.04	0.01	0.10	0.16	0.01	0	0.02	0.04	0.53

Table A32. Matrix confusion DAs DS-r1 (baseline) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.48	0.06	0	0.01	0.12	0	0.02	0.01	0.01	0.02	0.10	0.18
Agreement	0.20	0.35	0	0	0.21	0.04	0.04	0	0	0.02	0.04	0.09
Answer	0.06	0.10	0.04	0.01	0.06	0.08	0.15	0.01	0.02	0.04	0.23	0.20
Apology	0.02	0.03	0	0.79	0.02	0.02	0.01	0.01	0.01	0.01	0.01	0.08
Backchannel	0.06	0.06	0	0	0.66	0.02	0.01	0.01	0.09	0.05	0.01	0.03
Command	0.01	0	0	0	0.02	0.92	0.01	0	0	0.01	0.02	0.01
Disagreement	0.03	0.02	0.01	0.03	0.02	0.08	0.65	0	0.01	0.01	0.05	0.09
Greeting	0.14	0	0	0	0.01	0.02	0	0.75	0.03	0.02	0.01	0.01
Others	0.11	0.04	0.01	0	0.17	0.14	0.07	0.05	0.19	0.06	0.08	0.08
Question	0.01	0.01	0	0	0.02	0.05	0.04	0.02	0.01	0.78	0.02	0.04
Stat. Non Op.	0.08	0.03	0.01	0.01	0.02	0.15	0.07	0.02	0.02	0.03	0.29	0.27
Stat. Opinion	0.04	0.05	0	0.01	0.01	0.14	0.09	0.01	0	0.03	0.07	0.55

Table A33. Matrix confusion DAs DS-v3 (context) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.54	0.03	0	0.01	0.06	0	0.01	0	0	0.02	0.14	0.19
Agreement	0.34	0.35	0.02	0	0.03	0.02	0.03	0	0	0.02	0.06	0.11
Answer	0.05	0.07	0.2	0.01	0.01	0.03	0.05	0	0	0.01	0.35	0.21
Apology	0.04	0.07	0	0.8	0	0.01	0.01	0	0	0.01	0	0.07
Backchannel	0.37	0.01	0.02	0	0.38	0.02	0	0.01	0.03	0.11	0.02	0.02
Command	0.02	0.01	0	0	0	0.85	0.01	0.01	0	0.01	0.05	0.03
Disagreement	0.02	0.02	0.03	0.02	0.01	0.06	0.56	0	0	0.01	0.14	0.12
Greeting	0.15	0	0.01	0	0	0.01	0	0.73	0	0.01	0.05	0.02
Others	0.26	0.03	0.01	0	0.10	0.12	0.04	0.05	0.07	0.06	0.16	0.11
Question	0.02	0.01	0.01	0.01	0.01	0.03	0.02	0.03	0	0.78	0.03	0.07
Stat. Non Op.	0.06	0.03	0.01	0.01	0.01	0.09	0.03	0.01	0	0.02	0.41	0.31
Stat. Opinion	0.03	0.04	0	0.01	0	0.08	0.05	0	0	0.03	0.15	0.6

Table A34. Matrix confusion DAs DS-r1 (context) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.34	0.07	0	0.02	0.07	0.01	0.01	0.01	0.06	0.02	0.15	0.25
Agreement	0.15	0.49	0.04	0	0.05	0.04	0.01	0	0.01	0.03	0.07	0.11
Answer	0.01	0.07	0.35	0.01	0.01	0.04	0.04	0	0.01	0.03	0.28	0.15
Apology	0.01	0.03	0	0.76	0	0.03	0.01	0.01	0.01	0.03	0.01	0.10
Backchannel	0.14	0.04	0.02	0	0.43	0.02	0.01	0	0.18	0.12	0.02	0.01
Command	0.01	0.02	0	0	0	0.86	0.02	0	0.02	0.01	0.03	0.04
Disagreement	0.02	0.02	0.02	0.01	0	0.07	0.55	0	0.02	0.01	0.14	0.14
Greeting	0.11	0.01	0.01	0	0.01	0.01	0	0.68	0.06	0.02	0.05	0.03
Others	0.11	0.04	0.02	0	0.07	0.12	0.04	0.02	0.22	0.06	0.15	0.15
Question	0	0	0	0	0	0.03	0.01	0.02	0.01	0.85	0.03	0.03
Stat. Non Op.	0.03	0.03	0.02	0	0	0.10	0.03	0.01	0.01	0.03	0.44	0.28
Stat. Opinion	0.01	0.03	0.01	0.01	0	0.10	0.03	0	0	0.04	0.16	0.60

Table A35. Matrix confusion DAs DS-v3 (context + emotions) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.46	0.06	0.02	0.01	0.05	0	0.01	0	0.01	0.01	0.15	0.23
Agreement	0.33	0.31	0.01	0	0.04	0.02	0.03	0	0	0.02	0.10	0.13
Answer	0.04	0.06	0.15	0.01	0.01	0.02	0.04	0	0	0.02	0.44	0.21
Apology	0.04	0.04	0	0.76	0	0	0	0	0.01	0.01	0.02	0.12
Backchannel	0.36	0.03	0	0	0.38	0.02	0.01	0	0.04	0.11	0.04	0
Command	0.02	0	0	0	0.01	0.81	0.01	0	0.01	0.02	0.07	0.05
Disagreement	0.03	0.02	0.01	0.02	0.01	0.06	0.46	0	0	0.01	0.23	0.14
Greeting	0.13	0.01	0	0	0.01	0.01	0	0.73	0	0.01	0.07	0.03
Others	0.21	0.03	0	0	0.11	0.11	0.04	0.05	0.04	0.07	0.21	0.12
Question	0.01	0.01	0.01	0	0.01	0.02	0.01	0.03	0	0.78	0.05	0.07
Stat. Non Op.	0.05	0.03	0.01	0.01	0.01	0.08	0.03	0.01	0	0.02	0.43	0.33
Stat. Opinion	0.02	0.04	0	0.01	0	0.06	0.04	0	0	0.02	0.22	0.58

Table A36. Matrix confusion DAs DS-r1 (context + emotions) MELD.

	Acknow.	Agreement	Answer	Apology	Backch.	Command	Disagr.	Greeting	Others	Question	St. No Op.	St. Op.
Acknowledge	0.37	0.07	0.01	0.01	0.08	0.01	0	0.01	0.03	0.02	0.15	0.25
Agreement	0.15	0.44	0.02	0	0.06	0.04	0.03	0	0.01	0.03	0.11	0.11
Answer	0.01	0.07	0.33	0.01	0.01	0.04	0.04	0	0	0.03	0.31	0.15
Apology	0.02	0.03	0	0.74	0.01	0.03	0	0.01	0.01	0.02	0.03	0.10
Backchannel	0.19	0.03	0.01	0	0.37	0.02	0.02	0	0.19	0.14	0.03	0.01
Command	0.01	0.02	0	0	0.01	0.85	0.01	0	0.02	0.01	0.04	0.03
Disagreement	0.01	0.03	0.03	0.02	0	0.08	0.45	0	0.01	0.02	0.18	0.16
Greeting	0.11	0.01	0.01	0	0	0.01	0	0.70	0.04	0.02	0.07	0.03
Others	0.11	0.03	0.02	0	0.05	0.12	0.03	0.02	0.22	0.07	0.19	0.13
Question	0	0	0	0	0.01	0.03	0.01	0.02	0.01	0.84	0.03	0.04
Stat. Non Op.	0.03	0.03	0.01	0	0.01	0.10	0.02	0.01	0.01	0.03	0.48	0.26
Stat. Opinion	0.01	0.03	0.01	0.01	0	0.10	0.03	0	0	0.04	0.20	0.58

References

Brandtzaeg, P.B.; Følstad, A. Chatbots: Changing user needs and motivations. Interactions 2018, 25, 38–43. [Google Scholar] [CrossRef]
Bittner, E.; Oeste-Reiß, S.; Leimeister, J.M. Where is the bot in our team? Toward a taxonomy of design option combinations for conversational agents in collaborative work. In Proceedings of the Hawaii International Conference on System Sciences (HICSS), Maui, HI, USA, 8–11 January 2019; pp. 284–293. [Google Scholar]
Di Prospero, A.; Norouzi, N.; Fokaefs, M.; Litoiu, M. Chatbots as assistants: An architectural framework. In Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering, Markham, ON, USA, 6–8 November 2017; pp. 76–86. [Google Scholar]
Grudin, J.; Jacques, R. Chatbots, humbots, and the quest for artificial general intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–11. [Google Scholar]
Caldarini, G.; Jaf, S.; McGarry, K. A literature survey of recent advances in chatbots. Information 2022, 13, 41. [Google Scholar] [CrossRef]
Indrayani, L.; Amalia, R.; Hakim, F. Emotive expressions on social chatbot. J. Sosioteknologi 2020, 18, 509–516. [Google Scholar] [CrossRef]
Chaves, A.; Gerosa, M. How should my chatbot interact? A survey on social characteristics in human–chatbot interaction design. Int. J. Hum. Comput. Interact. 2021, 37, 729–758. [Google Scholar] [CrossRef]
Almansor, E.H.; Hussain, F.K. Survey on intelligent chatbots: State-of-the-art and future research directions. In Proceedings of the 13th International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS-2019), Sydney, Australia, 3–5 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 534–543. [Google Scholar]
Lowe, R.; Noseworthy, M.; Serban, I.V.; Angelard-Gontier, N.; Bengio, Y.; Pineau, J. Towards an automatic Turing test: Learning to evaluate dialogue responses. arXiv 2017, arXiv:1708.07149. [Google Scholar]
Luger, E.; Sellen, A. “Like having a really bad PA”: The gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 5286–5297. [Google Scholar]
Song, Z.; Zheng, X.; Liu, L.; Xu, M.; Huang, X. Generating responses with a specific emotion in dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 3685–3695. [Google Scholar]
Wolk, K. Real-time sentiment analysis for Polish dialog systems using MT as pivot. Electronics 2021, 10, 1813. [Google Scholar] [CrossRef]
Wang, Z.; Xie, Q.; Feng, Y.; Ding, Z.; Yang, Z.; Xia, R. Is ChatGPT a Good Sentiment Analyzer? In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA,, 7–9 October 2024. [Google Scholar]
Bi, X.; Chen, D.; Chen, G.; Chen, S.; Dai, D.; Deng, C.; Ding, H.; Dong, K.; Du, Q.; Fu, Z.; et al. DeepSeek LLM: Scaling open-source language models with longtermism. arXiv 2024, arXiv:2401.02954. [Google Scholar]
OpenAI. ChatGPT (Online AI Model Interface). Available online: https://chat.openai.com (accessed on 15 January 2025).
DeepSeek Research. DeepSeek-R1 Model Page. Available online: https://www.deepseek.com (accessed on 1 February 2025).
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Garcia-Garcia, J.M.; Penichet, V.M.; Lozano, M.D. Emotion detection: A technology review. In Proceedings of the XVIII International Conference on Human-Computer Interaction, Cancun, Mexico, 25–27 September 2017; pp. 1–8. [Google Scholar]
Ekman, P. Basic emotions. In Handbook of Cognition and Emotion; John Wiley & Sons: Hoboken, NJ, USA, 1999; pp. 45–60. [Google Scholar]
Acheampong, F.A.; Wenyu, C.; Nunoo-Mensah, H. Text-based emotion detection: Advances, challenges, and opportunities. Eng. Rep. 2020, 2, e12189. [Google Scholar] [CrossRef]
Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
Fragopanagos, N.; Taylor, J.G. Emotion recognition in human–computer interaction. Neural Netw. 2005, 18, 389–405. [Google Scholar] [CrossRef]
Chen, A.; Koegel, S.; Hannon, O.; Ciriello, R. Feels Like Empathy: How “Emotional” AI Challenges Human Essence. In Proceedings of the Australasian Conference on Information Systems, Wellington, New Zealand, 5–8 December 2023. [Google Scholar]
Pabba, C.; Kumar, P. An intelligent system for monitoring students’ engagement in large classroom teaching through facial expression recognition. Expert Syst. 2022, 39, e12839. [Google Scholar] [CrossRef]
Zhang, L.; Lyu, Q.; Callison-Burch, C. Intent detection with WikiHow. arXiv 2020, arXiv:2009.05781. [Google Scholar]
Austin, J.L. How to Do Things with Words; Harvard University Press: Cambridge, MA, USA, 1975. [Google Scholar]
Searle, J.R. A taxonomy of illocutionary acts. In Language, Mind and Knowledge; Gunderson, K., Ed.; University of Minnesota Press: Minneapolis, MN, USA, 1975; pp. 344–369. [Google Scholar]
Ye, F. User Intent and State Modeling in Conversational Systems. Ph.D. Thesis, University College London, London, UK, 2024. [Google Scholar]
Ghafoor, K.; Ahmad, T.; Aslam, M.; Wahla, S. Improving social interaction of visually impaired individuals through conversational assistive technology. Int. J. Intell. Comput. Cybern. 2024, 17, 126–142. [Google Scholar] [CrossRef]
Barnum, T.C.; Solomon, S.J. Fight or flight: Integral emotions and violent intentions. Criminology 2019, 57, 659–686. [Google Scholar] [CrossRef]
Bee, C.C.; Madrigal, R. Consumer uncertainty: The influence of anticipatory emotions on ambivalence, attitudes, and intentions. J. Consum. Behav. 2013, 12, 370–381. [Google Scholar] [CrossRef]
Soscia, I. Gratitude, delight, or guilt: The role of consumers’ emotions in predicting postconsumption behaviors. Psychol. Mark. 2007, 24, 871–894. [Google Scholar] [CrossRef]
Peng, W.; Hu, Y.; Xie, Y.; Xing, L.; Sun, Y. CogIntAc: Modeling the Relationships between Intention, Emotion, and Action in Interactive Process from Cognitive Perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Padua, Italy, 18–23 July 2022. [Google Scholar]
Saha, T.; Ekbal, A.; Bhattacharyya, P. Towards emotion-aided multi-modal dialogue act classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Bosma, W.; André, E. Exploiting emotions to disambiguate dialogue acts. In Proceedings of the 9th International Conference on Intelligent User Interfaces, Funchal, Portugal, 13–16 January 2004; pp. 85–92. [Google Scholar]
Banimelhem, O.; Amayreh, W. The performance of ChatGPT in emotion classification. In Proceedings of the 2023 14th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 November 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Imran, M.M.; Chatterjee, P.; Damevski, K. Uncovering the causes of emotions in software developer communication using zero-shot llms. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Asghar, N.; Poupart, P.; Hoey, J.; Jiang, X.; Mou, L. Affective neural response generation. In Proceedings of the Advances in Information Retrieval (ECIR 2018), Grenoble, France, 26–29 March 2018; pp. 154–166. [Google Scholar]
Majumder, N.; Hong, P.; Peng, S.; Lu, J.; Ghosal, D.; Gelbukh, A.; Mihalcea, R.; Poria, S. MIME: MIMicking emotions for empathetic response generation. arXiv 2020, arXiv:2002.00193. [Google Scholar]
Luo, J.; Phan, H.; Reiss, J. Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
Lin, Z.; Xu, P.; Winata, G.I.; Siddique, F.B.; Liu, Z.; Shin, J.; Fung, P. Caire: An end-to-end empathetic chatbot. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13622–13623. [Google Scholar]
Mullangi, P.; Dimmita, N.; Supriya, M.; Murty, P.S.C.; Nirmala, G.V.; Palagan, C.A.; Rao, K.T.; Rajeswaran, N. Sentiment and Emotion Modeling in Text-based Conversations utilizing ChatGPT. Eng. Technol. Appl. Sci. Res. 2025, 15, 20042–20048. [Google Scholar] [CrossRef]
Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Bias in emotion recognition with ChatGPT. arXiv 2023, arXiv:2310.11753. [Google Scholar]
Ortega, D.; Vu, N.T. Neural-based context representation learning for dialog act classification. arXiv 2017, arXiv:1708.02561. [Google Scholar]
Raheja, V.; Tetreault, J. Dialogue act classification with context-aware self-attention. arXiv 2019, arXiv:1904.02594. [Google Scholar]
Saha, T.; Srivastava, S.; Firdaus, M.; Saha, S.; Ekbal, A.; Bhattacharyya, P. Exploring machine learning and deep learning frameworks for task-oriented dialogue act classification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Li, R.; Lin, C.; Collinson, M.; Li, X.; Chen, G. A dual-attention hierarchical recurrent neural network for dialogue act classification. arXiv 2018, arXiv:1810.09151. [Google Scholar]
Shang, G.; Tixier, A.J.P.; Vazirgiannis, M.; Lorré, J.P. Speaker-change aware CRF for dialogue act classification. arXiv 2020, arXiv:2004.02913. [Google Scholar]
Novielli, N.; Strapparava, C. The Role of Affect Analysis in Dialogue Act Identification. IEEE Trans. Affect. Comput. 2013, 4, 439–451. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv 2018, arXiv:1810.02508. [Google Scholar]
Jurafsky, D. Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coders Manual. 1997. Available online: https://www.colorado.edu/ics/sites/default/files/attached-files/97-02-part1.pdf (accessed on 1 January 2025).
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow; O’Reilly Media, Inc.: Newton, MA, USA, 2022. [Google Scholar]
Müller, A.C.; Guido, S. Introduction to Machine Learning with Python: A Guide for Data Scientists; O’Reilly Media, Inc.: Newton, MA, USA, 2016. [Google Scholar]
Guingrich, R.E.; Graziano, M.S. Chatbots as social companions: How people perceive consciousness, human likeness, and social health benefits in machines. arXiv 2023, arXiv:2311.10599. [Google Scholar]

Figure 1. Classification’s model diagram.

Table 1. DS-v3 and DS-r1 emotion classification on IEMOCAP under different conditions.

Condition	Metrics	Precision		Recall		F1-Score		Performance
Condition	Metrics	v3	r1	v3	r1	v3	r1	Performance
Baseline	accuracy					0.36	0.37	low
	macro avg	0.36	0.35	0.34	0.30	0.30	0.25
	weighted	0.44	0.44	0.36	0.29	0.36	0.30
Context	accuracy					0.47	0.51	medium
	macro avg	0.45	0.47	0.48	0.5	0.45	0.43
	weighted avg	0.53	0.56	0.47	0.51	0.47	0.51
Context + DA	accuracy					0.49	0.48	medium
	macro avg	0.47	0.47	0.48	0.48	0.43	0.42
	weighted	0.54	0.55	0.49	0.48	0.48	0.48

Table 2. DS-v3 and DS-r1 emotion classification on MELD under different conditions.

Condition	Metrics	Precision		Recall		F1-Score		Performance
Condition	Metrics	v3	r1	v3	r1	v3	r1	Performance
Baseline	accuracy					0.44	0.54	low
	macro avg	0.41	0.44	0.44	0.53	0.38	0.46
	weighted	0.62	0.64	0.44	0.54	0.44	0.56
Context	accuracy					0.57	0.62	medium
	macro avg	0.46	0.51	0.53	0.56	0.48	0.52
	weighted	0.65	0.66	0.57	0.62	0.58	0.63
Context + DA	accuracy					0.62	0.63	medium-high
	macro avg	0.51	0.52	0.52	0.55	0.51	0.52
	weighted	0.64	0.65	0.62	0.63	0.63	0.63

Table 3. DS-v3 and DS-r1 DAs classification on IEMOCAP under different conditions.

Condition	Metrics	Precision		Recall		F1-Score		Performance
Condition	Metrics	v3	r1	v3	r1	v3	r1	Performance
Baseline	accuracy					0.44	0.45	low
	macro avg	0.42	0.39	0.5	0.49	0.39	0.37
	weighted	0.55	0.57	0.44	0.45	0.44	0.45
Context	accuracy					0.57	0.62	medium
	macro avg	0.46	0.51	0.53	0.56	0.48	0.52
	weighted	0.65	0.66	0.57	0.62	0.58	0.63
Context + emotions	accuracy					0.53	0.61	medium
	macro avg	0.53	0.54	0.50	0.56	0.46	0.52
	weighted	0.62	0.66	0.53	0.61	0.51	0.61

Table 4. DS-v3 and DS-r1 DAs classification on MELD under different conditions.

Condition	Metrics	Precision		Recall		F1-Score		Performance
Condition	Metrics	v3	r1	v3	r1	v3	r1	Performance
Baseline	accuracy					0.35	0.45	low
	macro avg	0.34	0.43	0.45	0.54	0.32	0.40
	weighted	0.48	0.57	0.35	0.45	0.35	0.44
Context	accuracy					0.50	0.55	medium
	macro avg	0.49	0.52	0.52	0.55	0.44	0.50
	weighted	0.61	0.63	0.50	0.55	0.50	0.56
Context + emotions	accuracy					0.48	0.55	medium
	macro avg	0.47	0.52	0.49	0.55	0.42	0.50
	weighted	0.58	0.63	0.48	0.55	0.48	0.56

Table 5. Performance of DeepSeek-r1, Gemini-2.5, and ChatGPT-4, under different conditions on classification of emotions.

Condition	Models	Precision		Recall		F1-score		Accuracy
Condition	Models	M¹	I²	M	I	M	I	M	I
Context	ChatGPT-4	0.49	0.43	0.55	0.47	0.49	0.39	0.60	0.46
	DeepSeek-r1	0.51	0.47	0.56	0.50	0.52	0.43	0.62	0.51
	Gemini-2.5	0.49	0.5	0.57	0.6	0.51	0.49	0.61	0.55
Context + DA	ChatGPT-4	0.53	0.43	0.52	0.44	0.51	0.37	0.63	0.45
	DeepSeek-r1	0.52	0.47	0.55	0.48	0.52	0.42	0.63	0.48
	Gemini-2.5	0.51	0.50	0.56	0.60	0.52	0.49	0.63	0.55

¹ MELD. ² IEMOCAP.

Table 6. Performance of DeepSeek-r1, Gemini-2.5, and ChatGPT-4, under different conditions on classification of DAs.

Condition	Models	Precision		Recall		F1-Score		Accuracy
Condition	Models	M¹	I²	M	I	M	I	M	I
Context	ChatGPT-4	0.53	0.52	0.50	0.46	0.48	0.46	0.57	0.53
	DeepSeek-r1	0.52	0.51	0.55	0.56	0.50	0.52	0.55	0.62
	Gemini-2.5	0.53	0.54	0.59	0.61	0.52	0.53	0.58	0.64
Context + emotion	ChatGPT-4	0.52	0.51	0.49	0.44	0.47	0.44	0.55	0.52
	DeepSeek-r1	0.52	0.54	0.55	0.56	0.50	0.52	0.55	0.61
	Gemini-2.5	0.52	0.55	0.58	0.63	0.51	0.54	0.58	0.64

¹ MELD. ² IEMOCAP.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Castro, E.; Calvo, H.; Kolesnikova, O. Emotion and Intention Detection in a Large Language Model. Mathematics 2025, 13, 3768. https://doi.org/10.3390/math13233768

AMA Style

Castro E, Calvo H, Kolesnikova O. Emotion and Intention Detection in a Large Language Model. Mathematics. 2025; 13(23):3768. https://doi.org/10.3390/math13233768

Chicago/Turabian Style

Castro, Emmanuel, Hiram Calvo, and Olga Kolesnikova. 2025. "Emotion and Intention Detection in a Large Language Model" Mathematics 13, no. 23: 3768. https://doi.org/10.3390/math13233768

APA Style

Castro, E., Calvo, H., & Kolesnikova, O. (2025). Emotion and Intention Detection in a Large Language Model. Mathematics, 13(23), 3768. https://doi.org/10.3390/math13233768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Emotion and Intention Detection in a Large Language Model

Abstract

1. Introduction

2. Background and Problem Formulation

2.1. Emotion Detection: Definition and Importance

2.2. Intention Detection: Definition and Importance

2.3. The Relationship Between Emotions and Intentions

2.4. Problem Statement and Hypothesis

3. Literature Review

3.1. Emotion Recognition in Dialogue

3.2. Intention (Dialog Act) Recognition

3.3. Emotion and Intent Recognition Together

4. Materials and Methods

4.1. Prompt Design for the LLM

4.2. Datasets

4.3. Metrics and Algorithms

4.4. Experimental Settings

5. Results

5.1. Emotion Recognition Performance

5.1.1. IEMOCAP on Emotion Recognition

5.1.2. MELD on Emotion Recognition

5.2. Intent (Dialog Act) Recognition Performance

5.2.1. IEMOCAP on Intent Recognition

5.2.2. MELD on Intent Recognition

6. Comparing the Results with ChatGPT-4 and Gemini-2.5

7. Discussion

8. Conclusions and Future Work

9. Ethical Considerations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI