Performance Evaluation Metrics for Empathetic LLMs

Hong, Yuna; Ku, Bonhwa; Ko, Hanseok

doi:10.3390/info16110977

Open AccessArticle

Performance Evaluation Metrics for Empathetic LLMs

by

Yuna Hong

¹,

Bonhwa Ku

¹ and

Hanseok Ko

^1,2,*

¹

Department of Electrical and Computer Engineering, Korea University, Seoul 02841, Republic of Korea

²

Department of Electrical and Computer Engineering, Catholic University of America, Washington, DC 20064, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 977; https://doi.org/10.3390/info16110977

Submission received: 21 August 2025 / Revised: 27 October 2025 / Accepted: 7 November 2025 / Published: 11 November 2025

(This article belongs to the Special Issue Graph Neural Networks and Transformers for Intelligent Data-Driven Systems)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of large language models (LLMs), recent systems have demonstrated increasing capability in understanding and expressing human emotions. However, no objective and standardized metric currently exists to evaluate how empathetic an LLM’s response is. To address this gap, we propose a novel evaluation framework that measures both sentiment-level and emotion-level alignment between a user query and a model-generated response. The proposed metric consists of two components. The sentiment component evaluates overall affective polarity through Sentlink and the naturalness of emotional expression via NEmpathySort. The emotion component measures fine-grained emotional correspondence using Emosight. Additionally, a semantic component, based on RAGAS, assesses the contextual relevance and coherence of the response. Experimental results demonstrate that our metric effectively captures both the intensity and nuance of empathy in LLM-generated responses, providing a solid foundation for the development of emotionally intelligent conversational AI.

Keywords:

empathy LLM; metrics; emotion; multimodal

1. Introduction

Recent advances in artificial intelligence have led to the emergence of interactive conversational agents that integrate speech recognition [1,2,3], large language models (LLMs), and generative response systems [4,5]. LLMs trained on massive corpora can generate human-like responses across a wide range of tasks and benchmarks. The development of these models began with the Transformer architecture [6], which revolutionized natural language processing (NLP). Subsequent breakthroughs such as BERT [7] further improved semantic understanding, paving the way for models like ChatGPT [8], GPT-4 [9], Llama [10], Gemini [11], and Mistral [12]. These systems have had a significant impact across social, cultural, and economic domains.

Despite their remarkable performance, LLMs face a fundamental limitation: their knowledge is restricted to the data available at the time of training. Consequently, they often struggle to provide accurate answers to queries involving newly emerging information. To overcome this limitation, researchers have explored methods that integrate retrieval mechanisms with generative models. Retrieval-Augmented Generation (RAG) [13] is a representative approach that retrieves relevant documents from an external knowledge base and incorporates them as contextual input during response generation. This enables models to produce more accurate and up-to-date answers even for unseen information.

To objectively evaluate the quality of responses generated by LLMs and RAG systems, various metrics such as GPTScore [14], BERTScore [15], BARTScore [16], and RAGAS [17] have been developed. GPTScore estimates the quality of generated text using token-level log probabilities. BERTScore and BARTScore leverage contextual embeddings to assess semantic similarity and text generation quality, respectively. In particular, RAGAS provides a comprehensive framework for evaluating RAG-based models through metrics such as Faithfulness, Context Recall, Answer Relevance, and Answer Accuracy.

As interest in emotionally responsive conversational systems grows, recent studies have aimed to develop LLMs capable of recognizing and expressing emotions. Emotional LLMs are designed to detect affective cues in user queries and generate empathetic responses aligned with the user’s emotional state. For example, [18] introduced EmoLLMs, models fine-tuned on the large-scale Affective Analysis Instruction Dataset (AAID) to enhance emotional understanding and generation. Similarly, [19] proposed EmoDialoGPT, which incorporates emotion embeddings and emotion prediction loss to improve emotional expressiveness in dialogue.

However, despite these advancements, there remains no standardized or objective metric to quantify the degree of empathy in model-generated responses. Existing evaluation metrics primarily focus on semantic correctness or factual relevance, overlooking emotional alignment and naturalness. This lack of a reliable empathy evaluation framework poses a critical challenge to the development of emotionally intelligent LLMs.

To address this gap, this paper proposes a novel metric to quantitatively assess how well emotional LLMs’ responses reflect empathy toward a given query. The proposed metric comprises two major components: emotional and semantic. The emotional component includes three submodules—Sentlink, Emosight, and NEmpathySort. Sentlink measures the sentiment-level relationship between a query and its response, Emosight evaluates fine-grained emotional correspondence, and NEmpathySort assesses the naturalness of empathetic responses, especially in negatively valenced contexts. The semantic component utilizes the answer relevance function from RAGAS, computing cosine similarity between embeddings of the query and response to obtain a semantic score ranging from 0 to 1. The overall empathy score is obtained by combining the emotional and semantic components, where higher values indicate more empathetic and contextually relevant responses. Experimental results across multiple datasets demonstrate that the proposed metric achieves promising results, offering a reliable and interpretable means of evaluating empathy in emotional LLMs.

2. Proposed Evaluation Metric

The proposed evaluation framework is designed based on two hypotheses: First, demonstrating empathy requires the presence of emotional content in a model’s response, and Second, empathetic responses are generally expressed in a positive tone. However, in real-world conversations, empathy is not always conveyed through positive expressions. For instance, when a user expresses negative emotions such as sadness or anger, an empathetic response should appropriately mirror that sentiment rather than contradict it. To capture such cases, we introduce the NEmpathySort process, which identifies whether a negatively toned response still reflects genuine empathy.

As illustrated in Figure 1, the overall metric consists of two components: an emotional component and a semantic component. Each component produces a normalized score in the range

[0, 1]

, and the final empathy score is obtained by taking their weighted average with equal importance. To enhance the robustness of sentiment analysis, both emotional and contextual factors are integrated within the emotional component.

2.1. Emotional Part

In the emotional part, the Emotion score is mostly generated through the ’Sentlink’ and ’Emosight’ process as shown in Figure 2. Sentiment represents the overall emotional tone or direction and can be seen as indicating the tendency of emotions, while emotion can be seen as a specific emotional state. Therefore, the relationship between the query and the answer is considered first with the overall Sentlink sentiment, and then the relationship between the query and the answer is considered in more detail in the Emosight process. However, if the answer contains negative sentiment, an exception will occur, in which case the Emotion score is generated using NEmpathySort process using BARTScore [20].

2.1.1. Sentlink

Sentlink captures the sentiment-level alignment between the query and the response. It classifies sentiment into three categories: positive, neutral, and negative, resulting in nine possible query–response combinations. When empathy is present, the response tends to exhibit a positive or emotionally expressive tone, even if the query sentiment is neutral. We employ the VADER sentiment analyzer [21] to calculate compound sentiment scores. Sentences with scores below −0.1 are classified as negative, those above 0.1 as positive, and others as neutral. This thresholding removes sentences that express minimal emotion.

2.1.2. Emosight

Emosight evaluates the fine-grained emotional relationship between a query and a response. We adopt a RoBERTa-based emotion classifier [22] fine-tuned on the GoEmotions dataset [23]. The Emotion score is computed by combining the results of Sentlink and Emosight, resulting in 36 possible cases. If the output sentiment is positive, or if it is neutral but Emosight detects emotional content, the emotional score is set to 1. Otherwise, the score is set to 0. However, if the input query is negative, the Emotion score is determined by the NEmpathySort process.

2.1.3. NEmpathySort

While empathy is often associated with positive expressions, limiting it solely to positive responses oversimplifies its nature. In real conversations, empathy can also be expressed through negative tones—for instance, when a user expresses sadness or anger, an empathetic model should reflect and validate those emotions rather than contradict them. To account for this, the NEmpathySort process is designed to detect genuine empathy in negative–negative interactions, that is, when both the query and the response exhibit negative sentiment.

NEmpathySort refines the Emotion score for such cases using BARTScore [20], which evaluates the conditional likelihood of a generated response given a query. We evaluate the weighted log probability of a generated text y given an input x as shown in Equation (1), where

ω_{t}

denotes the token weight and θ represents the model parameters.

BARTSCORE = \sum_{t = 1}^{m} ω_{t} log p (y_{t} | y_{< t}, x, θ)

(1)

Because BART [16] is trained primarily on neutral data, it tends to assign higher probabilities to direct, non-empathetic sentences, while empathetic responses—which often include emotional nuance or supportive phrasing—receive lower probabilities. Therefore, when the BARTScore value is lower than −3.1, the response is classified as empathetic, and the Emotion score is set to 1. Through this mechanism, NEmpathySort enables accurate recognition of empathetic alignment in negative–negative contexts, ensuring that emotionally congruent negative responses are correctly identified as empathetic rather than simply negative.

2.2. Semantic Part

An empathetic response must not only convey appropriate emotion but also maintain semantic relevance to the user’s query. We adopt the Answer Relevance metric from RAGAS [17] to measure this semantic alignment. Specifically, the metric is mathematically expressed as shown in Equation (2):

Answer Relevance = \frac{1}{n} \sum_{i = 1}^{n} cos (E_{g_{i}}, E_{o}),

(2)

where n denotes the number of generated queries,

E_{g_{i}}

represents the embeddings of the i-th generated query, and

E_{o}

represents the embedding of the original query.

cos (E_{g_{i}}, E_{o})

is the cosine similarity between the i-th generated query’s embedding and the original query’s embedding. Cosine similarity is used to quantify the semantic correspondence between the two. We employ the bge-base-en-v1.5 embedding model [24] for semantic encoding. A higher Answer Relevance score indicates that the response semantically aligns more closely with the intent of the original query.

Finally, the overall empathy score is obtained by equally combining the emotional and semantic scores, yielding a total score in the range

[0, 1]

. Both components are assigned equal weights to reflect the dual nature of empathy. It is not only about recognizing and expressing emotion but also about understanding the user’s intent and context. This balanced formulation ensures that a response lacking either emotional depth or semantic consistency cannot achieve a high empathy score. Consequently, the proposed metric provides a holistic measure of empathetic quality in model-generated responses, capturing both affective and contextual alignment.

3. Experiment

3.1. Dataset

We evaluated the proposed metric using multiple datasets covering neutral, emotional, and empathetic interactions. For reading comprehension and question answering (QA), we used CosmosQA [25] and SQuAD [26]. These datasets are largely neutral, containing factual questions and context-based answers, allowing the model to demonstrate its baseline ability to understand and respond accurately to queries without emotional influence. To assess everyday conversational dynamics, we employed the DailyDialog dataset [27], which consists of multi-turn dialogues covering various daily life topics. This dataset contains mild emotional content and occasional empathetic expressions, making it suitable for evaluating a model’s ability to maintain conversational flow and incorporate basic emotional understanding. For evaluating empathetic interactions, we used the Empathetic Counseling dataset [28] and Empathetic Dialogues [29]. Empathetic Counseling was created by combining [30,31,32], providing conversations that explicitly demonstrate emotional support and counseling behavior. These datasets are rich in affective content and allow testing whether the model can generate contextually appropriate empathetic responses. We also employed the Multimodal EmotionLines Dataset (MELD) [33], which contains approximately 13,000 utterances from 1433 dialogues extracted from the TV series “Friends,” annotated with emotion and sentiment labels across three modalities: audio, visual, and text. This dataset is particularly useful for evaluating the model’s performance in multimodal empathy recognition, capturing not only textual cues but also vocal tone and facial expressions. Finally, to specifically test empathy in negative-emotion contexts, we generated a dataset of 1000 single-turn conversations using ChatGPT’s GPT-4o. This dataset focuses on scenarios where the user query expresses negative emotions, enabling evaluation of the NEmpathySort component’s ability to detect empathy in negative–negative interactions. Figure 3 shows an example conversation from this dataset.

3.2. Baseline Model

We implemented a multimodal system as the baseline model, integrating text, speech, and visual inputs to predict the user’s emotional state, as illustrated in Figure 4. Emotions extracted from facial expressions, speech signals, and text—including text generated via ASR—were incorporated into the LLM prompts to guide empathetic response generation.

When only text data was available, speech and visual modalities were omitted, and the system relied solely on textual sentiment. The multimodal model is designed to provide empathetic responses by accurately perceiving and interpreting the user’s affective state across multiple input channels.

For facial expression recognition and speech emotion recognition, we employed a fine-tuned MobileNet V2 [34] and a fine-tuned Wav2Vec2 model [35], respectively. Wav2Vec2, based on the Wav2Vec2.0 transformer architecture and pre-trained on 53 languages, was augmented with a classifier layer and fine-tuned using the RAVDESS [36] and CREMA-D [37] datasets. The combined dataset was partitioned with 5-fold cross-validation to ensure robust evaluation. Table 1 summarizes the performance on RAVDESS and CREMA-D. All models were converted to ONNX [38] for cross-framework deployment, optimization, and fast inference.

For ASR, we used the Google Web Speech API [39], and the LLM was implemented with the quantized Mistral-7B-Instruct [12]. To enhance empathetic capabilities, prompts were designed following Carl Rogers’ client-centered therapy principles [40], emphasizing Authenticity, Unconditional Positive Regard, and Empathic Understanding. We assume that unconditional positive regard promotes positive user responses regardless of the user’s emotional state.

The prompts were structured as follows:

1. [Role designation for AI]

2. [Presentation of emotions to consider]

3. [Definition of empathy]

4. [Example of empathetic conversation]

5. [Definition of the output format]

6. [Input query with emotions]

As shown in Figure 5, when the designed prompt was used, the LLM acted as an empathetic LLM, providing empathetic responses that took the user’s emotions into account.

3.3. Performance Evaluation of Proposed Metrics

As mentioned earlier, we established hypotheses before creating the metrics. We then conducted experiments to test these hypotheses. Subsequently, we evaluated the final metrics to ensure their effectiveness.

3.3.1. Performance Evaluation of Sentlink

We hypothesized that empathetic responses are more likely to convey positive sentiment. To verify this, we analyzed the sentiment distributions of generated responses using the VADER Sentiment Analysis tool, adjusting the compound score threshold from the default ±0.05 to ±0.1 to exclude minimally emotional utterances.

As shown in Figure 6, the QA datasets (SQuAD and CosmosQA) exhibited sentiment distributions sharply concentrated around zero, indicating that they are largely neutral and contain minimal emotional or empathetic content. In contrast, Figure 7 illustrates that the DailyDialog dataset displays a wider sentiment range across the full interval of –1 to 1, with a mild bias toward the positive side. This suggests that everyday dialogues naturally contain more emotionally colored utterances than QA-style data.

More distinct trends appear in Figure 8, which presents the sentiment compound distributions of explicitly empathetic datasets such as EmpatheticCounseling and EmpatheticDialogues. Both exhibit right-skewed distributions concentrated in the positive region of the −1 to 1 range. Although a small portion remains near zero, negative scores are infrequent, confirming that empathetic utterances are dominantly expressed with positive affect. These results empirically demonstrate that empathy in dialogue is conveyed primarily through supportive and reassuring language rather than through negative resonance.

Overall, the cross-dataset comparison supports our hypothesis that as the emotional and empathetic intensity of data increases from QA to DailyDialog to empathetic corpora, the sentiment distribution gradually shifts from neutral toward the positive side of the sentiment spectrum. This finding validates that positive sentiment serves as a reliable indicator of empathetic expression and provides a quantitative foundation for the Sentlink metric.

3.3.2. Performance Evaluation of Emosight

We hypothesized that empathetic communication inherently contains emotional expressions. To verify this, we used a RoBERTa-base model fine-tuned on the GoEmotions dataset [23], which classifies text into 27 emotion categories and a neutral class. The model was applied to both factual QA datasets (SQuAD, CosmosQA) and empathy-related dialogue datasets (DailyDialog, EmpatheticCounseling, EmpatheticDialogues) to analyze the proportion of emotionally expressive versus neutral sentences.

As shown in Table 2, QA datasets consisted mostly of neutral sentences, accounting for more than 90%, which indicates that they are largely fact-oriented and lack affective content. In contrast, empathy-focused datasets exhibited a much higher proportion of emotionally rich utterances. For instance, the EmpatheticCounseling dataset contained nearly 80% emotional sentences, with positive expressions being the most frequent, while EmpatheticDialogues showed a similar trend.

These results clearly support our hypothesis that empathy is strongly associated with emotional language. The Emosight module effectively distinguishes emotionally expressive communication from neutral factual text, validating its role as a key component in quantifying the emotional depth of empathetic responses.

3.3.3. Performance Evaluation of NEmpathySort

In Section 3.3.1, we examined the emotional distribution of datasets and observed that although datasets with higher empathy levels tended to produce more positive responses, some negative responses still appeared. To address this limitation, we incorporated NEmpathySort to more accurately determine whether a response exhibiting negative sentiment could still be considered empathetic.

Figure 9 shows the BARTScore distributions for empathetic and non-empathetic datasets when the response sentiment is negative or when both the query and response are neutral. In the left panel of Figure 9, the empathetic datasets exhibit a sharp decline in BARTScore around −3.1, while the DailyDialog dataset, shown on the right, shows a similar drop near −3.8. These results indicate that empathetic responses are typically less direct and linguistically softer, reflecting emotional understanding rather than literal interpretation.

In contrast, Figure 10 demonstrates that most non-empathetic datasets maintain BARTScores above −3.1, implying that such responses tend to mirror the query’s surface meaning rather than considering emotional context. As seen earlier in Figure 5, empathetic responses focus on the speaker’s emotional state rather than the factual content of the query, whereas non-empathetic replies, such as in Figure 3, neglect affective cues and respond literally.

From these results, we defined −3.1 as the threshold separating empathetic from non-empathetic responses. Responses below this score are classified as empathetic, effectively distinguishing emotionally aware replies from purely informational ones and confirming NEmpathySort’s ability to capture negative–negative empathy interactions.

3.3.4. Performance Evaluation Using All Components

In our experiments, the parameter λ, representing the balance between the emotion score and the semantic score, was set to 0.5 to ensure an equal contribution from both aspects. Nevertheless, λ is a tunable parameter and can be adjusted depending on the characteristics or objectives of different tasks.

In Table 3, we experimented with the results according to the amount of emotion using several datasets. In addition, we experimented with the MELD when considering only text emotion and when considering text, vision, and speech emotions. It shows that scores increase as the amount of empathetic emotion grows. Emotion context in Table 3 refers to the emotion information used in the semantic part. When people interact with each other, they consider both verbal and nonverbal elements in their conversations, but when people and LLMs talk, only neutral text is used, making it difficult to consider the user’s emotions. Therefore, it is essential to supplement nonverbal elements to facilitate smoother interaction with users and create a more human-friendly model. That is why we decided to integrate the user’s vocal emotion and facial expression as additional information sources. By considering emotional cues, AI can communicate more effectively with users and become more “human-friendly” through empathy. So it shows that the score increases as the amount of emotion is considered and emotions of various modalities are considered. Combining all three modalities yields the highest score. However, the difference is not significant, which is likely due to the occasional inaccuracy in emotion classification in visual and speech modalities.

Table 4 presents an evaluation in which queries from an empathetic dialogue dataset were used, and responses were generated by an empathic LLM to assess whether they exhibit empathy. The EmpatheticDialogues and Empathetic Counseling datasets contain queries that clearly require empathy. In addition, the results in Table 4 demonstrate that the responses generated by the multimodal empathy LLM effectively express empathy, achieving high empathy scores.

Figure 11, Figure 12 and Figure 13 show conversations that are highly empathetic, conversations that are weakly empathetic, and conversations that are not empathetic. Based on the dictionary definition of empathy, which is an act of thinking from another person’s perspective and understanding experiences or thoughts from another person’s perspective, the example results are explained later. As you can see in Figure 11, the AI says that it is not wrong to feel sad because of the user’s bad situation, which is in line with the definition of empathy, and gives an objective opinion from a third-party perspective and gives positive words that encourage courage. Therefore, it empathizes and gives answers related to the user’s words, which resulted in a high score result. In Figure 12, it also shows words of encouragement and empathy, saying that it is natural to feel nervous when starting treatment by understanding the user’s condition. However, in the semantic score, the user talks about tension and fear, but the AI focuses on the awkwardness of crying, so it is judged to be insufficient in terms of informational perspective of the answer, and the score is slightly low. In Figure 13, the user is happy about being promoted, but the AI does not empathize at all. It is saying things that ignore the user’s feelings and only talks about the potential disadvantages of being promoted.

4. Conclusions

In this paper, we proposed an effective and interpretable metric for evaluating empathy in large language models (LLMs). The proposed framework integrates both emotional and semantic aspects by combining the Sentlink, Emosight, and NEmpathySort processes with the RAGAS-based answer relevance score.

Experimental results strongly support our hypotheses. The emotional component effectively captured the presence and quality of empathic expression, while the semantic component ensured contextual alignment between queries and responses. In particular, NEmpathySort demonstrated high reliability in detecting empathetic intent even in negative–negative emotional interactions.

Furthermore, quantitative evaluations showed that the overall empathy scores increased proportionally with the amount of emotional information provided, validating the sensitivity of the proposed metric. This trend was consistent across both textual and multimodal settings, where facial expressions and speech signals were incorporated. The multimodal empathetic LLM achieved the highest scores, confirming that the proposed metric can accurately reflect the degree of emotional understanding across modalities.

Overall, the experimental findings substantiate that our metric provides a robust and objective means to evaluate empathy in LLM-generated responses, supporting the conclusions drawn in this study.

5. Future Works

Although the proposed metric provides an initial attempt to quantify empathetic understanding in large language models, this study has several limitations that warrant further investigation. In this section, we discuss key concerns raised by reviewers and outline directions for future work.

First, in this study, empathy was initially operationalized as a positivity score (i.e., “Empathy = positivity”), providing a simple and quantifiable approach. However, in real-world scenarios, empathy can also be expressed in response to negative or distressing situations. To address this, we employed the NEmpathySort method, a technique designed to handle negative empathetic responses, which allows us to consider both negative and positive forms of empathy. Nevertheless, this approach may still not fully capture the multidimensional nature of empathy, including subtle cognitive and affective components. Future work should explore more nuanced and comprehensive definitions that better reflect the complexity of human empathetic understanding.

Second, the datasets used in this work, including SQuAD and CosmosQA, were not originally constructed for evaluating empathy. As such, the observed results may only approximate empathetic behavior rather than directly measure it. To enable more rigorous validation of the proposed metric, future research should consider developing or using datasets specifically annotated for empathy, where both positive and negative empathetic responses are labeled. By additionally using other datasets or datasets annotated for emotions, it would be possible to further assess the generalizability of the metric across different domains and contexts.

Third, the current experiments primarily focused on text-based and limited multimodal inputs. Further investigation of more sophisticated multimodal integration methods, as well as a wider range of emotional contexts, may help assess the applicability of the metric. Moreover, future work might include usability testing and human evaluation to examine alignment with human-perceived empathy and to avoid over interpreting correlations as causal relationships.

Overall, addressing these limitations will enhance the reliability, validity, and applicability of the proposed metric in measuring empathetic understanding in large language models.

Author Contributions

Methodology, Y.H. and H.K.; Supervision, B.K. and H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-23-1-4098.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Beh, J.; Baran, R.H.; Ko, H. Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment. IEEE Trans. Consum. Electron. 2006, 52, 583–589. [Google Scholar] [CrossRef]
Ahn, S.; Ko, H. Background noise reduction via dual-channel scheme for speech recognition in vehicular environment. IEEE Trans. Consum. Electron. 2005, 51, 22–27. [Google Scholar]
Lee, Y.; Min, J.; Han, D.K.; Ko, H. Spectro-temporal attention-based voice activity detection. IEEE Signal Process. Lett. 2019, 27, 131–135. [Google Scholar] [CrossRef]
Kwak, J.g.; Dong, E.; Jin, Y.; Ko, H.; Mahajan, S.; Yi, K.M. Vivid-1-to-3: Novel view synthesis with video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6775–6785. [Google Scholar]
Kwak, J.g.; Li, Y.; Yoon, D.; Kim, D.; Han, D.; Ko, H. Injecting 3d perception of controllable nerf-gan into stylegan for editable portrait image synthesis. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 236–253. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 6–7 June 2019; Volume 1. [Google Scholar]
Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Fu, J.; Ng, S.K.; Jiang, Z.; Liu, P. Gptscore: Evaluate as you desire. arXiv 2023, arXiv:2302.04166. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Lewis, M. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. Ragas: Automated evaluation of retrieval augmented generation. arXiv 2023, arXiv:2309.15217. [Google Scholar] [CrossRef]
Liu, Z.; Yang, K.; Xie, Q.; Zhang, T.; Ananiadou, S. Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5487–5496. [Google Scholar]
Jia, Y.; Cao, S.; Niu, C.; Ma, Y.; Zan, H.; Chao, R.; Zhang, W. EmoDialoGPT: Enhancing DialoGPT with emotion. In Proceedings Part II 10, Proceedings of the Natural Language Processing and Chinese Computing: 10th CCF International Conference (NLPCC 2021), Qingdao, China, 13–17 October 2021; ACM Digital Library: New York, NY, USA, 2021; pp. 219–231. [Google Scholar]
Yuan, W.; Neubig, G.; Liu, P. BARTScore: Evaluating Generated Text as Text Generation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 27263–27277. [Google Scholar]
Hutto, C.; Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 216–225. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions. arXiv 2020, arXiv:2005.00547. [Google Scholar]
Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv 2023, arXiv:2309.07597. [Google Scholar]
Huang, L.; Bras, R.L.; Bhagavatula, C.; Choi, Y. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. arXiv 2019, arXiv:1909.00277. [Google Scholar] [CrossRef]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan, 27 November–1 December 2017. [Google Scholar]
Valero, L.A.M. Empathetic_counseling_Dataset. Available online: https://huggingface.co/datasets/LuangMV97/Empathetic_counseling_Dataset (accessed on 20 August 2025).
Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. In Proceedings of the ACL, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Amod. Amod/mental_health_counseling_conversations. Available online: https://huggingface.co/datasets/Amod/mental_health_counseling_conversations (accessed on 20 August 2025).
EmoCareAI. ChatPsychiatrist. Available online: https://github.com/EmoCareAI/ChatPsychiatrist (accessed on 20 August 2025).
Bertagnolli, N. Counsel Chat: Bootstrapping High-Quality Therapy Data. 2020. Available online: https://medium.com/data-science/counsel-chat-bootstrapping-high-quality-therapy-data-971b419f33da (accessed on 20 August 2025).
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv 2018, arXiv:1810.02508. [Google Scholar]
Lee, B.; Hong, J.; Shin, H.; Ku, B.; Ko, H. Dropout Connects Transformers and CNNs: Transfer General Knowledge for Knowledge Distillation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 8346–8355. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef]
ONNX Runtime. 2021. Available online: https://onnxruntime.ai/ (accessed on 20 August 2025).
Google LLC. Speech-to-Text AI: Speech Recognition and Transcription. Available online: https://cloud.google.com/speech-to-text (accessed on 20 August 2025).
Rogers, C.R. The attitude and orientation of the counselor in client-centered therapy. J. Consult. Psychol. 1949, 13, 82. [Google Scholar] [CrossRef]

Figure 1. The proposed metric process.

Figure 2. Process of Sentlink and Emosight. Sentlink classifies the broad emotional categories of neutral, positive, and negative, while Emosight detects more fine-grained, specific emotions, considering both the query and the answer.

Figure 3. An example of a sentence generated by ChatGPT that does not show empathy.

Figure 4. The process of multimodal empathy LLM.

Figure 5. LLM results using empathetic prompts.

Figure 6. QA sentiment compound distribution ((left): SQuAD/(right): CosmosQA).

Figure 7. Daily Dialog sentiment compound distribution.

Figure 8. Empathetic sentiment compound distribution ((left): EmpatheticCounseling/(right): EmpatheticDialogues).

Figure 9. BARTScore results for data meeting NEmpathySort conditions. The x-axis represents the BARTScore values, and the y-axis indicates the number of occurrences for each value. ((left): EmpatheticDialogues/(right): DailyDialog).

Figure 10. BARTScore results for non-empathetic datasets (The x-axis represents the BARTScore values, and the y-axis indicates the number of occurrences for each value).

Figure 11. LLM results generating strongly empathetic responses.

Figure 12. LLM results generating moderately empathetic responses.

Figure 13. LLM results generating non-empathetic responses.

Table 1. Comparison of fine-tuning performance for fold 0.

Train	Test	Accuracy	Weighted F1-Score
RAVDESS [36]	RAVDESS [36]	0.88	0.8774
CREMA-D [37]	CREMA-D [37]	0.7447	0.7414
RAVDESS [36] + CREMA-D [37]	RAVDESS [36]	0.8867	0.8866
	CREMA-D [37]	0.7454	0.7409

Table 2. Comparison of data counts by emotion presence and sentiment type.

Dataset	Sentiment			Total
Dataset	Negative	Neutral	Positive	Total
SQuAD [26] w/ emotion	656	528	748	1932
SQuAD [26] w/o emotion	4264	85,736	6237	96,237
CosmosQA [25] w/ emotion	1969	1046	2693	5708
CosmosQA [25] w/o emotion	3931	21,379	4192	29,502
DailyDialog [27] w/ emotion	3134	9853	12,327	25,314
DailyDialog [27] w/o emotion	1324	5637	5714	12,675
Empathetic_counseling [28] w/ emotion	10,857	6695	30,892	48,444
Empathetic_counseling [28] w/o emotion	1999	5180	4036	11,215
EmpatheticDialogues [29] w/ emotion	7069	6615	21,814	35,498
EmpatheticDialogues [29] w/o emotion	694	2538	1523	4755

Table 3. Results from experiments using the proposed metrics.

Dataset	Emotion Context	Average Score
SQuAD [26]	text	0.3473
CosmosQA [25]	text	0.4591
DailyDialog [27]	text	0.6234
Empathetic_counseling [28]	text	0.6977
EmpatheticDialogues [29]	text	0.7043
MELD [33]	text	0.7039
MELD [33]	text, vision, speech	0.7121

Table 4. Empathy score of the answer generated using empathetic LLM.

Dataset	Emotion Context	Average Score
Empathetic_counseling [28]	text	0.7925
EmpatheticDialogues [29]	text	0.7887

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, Y.; Ku, B.; Ko, H. Performance Evaluation Metrics for Empathetic LLMs. Information 2025, 16, 977. https://doi.org/10.3390/info16110977

AMA Style

Hong Y, Ku B, Ko H. Performance Evaluation Metrics for Empathetic LLMs. Information. 2025; 16(11):977. https://doi.org/10.3390/info16110977

Chicago/Turabian Style

Hong, Yuna, Bonhwa Ku, and Hanseok Ko. 2025. "Performance Evaluation Metrics for Empathetic LLMs" Information 16, no. 11: 977. https://doi.org/10.3390/info16110977

APA Style

Hong, Y., Ku, B., & Ko, H. (2025). Performance Evaluation Metrics for Empathetic LLMs. Information, 16(11), 977. https://doi.org/10.3390/info16110977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Evaluation Metrics for Empathetic LLMs

Abstract

1. Introduction

2. Proposed Evaluation Metric

2.1. Emotional Part

2.1.1. Sentlink

2.1.2. Emosight

2.1.3. NEmpathySort

2.2. Semantic Part

3. Experiment

3.1. Dataset

3.2. Baseline Model

3.3. Performance Evaluation of Proposed Metrics

3.3.1. Performance Evaluation of Sentlink

3.3.2. Performance Evaluation of Emosight

3.3.3. Performance Evaluation of NEmpathySort

3.3.4. Performance Evaluation Using All Components

4. Conclusions

5. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI