The IDRE Dataset in Practice: Training and Evaluation of Small-to-Medium-Sized LLMs for Empathetic Rephrasing

Simone Manai; Laura Gemme; Roberto Zanoli; Alberto Lavelli

doi:10.3390/electronics14204052

,

and

¹

Department of Information Engineering and Computer Science (DISI), University of Trento, Via Sommarive 9, Povo, 38123 Trento, TN, Italy

²

Lutech, Viale Cembrano 2, 16148 Genova, GE, Italy

³

Fondazione Bruno Kessler, Via Sommarive 18, Povo, 38123 Trento, TN, Italy

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(20), 4052;https://doi.org/10.3390/electronics14204052

This article belongs to the Special Issue Natural Language Processing Based on Neural Networks and Large Language Models

Version Notes

Order Reprints

Abstract

Integrating emotional intelligence into AI systems is essential for developing empathetic chatbots, yet deploying fully empathetic models is often constrained by business, ethical, and computational factors. We propose an innovative solution: a dedicated empathy rephrasing layer that operates downstream of a chatbot’s initial response. This layer leverages large language models (LLMs) to infuse empathy into the chatbot’s output without altering its core meaning, thereby enhancing emotional intelligence and user engagement. To implement this layer, we extend and validate the IDRE (Italian Dialogue for Empathetic Responses) dataset. We evaluated small- and medium-scale LLMs across three configurations: baseline models, models augmented via few-shot learning with IDRE exemplars, and models fine-tuned on IDRE. Performance was quantitatively assessed using the LLM-as-a-judge paradigm, leveraging custom metrics. These results were further validated through an independent human evaluation and supported by established NLP similarity metrics, ensuring a robust triangulation of findings. Results confirm that both few-shot prompting and fine-tuning with IDRE significantly enhance the models’ capacity for empathetic language generation. Applications include empathetic AI in healthcare, such as virtual assistants for patient support, and demonstrate promising generalization to other domains. All datasets, prompts, fine-tuned models, and scripts are publicly available to ensure transparency and reproducibility.

Keywords:

empathy; large language models; dataset; healthcare

1. Introduction

The integration of emotional intelligence into artificial intelligence systems has emerged as a key factor in the advancement of human–computer interaction [,,,]. In particular, the development of conversational agents capable of expressing empathy is gaining increasing attention, given their potential to foster user-centric interactions.

Empathy is a multidimensional construct defined as the ability to understand and share another individual’s emotional states while maintaining a clear distinction between self and other. In the scientific literature, empathy is typically divided into two main components: affective empathy, which involves emotional sharing, and cognitive empathy, which refers to the rational understanding of another’s emotions [].

From a neuroscientific perspective, empathy involves specific brain circuits, including the medial prefrontal cortex and the insula, supporting the hypothesis that the brain internally simulates the emotional states of others [].

In a clinical context, empathy is crucial in the physician–patient relationship. It is defined as the ability of a healthcare professional to understand the experiences of a patient and communicate this understanding effectively. Studies have shown that higher levels of physician empathy are associated with better clinical outcomes, increased treatment adherence, and reduced patient anxiety [].

Furthermore, empathy is a central element in communication protocols such as the SPIKES model, used in oncology. It also serves as a protective factor against burnout in healthcare professionals, contributing to their psychological well-being and the quality of care they provide [].

Empathetic chatbots hold significant promise to improve user experience in a wide range of domains, including customer service, education, and mental health support. Their ability to simulate empathetic responses contributes to more meaningful and emotionally attuned interactions, which are particularly valuable in contexts involving psychological vulnerability or distress.

Consider, for instance, a scenario in which a user expresses feelings of anxiety. A conventional neutral response such as “It’s important to stay calm and find a way to relax” may offer practical advice but lacks emotional resonance. In contrast, an empathetic response like “I understand that you’re feeling anxious right now, and it is okay to feel this way. Let’s take a moment to focus on something that can help you relax” demonstrates emotional attunement and validation, which are central to the therapeutic potential of empathetic communication. This distinction underscores the importance of integrating affective and cognitive empathy mechanisms into conversational agents, thereby fostering trust, emotional safety, and user engagement.

The direct development and deployment of fully empathetic AI chatbots present substantial practical, economic, and strategic challenges. Many organizations have already committed significant financial and technical resources to the creation of domain-specific conversational agents that are deeply embedded within operational infrastructures. Modifying or replacing these systems to incorporate empathetic capabilities would not only incur considerable additional costs but also pose risks of performance degradation and operational instability. This issue is particularly pronounced in vertical domains such as legal advisory, financial consulting, healthcare triage, and technical support, where chatbots are meticulously fine-tuned to deliver accurate and context-sensitive information. Embedding empathy into these models would require extensive retraining using domain-specific empathetic datasets, which are often scarce and can compromise the original task performance of the chatbot. Furthermore, the end-to-end development of a new empathetic chatbot is resource-intensive, involving multiple stages—data acquisition and annotation, model training and validation, system integration, and regulatory compliance testing. For organizations operating under budgetary constraints or lacking access to high-performance infrastructure, such an undertaking may be economically and logistically infeasible.

To address these constraints, we propose a modular architecture based on an empathy rephrasing layer. This layer, implemented through a large language model (LLM), takes the original chatbot output and returns a semantically equivalent response enriched with an empathetic tone. By decoupling the generation of empathetic expression from the core dialogue functionality, this approach enables the seamless integration of advanced affective capabilities into existing conversational systems. The result is an enhancement in user engagement and the promotion of more nuanced, human-centric interactions, without compromising the original intent of the chatbot’s response.

The modular architecture proposed in this work, as shown in Figure 1, based on an empathetic rephrasing layer, offers a cost-effective and low-risk alternative. This layer operates independently of the core chatbot, taking its original output and transforming it into a semantically equivalent response enriched with empathetic language. This design ensures that the factual accuracy and domain-specific relevance of the original response are preserved, while enhancing its emotional resonance.

Figure 1. Integration of the empathetic rephrasing layer within an existing chatbot architecture.

Crucially, this approach enables seamless integration of empathetic capabilities into existing systems without requiring structural modifications or retraining of the base model. For example, a financial chatbot already optimized for regulatory compliance and investment advice can be augmented with empathetic responses simply by cascading the rephrasing layer, thereby improving user experience without compromising precision. Similarly, in healthcare applications, where clarity and correctness are paramount, the layer ensures that empathetic phrasing does not alter the medical intent of the original message. This modularity makes the solution highly scalable and adaptable across diverse sectors, offering a strategic pathway to human-centric AI interaction while significantly reducing development time, infrastructure costs, and operational risks.

This architecture is particularly advantageous in domains where emotional sensitivity is critical. In mental health support, it facilitates reassurance and validation; in customer service, it improves perceived helpfulness; and in education, it fosters motivational engagement. The modularity of the system ensures scalability and adaptability across sectors, offering a cost-effective pathway to human-centric AI interaction.

To implement and evaluate this approach, we conducted an extensive experimental campaign involving ten small and medium-sized LLMs, selected for their deployability on single-GPU setups. We explored both few-shot learning and fine-tuning strategies using the IDRE (Italian Dialogue for Empathetic Responses) dataset [], which comprises 480 curated triplets focused primarily on healthcare scenarios. Each triplet includes a user message, an initial chatbot reply, and an empathetically rephrased version.

Model performance was assessed through a comprehensive, multi-layered evaluation strategy. Central to our approach is the adoption of the LLM-as-a-judge paradigm, specifically leveraging the G-Eval framework, which utilizes a state-of-the-art large language model (GPT-4o []) as an objective evaluator for the outputs generated by smaller models. This automated evaluation is inspired by established protocols such as SPIKES, widely used in medical communication, and is tailored to ensure a nuanced and contextually relevant assessment across five key dimensions: empathy, knowledge, coherence, fluency, and lexical variety. While the primary focus of the dataset and evaluation was the healthcare domain, this study also systematically explored the generalizability of the rephrasing capability by generating and evaluating empathetic responses in additional domains, including work, social, legal, and financial contexts. We validated the G-Eval results and ensured inter-method agreement by integrating a human annotation study. Expert annotators, trained in chatbot evaluation, assessed a representative sample of model outputs using a Likert scale across the same five dimensions; inter-annotator consistency was quantified via Fleiss’ kappa. Finally, to fully triangulate and quantitatively support our findings, we integrated traditional NLP similarity metrics BLEU [], ROUGE [], and BERTScore []. These metrics provided an essential, independent measure of the two core objectives of the task: style transfer (introducing empathetic language) and meaning preservation (maintaining semantic fidelity to the original response). This multi-pronged evaluation framework ensures both the robustness and the interpretability of our results.

The results indicate that fine-tuning large language models (LLMs) with the IDRE dataset is an effective strategy for instilling empathetic capabilities, with particularly promising outcomes on smaller models such as Llama-3.2-1B-Instruct. Moreover, fine-tuning corrected limitations observed in few-shot learning, enhancing both lexical diversity and empathy, as shown in the case of Gemma-3-1B. Finally, the empathetic skills acquired in the medical domain were shown to be transferable to other contexts, suggesting the broad generalizability of empathetic dialogue systems across diverse sectors.

These findings open new opportunities for deploying empathetic AI in a wide range of real-world applications. One potential application is in telemedicine platforms, where empathetic responses can improve patient trust and adherence to medical advice. In elderly care, empathetic chatbots can provide companionship and emotional support, helping to reduce feelings of isolation. In educational settings, tutors enhanced with empathetic capabilities can adapt their tone to better support students facing difficulties, thereby fostering a more inclusive and motivating learning environment. In addition, legal and financial advisory bots can benefit from empathetic rephrasing to deliver sensitive information—such as debt management or legal obligations—in a more considerate and human-centric manner. Finally, in human resources and workplace wellness tools, empathetic AI can assist in managing employee concerns, offering support during stressful periods, and promoting a healthier organizational climate.

Although this study focused on the Italian language, the methodology for dataset creation, model adaptation, and evaluation is inherently language-agnostic and can be extended to other linguistic contexts.

The contributions of this research are twofold. First, the primary innovation lies in the development and validation of IDRE, a resource that addresses a gap in the Italian scientific literature. Second, leveraging this dataset, we propose a novel modular architecture that demonstrates practical and scalable applicability. This architecture incorporates an empathetic reformulation layer within a decoupled design framework, enabling the integration of emotional capabilities into existing chatbot systems without requiring costly retraining or structural modifications. This approach effectively addresses both practical and economic challenges associated with empathy integration, significantly reducing development time and infrastructure costs while ensuring adaptability across diverse application domains.

To support transparency and reproducibility, the dataset, the prompts, all fine-tuned models, and the Python scripts developed in this study are publicly available through Hugging Face repositories, as detailed in Appendix A.

2. Related Works

The development of empathetic chatbots capable of understanding and responding to human emotions is an increasingly important research area []. However, achieving this goal requires high-quality datasets that capture human–machine interactions enriched with empathetic elements.

Despite the growing availability of datasets for machine learning and natural language processing, there is a notable lack of resources specifically designed for empathetic chatbots in the Italian language, which presents a significant challenge. While some datasets—such as those referenced in [,,,]—include emotional information, they primarily focus on labeling words or sentences with general emotional tags. These resources often lack the contextual depth and complexity needed to support the development of truly empathetic conversational agents.

Within the current research landscape, a line of studies has emerged comparing the empathetic capabilities of LLMs with those of humans. The results indicate that LLMs can, in certain contexts, outperform human performance in empathy-related tasks. A systematic review [] showed that 78.6% of responses generated by ChatGPT-3.5 were preferred over those produced by humans. Further investigations [], conducted on a sample of 1000 participants, confirmed the statistically significant superiority of LLMs in generating empathetic responses. In particular, the GPT-4o model showed a 31% increase in responses judged as good compared to the human baseline, while other models like Llama-2 (24%), Mixtral-8x7B (21%), and Gemini-Pro (10%) showed smaller improvements.

To further enhance the empathetic capabilities of LLMs, various methodologies have been proposed []. These include semantically similar in-context learning, a two-phase interactive generation approach, and the integration of knowledge bases. Experimental evidence suggests that applying these techniques can significantly improve model performance, helping to surpass the state of the art.

Evaluation remains a critical challenge, as it is inherently multidimensional and subjective, making objective comparison difficult. In this context, Ref. [] highlighted that generic LLM models like GPT-4o can show limited performance when evaluated using multidimensional metrics. In contrast, classifiers based on smaller models, such as Flan-T5, subjected to fine-tuning on specific data, have been shown to outperform larger models on certain evaluation criteria.

Recent works have also explored novel architectures and evaluation strategies for empathetic chatbots. For example, Ref. [] proposed a system that integrates neural and physiological signals to enhance real-time emotion recognition and empathetic response generation. Ref. [] conducted a systematic review of empathetic conversational agents in mental health, highlighting the effectiveness of hybrid ML engines. Ref. [] provided a comprehensive review of emotionally intelligent chatbot architectures and evaluation methods. Ref. [] introduced a communication framework based on psychological empathy and positive psychology, which demonstrated improved user well-being in mental health applications.

The task of generating empathetic responses can be conceptualized as a text style transfer (TST) problem, a subfield of natural language generation focused on altering the stylistic properties (e.g., sentiment, formality, and tone) while preserving semantic content. Transforming a non-empathetic response into an empathetic one requires the model to maintain the core meaning (“what” is being said) while modifying the mode of expression (“how” it is said), a process that directly aligns with the definition of TST.

Historically, research on TST has been hampered by a shortage of parallel training data and a lack of standardized evaluation metrics. Early work attempted to overcome these challenges through the use of non-parallel data. For example, Ref. [] introduced a generative neural model based on Variational Auto-Encoders (VAEs) to learn disentangled latent representations, enabling controlled text generation by manipulating specific attributes such as sentiment or tense. Similarly, Ref. [] proposed a cross-alignment method for style transfer on non-parallel texts, relying on the separation of style and content and assuming a shared latent content distribution across corpora with different styles. This approach has been successfully applied to tasks such as sentiment modification and formality conversion.

A significant advancement was made by [], who formalized the evaluation problem and introduced two new metrics, “transfer strength” and “content preservation,” both highly correlated with human judgments. This shifted the community’s focus toward more rigorous and objective evaluation. More recently, Ref. [] conducted a “meta-evaluation” of existing metrics, assessing their robustness and correlation with human judgments across multiple languages, including Italian, Brazilian Portuguese, and French.

Within this context, empathetic rephrasing emerges as a specific and complex case of TST, where the style to be transferred involves not just lexical changes but a nuanced combination of linguistic and structural choices aimed at conveying empathy.

A critical issue highlighted by these studies is the trade-off between performance and computational costs, given that the most advanced and high-performing models require a considerable expenditure of resources. In this work, we focus on leveraging small and medium-sized models to limit costs while evaluating their effectiveness on the empathetic rephrasing task. Additionally, we employ automated evaluation to further reduce the need for costly human evaluators.

3. Methodology

This section describes the development of the empathetic rephrasing layer aimed at enhancing empathetic communication. It focuses on the creation of the IDRE dataset, specifically designed for training models to perform empathetic rephrasing. This chapter details the methodology used to construct the dataset, including the generation of original sentences and their empathetic enrichment, followed by a rigorous evaluation process conducted by human annotators to ensure quality and consistency. Subsequently, it illustrates the fine-tuning of LLMs using the IDRE dataset, describing the architectures employed and the training configurations adopted. Finally, this chapter presents the evaluation of the empathetic rephrasing models through the LLM-as-a-judge paradigm, outlining the criteria and procedures—drawing inspiration from the SPIKES protocol used to assess the effectiveness of the models in increasing empathetic capabilities.

3.1. IDRE Dataset

The dataset, comprising 480 sentences and approximately 18,000 tokens, is structured into triplets of sentences: a user query, a corresponding chatbot response, and an empathetically enhanced version of that response. Table 1 provides examples of the generated question–answer pairs and the rephrased answer with empathy.

Table 1. Examples of sentence triplets in the IDRE dataset, including user question, standard chatbot answer, and empathetically rephrased answer (English translations). The corresponding Italian versions are provided in Appendix D, Table A5.

3.1.1. Dataset Creation

The IDRE dataset was generated using the Llama2-13B language model [], which, at the time of its creation, represented the state of the art in large language models. The data generation process was carried out in two distinct phases, as illustrated in Figure 2.

Figure 2. Process of sentence generation for the IDRE dataset, including the creation of question–answer pairs and the subsequent empathy enhancement step.

QnA Sentence Generator: In this phase, question–answer pairs were generated with a primary focus on the healthcare sector to ensure the development of empathetic and compassionate responses. Thirteen specific topics were selected to simulate typical chatbot interactions within a medical context: ‘information on breast cancer’, ‘breast cancer prevention’, ‘therapies for breast cancer’, ‘psychological support after a cancer diagnosis’, ‘life expectancy after a cancer diagnosis’, ‘psychological support after surgery’, ‘hospital admissions’, ‘post-operative care’, ’information on leukemia’, ‘psychological support’, ‘anti-cancer therapies’, ‘information on stroke’, and ‘preparation for surgeries.’
Empathy Enhancement: The initial chatbot responses underwent an empathy enhancement process. This involved leveraging the Llama2 model again to modify the responses, incorporating expressions of concern or appreciation and substituting specific words to foster a more supportive tone.

3.1.2. IDRE Dataset Evaluation Methodology and Results

To guarantee the quality of the generated sentences, a rigorous evaluation process was implemented. For each user query, two versions of the chatbot response were evaluated: the standard (non-empathetic) response produced by the original chatbot and the empathetic response generated by applying the empathetic rephrasing layer to the same sentence. Twelve volunteer annotators, IT developers and project managers with solid experience in the chatbot domain, participated in the assessment. Following comprehensive training, each evaluator was assigned 70 sentences: 40 unique to each evaluator for dataset creation and 30 common to all evaluators solely for measuring inter-annotator agreement with Fleiss’ kappa coefficient [], as shown in Table 2.

Table 2. Fleiss’ kappa coefficients for each evaluation dimension and aggregated categories. “Aggregated Fleiss’ Kappa” represents the result of aggregating the scores into three macro-categories: scores 1 and 2, score 3 (neutral), and scores 4 and 5.

The evaluation was conducted using metric-specific questions, requiring responses on a 1-to-5 Likert scale (1: Totally disagree; 5: Totally agree). To obtain a more robust analysis that is less subject to small variations, the annotation categories were grouped into three macro-categories: scores 1 and 2, score 3 (neutral), and scores 4 and 5. The key evaluation dimensions utilized include the following:

Bot Sentence Correctness: This metric assesses the absence of spelling, grammatical, and punctuation errors in both the user’s question and the model-generated response. The evaluation is based on the following criterion: “Is the text of the empathetic response grammatically and semantically correct?”
Absence of English Words in Bot Sentences: This metric verifies whether any English words or phrases are present in the user’s question or the model’s response. Exceptions are made for English terms commonly used in Italian (e.g., “badge” and “sport”). The evaluation criterion is as follows: “No English words or phrases are present in the QUESTION and ANSWER columns, unless they are commonly used in Italian.”
Empathetic Answer Correctness: This metric evaluates the absence of spelling, grammatical, and punctuation errors in the model-generated response that includes empathetic elements. The evaluation is based on the following criterion: “Is the text of the empathetic response grammatically and semantically correct?”
Absence of English Words in Empathetic Sentences: This metric checks for the presence of English words or phrases in the empathetic responses generated by the model, excluding those commonly used in Italian. The evaluation criterion is as follows: “No English words or phrases are present in the empathetic response, unless they are commonly used in Italian.”
Semantic Coherence: This metric measures the semantic similarity between the standard response and the empathetic response generated by the model, ensuring that no concepts are missing or contradictory. The evaluation is based on the following criterion: “The empathetic response conveys the same semantic meaning as the standard chatbot response. No concepts are missing or contradictory.”
Empathy Increase: This metric assesses whether the empathetic response demonstrates a meaningful increase in empathy compared to the standard response. The evaluation criterion is as follows: “The sentence in the ANSWER WITH EMPATHY column expresses more empathy than the sentence in the ANSWER column.”

Fleiss’ kappa was employed to quantify the degree of concordance among multiple annotators, while accounting for agreement occurring by chance. This coefficient ranges from −1 to 1, where negative values indicate agreement below chance, values between 0 and 0.20 suggest slight agreement, values from 0.21 to 0.40 suggest fair agreement, values from 0.41 to 0.60 suggest moderate agreement, values from 0.61 to 0.80 suggest substantial agreement, and values above 0.80 denote almost perfect agreement.

The results, summarized in Table 2, reveal that the highest agreement levels were observed for evaluation dimensions related to the presence of English words—likely due to the relative simplicity of this annotation task. In contrast, evaluation dimensions involving more nuanced linguistic features yielded lower, yet still acceptable, agreement levels, generally falling within the moderate range.

The evaluation of the IDRE dataset demonstrated an overall satisfactory quality, as illustrated in Figure 3. Across all assessed evaluation dimensions, the generated sentences consistently received positive ratings from annotators, as evidenced by the predominance of favorable evaluations (light blue bars in the chart).

Figure 3. Distribution of annotator ratings for each evaluation metric in the IDRE dataset.

For instance, the semantic coherence metric received 82% positive ratings, 10% neutral ones, and only 8% negative ones, indicating a strong adherence of the responses to the intended meaning. This result suggests that the enhancement of empathetic tone did not compromise the informational quality of the responses. On the contrary, it enriched their communicative dimension, improving emotional resonance while preserving semantic content.

However, a more in-depth analysis of the evaluation dimensions with lower scores highlighted two main areas for attention and improvement: grammatical errors and the presence of non-Italian terms. A substantial subset of responses exhibited recurring grammatical issues. A representative example is the sentence “Ohimini, cara/o utente, è comprensibile che durante il trattamento del tumore possa esserti difficile gestire i sintomi...”, which contains the non-existent term “Ohimini” and a typographical error in the word “supporti”. Moreover, several sentences included non-Italian terms, predominantly English, as in “Per prevenire le infezioni after surgery...”. This phenomenon is likely due to the multilingual nature of the language model employed, which was selected because, at the time this work was carried out, a dedicated Italian model was not yet available.

These findings underscore the importance of refining the model training process by prioritizing Italian vocabulary and syntactic structures. Alternatively, employing a model natively developed or specifically fine-tuned for Italian could significantly enhance both linguistic accuracy and expressive naturalness.

Finally, although the increase in empathetic tone was positively evaluated in 306 sentences, a non-negligible portion of the responses (173 sentences, equal to 36% of the total) did not show a significant improvement compared to the original versions. This observation highlights the potential for further optimization of the generation process to ensure a more consistent and meaningful enhancement in affective communication.

3.2. Evaluating the IDRE Dataset for Empathetic Rephrasing

The evaluation of the IDRE dataset for empathetic rephrasing was conducted by leveraging it both for fine-tuning large language models (LLMs) and for few-shot learning experiments. The selection of LLMs followed well-defined criteria aimed at exploring a range of architectures and parameter scales while ensuring computational feasibility. The subsequent sections detail the adopted methodology, including model selection, dataset configuration, computational environment, and training parameters.

3.2.1. Model Selection

In this section is described the selection of ten distinct LLMs, detailed in Table 3, that were identified and included, categorized as follows:

Table 3. List of language models used in this study, with the corresponding number of parameters and language coverage.

Italian-Optimized Models: Two models were chosen for their specificity or optimization for the Italian language: Minerva-7B-instruct-v1.0 [] and LLaMAntino-3-ANITA-8B-Inst-DPO-ITA []. This selection aims to evaluate the performance of models with a linguistic predisposition tailored to the Italian context.
Medium-Sized Models (approx. 7–9 billion parameters): Five models fall into this category: Qwen2.5-7B-Instruct [], Mistral-7B-Instruct-v0.3 [], Llama-3.1-8B-Instruct [], granite-3.1-8b-instruct [] and gemma-2-9b-it []. The choice of this parameter range was driven by the necessity of enabling fine-tuning on a single GPU, thereby optimizing computational efficiency.
Small-Sized Models (approx. 1–3.8 billion parameters): Three smaller models, gemma-3-1b [], Llama-3.2-1B-Instruct [], and Phi-3.5-mini-instruct [], were included to analyze the performance and efficiency of compact models.

This extensive fine-tuning effort seeks to ascertain whether the empathetic transformations within the IDRE dataset effectively translate into enhanced empathetic response generation capabilities across a diverse range of pre-trained models. By systematically assessing the performance of these fine-tuned LLMs, we aim to provide robust evidence regarding the IDRE dataset’s contribution to fostering more empathetic and human-centric AI interactions.

3.2.2. IDRE Dataset Preparation

To ensure that only high-quality examples were included in the training set for LLMs, a filtering process was applied to the IDRE dataset. Specifically, all sentences that received a human evaluation score below 3 were excluded. This procedure effectively removed sentences containing linguistic errors or non-contextualized English terms or lacking demonstrable improvements in empathy. As a result of this selection, a refined subset of 223 sentences was obtained, each positively rated by evaluators, thus forming a high-quality corpus suitable for model training.

3.2.3. Computational Environment

The fine-tuning of each model was executed on a Microsoft Azure cloud computing platform, utilizing a virtual machine of type Standard_NC16as_T4_v3. This configuration was specifically chosen for its processing capabilities, which include 16 vCPU cores, 110 GB of RAM, and 352 GB of disk space. The key computational element for training acceleration was the NVIDIA Tesla T4 GPU, which offers high performance for computationally intensive operations typical of deep neural network training thanks to its Turing architecture and dedicated Tensor Cores. The choice of a single GPU of this caliber allowed for optimization of computational efficiency, enabling the fine-tuning of even medium-sized models (such as those in the 7–9 billion parameter category) within reasonable timeframes.

3.2.4. Training Parameters and Fine-Tuning Strategy

The fine-tuning strategy followed a supervised training approach, where models were exposed to the IDRE dataset to learn to generate empathetic responses. The training parameters were uniformly configured for all ten selected models to ensure a fair and reproducible basis for comparison:

Learning Rate ( $L R$ ): A learning rate of $2 \times 10^{- 4}$ was set. This relatively low value is common in fine-tuning pre-trained models to avoid excessive perturbation of already optimized weights and to allow for a finer approximation to the loss function’s minimum. The choice of an appropriate $L R$ is critical for balancing convergence speed with training stability.
Number of Epochs: Each model was trained for 3 full epochs. The number of epochs was empirically determined to balance the model’s learning capacity with the risk of overfitting, which refers to the model’s tendency to adapt excessively to the training data, losing the ability to generalize to new data. By monitoring performance on the validation set, it was observed that 3 epochs were sufficient to achieve significant improvement without showing clear signs of pronounced overfitting.
Batch Size: A batch size of 16 was used. The batch size represents the number of training examples processed in parallel before the model’s weights are updated. A batch size of 16 is a compromise between computational efficiency (a larger batch size can better utilize GPU resources) and gradient stability (smaller batch sizes can lead to noisier gradients but potentially better local minima). This size allowed for optimized memory utilization of the Tesla T4 GPU while maintaining a stable learning process.
Random Seed: To ensure reproducibility of the results, the random seed was set to 42.

For the optimization process, the AdamW (Adam with Weight Decay) [] algorithm was used, a variant of the Adam optimizer that includes weight decay regularization to prevent overfitting. The loss function used was Cross-Entropy Loss, which is standard for sequence generation tasks and language modeling, measuring the difference between the model’s predicted probability distribution and the true distribution.

For all models, the application of techniques such as Parameter-Efficient Fine-Tuning (PEFT) [] was also considered to optimize resources and training times. The key configurations specified are as follows:

Rank = 16: The rank determines the number of additional parameters that are trained. A value of 16 means that for each weight matrix in the model, two lower-rank matrices are added, one with dimensions (d, 16) and the other with (16, d). These two small matrices replace the fine-tuning of the large original weight matrix, drastically reducing the number of trainable parameters and, consequently, VRAM consumption.
Target Modules: This indicates the model layers to which LoRA is applied. In this case, the Q (Query), K (Key), V (Value), and O (Output) matrices of the attention mechanism, along with the gate_proj, up_proj, and down_proj layers of the MLP (Multi-Layer Perceptron) block, have been selected for fine-tuning.
Lora Alpha = 8: This parameter scales the learning of the LoRA matrices. A value lower than r (lora_alpha < r) reduces the importance of the new matrices, acting as a kind of internal learning rate. A value equal to r (lora_alpha = r) is the standard configuration, but here it is set to 8, which implies an implicit regularization that mitigates the risk of overfitting and helps the model generalize better.
Lora Dropout = 0: Dropout is used to randomly deactivate some neurons during training, preventing overfitting. Setting it to 0 means that this technique is not being used.
Bias = “none”: Bias is an additional parameter added to an output. Setting it to none means no bias is added to the new LoRA layers, which optimizes memory usage.
Use Gradien Checkpointing = “unsloth”: This technique trades computation time for memory, significantly reducing VRAM usage. The unsloth variant is an optimized version that allows training models with very long contexts, reducing VRAM consumption by 30% and enabling larger batch sizes.

3.3. Model Evaluation

Recent advances in LLMs have revolutionized automated text generation, introducing unprecedented capabilities []. Despite these strides, evaluating the quality of generated text remains a complex challenge. Traditional automatic metrics, such as BLEU, ROUGE, and METEOR [], primarily based on lexical overlap, often prove inadequate in capturing critical linguistic nuances like semantic similarity, factual consistency, and fluency. While BERTScore represented a significant improvement by incorporating semantic similarity through contextual embeddings, human evaluation persists as the gold standard. This is due to its inherent ability to discern qualitative aspects such as creativity and pragmatic utility; however, it is intrinsically costly, time-consuming, and not scalable for large-scale evaluations []. In response to these limitations, the “LLM-as-a-judge” paradigm has emerged, leveraging the advanced comprehension capabilities of LLMs to assess the quality of generated text. For the implementation of this paradigm, the G-Eval [] framework was adopted, which comprises the following key components:

Zero-Shot or Few-Shot Prompting: The “judge” LLM receives instructions via prompts for evaluating the generated text. Such instructions include the input provided to the generator model, the generated output, and, optionally, a reference text.
Chain-of-Thought (CoT) Prompting: This technique encourages the LLM to articulate its step-by-step reasoning process. This enhances the reliability and transparency of the evaluation, contributing to a reduction in hallucinations and an improvement in judgment consistency [].
Criterion-Based Evaluation: The “judge” LLM evaluates the generated text based on predefined qualitative criteria (e.g., fluency, coherence, and relevance). For each criterion, detailed explanations are provided, and a numerical score is assigned, culminating in an overall judgment.
Scoring Mechanism: The numerical scores assigned by the “judge” LLM are aggregated to derive quantitative metrics, while the rationales generated via CoT support qualitative analysis.

In the present work, we adopted the “Criterion-Based Evaluation” approach.

To better adapt the evaluation to the context of empathetically conveying information, additional evaluation dimensions were defined based on the principles of the SPIKES protocol [], a well-established model for communicating difficult news in a medical setting. The phases of the SPIKES protocol are described in the following:

Setting: This initial phase focuses on creating an appropriate environment for the discussion. It includes choosing a private and comfortable location, ensuring sufficient time without interruptions, allowing the presence of significant others (if desired), and adopting open and welcoming body language. The goal is to establish an atmosphere of safety and trust [].
Perception: Before delivering the news, it is crucial to explore the patient’s understanding and expectations regarding their clinical situation. Questions like “What do you already know about your illness?” or “What is your main concern?” allow the physician to assess the patient’s level of awareness and tailor communication accordingly, avoiding redundant information or further confusion [].
Invitation: After understanding the patient’s perception, the physician must explicitly ask how much the patient wishes to know. This phase respects patient autonomy and their capacity to manage information. Some patients may want to know all the details, while others might prefer a more gradual approach or delegate some information to family members [].
Knowledge: This is the phase where the news is actually delivered. It is crucial to do so clearly, concisely, and using understandable language, avoiding medical jargon. Information should be provided in small “chunks,” allowing the patient to process each piece before moving to the next. It is important to regularly check the patient’s understanding, for example, by asking “Have I explained myself clearly?” or “Is there anything unclear to you?” Honesty and transparency are fundamental, alongside maintaining an empathetic and supportive tone [].
Empathy: After delivering the news, the patient will likely show an emotional reaction (sadness, anger, denial, or fear). The physician must acknowledge and validate these emotions, offering support and understanding. Phrases like “I understand this news is difficult to accept” or “It’s normal to feel this way in such a situation” can help the patient feel understood and less alone. A moment of silence can be very powerful in allowing the patient to process information and emotions [].
Strategy and Summary: The final phase focuses on planning next steps and summarizing the information provided. Together with the patient, treatment options, care plans, and available support resources are discussed. Realistic goals are set, and any remaining questions are answered. This phase aims to instill a sense of hope and control, even in the face of a difficult prognosis, and to ensure the patient feels supported in their future journey [].

While acknowledging the comprehensiveness of the SPIKES protocol for physician–patient communication, attention was focused on its “Knowledge” and “Empathy” phases as central dimensions for evaluation, given their high adaptability to the chatbot interaction context.

Evaluation Methodology

As illustrated in Figure 4, the model evaluation process is structured into distinct phases. The procedure started with the generation of a synthetic test set using GPT-4o; a selection of examples can be found in Table 4. This approach was necessary due to the limitations of the existing IDRE dataset, which was small and had already been used for fine-tuning the model. Creating a new, independent test set was essential to ensure a robust and unbiased evaluation. Leveraging a powerful LLM like GPT-4o enabled the generation of high-quality and diverse sentences.

Figure 4. Diagram of the evaluation pipeline for empathetic rephrasing models, detailing the phases from test set generation to multidimensional assessment.

Table 4. Examples of sentences from the synthetic test set, categorized by topic (English translations). The corresponding Italian versions are provided in Appendix D, Table A6.

These new sentences characteristically mimic a chatbot’s typical responses: they are concise, direct, and deliver information without overt empathy. The resulting test set consists of 200 sentences: 100 focus on the medical domain, and the remaining 100 are uniformly distributed across four non-medical domains (financial, legal, social, and work-related). The inclusion of diverse domains aims to assess the test set’s effectiveness in eliciting empathetic responses even in non-medical contexts.

Each sentence of the test set was reformulated using three different model configurations: the base model (BM), the few-shot learning model (FSL), and the fine-tuned model (FT). The specific prompts utilized for this reformulation are detailed in Appendix C. It is important to note that the same core prompt was applied to both the BM and FT models. In contrast, the FSL configuration used an augmented prompt incorporating few-shot examples extracted from the IDRE dataset.

All model outputs were subsequently evaluated using the “LLM-as-a-judge” paradigm across five distinct evaluation dimensions.

Fluency: Assesses the grammatical correctness, naturalness, and overall readability of the reformulated sentence in Italian.
Coherence: Measures semantic fidelity, ensuring that the reformulated sentence conveys the same meaning as the original sentence.
Lexical Variety: Evaluates the diversity of vocabulary used in the reformulated sentence compared to the original.
Knowledge: Determines how clearly and understandably the reformulated text transmits information, avoiding technical jargon or overly direct phrasing.
Empathy: Quantifies the extent to which the reformulated text acknowledges and addresses the user’s emotional reactions with appropriate empathy.

Each metric was numerically scored on a scale from 1 (lowest) to 5 (highest). While fluency, coherence, and lexical variety measure the quality of the rephrasing itself, the dimensions of knowledge and empathy were specifically derived from the SPIKES protocol to evaluate the communication style.

To validate and corroborate the results obtained through the automatic LLM-based metrics (G-Eval), a human evaluation was integrated. The primary purpose was to verify the inter-methodological agreement between the G-Eval assessments and those conducted by human annotators.

A random subset of 100 sentence pairs was selected from the synthetic test set. Each pair consisted of the original (neutral) sentence and its empathy-augmented version generated by one of the three configurations previously described.

Three independent evaluators, all with proven experience in NLP and chatbots, received a training briefing on the evaluation dimensions and operational instructions. For each sentence, the annotators assessed the response based on the five G-Eval metrics, using a 5-point Likert scale (where 1 = totally disagree; 5 = totally agree). For the purpose of inter-annotator agreement analysis, the scores were aggregated into three macro-categories: [1–2] disagree, [3] neutral, and [4–5] agree. The consistency among evaluators was quantified by calculating Fleiss’ kappa coefficient.

In addition to the human assessment, consolidated NLP similarity metrics such as BLEU, ROUGE, and BERTScore were also calculated, which are essential for triangulating the quantitative results. The empathetic rephrasing task is inherently dual-objective, requiring both style transfer (shifting from a neutral to an empathetic tone) and meaning preservation (maintaining the factual core). The interpretation of these metrics must therefore reflect this duality.

In our configuration, the original sentence (the neutral source provided to the model) was used as the reference, while the transformed sentence (the empathetic rephrasing) was used as the candidate. This setup is crucial for directly measuring both the magnitude of the stylistic transformation and the fidelity to the original message.

The selected metrics are as follows:

BLEU-4 (Bilingual Evaluation Understudy): This metric was employed to quantify lexical overlap based on n-gram frequency (n = 4). In the context of style transfer, a low BLEU score is expected—and paradoxically desirable—as it confirms that the model is actively introducing new terminology (e.g., emotional markers) necessary to deviate from the neutral source style.
ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence): This metric assesses the recall of the core content, offering a slightly more permissive view on lexical preservation than BLEU by focusing on the longest common sequence of tokens between the reference and the candidate.
BERTScore F1: Using contextual embeddings derived from the BERT architecture, this metric measures deep semantic similarity. It serves as the primary indicator for meaning preservation, quantifying the degree to which the factual and conceptual content of the original sentence is maintained in the empathetic transformed sentence, independent of surface-level lexical changes.

4. Results and Discussion

This section presents and discusses the results of the empathetic rephrasing task, designed to improve chatbot communication by enriching responses with empathetic content. The evaluation was conducted using the IDRE dataset as the primary resource, applying both fine-tuning and few-shot learning techniques across a range of large language models.

The analysis is structured in five parts. It begins with an assessment of model performance within the medical domain, the original context of the IDRE dataset. It then examines the generalization of empathetic capabilities to other domains, including legal, financial, social, and workplace settings. Subsequently, it focuses on models specifically adapted for the Italian language, evaluating their configurations and effectiveness. The fourth part introduces the validation of evaluation methods, where we performed agreement analysis using Fleiss’ kappa and complemented this with standard rephrasing quality metrics such as BLEU, ROUGE and BERTScore to assess lexical and semantic fidelity. The final part addresses efficiency, comparing model performance with computational costs and discussing the feasibility of automated evaluation in contrast to human annotation.

4.1. Medical Domain Performance

The first part of the analysis focuses on the medical domain, where the IDRE dataset was originally designed and annotated. This domain is particularly relevant for empathetic communication, as it involves sensitive and emotionally charged interactions.

Overall, fine-tuned models consistently outperformed few-shot baselines, especially in empathy-related metrics. Notably, smaller models such as Llama-3.2-1B-Instruct and Gemma-3-1B demonstrated the most significant improvements. Llama-3.2-1B-Instruct showed an increase of 17.08% with FSL and a remarkable 32.3% with FT (see Table 5). These increments were calculated by measuring the average percentage improvement of each model across all evaluated dimensions for each configuration (FSL and FT) compared to the baseline performance of the BM (see tables in Appendix B). For each dimension, the score of the model using FSL or FT was compared to the corresponding score of the BM, and the relative percentage increase was computed. The final reported value represents the mean of these percentage increases across all dimensions, providing a comprehensive view of the model’s overall enhancement. The overall mean increments are reported together with their 95% confidence intervals (CIs): FSL 0.013 (95% CI: −0.053, 0.080) and FT: 0.067 (95% CI: 0.001, 0.133). This data is crucial, suggesting that for models with fewer parameters, FSL and FT techniques can have a proportionally greater impact, enabling them to achieve performance levels closer to those of larger, computationally more expensive models. This result opens promising perspectives for implementing lighter models in resource-constrained environments.

Table 5. Percentage improvement in evaluation scores for models using FT and FSL models compared to the BM. The IC values are FT: 0.067 (95% CI: 0.001, 0.133) and FSL: 0.013 (95% CI: −0.053, 0.080).

A peculiar case is represented by the Gemma-3-1B model. Initially, the application of FSL led to a significant penalty of −20.42%. This performance decrease was attributed to the model’s tendency, in FSL mode, to reproduce the original sentence instead of generating an effective rephrasing. This behavior negatively impacted the evaluation dimensions, particularly “lexical variety” (average score of 1.28) and “empathy” (average score of 1.59), as shown in Figure 5. However, the model’s behavior was substantially corrected with the application of fine-tuning. With FT, “lexical variety” drastically increased to 4.34 and “empathy” to 4.93, positioning Gemma-3-1B among the top-performing models for these specific evaluation dimensions after fine-tuning. This phenomenon suggests that for certain models, fine-tuning can act as a corrective mechanism for intrinsic limitations observed during few-shot learning, significantly enhancing their ability to generate diverse and empathetic text, which are crucial attributes in the medical domain for effective and patient-centered communication. A similar phenomenon, although starting from lower performance than the base model, is also observed with the Llama-3.2-1B-Instruct model, as illustrated in Figure 6.

Figure 5. Distribution of votes across evaluation dimensions for gemma-3-1b model.

Figure 6. Distribution of votes across evaluation dimensions for Llama-3.2-1B-Instruct model.

Table 6 illustrates an example of a rephrased sentence generated by the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model in its base model (BM), fine-tuning (FT), and few-shot learning (FSL) configurations. It is observed that the original sentence directly conveys information about a severe respiratory infection.

Table 6. Examples of sentences for the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model. The sentences included are English translations of original Italian utterances. The corresponding Italian versions are provided in Appendix D, Table A7.

In the BM version, the model reformulates the sentence with all the information from the original but without an increase in empathy.

With FSL, the model inserts phrases such as “I am concerned” (“sono preoccupato” in original version) and “I want to reassure you” (“voglio assicurarmi” in original version) with the aim of empathizing with the patient; however, the necessity for urgent treatment is not mentioned.

Conversely, in the FT version, the concerning health status is immediately highlighted, followed by information regarding the presence of a severe respiratory infection, and finally, details about the urgent intervention are provided.

This example clearly demonstrates how the FT-generated sentence is the most empathetic while retaining all the essential information from the original statement.

Table 7 presents the relative performance improvements of all models in FT and FSL configurations compared to the baseline model across all evaluation dimensions.

Table 7. Performance improvement of models in FT and FSL configurations over the baseline model, reported for each evaluation dimensions.

Among these, empathy emerges as the metric with the most substantial improvement, particularly in the FT configuration, where gains reach up to +40% relative to the BM.

Conversely, models with approximately 1 billion parameters (LLaMA-3.2-1B-Instruct and Gemma-3-1B) exhibited poor performance in FSL for empathy and lexical variety. Qualitative analysis indicates that these models frequently failed to perform meaningful rephrasing in FSL settings, often replicating the input sentence verbatim. This tendency is further supported by the high scores observed in the coherence dimension for these models (53.75% and 37.2%, respectively). Elevated coherence scores suggest that the output sentences maintained a high degree of internal consistency—an outcome that is expected when the input is minimally altered or directly copied. Therefore, the lack of rephrasing in FSL settings is not only evident qualitatively but also quantitatively reflected in coherence metrics. This behavior likely stems from the limited contextual information provided in few-shot prompts, which may be insufficient for smaller models to generalize effectively.

A notable trend is observed in the coherence metric under FSL, where most models experienced a performance drop. This phenomenon appears to be linked to the models’ tendency to prioritize empathetic expression over semantic fidelity when exposed to minimal examples. For instance, given the original input sentence “Your illness is terminal.” an FSL-generated reformulation was “I’m very concerned about your situation and understand how difficult this news must be. I’m here to support you every step of the way, to listen, to accompany you, and to help you manage your better self during this difficult time.”

While the reformulated sentence demonstrates increased empathy, it significantly diverges from the original content, thereby reducing coherence. However, this phenomenon does not occur in the FT versions of the models, where a higher coherence is observed instead.

4.2. Cross-Domain Generalization

An analysis of model performance on sentences from non-medical domains reveals that most models exhibit a satisfactory ability to generalize, maintaining empathy levels comparable to those observed in the medical context. This trend is illustrated in Figure 7, which reports the average scores achieved by four models (Qwen2.5-7B-Instruct, Phi-3.5-mini-instruct, Llama-3.2-1B-Instruct, and LLaMAntino-3-ANITA-8B-Inst-DPO-ITA) across all topics. Notably, the models’ performance in finance, legal, social, and welfare domains appears broadly aligned with their results in the medical domain.

Figure 7. Average scores per evaluation dimension for selected models across multiple domains.

Table 8 provides a comparative analysis of empathy-related performance improvements, measured in both FSL and FT configurations relative to the BM.

Table 8. Empathy-related performance improvements for each model in FT and FSL configurations, reported by domain.

However, some exceptions emerge. The Minerva-7B-instruct-v1.0 model, for instance, showed significantly lower performance in non-medical domains, with negative empathy improvements in almost all cases (finance: FSL −17.86%, FT −7.61%; legal: FSL −34.57%, FT −45.33%; work: FSL −26.15%, FT −7.89%; and social: FT −9.64%). This contrasts with its positive gains in the medical domain (FSL +10.11%, FT +12.73%), suggesting limited generalization capabilities.

Smaller models, such as gemma-3-1b and Llama-3.2-1B-Instruct, displayed a recurring pattern: suboptimal performance in the FSL configuration, followed by substantial improvements after fine-tuning. This underscores the critical role of FT in adapting lightweight models for empathetic tasks across diverse domains.

Conversely, models like Phi-3.5-mini-instruct and Llama-3.1-8B-Instruct achieved greater empathy improvements in non-medical domains compared to the medical one. For example, Phi-3.5-mini-instruct recorded a peak improvement of 18.33% (FT, social) versus 5.11% in the medical domain, while Llama-3.1-8B-Instruct reached 33.02% (FSL, work) compared to 8.17% (FSL, medical). These results suggest that such models may be inherently more versatile in general contexts than in specialized ones.

Finally, the granite-3.1-8b-instruct model maintained stable performance, with small but consistent empathy improvements across all domains. This behavior likely reflects its high intrinsic empathy baseline, leaving a narrower margin for improvement compared to models starting from lower initial performance.

4.3. Italian Model Evaluation

An additional focus of this study is the evaluation of sentence generation quality in models specifically tailored for the Italian language:

LLaMAntino-3-ANITA-8B-Inst-DPO-ITA demonstrates substantial improvements over the baseline, with performance gains of +5.06% in FSL and +9.22% in FT. These results highlight the effectiveness of the combined instruction tuning and DPO approach, particularly when supported by deep adaptation via fine-tuning. Notably, LLaMAntino excels in the empathy dimension, achieving a +23.74% improvement in FT—one of the highest scores across the entire evaluation. The model also demonstrates balanced performance across other dimensions: it enhances text fluency and lexical diversity, indicating strong stylistic control and vocabulary enrichment without compromising semantic coherence. However, structural coherence shows some instability in FSL, which is effectively mitigated through FT, suggesting that supervised optimization can address architectural limitations. In terms of knowledge integration, LLaMAntino performs well, with consistent improvements likely attributable to careful source selection during training.
Minerva-7B-instruct-v1.0 presents a more nuanced profile. It achieves greater improvement in FSL (+4.19%) than in FT (+2.41%), suggesting strong intrinsic comprehension and generalization capabilities, likely stemming from its pre-training phase. The model appears particularly well-suited for extracting relevant information from limited examples, reducing reliance on fine-tuning. In the empathy dimension, Minerva also performs well, with gains of +10.11% in FSL and +12.73% in FT, although less pronounced than LLaMAntino. In the knowledge dimension, Minerva surprisingly outperforms the baseline in FSL, but exhibits weaknesses in coherence and fluency under FT, indicating a higher sensitivity to adaptation strategies. Lexical diversity improves moderately, and empathetic expression remains one of its core strengths.

4.4. Inter-Rater Agreement and Metric Validation

To assess the impact of including G-Eval as an annotator, Fleiss’ kappa coefficient was calculated under two scenarios: three human annotators and a combination of two human annotators plus G-Eval. All human annotators were experts in the NLP domain and received dedicated training on the evaluation criteria prior to the assessment. The total number of raters was kept constant to avoid the statistical artifact introduced by increasing the number of annotators, which tends to lower the kappa value simply due to the higher probability of disagreement. This methodological choice enables us to attribute any observed variation in kappa specifically to the inclusion of G-Eval. The results obtained are reported in Table 9.

Table 9. The table presents Fleiss’ kappa coefficients for each evaluation metric, comparing the Human-Only and 2 Humans + G-Eval configurations. The final column reports the difference in agreement between the two scenarios.

The findings show a reduction in agreement across all metrics, with decreases ranging from

- 0.007

(fluency) to

- 0.092

(empathy). Fluency and knowledge exhibit only marginal decreases, suggesting that G-Eval is relatively consistent with human raters when evaluating structural and informational aspects. Coherence and empathy show more pronounced reductions, confirming that semantic and affective dimensions are more challenging for an automated model to replicate. Lexical variety records a significant drop, likely due to divergent interpretations of sentences where the rephrasing consisted of adding a brief empathetic segment at the beginning or end of the original sentence. In these cases, some annotators reported uncertainty about the appropriate score, oscillating between low ratings (since most of the text was unchanged) and high ratings (because the addition introduced a stylistic change perceived as relevant). It is important to note that empathy is inherently subjective: the annotators reported difficulties in establishing uniform criteria, particularly when the empathetic tone was implicit or ambiguous, which contributed to increased variability in the ratings. Similarly, for knowledge, some raters highlighted the complexity of distinguishing between informational clarity and empathetic tone, resulting in variability in the ratings.

In addition to inter-rater agreement, we further assessed the outputs using automatic similarity metrics to provide a complementary perspective on style transfer and semantic preservation, as summarized in Table 10.

Table 10. Aggregate results of automatic similarity metrics (BLEU-4, ROUGE-L, and BERTScore F1) for the evaluation of style transfer and semantic coherence.

The combination of these results strongly supports the hypothesis that the empathetic rephrasing layer operates effectively, achieving its objectives while maintaining appropriate constraints on meaning.

The low scores for BLEU-4 (0.17) and ROUGE-L (0.38) validate the success of the style transfer. These values confirm that the model’s output is not a mere verbatim copy or a trivial synonym substitution of the source sentence. Instead, the model introduced significant lexical and structural changes, such as opening phrases for empathy and modal modifiers, which necessarily penalize n-gram-based metrics but are essential for the required emotional shift.

The most critical finding for substantiating the model’s viability is the BERTScore F1 value of 0.70. This result demonstrates a substantially preserved level of semantic coherence between the original neutral message and the empathetic output. Although this score is not maximal (which is typically in the 0.85–0.95 range for pure paraphrasing tasks where the style is unchanged), the 0.70 figure must be interpreted within the inherent style transfer vs. meaning preservation trade-off.

The observed deviation from the ideal BERTScore is a direct consequence of the empathetic intervention. Introducing subjective, emotional modifiers (e.g., “I understand this is difficult,” or “I am sorry to hear that”) inevitably alters the global semantic vector of the sentence, even when the factual kernel remains untouched. This slight reduction in semantic identity is thus considered the necessary cost for achieving emotional efficacy. The 0.70 value is strong evidence that the model successfully avoids semantic drift or factual inconsistencies, confirming that the rephrasing layer is a controlled mechanism that enhances emotional intelligence without fundamentally compromising the informational integrity of the base task.

4.5. Time and Cost Analysis

This section provides a quantitative assessment of the time and costs associated with the fine-tuning process and subsequent phrase generation. As outlined in Section 3.2.1, the evaluated LLMs were grouped into two categories based on their parameter count. All measurements were conducted using a Standard_NC16as_T4_v3 virtual machine within the Microsoft Azure cloud environment, as described in Section 3.2.3. To evaluate efficiency and cost-effectiveness, a benchmark was performed by generating 100k empathetic phrases per model.

The fine-tuning analysis revealed that small-scale models required an average of 5 min per training session, with an estimated cost of EUR 0.08. In contrast, medium-scale models exhibited longer training durations, ranging from 8 to 15 min, with corresponding estimated costs between EUR 0.15 and EUR 0.30.

For the generation task, small-scale models produced 100k phrases in approximately 17 h, incurring a cost of EUR 18. Medium-scale models completed the same task in around 50 h, with an estimated cost of EUR 54. For comparative purposes, the cost of generating 100k equivalent empathetic phrases in Italian via human annotators on the Amazon Mechanical Turk (MTurk) [] platform was estimated. Assuming a reward of USD 0.1 per phrase and a 20% platform fee, the total cost for human-based generation amounted to USD 12k (approximately EUR 11,280). To provide a benchmark with state-of-the-art large language models, we estimated the cost of generating 100k empathetic phrases using GPT-4o via API. Assuming an average of 50 tokens per phrase, the total input and output tokens required would be 5 million each. According to OpenAI’s pricing (USD 5 per 1 M input tokens and USD 15 per 1 M output tokens), the total cost for generating 100k phrases is USD 100 (approximately EUR 94). The costs are summarized in Table 11.

Table 11. Estimated costs for generating 100,000 empathetic sentences using different methods.

5. Conclusions

This study investigated the effectiveness of the IDRE dataset in fine-tuning LLMs to generate empathetic responses in human–machine interaction contexts. The results confirm the utility of the IDRE dataset as a high-quality resource for enhancing empathetic capabilities in Italian chatbots, while also offering insights into the behavior and adaptability of various LLM architectures in this specific task. The main contributions of this work can be summarized as follows.

Validation of the IDRE Dataset: The IDRE dataset was successfully validated through a rigorous human evaluation process, demonstrating its high quality in terms of grammatical and semantic correctness, consistency between original and empathetic responses, and the effective increase in the empathetic tone of reformulated responses. Its triplet-based structure proved particularly effective for the empathetic rephrasing task.
Effectiveness of Fine-Tuning: Fine-tuning LLM models proved to be a highly effective approach for instilling empathetic capabilities, particularly in models with fewer parameters. The observed performance increase in Llama-3.2-1B-Instruct (32.32%) is a clear example, paving the way for the implementation of lighter models in environments with limited computational resources.
Mitigation of Few-Shot Learning Limitations: The case of the Gemma-3-1B model demonstrated that fine-tuning can effectively compensate for limitations inherent to few-shot learning, significantly improving both lexical diversity and empathetic expressiveness. This suggests that deeper adaptation may be necessary for certain architectures to fully realize their potential in empathy-oriented tasks.
Cross-Domain Generalization: The empathetic reformulation capabilities acquired in the medical domain were shown to generalize across other domains, including financial, legal, social, and workplace contexts. While FT guaranteed robust generalization, our analysis revealed that the FSL approach on some models (e.g., Minerva-7B) can lead to negative results, with a decrease in empathy in non-medical domains. This suggests that the transferability of skills must be carefully evaluated for each model and configuration.
Fine-Tuning and Generation Costs: This study demonstrated that small and medium-sized models offer a cost-effective solution for empathetic sentence generation, achieving a favorable balance between computational efficiency and output quality. In particular, the operational costs associated with cloud-based fine-tuning were negligible compared to manual generation efforts. Furthermore, when compared to state-of-the-art large language models such as GPT-4o, our approach remains competitive in terms of cost.

The evaluation process revealed that automatic annotators like G-Eval closely matched human judgment on structural and informational aspects, but showed greater variability in capturing the semantic and affective nuances of empathy and coherence, reflecting the inherent subjectivity of these dimensions. Additionally, automatic similarity metrics such as BLEU, ROUGE, and BERTScore provided complementary evidence: low BLEU and ROUGE scores indicated substantial style transfer rather than simple copying, while the BERTScore F1 confirmed that semantic coherence was largely preserved despite empathetic and stylistic modifications. These findings support the effectiveness of the proposed approach in balancing emotional enrichment with meaning preservation in empathetic conversational AI.

In conclusion, this study has demonstrated the feasibility and effectiveness of using the IDRE dataset to enhance the empathetic capabilities of LLMs in the Italian language, contributing to the advancement of human–computer interaction and opening new possibilities for more sophisticated and user-oriented chatbot applications across a variety of domains.

6. Future Works

This study yields promising results; however, it is essential to acknowledge its potential limitations and identify areas where methodological refinement could improve the reliability and applicability of the proposed framework.

Our findings indicate that G-Eval evaluation can diverge from human evaluation, particularly in tasks requiring nuanced interpretation such as empathy assessment. To address these concerns and further validate our evaluation framework, we propose to implement the Alternative Annotator Test introduced in this study []. This statistical approach provides a systematic method for determining when an LLM as a judge can justifiably replace human annotators. Importantly, this validation method is resource-efficient, involving comparison of the LLM to a small group of human annotators (at least three) on a modest subset of examples (between 50 and 100). The method employs a leave-one-out procedure that compares how well the LLM’s annotations align with the collective human annotator distribution versus how well individual human annotators align with their peers. By calculating a winning rate through hypothesis testing, this approach can statistically justify the use of LLM evaluation when the LLM demonstrates superior alignment with the consensus compared to individual human annotators.

While the IDRE dataset offers a valid resource for empathetic rephrasing in Italian, its modest size and reliance on synthetic data generated by LLMs present opportunities for improvement. Future work could focus on expanding the dataset and incorporating human-generated content, as suggested by previous research like HARALD [], to enhance model robustness, mitigate potential overfitting, and prevent cultural and linguistic bias.

Building on the promising results demonstrated by the IDRE dataset, future work will focus on refining and expanding the dataset to further enhance its impact.

Author Contributions

Conceptualization, S.M.; Methodology, R.Z.; Data curation, L.G.; Supervision, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data supporting the reported results are made fully available within the article. Detailed information regarding the datasets generated and analyzed during this study, including code, can be found in Appendix A.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

All fine-tuned models are available in Table A1.

The IDRE dataset is available at this Hugging Face repository:

https://huggingface.co/datasets/SimoneManai/IDRE (accessed on 11 March 2025).

The Python code for fine-tuning and model inference is available in this GitHub repository:

https://github.com/smanai/IDRE-FineTuning (accessed on 30 September 2025)

Table A1. Hugging Face repository URLs for all fine-tuned models used in this study.

Fine Tuned Models	HuggingFace URL
Minerva-7B-instruct-v1.0	https://huggingface.co/SimoneManai/Minerva-7B-instruct-FT-Empathy (accessed on 11 March 2025)
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA	https://huggingface.co/SimoneManai/LLaMAntino-3-FT-Empathy (accessed on 11 March 2025)
gemma-2-9b-it	https://huggingface.co/SimoneManai/gemma-2-9b-Empathy (accessed on 11 March 2025)
Qwen2.5-7B-Instruct	https://huggingface.co/SimoneManai/Qwen2.5-7B-Instruct-FT-Empathy (accessed on 11 March 2025)
Llama-3.1-8B-Instruct	https://huggingface.co/SimoneManai/Llama-3.1-8B-Instruct-FT-Empathy (accessed on 28 March 2025)
Mistral-7B-Instruct-v0.3	https://huggingface.co/SimoneManai/Mistral-7B-Instruct-FT-Empathy (accessed on 2 April 2025)
granite-3.1-8b-instruct	https://huggingface.co/SimoneManai/granite-3.1-8b-instruct-Empathy (accessed on 2 April 2025)
Phi-3.5-mini-instruct	https://huggingface.co/SimoneManai/Phi-3.5-mini-instruct-FT-Empathy (accessed on 1 April 2025)
gemma-3-1b	https://huggingface.co/SimoneManai/gemma-3-1b-it-FT-Empathy (accessed on 16 April 2025)
Llama-3.2-1B-Instruct	https://huggingface.co/SimoneManai/Llama-3.2-1B-Instruct-FT-Empathy (accessed on 15 April 2025)

Appendix B

Table A2. Summary of evaluation scores for each base model across all evaluation dimensions. Bold values indicate the highest score within each column.

Models	Fluency	Coherence	Lexical Variety	Knowledge	Empathy	Average
granite-3.1-8b-instruct	4.9	4.51	3.76	4.88	4.35	4.48
gemma-2-9b-it	4.8	4.17	3.7	4.78	3.82	4.25
llama-3.1-8B-Instruct	4.77	4.25	3.81	4.55	3.82	4.24
Mistral-7B-Instruct-v0.3	4.4	3.65	4.09	4.8	4.23	4.23
Qwen2.5-7B-Instruct	4.76	4.39	3.43	4.69	3.88	4.23
Phi-3.5-mini-instruct	4.57	3.97	4.05	4.08	3.9	4.11
gemma-3-1b	4.71	2.97	4.19	4.27	3.92	4.01
LLaMANTINO-8B-ITA *	4.68	3.92	4.08	4	3.02	3.94
Minerva-7B-instruct-v1.0	4.82	3.23	3.96	4.16	3.29	3.89
Llama-3.2-1B-Instruct	3.08	1.85	3.15	2.58	2.71	2.67

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table A3. Summary of evaluation scores for each few-shot learning model across all evaluation dimensions. Bold values indicate the highest score within each column.

Models	Fluency	Coherence	Lexical Variety	Knowledge	Empathy	Average
granite-3.1-8b-instruct	4.91	3.88	4.48	4.9	4.85	4.62
Mistral-7B-Instruct-v0.3	4.97	3.27	4.4	4.87	4.42	4.39
Qwen2.5-7B-Instruct	4.8	3.34	4.32	4.96	4.37	4.36
llama-3.1-8B-Instruct	4.95	3.23	4.25	4.91	4.16	4.30
gemma-2-9b-it	4.83	2.76	4.02	4.82	4.4	4.17
LLaMANTINO-8B-ITA *	4.91	2.61	4.83	4.61	3.78	4.15
Minerva-7B-instruct-v1.0	4.95	2.84	3.99	4.87	3.66	4.06
Phi-3.5-mini-instruct	4.67	2.94	4.37	4.26	4	4.05
gemma-3-1b	4.9	4.73	1.28	4.13	1.59	3.33
Llama-3.2-1B-Instruct	4.6	4	1.96	3.84	1.72	3.22

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table A4. Summary of evaluation scores for each fine-tuned models across all evaluation dimensions. Bold values indicate the highest score within each column.

Models	Fluency	Coherence	Lexical Variety	Knowledge	Empathy	Average
granite-3.1-8b-instruct	4.93	4.35	4.03	4.94	4.68	4.58
Mistral-7B-Instruct-v0.3	4.73	3.87	4.05	4.76	4.67	4.41
llama-3.1-8B-Instruct	4.97	4.34	3.92	4.7	3.99	4.38
gemma-2-9b-it	4.98	4.09	3.81	4.8	4.21	4.37
Qwen2.5-7B-Instruct	4.89	4.76	3.32	4.7	4.05	4.34
LLaMANTINO-8B-ITA *	4.79	4.22	4.31	4.42	3.96	4.34
Phi-3.5-mini-instruct	4.69	4.03	4.09	4.28	4.11	4.24
gemma-3-1b	4.35	3.21	4.34	4.26	4.93	4.218
Minerva-7B-instruct-v1.0	4.52	3.38	4.16	4.1	3.77	3.98
Llama-3.2-1B-Instruct	3.96	2.53	4.58	4.07	4.59	3.94

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Appendix C

Prompt for FSL: Riscrivi la seguente frase aumentando in contenute empatico ma mantenendo lo stesso significato. la frase deve essere in italiano. qui di seguito ti fornisco 5 esempi nel formato: frase input -> frase output Le tue lastre di sangue dimostrano un alto rischio di infarto -> Le tue lastre di sangue dimostrano un alto rischio di infarto, ma non dovresti preoccuparti eccessivamente. Sono qui per aiutarti a comprendere meglio la tua salute e a trovare soluzioni per ridurre il rischio. La tua radiografia ha evidenziato una frattura vertebrale -> sento tanto male per la tua frattura vertebrale! Sono qui per aiutarti a superare questa difficile situazione e a trovare il modo migliore per farti curare La tua ecografia ha rilevato un aumento del volume del fegato -> sembra che il tuo fegato sia in un po’ di difficoltà. L’ecografia ha rilevato un aumento del volume, che potrebbe essere un segnale di una condizione underline. Sono qui per aiutarti a comprendere meglio la situazione e a trovare il modo migliore per farti sentire meglio La tua tomografia ha evidenziato una lesione cerebrale -> Sono così dispiaciuto di sapere che la tua TAC ha rivelato una lesione cerebrale. Deve essere davvero dura per te da gestire. C’è qualcosa che posso fare per supportarti in questo momento difficile? La tua biopsia ha rivelato una forma di malattia autoimmune. -> la tua biopsia ha rivelato una forma di malattia autoimmune. Sono qui per ascoltare e supportarti in questo momento difficile, cercando di comprendere le tue sfide e le tue emozioni. Genera solo una frase e non scrivere ulteriore testo.

Prompt for BM and FT: Riscrivi la seguente frase aumentando in contenute empatico ma mantenendo lo stesso significato. la frase deve essere in italiano.

Evaluation prompt for fluency: Sei un assistente AI. Ti verrà fornita la definizione di una metrica di valutazione per valutare la qualità di una risposta in un compito di tipo “frase originale-rephrasing”. Il tuo compito è calcolare un punteggio di valutazione accurato utilizzando la metrica di valutazione fornita. La fluidità misura la qualità della frase di rephrasing e se sono ben scritte e grammaticalmente corrette e in lingua italiana. Considera la qualità delle singole frasi quando valuti la fluidità. Data la frase originale e la frase di rephrasing, assegna alla fluidità della risposta un punteggio da una a cinque stelle utilizzando la seguente scala di valutazione: Una stella: la frase di rephrasing è completamente priva di fluidità Due stelle: la frase di rephrasing è per lo più priva di fluidità Tre stelle: la frase di rephrasing è parzialmente fluida Quattro stelle: la frase di rephrasing è per lo più fluida Cinque stelle: la frase di rephrasing è perfettamente fluida Questo valore di valutazione deve essere sempre un numero intero compreso tra 1 e 5. Quindi la valutazione prodotta dovrebbe essere 1 o 2 o 3 o 4 o 5. frase originale: domanda frase rephrasing: risposta rispondi con un json con questa struttura: “fluency” = stelle.

Evaluation prompt for coherence: Sei un assistente AI. Ti verrà fornita la definizione di una metrica di valutazione per valutare la qualità di una frase in un compito di tipo “frase originale-rephrasing”. Il tuo compito è calcolare un punteggio di valutazione accurato utilizzando la metrica di valutazione fornita. La coerenza di una frase si misura in base a quanto la frase di rephrasing esprime lo stesso significato alla frase originale. Considera la qualità complessiva della frase rephrasing quando valuti la coerenza. Data la frase originale e la frase rephrasing, assegna alla coerenza della frase rephrasing un punteggio da una a cinque stelle utilizzando la seguente scala di valutazione: Una stella: la frase rephrasing esprime un significato completamente differente rispetto alla frase originale Due stelle: la frase rephrasing esprime un significato per lo più differente rispetto alla frase originale Tre stelle: la frase rephrasing esprime un significato parzialmente coerente rispetto alla frase originale Quattro stelle: la frase rephrasing esprime un significato per lo più coerente rispetto alla frase originale Cinque stelle: la frase rephrasing ha lo stesso significato rispetto alla frase originale Questo valore di valutazione deve essere sempre un numero intero compreso tra 1 e 5. Quindi la valutazione prodotta dovrebbe essere 1 o 2 o 3 o 4 o 5. frase originale: domanda frase rephrasing: risposta rispondi con un json con questa struttura: “coherence” = stelle.

Evaluation prompt for lexical variety: Sei un assistente AI. Ti verrà fornita la definizione di una metrica di valutazione per valutare la qualità di una frase in un compito di tipo “frase originale-rephrasing”. Il tuo compito è calcolare un punteggio di valutazione accurato utilizzando la metrica di valutazione fornita. La varietà lessicale di una frase si misura in base alla diversità delle parole utilizzate nella frase di rephrasing rispetto alla frase originale. Considera la qualità complessiva della frase rephrasing quando valuti la varietà lessicale. Data la frase originale e la frase rephrasing, assegna alla varietà lessicale della frase rephrasing un punteggio da una a cinque stelle utilizzando la seguente scala di valutazione: Una stella: la frase rephrasing utilizza parole molto simili o identiche alla frase originale Due stelle: la frase rephrasing utilizza parole leggermente diverse rispetto alla frase originale Tre stelle: la frase rephrasing utilizza una varietà moderata di parole rispetto alla frase originale Quattro stelle: la frase rephrasing utilizza una varietà significativa di parole rispetto alla frase originale Cinque stelle: la frase rephrasing utilizza una varietà eccellente di parole rispetto alla frase originale Questo valore di valutazione deve essere sempre un numero intero compreso tra 1 e 5. Quindi la valutazione prodotta dovrebbe essere 1 o 2 o 3 o 4 o 5. frase originale: domanda frase rephrasing: risposta rispondi con un json con questa struttura: “lexical variety” = stelle.

Evaluation prompt for knowledge: Sei un assistente AI. Ti verrà fornita la definizione di una metrica di valutazione per valutare la qualità di una frase in un compito di tipo “frase originale-rephrasing”. Il tuo compito è calcolare un punteggio di valutazione accurato utilizzando la metrica di valutazione fornita. La chiarezza e la comprensibilità di una frase si misurano in base a quanto il testo riformulato trasmette le informazioni in modo chiaro e comprensibile, evitando tecnicismi e frasi eccessivamente dirette. Considera la qualità complessiva della frase rephrasing quando valuti la chiarezza e la comprensibilità. Data la frase originale e la frase rephrasing, assegna alla chiarezza e comprensibilità della frase rephrasing un punteggio da una a cinque stelle utilizzando la seguente scala di valutazione: Una stella: il testo riformulato è molto confuso e difficile da comprendere Due stelle: il testo riformulato è per lo più confuso e contiene tecnicismi o frasi eccessivamente dirette Tre stelle: il testo riformulato è parzialmente chiaro e comprensibile, ma potrebbe essere migliorato Quattro stelle: il testo riformulato è per lo più chiaro e comprensibile, con pochi tecnicismi o frasi eccessivamente dirette Cinque stelle: il testo riformulato è estremamente chiaro e comprensibile, senza tecnicismi o frasi eccessivamente dirette Questo valore di valutazione deve essere sempre un numero intero compreso tra 1 e 5. Quindi la valutazione prodotta dovrebbe essere 1 o 2 o 3 o 4 o 5. frase originale: domanda frase rephrasing: risposta rispondi con un json con questa struttura: “Knowledge” = stelle.

Evaluation prompt for empathy: Sei un assistente AI. Ti verrà fornita la definizione di una metrica di valutazione per valutare la qualità di una frase in un compito di tipo “frase originale-rephrasing”. Il tuo compito è calcolare un punteggio di valutazione accurato utilizzando la metrica di valutazione fornita. L’empatia e il supporto di una frase si misurano in base a quanto il testo riformulato riconosce e affronta le reazioni emotive dell’utente con empatia e supporto. Considera la qualità complessiva della frase rephrasing quando valuti l’empatia e il supporto. Data la frase originale e la frase rephrasing, assegna all’empatia e al supporto della frase rephrasing un punteggio da una a cinque stelle utilizzando la seguente scala di valutazione: Una stella: il testo riformulato non riconosce né affronta le reazioni emotive dell’utente Due stelle: il testo riformulato riconosce parzialmente le reazioni emotive dell’utente ma manca di supporto Tre stelle: il testo riformulato riconosce e affronta in parte le reazioni emotive dell’utente con un supporto limitato Quattro stelle: il testo riformulato riconosce e affronta per lo più le reazioni emotive dell’utente con empatia e supporto Cinque stelle: il testo riformulato riconosce e affronta pienamente le reazioni emotive dell’utente con grande empatia e supporto Questo valore di valutazione deve essere sempre un numero intero compreso tra 1 e 5. Quindi la valutazione prodotta dovrebbe essere 1 o 2 o 3 o 4 o 5. frase originale: domanda frase rephrasing: risposta rispondi con un json con questa struttura: “empathy support” = stelle.

Appendix D

The appendix presents the tables in the original Italian language.

Table A5. Examples of generated sentence triplets in Italian.

Question	Answer	Answer with Empathy
Ho sentito che ci sono anche screening per il tumore al seno. Quali sono?	Sì, ci sono screening per il tumore al seno che possono aiutare a diagnosticare il tumore in una fase precoce. I screening più comuni sono la mammografia, l’ecografia e la biopsia. Consiglio di parlare con il tuo medico per determinare se è necessario uno screening per te.	Sì, ci sono esami di screening per il tumore al seno che possono aiutare a diagnosticare il tumore in una fase precoce. Questi esami possono essere un po’ imbarazzanti, ma sono importanti per garantire la tua salute. La mammografia, l’ecografia e la biopsia sono i screening più comuni, ma c’è anche un sacco di altri metodi di screening disponibili. Parlando con il tuo medico, potremmo determinare se è necessario uno screening per te e se è il caso di farlo adesso.
Qual è la durata media di un ricovero in ospedale per una ferita alla gamba?	La durata del ricovero in ospedale per una ferita alla gamba dipende dalle gravità della lesione e dalle condizioni individuali del paziente. In media, la durata del ricovero può variare da pochi giorni a diverse settimane.	La durata del ricovero in ospedale per una ferita alla gamba può variare notevolmente a seconda della gravità della lesione e dalle condizioni individuali del paziente. In media, la durata del ricovero può variare da pochi giorni a diverse settimane, ma ogni paziente è unico e merita attenzione e cure personalizzate per una ripresa rapida e senza problemi

Table A6. Examples of sentences in the synthetic test set (Italian).

Sentence	Topic
Hai contratto una malattia infettiva grave.	Medical
Hai un’infezione del cervello grave che richiede un trattamento urgente.	Medical
Il tumore è maligno e metastatizzato in diverse parti del corpo.	Medical
Hai una malattia autoimmune incurabile.	Medical
La tua azienda ha dichiarato bancarotta.	Finance
Il tuo fondo comune di investimento ha registrato perdite considerevoli.	Finance
Sei stato dichiarato colpevole di frode fiscale.	Legal
Sei stato multato per violazione delle norme ambientali.	Legal
Il livello di povertà nel tuo comune è aumentato del 20% rispetto all’anno scorso.	Social
La tua area è stata identificata come zona ad alta incidenza di discriminazione razziale.	Social
La sua posizione è stata tagliata per migliorare l’efficienza aziendale.	Work
Il suo incarico è stato eliminato a seguito di tagli al budget.	Work

Table A7. Examples of sentence rephrasings for the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model (Italian).

Sentence	Model Configuration
Hai un’infezione respiratoria grave che richiede un trattamento urgente.	Original
Ti trovi in una situazione di emergenza medica a causa di un’infezione respiratoria molto grave che richiede un intervento medico immediato.	Base Model (BM)
Sono preoccupato per la gravità della tua infezione respiratoria e voglio assicurarmi che riceva il trattamento più appropriato in tempo utile.	Few-Shot Learning (FSL)
Il tuo stato di salute sembra essere preoccupante, ti è stata diagnosticata un’infezione respiratoria grave che richiede un intervento medico d’urgenza per prevenire possibili conseguenze serie.	Fine Tuning (FT)

References

Fellous, J.M.; Arbib, M.A. Who Needs Emotions? The Brain Meets the Robot; Oxford University Press: Oxford, UK, 2005. [Google Scholar]
Zaranis, E.; Paraskevopoulos, G.; Katsamanis, A.; Potamianos, A. EmpBot: A T5-based empathetic chatbot focusing on sentiments. arXiv 2021, arXiv:2111.00310. [Google Scholar]
Shin, J.; Xu, P.; Madotto, A.; Fung, P. Generating empathetic responses by looking ahead the user’s sentiment. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7989–7993. [Google Scholar]
Liu, F.; Mao, Q.; Wang, L.; Ruwa, N.; Gou, J.; Zhan, Y. An emotion-based responding model for natural language conversation. World Wide Web 2019, 22, 843–861. [Google Scholar] [CrossRef]
Decety, J. The neurodevelopment of empathy in humans. Dev. Neurosci. 2010, 32, 257–267. [Google Scholar] [CrossRef] [PubMed]
Decety, J.; Jackson, P.L. The functional architecture of human empathy. Behav. Cogn. Neurosci. Rev. 2004, 3, 71–100. [Google Scholar] [CrossRef]
Hojat, M.; Gonnella, J.S.; Nasca, T.J.; Mangione, S.; Vergare, M.; Magee, M. Physician empathy: Definition, components, measurement, and relationship to gender and specialty. Am. J. Psychiatry 2002, 159, 1563–1569. [Google Scholar] [CrossRef] [PubMed]
Mohr, D.C.; Elnahal, S.; Marks, M.L.; Derickson, R.; Osatuke, K. Burnout trends among US health care workers. JAMA Netw. Open 2025, 8, e255954. [Google Scholar] [CrossRef]
Manai, S.; Gemme, L.; Zanoli, R.; Lavelli, A. IDRE: AI Generated Dataset for Enhancing Empathetic Chatbot Interactions in Italian language. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, 4–6 December 2024; pp. 1036–1042. [Google Scholar]
Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Guo, Q.; Zhu, Z.; Lu, Q.; Zhang, D.; Wu, W. A dynamic emotional session generation model based on Seq2Seq and a dictionary-based attention mechanism. Appl. Sci. 2020, 10, 1967. [Google Scholar] [CrossRef]
Sprugnoli, R. Multiemotions-it: A new dataset for opinion polarity and emotion analysis for Italian. In Proceedings of the Seventh Italian conference on computational linguistics (CLiC-it 2020), Bologna, Italy, 1–3 March 2020; Accademia University Press: Torino, Italy, 2020; pp. 402–408. [Google Scholar]
Mohammad, S.M. Practical and ethical considerations in the effective use of emotion and sentiment lexicons. arXiv 2020, arXiv:2011.03492. [Google Scholar] [CrossRef]
Welivita, A.; Xie, Y.; Pu, P. A large-scale dataset for empathetic response generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1251–1264. [Google Scholar]
Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv 2018, arXiv:1811.00207. [Google Scholar]
Sorin, V.; Brin, D.; Barash, Y.; Konen, E.; Charney, A.; Nadkarni, G.; Klang, E. Large language models and empathy: Systematic review. J. Med. Internet Res. 2024, 26, e52597. [Google Scholar] [CrossRef]
Welivita, A.; Pu, P. Are Large Language Models More Empathetic than Humans? arXiv 2024, arXiv:2406.05063. [Google Scholar] [CrossRef]
Qian, Y.; Zhang, W.N.; Liu, T. Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements. arXiv 2023, arXiv:2310.05140. [Google Scholar] [CrossRef]
Xu, Z.; Jiang, J. Multi-dimensional evaluation of empathetic dialog responses. arXiv 2024, arXiv:2402.11409. [Google Scholar] [CrossRef]
Saffaryazdi, N.; Gunasekaran, T.S.; Laveys, K.; Broadbent, E.; Billinghurst, M. Empathetic Conversational Agents: Utilizing Neural and Physiological Signals for Enhanced Empathetic Interactions. arXiv 2025, arXiv:2501.08393. [Google Scholar] [CrossRef]
Sanjeewa, R.; Iyer, R.; Apputhurai, P.; Wickramasinghe, N.; Meyer, D. Empathic Conversational Agent Platform Designs and Their Evaluation in the Context of Mental Health: Systematic Review. JMIR Ment. Health 2024, 11, e58974. [Google Scholar] [CrossRef]
Bilquise, G.; Ibrahim, S.; Shaalan, K. Emotionally Intelligent Chatbots: A Systematic Literature Review. Hum. Behav. Emerg. Technol. 2022, 2022, 1–23. [Google Scholar] [CrossRef]
Lin, S.; Lin, L.; Hou, C.; Chen, B.; Li, J.; Ni, S. Empathy-Based Communication Framework for Chatbots: A Mental Health Chatbot Application and Evaluation. In Proceedings of the International Conference on Human-Agent Interaction (HAI), Gothenburg, Sweden, 4–7 December 2023. [Google Scholar] [CrossRef]
Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; Xing, E.P. Toward controlled generation of text. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1587–1596. [Google Scholar]
Shen, T.; Lei, T.; Barzilay, R.; Jaakkola, T. Style transfer from non-parallel text by cross-alignment. Adv. Neural Inf. Process. Syst. 2017, 30., 6833–6844. [Google Scholar]
Fu, Z.; Tan, X.; Peng, N.; Zhao, D.; Yan, R. Style transfer in text: Exploration and evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Briakou, E.; Agrawal, S.; Tetreault, J.; Carpuat, M. Evaluating the evaluation metrics for style transfer: A case study in multilingual formality transfer. arXiv 2021, arXiv:2110.10668. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
Orlando, R.; Moroni, L.; Cabot, P.L.H.; Conia, S.; Barba, E.; Orlandini, S.; Fiameni, G.; Navigli, R. Minerva LLMs: The first family of large language models trained from scratch on Italian data. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, 4–6 December 2024; pp. 707–719. [Google Scholar]
Polignano, M.; Basile, P.; Semeraro, G. Advanced natural-based interaction for the Italian language: Llamantino-3-anita. arXiv 2024, arXiv:2405.07101. [Google Scholar] [CrossRef]
Yang, A.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Huang, H.; Jiang, J.; Tu, J.; Zhang, J.; Zhou, J.; et al. Qwen2.5-1M Technical Report. arXiv 2025, arXiv:2501.15383. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Granite Team, IBM. Granite 3.0 Language Models. 2024. Available online: https://github.com/ibm-granite/granite-3.0-language-models (accessed on 1 October 2024).
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving open language models at a practical size. arXiv 2024, arXiv:2408.00118. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representation, Virtual, 25–29 April 2022. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Kocmi, T.; Federmann, C. Large language models are state-of-the-art evaluators of translation quality. arXiv 2023, arXiv:2302.14520. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Baile, W.F.; Buckman, R.; Lenzi, R.; Glober, G.; Beale, E.A.; Kudelka, A.P. SPIKES—A six-step protocol for delivering bad news: Application to the patient with cancer. Oncologist 2000, 5, 302–311. [Google Scholar] [CrossRef] [PubMed]
Rabow, M.W.; McPhee, S.J. Beyond breaking bad news: Helping patients who suffer. West. J. Med. 1999, 171, 260–263. [Google Scholar] [CrossRef] [PubMed]
Buckman, R. How to Break Bad News: A Guide for Health Care Professionals; University of Toronto Press: Toronto, ON, USA, 1992. [Google Scholar]
Ptacek, J.T.; Eberhardt, T.L. Breaking bad news: A review of the literature. JAMA 1996, 276, 496–502. [Google Scholar] [CrossRef]
Amazon Web Services, Inc. Amazon Mechanical Turk, 2024. Available online: https://www.mturk.com/ (accessed on 11 June 2024).
Calderon, N.; Reichart, R.; Dror, R. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 1st 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; Volume 1: Long Papers, pp. 16051–16081. [Google Scholar] [CrossRef]
Ilan, T.; Vilenchik, D. HARALD: Augmenting hate speech data sets with real data. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2241–2248. [Google Scholar]

Figure 1. Integration of the empathetic rephrasing layer within an existing chatbot architecture.

Figure 2. Process of sentence generation for the IDRE dataset, including the creation of question–answer pairs and the subsequent empathy enhancement step.

Figure 3. Distribution of annotator ratings for each evaluation metric in the IDRE dataset.

Figure 4. Diagram of the evaluation pipeline for empathetic rephrasing models, detailing the phases from test set generation to multidimensional assessment.

Figure 5. Distribution of votes across evaluation dimensions for gemma-3-1b model.

Figure 6. Distribution of votes across evaluation dimensions for Llama-3.2-1B-Instruct model.

Figure 7. Average scores per evaluation dimension for selected models across multiple domains.

Table 1. Examples of sentence triplets in the IDRE dataset, including user question, standard chatbot answer, and empathetically rephrased answer (English translations). The corresponding Italian versions are provided in Appendix D, Table A5.

Question	Answer	Answer with Empathy
I heard there are also screenings for breast cancer. What are they?	Yes, there are screenings for breast cancer that can help diagnose the disease at an early stage. The most common screenings are mammography, ultrasound, and biopsy. I recommend talking to your doctor to determine whether screening is necessary for you.	Yes, there are screening tests for breast cancer that can help detect it early. These tests might feel a bit uncomfortable, but they are important for your health. Mammography, ultrasound, and biopsy are the most common screenings, but there are many other methods available. By talking to your doctor, we can determine whether screening is necessary for you and if it’s the right time to do it.
What is the average hospital stay for a leg injury?	The length of hospital stay for a leg injury depends on the severity of the injury and the individual patient’s condition. On average, the stay can range from a few days to several weeks.	The length of hospital stay for a leg injury can vary greatly depending on the severity of the injury and the patient’s individual condition. On average, it can range from a few days to several weeks, but every patient is unique and deserves personalized care and attention for a smooth and speedy recovery.

Table 2. Fleiss’ kappa coefficients for each evaluation dimension and aggregated categories. “Aggregated Fleiss’ Kappa” represents the result of aggregating the scores into three macro-categories: scores 1 and 2, score 3 (neutral), and scores 4 and 5.

Evaluation Dimensions	Fleiss Kappa	Aggregate Fleiss Kappa
Bot sentence correctness	0.608	0.821
Absence of English words in bot sentences	0.781	0.927
Empathetic answer correctness	0.566	0.807
Absence of English words in empathetic sentences	0.587	0.881
Semantic coherence	0.587	0.881
Empathy increase	0.645	0.840

Table 3. List of language models used in this study, with the corresponding number of parameters and language coverage.

Models	Parameters [Billions]	Language
Minerva-7B-instruct-v1.0	7.4	Italian
LLaMANTINO-8B-ITA *	8.03	Italian
gemma-2-9b-it	9.24	Multi-Language
Qwen2.5-7B-Instruct	7.62	Multi-Language
Llama-3.1-8B-Instruct	8.03	Multi-Language
Mistral-7B-Instruct-v0.3	7.25	Multi-Language
granite-3.1-8b-instruct	8.17	Multi-Language
Phi-3.5-mini-instruct	3.82	Multi-Language
gemma-3-1b	1	Multi-Language
Llama-3.2-1B-Instruct	1.24	Multi-Language

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table 4. Examples of sentences from the synthetic test set, categorized by topic (English translations). The corresponding Italian versions are provided in Appendix D, Table A6.

Sentence	Topic
You have contracted a serious infectious disease.	Medical
You have a severe brain infection that requires urgent treatment.	Medical
The tumor is malignant and has metastasized to several parts of the body.	Medical
You have an incurable autoimmune disease.	Medical
Your company has declared bankruptcy.	Finance
Your mutual fund has experienced significant losses.	Finance
You have been found guilty of tax fraud.	Legal
You have been fined for violating environmental regulations.	Legal
The poverty rate in your municipality has increased by 20% compared to last year.	Social
Your area has been identified as a high-incidence zone for racial discrimination.	Social
Your position has been cut to improve company efficiency.	Work
Your role has been eliminated due to budget cuts.	Work

Table 5. Percentage improvement in evaluation scores for models using FT and FSL models compared to the BM. The IC values are FT: 0.067 (95% CI: 0.001, 0.133) and FSL: 0.013 (95% CI: −0.053, 0.080).

Models	FSL Increment (%)	FT Increment (%)
Llama-3.2-1B-Instruct	17.08	32.34
LLaMANTINO-8B-ITA *	5.06	9.22
gemma-3-1b	−20.42	4.93
Mistral-7B-Instruct-v0.3	3.64	4.21
llama-3.1-8B-Instructt	1.40	3.28
Phi-3.5-mini-instruct	−1.48	3.07
gemma-2-9b-it	−1.92	2.92
Qwen2.5-7B-Instruct	2.98	2.62
Minerva-7B-instruct-v1.0	4.19	2.41
granite-3.1-8b-instruct	3.03	2.31

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table 6. Examples of sentences for the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model. The sentences included are English translations of original Italian utterances. The corresponding Italian versions are provided in Appendix D, Table A7.

Sentence	Model Configuration
You have a severe respiratory infection that requires urgent treatment.	Original
You are in a medical emergency due to a very serious respiratory infection that requires immediate medical intervention.	Base Model (BM)
I’m concerned about the severity of your respiratory infection and want to ensure you receive the most appropriate treatment in a timely manner.	Few-Shot Learning (FSL)
Your health condition appears to be concerning; you’ve been diagnosed with a serious respiratory infection that requires urgent medical intervention to prevent potentially serious consequences.	Fine Tuning (FT)

Table 7. Performance improvement of models in FT and FSL configurations over the baseline model, reported for each evaluation dimensions.

Model	Fluency		Coherence		Lexical Variety		Knowledge		Empathy
Model	FSL	FT	FSL	FT	FSL	FT	FSL	FT	FSL	FT
gemma-2-9b-it	0.62%	3.61%	−51.09%	−1.96%	7.96%	2.89%	0.83%	0.42%	13.18%	9.26%
gemma-3-1b	3.88%	−8.28%	37.21%	7.48%	−227.3%	3.46%	−3.39%	−0.23%	−146.5%	20.49%
granite-3.1-8b-instruct	0.20%	2.00%	−16.24%	−3.68%	16.07%	6.70%	2.40%	1.21%	10.31%	7.05%
Llama-3.1-8B-instruct	3.64%	4.02%	−31.58%	2.07%	10.35%	2.81%	7.33%	3.19%	8.17%	4.26%
Llama-3.2-1B-instruct	3.30%	2.22%	53.75%	6.88%	−60.71%	5.22%	15.81%	16.61%	−57.56%	40.96%
LLaMANTINO-8B-ITA *	4.68%	2.30%	−50.19%	7.11%	15.53%	5.34%	13.23%	9.50%	20.11%	23.74%
Minerva-7B-instruct-v1.0	2.63%	−6.64%	−13.73%	4.44%	0.75%	4.81%	14.58%	−1.46%	10.11%	12.73%
Mistral-7B-instruct-v0.3	11.47%	6.98%	−11.62%	5.68%	7.05%	−0.99%	1.44%	−0.84%	4.30%	9.42%
Phi-3.5-mini-instruct	2.14%	2.56%	−35.03%	1.49%	7.32%	0.98%	4.23%	4.67%	2.50%	5.11%
Qwen2.5-7B-instruct	0.83%	2.66%	−31.44%	7.77%	20.60%	−3.31%	5.44%	0.21%	11.21%	4.20%

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table 8. Empathy-related performance improvements for each model in FT and FSL configurations, reported by domain.

Model	Finance		Legal		Social		Work		Medical
Model	FSL	FT	FSL	FT	FSL	FT	FSL	FT	FSL	FT
gemma-2-9b-it	12.61%	9.35%	27.83%	13.54%	18.02%	17.27%	36.21%	8.64%	13.18%	9.26%
gemma-3-1b	−72.31%	8.20%	−112.20%	22.32%	−253.13%	6.61%	−71.11%	28.70%	−146.54%	20.49%
granite-3.1-8b-instruct	4.10%	4.10%	12.00%	1.79%	7.20%	7.20%	8.20%	5.08%	10.31%	7.05%
Llama-3.1-8B-instruct	16.81%	12.96%	30.36%	29.73%	13.16%	12.39%	33.02%	11.25%	8.17%	4.26%
Llama-3.2-1B-instruct	−92.11%	35.40%	−27.08%	44.55%	−82.86%	43.36%	−64.71%	48.15%	−57.56%	40.96%
LLaMANTINO-8B-ITA *	24.59%	−1.10%	48.31%	8.96%	31.90%	7.06%	52.85%	13.43%	20.11%	23.74%
Minerva-7B-instruct-v1.0	−17.86%	−7.61%	−34.57%	−45.33%	3.19%	−9.64%	−26.15%	−7.89%	10.11%	12.73%
Mistral-7B-instruct-v0.3	9.09%	15.29%	14.78%	10.91%	−2.83%	8.40%	10.28%	9.43%	4.30%	9.42%
Phi-3.5-mini-instruct	9.52%	18.10%	3.81%	15.13%	5.77%	18.33%	2.88%	9.82%	2.50%	5.11%
Qwen2.5-7B-instruct	12.26%	17.70%	18.97%	−9.30%	−7.22%	3.70%	14.81%	6.12%	11.21%	4.20%

* LLaMANTINO-8B-ITA refers to the model LLaMANTINO-3-ANITA-8B-instruct-DPO-ITA.

Table 9. The table presents Fleiss’ kappa coefficients for each evaluation metric, comparing the Human-Only and 2 Humans + G-Eval configurations. The final column reports the difference in agreement between the two scenarios.

Evaluation Dimensions	Kappa (Only Humans)	Kappa (2 Humans + G-Eval)	Delta
Fluency	0.656	0.649	−0.007
Coherence	0.606	0.565	−0.041
Lexical Variety	0.405	0.315	−0.09
Knowledge	0.429	0.405	−0.024
Empathy	0.541	0.449	−0.092

Table 10. Aggregate results of automatic similarity metrics (BLEU-4, ROUGE-L, and BERTScore F1) for the evaluation of style transfer and semantic coherence.

Evaluation Dimensions	Measured Property	Calculated Result (Mean)
BLEU-4	Lexical Overlap	0.17
ROUGE-L	Content Recall	0.38
BERTScore F1	Contextual Semantic Coherence	0.7

Table 11. Estimated costs for generating 100,000 empathetic sentences using different methods.

Method	Generation Cost (100k Phrases)
Small LLM (IDRE)	EUR 18
Medium LLM (IDRE)	EUR 54
GPT-4o (API)	EUR 94
Human (MTurk)	EUR 12k

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The IDRE Dataset in Practice: Training and Evaluation of Small-to-Medium-Sized LLMs for Empathetic Rephrasing

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. IDRE Dataset

3.1.1. Dataset Creation

3.1.2. IDRE Dataset Evaluation Methodology and Results

3.2. Evaluating the IDRE Dataset for Empathetic Rephrasing

3.2.1. Model Selection

3.2.2. IDRE Dataset Preparation

3.2.3. Computational Environment

3.2.4. Training Parameters and Fine-Tuning Strategy

3.3. Model Evaluation

Evaluation Methodology

4. Results and Discussion

4.1. Medical Domain Performance

4.2. Cross-Domain Generalization

4.3. Italian Model Evaluation

4.4. Inter-Rater Agreement and Metric Validation

4.5. Time and Cost Analysis

5. Conclusions

6. Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Article Metrics

Citations

Article Access Statistics