1. Introduction
Empathy is an innate cognitive ability that enables humans to recognize and comprehend others’ emotional states through perspective-taking. It plays a vital role in social interactions by strengthening emotional bonds, fostering trust, and nurturing compassion. Furthermore, empathy is integral to everyday conversation, as people share experiences and emotions, reinforcing mutual understanding and connection [
1]. Since the term empathy emerged in the late nineteenth century, it has been defined differently across the diverse disciplines of social psychology, ethics, cognitive science, and clinical psychology. For example, Barker 1995 [
2] defined empathy as “the act of perceiving, understanding, experiencing, and responding to the emotional state and ideas of another person.” In contrast, Hoffman [
3] defined empathy as “an affective response more appropriate to another’s situation than one’s own.” In addition, the concept of empathy encompasses both affective and cognitive dimensions [
1]. Affective empathy is the ability to feel or mirror—either implicitly or explicitly—another person’s feelings. Cognitive empathy, also called perspective-taking, entails a deliberate effort to understand and interpret another person’s situation and implicit feelings without necessarily sharing them. Hence, empathy requires recognizing and understanding others’ emotional states as well as the ability to respond to them with sympathy and compassion.
A key challenge in developing human-like conversational models is enabling empathy. This involves equipping a model with the ability to infer and respond to a user’s emotions, thereby enabling it to generate more genuine and engaging responses. This response generation process requires modeling cognitive and affective states, making it a complex generative challenge beyond mere emotion or sentiment detection. Empathetic responses have the potential to revolutionize various fields, including education, customer service, entertainment, and healthcare. Although English language research has increased in this domain, there remains a gap in the nuanced application of empathy across languages and cultures, where cultural backgrounds significantly shape how people express emotions and seek empathy [
4,
5]. These socio-cultural variations are often overlooked in current open-domain conversational systems, potentially limiting their effectiveness and scalability. A significant research gap persists in the study of empathy for Arabic conversational AI. The scarcity of human-generated conversation datasets that accurately capture their nuanced communication strategies and empathetic behavior poses a significant barrier to developing truly emotionally intelligent and culturally sensitive conversational models.
Despite its intensive history, the field of conversational AI has recently witnessed remarkable advancements driven by the emergence of transformer-based large language models (LLMs). These language models have shown extraordinary capabilities in a variety of language understanding and generation tasks. However, most of these models are mainly tailored for the English language. Although some models exhibit multilingual capabilities, they still struggle to comprehend the morphological and syntactic characteristics of the Arabic language in particular [
6,
7,
8]. In recent years, the field of Arabic language processing has seen increased development of generative Arabic LLMs, such as AraGPT2 [
9] and AraT5 [
10]. However, these models are not tailored for following instructions or maintaining coherent conversations. Only recently has the research community witnessed the emergence of powerful chat-oriented and instruction-following Arabic generative LLMs such as Jais-chat [
11], AceGPT-chat [
12], and ALLaM [
13], offering exciting prospects to advance Arabic applications. Nevertheless, as far as we know, these models remain untested with respect to their emotional awareness or empathetic capabilities.
In view of these limitations, we present the Arabic Empathetic Conversations (AEConvs) —a dataset containing more than 4K open-domain Arabic textual empathetic conversations. Each conversation in the dataset is conducted between two humans, resulting in authentic and natural conversations. In addition, we leverage AEConvs to evaluate the empathetic capabilities of two recently released Arabic generative LLMs—AceGPT-chat and Jais-chat—using two learning approaches: zero-shot and fine-tuning. In contrast to fine-tuning, zero-shot testing evaluates a model’s ability to perform a new task without explicit training on task-specific examples. This approach leverages the model’s inherent knowledge and capacity to generalize to unseen situations. Despite the relatively small dataset, automatic and human evaluation results demonstrate that fine-tuning with AEConvs yields significantly more fine-grained empathetic responses in both models compared to zero-shot responses, while also enhancing the ability to generate more fluent and contextually relevant responses. Furthermore, human evaluators found that the fine-tuned AceGPT-chat responses were more empathetic than Jais-chat in over half of the cases, suggesting a potential performance advantage for AceGPT-chat for this task.
The main contributions of this study are as follows:
- 1.
We construct a novel Arabic empathetic conversation dataset as a new benchmark. To the best of our knowledge, AEConvs is the first of its kind to contain genuine empathetic open-domain conversations written in Modern Standard Arabic (MSA), the formal and standardized form of language used across the Arabic-speaking world. The AEConvs dataset is designed to advance research on open-domain conversations, particularly the development of empathetic conversational models. We believe that the release of AEConvs is a step toward making Arabic LLMs more emotionally sensitive and culturally aligned.
- 2.
We provide the first empirical investigation of the empathetic response capabilities of two state-of-the-art Arabic generative LLMs—AceGPT-chat and Jais-chat—under two training settings: zero-shot and fine-tuning. Compared to zero-shot baselines, human evaluation results demonstrate that fine-tuning on AEConvs improved the performance of both models in generating empathetic responses, while also improving their fluency and context-following capabilities. Additionally, the automatic evaluation indicates improved language modeling and better lexical and semantic similarity with human reference responses in both models.
The remainder of this study is organized as follows:
Section 2 provides an overview of generative large language models (LLMs) and the recent literature on empathetic response generation in English and Arabic.
Section 3 describes the methodology used to curate the
AEConvs dataset and its relevant statistics. In addition, it details the experimental setup, including the base models, fine-tuning details, and evaluation metrics.
Section 4 presents and discusses the experimental results and provides a case study with error analysis, offering a qualitative in-depth examination of model outputs. Finally,
Section 5 provides concluding remarks and future directions.
3. Materials and Methods
This section details the experimental setup of this study. First, we introduce the AEConvs dataset, outlining its construction process and key characteristics. We then describe the base pre-trained language models—AceGPT-chat and Jais-chat—and their fine-tuning details. Finally, we present a comprehensive evaluation framework that leverages both automatic metrics and human assessment.
3.1. AEConvs Dataset
Dataset Collection: In order to construct the
Arabic
Empathetic
Conversations
(AEConvs) dataset, a total of 85 freelancer content writers were hired from
Khamsat.com, an Arabic freelancing marketplace. The freelancers were all native Arabic speakers with experience writing in Modern Standard Arabic (MSA).
Table 1 shows the demographic data of the freelancers, including gender, age, and education. Data were collected in batches. For each batch, a group of writers (maximum 10) was added to a group chat on the
Telegram application and asked to engage in dyadic open-domain conversations. Each conversation consisted of one speaker and one listener, in which the speaker initiated a conversation about a daily event or an emotional incident, and the listener responded with empathy. The speaker and listener can alternate turns in each conversation, with a minimum of 4 and a maximum of 10 turns. After a pair of writers finishes their conversation, another pair starts a new one, and so on.
To assess the qualifications of each freelance writer for the task, particularly their writing ability and understanding of the assignment, they were first asked to write a sample set of empathetic conversations. Upon acceptance, the writers were explicitly instructed not to use dialectical language, emojis, symbols, or numbers. They were also required to use proper language and avoid greetings and short answers (e.g., “Yes,” “No,” “I agree,” etc.). In addition, the writers were advised to refrain from revealing personal information or discussing controversial topics such as politics, sex, and religion. Proofreading and quality checks were manually performed after each batch.
Dataset Statistics: A total of 4155 conversations were collected. After filtering out duplicates, conversations with fewer than 4 utterances, violations in speaker–listener turn-taking, and listeners’ turns with fewer than 3 words, we obtained 4120 unique conversations divided equally between listener and speaker utterances.
Table 2 shows the detailed statistics of the
AEConvs dataset. All conversations are written in MSA. Most of the dataset consists of short conversations with 4–6 turns, with less than 1% containing more than 8 utterances. A sample conversation from
AEConvs is shown in
Table 3. In addition,
Figure 1 shows the word cloud of the 50 most frequently occurring words in the listeners’ turns across the 4120 conversations, indicating the level of expressed empathy in
AEConvs. Stop words were excluded using a custom Arabic stop word list, and the ISRIStemmer from the nltk library was applied to normalize word forms.
Data Pre-processing: We used CAMeL tools [
54] for Arabic data pre-processing and performed the following steps:
- 1.
Dediacritization: Arabic diacritical marks (ḥarakāt) are stripped from text to obtain a more consistent and standardized orthographic representation, e.g., مُبَارك to مبارك.
- 2.
Removal of repeating characters: Duplicated letters are removed from words, as they are often used to convey tone, e.g., مبروووك to مبروك.
- 3.
Kashīda removal: Elongation characters used in Arabic text to artificially extend words are removed, e.g., شــكـرا to شكرا.
- 4.
Text normalization: Punctuation, numerals, and special characters are stripped, and white spaces are standardized.
In addition, the dataset was randomly partitioned into 80% for training, 10% for validation, and 10% for testing. The splitting was performed at the conversation level, rather than the turn level, to maintain integrity.
3.2. Base Models
AceGPT-chat: AceGPT models [
12] are a recently released series of LLaMA2-based [
24] models with different parameter sizes, ranging from 7B to 13B. AceGPT was designed to serve as an Arabic culture-aware and value-aligned generative model. The integration of localized pre-training, fine-tuning on native Arabic instructions, and reinforcement learning from AI feedback (RLAIF) enables AceGPT models to better align with Arabic cultural norms and values. This leads to improved performance on culturally-specific tasks. AceGPT models are pre-trained on a mixture of English and Arabic data, using a larger subset of the latter. The AceGPT family consists of two variants: AceGPT-base (the foundational model) and AceGPT-chat (an instruction-tuned model optimized for applications). Both configurations outperformed similar open source LLMs, including Jais-13B [
11], on the Arabic Cultural and Value Alignment (ACVA) benchmark under zero-shot settings, demonstrating the model’s superior ability to attune to the unique cultural characteristics of Arabic communities.
Table 4 lists the performance of AceGPT-13B-chat compared to Jais-13B-chat on the ACVA benchmark.
Figure 1.
(a) Word cloud of listeners’ turns in the AEConvs dataset. (b) The English version (translated via Google Translate API).
Figure 1.
(a) Word cloud of listeners’ turns in the AEConvs dataset. (b) The English version (translated via Google Translate API).
Jais-chat: Jais [
11] is a family of bilingual English–Arabic generative LLMs introduced by the Inception Institute of Artificial Intelligence, UAE, in 2023. Jais models are designed to excel in the Arabic language while maintaining strong English capabilities, and they are trained on a mixture of Arabic, English, and code data. In addition, they have demonstrated superior performance in Arabic knowledge and reasoning capabilities across existing Arabic and multilingual LLMs, while proving to be strong competitors to their English counterparts. Two foundational Jais model families have been released. The first is the Jais-family, pre-trained from scratch using a decoder-only architecture based on GPT-3 [
55], comprising eight model sizes, ranging from 590 million to 30 billion parameters. The second family, Jais-adapted, was released more recently in 2024. These models are adaptively pre-trained on top of the LLaMA-2 [
24] architecture, trained on a larger Arabic dataset to enhance Arabic capabilities. The Jais-adapted models come in three sizes (7B, 13B, and 70B parameters) and feature an expanded context window of 4096 tokens. Both foundational families are available in two variants: base models and instruction-tuned chat models (Jais-chat). The chat variants are optimized for applications by training on a corpus of 4 million Arabic and 10 million English instruction–response pairs across single- and multi-turn settings.
To ensure a fair comparison, we used the AceGPT-7B-chat and Jais-adapted-7B-chat versions; both are LLaMA-2-based, with approximately 7 billion parameters.
Table 5 shows the main features of AceGPT-7B-chat and Jais-adapted-7B-chat. For ease of reference, we refer to these models as
AceGPT-chat and
Jais-chat, respectively, throughout the rest of this paper.
3.3. Fine-Tuning Details
We fine-tuned both models using Low-Rank Adaptation (LoRA) [
56] within the Unsloth
1 framework. LoRA enhances LLM fine-tuning by decomposing the weight update matrix into a lower-rank representation, enabling faster training and potentially lower computational cost without sacrificing performance. The LoRA configuration employed a rank of
with a scaling factor
and a dropout rate of 0.1 for regularization. For model stability, we enabled rank stabilization (rsLoRA) [
57], and gradient checkpointing was enabled through Unsloth’s optimized implementation to reduce memory consumption. A random seed of 3407 was used to ensure reproducibility. For the learning rate, we experimented with different values ranging from
to
and selected the one that achieved the best perplexity (PPL) on the validation set without overfitting, as shown in
Table 6. We followed existing work [
58,
59] using perplexity as a metric to find the optimal learning rate and number of training epochs. Both models were trained with early stopping, and training for more than 2 epochs resulted in overfitting and higher perplexity, indicating worse performance. Furthermore, we set the batch size to 16, the maximum sequence length to 512, and used the AdamW (Adam with Weight Decay) optimizer for both models. Conversations were formatted according to each model’s standardized chat template prior to fine-tuning. All experiments were conducted on a single Tesla T4 GPU provided by Google Colab. The training time lasted 5 h and 31 min for AceGPT-chat and 3 h and 2 min for Jais-chat.
3.4. Evaluation
In this work, we employ a comprehensive evaluation strategy that combines automated and human assessment. Automatic metrics capture different dimensions of text quality, while human evaluation provides holistic validation. The models’ performance was evaluated on a random sample of 100 conversations from the held-out test set. The assessment was restricted to a single-turn response generation task, providing a critical and interpretable baseline for meaningful future comparisons before progressing to the added complexity of multi-turn interaction.
3.4.1. Automatic Evaluation
Automatic evaluation typically compares the model-generated response against a golden reference response using predefined metrics to assess various attributes of text quality. We employed a suite of automatic evaluation metrics on the test set, each targeting a distinct aspect of text quality: perplexity (PPL) for intrinsic fluency, BLEU [
60] for lexical overlap with reference data, and BERTScore [
61] for semantic similarity via contextual embeddings.
Perplexity (PPL) is an intrinsic metric used to evaluate the performance of language models and is commonly used to evaluate the fluency of generative LLMs. For each word in the sequence, perplexity measures the probability of that word occurring given the previous words. In other words, perplexity measures how well the model is confident in its generated responses, where lower perplexity scores indicate better language modeling performance and higher perplexity scores indicate model confusion. Perplexity is calculated as the exponentiated average negative log-likelihood of a sequence:
where N is the sequence length, and
P(
wi) is the model’s predicted probability of the
i-th token.
BLEU [
60] (Bi-Lingual Evaluation Understudy) is an n-gram-based metric originally designed for evaluating the quality of machine-translated text against one or more reference human translations. Despite capturing only surface-level lexical similarity, BLEU has been widely used in evaluating empathetic response generation [
37,
39,
45,
46], where improvements in the BLEU score suggest better alignment with the characteristic linguistic patterns of empathetic expressions in the training data. BLEU is computed as the geometric mean of the modified precision, weighted by a uniform weighting factor and multiplied by a brevity penalty (BP) to penalize overly short outputs:
where
is the modified n-gram precision,
denotes the weights for each n-gram, and
BP is the brevity penalty.
BERTScore [
61] is an automatic evaluation metric for text generation tasks. It utilizes contextual embeddings from a pre-trained BERT model to compute the similarity score between a generated text and a reference text, capturing nuanced semantic relationships beyond surface-level n-gram matching. BERTScore has demonstrated better correlation with human judgment in text generation tasks compared to common automatic metrics, including BLEU and ROUGE [
61,
62]. The BERTScore is represented by its F1 value derived from precision (P) and recall (R) according to the following formula:
where
precision (
P) is the true positives divided by the sum of true positives and false positives, and
recall (
R) is the true positives divided by the sum of true positives and false negatives.
3.4.2. Human Evaluation
To qualitatively assess model performance, we conducted two human evaluation tests:
an aspect-based rating test and
a pairwise preference test. We recruited 3 native Arabic speakers (1 male and 2 females) with a Bachelor’s degree and content writing expertise. Prior to evaluation, a screening test was conducted to ensure that evaluators shared a consistent understanding of the evaluation criteria. They were asked to rate responses without knowing which model or setting (zero-shot vs. fine-tuned) produced each output. We used Fleiss’ Kappa [
63] as a measure of inter-annotator agreement between the evaluators.
In the aspect-based rating test, we generated responses from each model for the contexts of the sampled conversations—in zero-shot and fine-tuning settings—and then asked the evaluators to judge each generated response based on three criteria:
Empathy: To what extent does this response show emotional understanding and compassion for the speaker’s experience?
Relevance: To what extent is this response relevant to the context of the conversation?
Fluency: To what extent is this response readable and fluent?
For each criterion, a response was rated on a 3-point Likert [
64] scale. Following [
65], we opted for a Likert scale with three dimensions instead of five. This is because five-point ratings are highly likely to vary between different individuals, resulting in low inter-annotator agreement. In the
pairwise preference test, each context of the sampled conversations was paired with two responses generated by each of the fine-tuned models, and the evaluators were asked to select the best response in terms of
empathy. Ties were allowed if two responses were perceived as equally empathetic.
4. Results and Discussion
In this section, the results of both the automatic and human evaluations are presented and analyzed, comparing the performance of AceGPT-chat and Jais-chat in generating empathetic responses under zero-shot and fine-tuning settings. In addition, we present a case study with an error analysis, offering a qualitative in-depth examination of model outputs and showcasing their strengths and limitations in practical scenarios.
4.1. Automatic Evaluation
After fine-tuning, the perplexity, BLEU, and BERTScore were calculated on the test set. The results are illustrated in
Table 7. Clearly, AceGPT-chat showed competitive performance compared to Jais-chat. Its low perplexity score indicates better fluency in its generated responses. Additionally, the slightly higher BLEU score indicates a stronger match with the human reference responses. Similarly, the BERTScore demonstrates that AceGPT-chat has slightly better semantic similarity with human responses than Jais-chat.
Note: Under zero-shot settings, we note that both models often produce English or empty responses, despite being explicitly prompted to answer in Arabic. This could be attributed to their bilingual training. Specifically, 19% of AceGPT-chat responses contained mixed Arabic–English text. For Jais-chat, 8% of the responses were in English, and fewer than 2% were empty responses. Thus, automatic metrics could not be computed as these require Arabic reference texts for comparison.
4.2. Human Evaluation
Table 8 and
Table 9 show the average human ratings in the
aspect-based rating test for AceGPT-chat and Jais-chat, respectively. In zero-shot testing, both models showed comparable levels of empathy, with AceGPT-chat slightly ahead. Similarly, AceGPT-chat demonstrated better context relevance, while Jais-chat was slightly better in fluency. However, we observed that both models exhibited repetitive patterns of empathy, often generating generic responses such as “I am sorry to hear that,” which we showcased in
Section 4.3. After fine-tuning, both models clearly showed a significant improvement in empathy, with Jais-chat performing slightly better. Likewise, fine-tuning led to improved context relevance for both models, with Jais-chat improving the most across all aspects. Lastly, fluency in both models improved the least after fine-tuning, indicating that both models were inherently able to generate fluent and coherent language.
For the
pairwise preference test, the results in
Table 10 show that the responses of AceGPT-chat after fine-tuning, in terms of empathy, were preferred by a large margin over those generated by the Jais-chat model. However, the relatively high tie rate suggests that the Jais-chat model is still competitive in many cases.
In general, the results of both the automatic and human evaluation tests highlight both models as robust baselines for Arabic conversational AI. The AceGPT-chat and Jais-chat models were highly competitive in their ability to generate more empathetic, contextually relevant, and fluent responses after fine-tuning with the
AEConvs dataset compared to their performance in zero-shot settings. This indicates the benefit of fine-tuning LLMs with a good-quality, task-specific dataset, even when the dataset is relatively small. In addition, the performance advantage of AceGPT-chat over Jais-chat can be attributed to its targeted training to localization and alignment with Arabic cultural nuances and values, as evidenced by its higher score on the Arabic Cultural and Value Alignment (ACVA) benchmark [
12] (see
Table 4). This underscores the critical role of culturally tailored datasets in enhancing LLMs’ performance for region-specific applications.
4.3. Case Study
Table 11 shows selected conversation contexts from the
AEConvs dataset, providing the gold-standard human responses compared to the responses generated by the AceGPT-chat and Jais-chat models under both zero-shot and fine-tuning settings. Clearly, fine-tuning both models with
AEConvs consistently improved the empathy of the generated responses across all cases. As demonstrated by the first case, the AceGPT-chat model generated a more convincing, empathetic response after fine-tuning than in zero-shot settings, exhibiting a spiritual perspective and offering comfort by suggesting that everything happens at the right time according to God’s plan. The Jais-chat model, on the other hand, generated similar responses before and after fine-tuning, and these responses resemble generic empathetic expressions such as “I am sorry to hear that,” as well as suggesting a medical consultation.
In the second case, both the AceGPT-chat and Jais-chat models under zero-shot settings showed brief responses that only acknowledged the speaker’s feelings without offering any additional insights or encouragement. After fine-tuning, both models generated longer, more engaging responses, with AceGPT-chat closely mimicking the optimistic tone of a gold-standard human response. Similarly, Jais-chat’s response was more empathetic and constructive, acknowledging the challenges of moving to a new house while framing the change as an opportunity for growth and learning. It provides a balanced perspective that combines understanding with encouragement.
Lastly, the third case shows brief and generic responses from both models under zero-shot settings that lack depth and fail to meaningfully engage with the speaker. In contrast, the responses generated by both models after fine-tuning are more empathetic and engaging. The AceGPT-chat model again mirrors the gold-standard human response by emphasizing the importance of gratitude in Islamic values and encouraging such acts. Jais-chat’s response, conversely, offers acknowledgment and asks for further engagement with the speaker, promoting a deeper and more interactive conversation.
4.4. Error Analysis
Table 12 shows a few cases of erroneous model responses where the fine-tuned models fail to generate empathetic, coherent, and context-relevant responses. Jais-chat’s response in the first case, despite appearing empathetic on the surface, offers a logically impossible advice—getting sleep before going to bed is temporally incoherent (sleep occurs after going to bed). This pattern, also observed in AceGPT-chat, suggests that the model has learned to recognize emotional cues but struggles to generate factual and contextually specific responses. The second case demonstrates that the AceGPT-chat model may be hallucinating conversation history, a manifestation of the broader hallucination problem in large language models. Although very rare, this hallucination error was observed in Jais-chat’s responses as well. In the third case, Ace-GPT-chat gave a detailed medical diagnosis beyond its competence. This can be attributed to models optimized for helpfulness that may interpret symptom descriptions as requests for explanation, leading to diagnostic responses even when clinically inappropriate. Jais-chat exhibited similar behavior in a few responses. Surprisingly, the last case shows similar responses from both models, as they claimed to be able to fix the speaker’s broken phone. These responses, despite being empathetic, commit a credibility fallacy, where a conversational agent falsely claims personal experience of a phenomenon it cannot physically or cognitively experience [
5]. This type of error could diminish the perceived authenticity and empathy of conversational models.
5. Conclusions
Empathy plays a fundamental role in human interactions, providing emotional support and improving communication. Understanding how perceived empathy affects interactions in systems is essential for their advancement, as it directly shapes user trust, engagement, and long-term adoption. These advances have the potential to enable transformative applications in fields such as clinical psychology, education, and customer service. While numerous studies have explored empathetic responses in English, the development of such models for the Arabic language is not addressed in the existing literature. This is largely attributable to the scarcity of high-quality open-domain conversational data and the challenges of curating new datasets. The cultural diversity across human societies leads to significant variations in how empathy is expressed and perceived. The proposed dataset, Arabic Empathetic Conversations (AEConvs) serves as a new benchmark that provides valuable insights into how humans express emotions and empathetic reactions in conversations, specifically within the Arabic culture. In addition, we leveraged AEConvs to evaluate the empathetic response capacity in two powerful generative Arabic LLMs, Ace-GPT-chat and Jais-chat. Empirical human evaluation results demonstrate that, despite the relatively small dataset, both models have the potential to generate better empathetic responses after fine-tuning compared to their zero-shot performance. Moreover, the automatic evaluation results indicate improvements in the capability of both models to generate more fluent and context-relevant responses after fine-tuning. Building on our initial findings, we aim to explore and compare the performance of other generative Arabic LLMs and test their prompt-based in-context learning capabilities to generate empathetic responses. Moreover, we plan to investigate methods for generating more fine-grained empathetic responses by leveraging additional contextual features of conversational data, such as emotion labels and dialogue acts.
Our primary contribution lies in establishing reproducible baselines for this task, highlighting the potential of current approaches to computational empathy, as well as the limitations. Thus, we acknowledge several limitations of this study. Due to resource constraints, only three evaluators were used for the human evaluation. This small evaluation pool, although common in exploratory research, limits the generalizability of our findings. Furthermore, single-turn evaluation, while useful for initial benchmarking, overlooks critical dimensions of genuine empathetic interaction. This may overestimate or misrepresent the models’ true empathetic capabilities. Empathy is a multidimensional concept. Consequently, its evaluation in systems encompasses a broad range of technological, psychological, and ethical considerations [
5]. There is currently no widely accepted method for assessing empathy in conversational systems. The reliance on heterogeneous evaluation frameworks—combining subjective human judgments with automatic metrics that capture only surface-level text quality—creates a significant barrier to the fair and objective comparison of conversational AI systems. We acknowledge that the complexity of human empathy—encompassing cognitive, affective, and cultural dimensions—cannot be fully captured using only the automated and human evaluation metrics employed here. Moreover, while both fine-tuned models demonstrate improved alignment with cultural and religious norms in Arabic responses—as illustrated in
Section 4.3—this observation remains qualitative. The lack of established evaluation protocols for cultural alignment underscores a critical methodological gap in conversational AI, especially for low-resource languages such as Arabic. Our initial findings, therefore, indicate both the potential of cultural fine-tuning and the need for standardized metrics to systematically validate cultural nuance. In conclusion, this work highlights the importance of empathy in Arabic conversational systems and underscores the need for more robust resources in terms of datasets, pre-trained LLMs, and systematic evaluation frameworks to drive progress in this area.