This section presents the experiments and results for the proposed approaches: Emotion-Driven Response Generation (EDRG) and EmoLlama. It covers dataset collection, preparation, training procedures, and a comparative analysis of model performance.
4.2. Experimental Results
The marBERT model, fine-tuned for emotion classification, demonstrated a significant improvement after balancing the dataset.
Table 2 summarizes its performance, showing accuracy increasing from 77.52% to 85.19% and F1-score from 77.50% to 85.59%. These results highlight the importance of dataset balancing in improving classification accuracy and generalization.
To provide a more detailed class-wise evaluation of the marBERT classifier,
Table 3 and
Figure 9 present the row-normalized confusion matrix computed on the human-generated test set. The results show that the model achieved the highest recognition rates for Neutral (98.75%), Disgust (94.04%), and Surprise (90.71%), while maintaining strong performance for Anger (81.39%) and Fear (81.98%). The lowest recognition rates were observed for Sadness (73.91%) and Happiness (75.01%). Furthermore, the confusion matrix reveals that the largest proportion of misclassifications occurred between these two emotion categories, with 17.90% of Sadness samples classified as Happiness and 15.20% of Happiness samples classified as Sadness. Overall, these results provide additional insight into the classifier’s behavior across individual emotion categories and complement the overall performance metrics reported in
Table 2.
The dataset was used to train both the emotion classification and response generation models. Preprocessing included emoji removal, normalization, and segmentation, with tokenization performed using model-specific tokenizers. For EDRG, the balanced dataset was split into training (70%), validation (15%), and testing (15%) sets using stratified sampling.
During fine-tuning, we performed a grid search over hyperparameters such as batch size, learning rate, and epochs to optimize classification accuracy. Early stopping with a patience parameter of two epochs was applied to prevent overfitting. The model outputs logits for seven emotion classes: happiness, sadness, anger, fear, disgust, surprise, and neutral. It then selects the class with the highest probability.
Decoding configurations, including beam search with num_beams = 6, top-k sampling (top_k = 50), and enabling do_sample = True, were applied to balance coherence and diversity in the generated responses.
It is important to note that model evaluation was conducted on a human-generated test subset to assess generalization to real-world user inputs.
Five models were evaluated for response generation. These included four Arabic LLMs (AraBERT, AraELECTRA, AraGPT-2, and MT5) and the EmoLlama retrieval-augmented model, which was based on qwen2:7b-text embeddings.
Table 4 presents the BLEU and Cosine Similarity scores for each model. AraBERT and AraELECTRA demonstrated the strongest performance among the LLMs, achieving BLEU scores of 0.57 and 0.56 and Cosine Similarity of 0.51 each. MT5 showed the weakest performance, with a BLEU score of 0.28 and Cosine Similarity of 0.29. While EmoLlama had a lower BLEU score (0.34), it significantly outperformed all other models in terms of semantic similarity, achieving a Cosine Similarity score of 0.91. This result underscores the effectiveness of retrieval-augmented generation in capturing user intent and producing semantically rich responses.
To further understand the effectiveness of each model across specific emotions, we analyzed the BLEU scores per emotion category.
Table 5 presents the results, highlighting how AraBERT and AraELECTRA models consistently outperformed others across most emotion types.
The results indicate that the highest BLEU score was observed for the anger emotion using the AraELECTRA model (0.64), followed closely by AraBERT (0.63). AraBERT outperformed other models in fear, happiness, and surprise, making it particularly effective for a broader range of emotions. AraELECTRA achieved similar results, especially in sadness and neutral, matching AraBERT with a BLEU score of 0.57 for sadness. In contrast, AraGPT-2 and MT5 showed significantly lower scores across all emotions, highlighting the superiority of the fine-tuned encoder-based models.
These emotion-specific BLEU scores support the core design of the Emotion-Driven Response Generation (EDRG) approach, where the system dynamically selects the optimal response model based on the detected emotion in the user input. For instance, if the detected emotion is fear, the system will prioritize AraBERT as the generator due to its superior performance for that emotion. In cases where two models exhibit similar performance (e.g., AraBERT and AraELECTRA for sadness, the system resorts to additional metrics such as the percentage of sentiment alignment (introduced in the next paragraph) between the user input and the generated response to resolve tie-breaking. This mechanism ensures both emotional relevance and contextual coherence in real-time response generation.
To enhance the emotion-alignment evaluation, we introduce the sentiment match percentage, which measures how well the sentiment of the generated response aligns with the user’s original input. This metric is particularly useful when two models yield similar BLEU scores for the same emotion category.
As shown in
Table 6, although both models perform similarly on BLEU scores, the sentiment match percentage provides further insights. For instance, while both AraBERT and AraELECTRA achieved a BLEU score of 0.57 for sadness, AraELECTRA showed a higher sentiment match (47.46%) compared to AraBERT (38.21%). Such comparisons allow the system to make informed decisions when choosing between models with otherwise similar performance.
In
Table 7, the EmoLlama model demonstrates strong sentiment alignment for Neutral and Surprise, supporting its effectiveness in retrieval-augmented response generation. However, the model shows moderate alignment for other emotions, suggesting that while semantic relevance is high, emotional matching can still be further optimized.
In addition to evaluating BLEU scores and sentiment match percentages, a deeper analysis was conducted to examine how accurately each model matched the target emotion of the user input. This was done by computing the percentage of correctly matched emotions, as well as identifying the most frequent mismatches for each emotion category.
Table 8,
Table 9, and
Table 10 present the percentages of correctly matched and mismatched emotions for the araBERT, araELECTRA, and EmoLlama models, respectively.
This analysis reveals common confusion patterns among models. For instance, the emotion Happiness was frequently mismatched with Surprise and Neutral, particularly in the MT5 and araGPT-2 models, highlighting the difficulty of distinguishing between closely related emotional tones. Conversely, Neutral and Anger were consistently recognized with higher accuracy across most models.
These insights are valuable for understanding the limitations of each model and guiding future improvements in emotion recognition and generation. Importantly, high BLEU and Cosine scores do not necessarily imply that the predicted responses align with the user’s intended emotion. Measuring this alignment provides a complementary perspective, as demonstrated in
Table 8,
Table 9 and
Table 10, and highlights a critical avenue for future work.
Overall, AraBERT and AraELECTRA demonstrated superior performance for emotion-aligned response generation. Meanwhile, EmoLlama excelled in maintaining semantic consistency and adapting to multi-turn conversations, highlighting its strength in real-world conversational AI applications where contextual grounding is critical. This comparison underscores the complementary nature of both approaches in addressing the challenges of empathetic dialogue systems for Arabic.
4.3. Ethical Considerations
The proposed chatbot frameworks raise several ethical considerations that should be acknowledged. While the Emotion-Driven Response Generation (EDRG) approach and EmoLlama employ different response generation mechanisms, both systems interact with users in emotionally sensitive contexts and may influence user perceptions and decision-making.
First, the EDRG approach relies on automatic emotion classification to guide response generation. Although the emotion classifier achieved strong performance, misclassification remains possible and may lead to responses that do not appropriately reflect the user’s actual emotional state. This limitation is particularly important in sensitive domains where emotional understanding is critical.
Second, emotional alignment does not necessarily imply that the generated response should mirror the user’s detected emotion. For example, a response to sadness may be more effective when providing support and encouragement rather than simply reproducing the same emotional tone. Therefore, future work should further investigate the relationship between detected emotions and the most contextually appropriate response strategies.
Third, both EDRG and EmoLlama may be affected by biases present in the training and evaluation data. The datasets used may not fully represent all Arabic dialects, cultural backgrounds, and communication styles. As a result, system performance may vary across different user groups. Expanding emotion datasets and improving dialectal coverage are important steps toward reducing potential biases and improving fairness.
Finally, the proposed systems are intended as conversational support tools and should not be considered substitutes for professional medical, psychological, or counseling services. Human oversight remains essential in high-risk applications, particularly those involving mental health or sensitive personal situations. In addition, appropriate user awareness and transparency measures should be maintained, as AI-generated responses may occasionally contain inaccuracies or inappropriate interpretations.