MENARA: Medical Natural Arabic Response Assistant

Ibrahim, Ahmed; Hosseini, Abdullah; Helmy, Hoda; Arabi, Maryam; AlShareef, Aya; Lakhdhar, Wafa; Serag, Ahmed

doi:10.3390/make8040110

Open AccessArticle

MENARA: Medical Natural Arabic Response Assistant

by

Ahmed Ibrahim

,

Abdullah Hosseini

,

Hoda Helmy

,

Maryam Arabi

,

Aya AlShareef

,

Wafa Lakhdhar

and

Ahmed Serag

^*

AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha P.O. Box 24144, Qatar

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(4), 110; https://doi.org/10.3390/make8040110

Submission received: 6 March 2026 / Revised: 14 April 2026 / Accepted: 17 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue Advancing Natural Language Processing for Low-Resource Languages and Dialects)

Download

Browse Figures

Versions Notes

Abstract

Dialectal variation presents a major challenge for deploying medical language models in real-world healthcare settings, where patient–clinician communication often occurs in regional vernaculars rather than standardized language forms. This challenge is particularly pronounced in the Arabic-speaking world, where clinical interactions frequently take place in diverse dialects that differ substantially from Modern Standard Arabic. Fine-tuning and maintaining separate models for each dialect is computationally inefficient and difficult to scale, motivating more integrated approaches. In this work, we present MENARA, an Arabic medical language model constructed by merging Egyptian Arabic, Moroccan Darija, and medical-domain specialists through model merging. We extend prior feasibility findings through comprehensive evaluation of cross-dialect performance, medical safety, and cross-lingual knowledge retention. Specifically, we introduce a fine-grained dialect composition analysis to quantify lexical purity and structured code-switching behavior, benchmark against state-of-the-art Arabic LLMs, conduct subject-matter-expert assessment of both dialectal fidelity and medical appropriateness. The results show that model merging preserves core medical competence while enabling robust dialectal adaptation, achieving strong cross-dialect fidelity while substantially reducing storage and deployment overhead compared to maintaining separate models. These findings establish model merging as a potentially practical and resource-efficient paradigm for dialect-aware medical NLP in linguistically fragmented healthcare environments.

Keywords:

Artificial Intelligence (AI); large language models (LLMs); Natural Language Processing (NLP); model merging; Arabic dialects

1. Introduction

Deploying medical language technologies in healthcare systems presents challenges that extend beyond conventional multilingual Natural Language Processing (NLP), particularly in regions where communication occurs through diverse dialectal varieties rather than standardized language forms. In the Arabic-speaking world, this issue becomes especially pronounced. Unlike languages with relatively standardized spoken forms, Arabic exists as a continuum of regionally grounded dialects that often diverge substantially in vocabulary, phonology, and syntax. In clinical practice, patients rarely communicate in Modern Standard Arabic (MSA); instead, they rely on colloquial dialects that may be only partially intelligible outside their region. This linguistic reality complicates the safe deployment of medical large language models (LLMs), particularly in high-stakes healthcare settings [1,2]. Dialects such as Egyptian Arabic and Moroccan Darija exhibit marked phonological and lexical divergence, significantly reducing mutual intelligibility across regions [3]. When clinicians are unfamiliar with a patient’s dialectal nuances, critical symptom descriptions may be misinterpreted, potentially leading to diagnostic errors, inappropriate treatment decisions, or delayed care [4].

Traditional adaptation strategies for large language models rely heavily on dialect-specific fine-tuning [5,6,7]. While effective in controlled environments, this approach scales poorly in practice. Training and maintaining independent medical models for each dialect requires large annotated corpora, repeated computational investment, and ongoing version management. In healthcare systems that already operate under resource constraints, maintaining parallel dialect-specific models is operationally unrealistic. Moreover, such fragmentation increases the risk of inconsistent clinical behavior across model variants. Even parameter-efficient strategies such as LoRA or quantization do not eliminate the cumulative computational burden associated with maintaining multiple specialized systems [8,9]. Given the breadth of Arabic dialectal diversity, per-dialect fine-tuning is inherently difficult to scale.

An alternative perspective views dialectal specialization not as a separate training objective, but as a modular capability that can be integrated into a shared medical foundation. Recent advances in model merging provide a mechanism for combining independently fine-tuned models without retraining from scratch [10,11]. By aligning and consolidating task-specific parameter updates, merging enables a single unified architecture to inherit multiple specializations while mitigating task interference. This paradigm offers a promising pathway for constructing dialect-aware medical LLMs without incurring the computational and logistical costs of maintaining per-dialect fine-tuned models.

In this study, we propose MENARA (MEdical Natural Arabic Response Assistant), as illustrated in Figure 1. We investigate whether merging dialect-specialized Arabic models with a medical-domain language model can yield a unified system capable of robust cross-dialect clinical communication. Specifically, we integrate Egyptian Arabic and Moroccan Darija specialists with a medical-domain expert. Our goal is not merely to demonstrate dialect generation, but to evaluate whether linguistic adaptation can be achieved while preserving medical competence and foundational language knowledge.

Beyond validating dialectal performance, we examine three additional dimensions critical for real-world deployment. First, we introduce a fine-grained dialect composition analysis that quantifies lexical purity and code-switching behavior across generated outputs. Second, we benchmark the merged system against state-of-the-art Arabic LLMs to situate its performance within the broader ecosystem. Third, we evaluate catastrophic forgetting by testing whether merging degrades the model’s English medical capabilities. Together, these analyses move beyond proof-of-concept and toward a comprehensive assessment of scalability, safety, and retention.

Our central research question is therefore reformulated as follows: Can model merging produce a resource-efficient, clinically reliable, and linguistically adaptable Arabic medical LLM that maintains foundational knowledge while supporting cross-dialect communication?

Through systematic evaluation combining automated scoring, subject matter expert review, benchmarking comparisons, and deployment analysis, we demonstrate that merging provides a practical and scalable solution for dialect-aware medical NLP. This work contributes not only to Arabic healthcare NLP but also to broader efforts in modular specialization and resource-efficient model adaptation for linguistically fragmented domains.

A preliminary version of this work appeared in the Proceedings of ArabicNLP 2025 [12]. The present manuscript substantially extends the conference publication through several new analyses and evaluations. First, we construct a synthetic yet clinically constrained dataset containing symptom descriptions that remain linguistically consistent across multiple Arabic dialectal varieties, including MSA, Egyptian Arabic, and Moroccan Darija. Building on this dataset, we introduce a dialect composition analysis to quantify lexical purity and structured code-switching patterns across dialect outputs. We further benchmark the merged model against additional state-of-the-art Arabic large language models, including Jais, ALLaM, and Fanar, situating its performance within the broader Arabic model ecosystem. In addition to automated competence metrics, subject matter experts perform medical-focused evaluations to provide domain-level validation of the model’s outputs. Finally, we present an expanded discussion examining performance gradients across dialects, mechanisms for mitigating model interference, and practical deployment considerations in resource-constrained healthcare environments.

2. Related Work

2.1. Dialectal Variation in Arabic NLP

Arabic presents one of the most complex cases of intra-language variation in contemporary NLP. The language operates under a diglossic structure in which MSA coexists with more than thirty regional dialects that dominate daily communication [13,14,15]. These dialects differ substantially in lexical choice, morphological structure, and phonological realization, often resulting in limited mutual intelligibility across regions. Egyptian Arabic and Moroccan Darija, for example, can diverge sufficiently to require adaptation even among native speakers.

Despite this diversity, most Arabic NLP research has historically concentrated on MSA due to the relative availability of curated corpora and annotated datasets. In contrast, regional dialects remain comparatively low-resource, lacking the large-scale, high-quality data necessary to train robust language models [16]. As a result, models trained primarily on MSA frequently exhibit performance degradation when exposed to dialectal inputs. This gap is especially problematic in tasks requiring semantic precision, such as intent classification, question answering, and information extraction, where lexical substitution alone does not account for dialect-specific syntactic constructions and idiomatic expressions.

Efforts to address dialectal NLP have included dialect identification benchmarks, dialect-specific corpora, and shared tasks such as NADI, which expand evaluation to spoken and multidialectal contexts. However, many of these approaches treat dialect modeling as an isolated objective, resulting in fragmented adaptation strategies. In domains where consistency and safety are critical, including healthcare, maintaining separate dialect-specific systems can introduce operational complexity and increase the risk of inconsistent behavior across model variants.

2.2. Arabic Medical Language Processing

Medical NLP in Arabic remains comparatively underdeveloped relative to high-resource languages such as English. The central limitation is not only the scarcity of annotated clinical data but also the mismatch between available resources and authentic patient–clinician communication. Although MSA-based medical datasets exist, real-world clinical interactions occur predominantly in regional dialects, creating a gap between training data and deployment conditions.

Recent benchmark initiatives have begun addressing Arabic medical evaluation. The AraHealthQA 2025 shared task, for example, provides structured datasets for medical question answering across clinical domains [17]. Similarly, the Arabic Healthcare Dataset (AHD) offers a large-scale, professionally curated collection of medical question–answer pairs [18]. However, these resources are largely MSA-centric and do not capture the dialectal variability characteristic of patient-reported symptoms.

Other datasets derived from social media, such as ArCOV-19, incorporate informal dialectal language across multiple Arab regions [19]. While valuable for studying health discourse, such corpora often lack the clinical reliability required for medical decision-support systems. Additional dialect-focused corpora, including the Shami corpus for Levantine Arabic, provide linguistic coverage but are not specialized for healthcare tasks.

Model-based approaches have also been explored. For instance, Mohammad et al. [20] investigated fine-tuning large language models for Arabic medical dialogue. However, these strategies generally rely on further adaptation of domain models rather than integrating dialectal specialization within a unified architecture. As dialect coverage expands, repeated fine-tuning introduces computational and maintenance burdens that limit scalability.

Collectively, these constraints—data scarcity, dialect mismatch, and clinical safety considerations—underscore the need for alternative methodologies that enable dialect-aware medical NLP without requiring separate models for each linguistic variant.

2.3. Model Merging as Modular Adaptation

Model merging has recently emerged as an alternative to full retraining for combining specialized capabilities. Rather than jointly re-optimizing models from scratch, merging approaches consolidate parameter updates from independently fine-tuned systems into a shared architecture [21]. This paradigm enables modular specialization while mitigating task interference.

Several merging strategies have been proposed. Linear averaging, Fisher-weighted averaging [22], and other parameter-space integration methods [23] aim to preserve task-specific strengths while reducing conflicts across models. Recent analyses have further examined the scaling behavior and efficiency properties of merging compared to retraining [24]. These approaches offer compelling efficiency advantages, particularly in multilingual and low-resource settings where maintaining multiple specialized models would be computationally prohibitive [25].

In addition, robustness and bias mitigation are important considerations in machine learning systems [26,27]. In large language models, model merging can improve robustness and reduce bias by integrating independently trained models, thereby reducing reliance on a single training distribution and promoting more generalizable representations. Methods such as TIES address parameter interference through sparsification and sign alignment, improving stability across tasks and domains [28]. This can be viewed as parameter-space ensembling, enhancing robustness to noise and domain shift. However, these benefits are configuration-dependent, and merging does not explicitly enforce fairness at the output level, motivating complementary bias mitigation strategies.

Despite these advances, the application of model merging to high-stakes medical NLP under dialectal fragmentation remains underexplored. Questions concerning clinical reliability, dialectal authenticity, cross-dialect transfer, and knowledge retention have not been systematically examined within Arabic healthcare contexts. Our work addresses this gap by evaluating model merging not merely as a linguistic adaptation technique, but as a deployment-oriented strategy for scalable and clinically reliable dialect-aware medical NLP.

3. Methodology

3.1. Dataset

Developing and evaluating AI systems for dialectal medical language understanding is often constrained by the limited availability of dialect-specific clinical data and by strict data privacy regulations. To address these challenges, we constructed a synthetic but clinically grounded dataset using structured prompting with GPT-4.1-nano. This approach enables controlled, reproducible experimentation while ensuring full compliance with patient privacy requirements.

The final dataset comprises a total of 1050 symptom descriptions. Of these, 900 samples form the primary evaluation set, while an additional 150 samples are used as a supplementary noisy evaluation set designed to assess robustness under more realistic input conditions.

3.1.1. Primary Evaluation Set

To promote linguistic diversity while maintaining clinical consistency, the primary dataset consists of parallel symptom descriptions across three language varieties: MSA, Egyptian Arabic, and Moroccan Darija (300 samples per dialect, totaling 900 samples). All samples were generated using a unified prompting framework designed to enforce consistent medical content while varying only the target dialect. Specifically, the generation process was guided by a structured template that constrains symptom presentation, demographic attributes (e.g., age and gender), and clinical plausibility (see Figure 2).

Rather than allowing unconstrained free-form generation, prompts were carefully designed to ensure: (i) dialect-specific lexical and stylistic consistency, (ii) medically plausible symptom descriptions, and (iii) controlled variation across demographic factors. This design ensures that differences across dialect conditions are linguistic rather than semantic, enabling fair evaluation of cross-dialect robustness.

For dialect generation, the prompt’s language instruction was explicitly modified to request “authentic Egyptian colloquial Arabic” or “authentic Moroccan Darija expressions,” while preserving all other constraints. This ensures semantic alignment across dialects while capturing natural linguistic variation. Examples of the generated test data are shown in Figure 3.

Generation was performed using controlled decoding parameters (temperature = 0.2, top-p = 0.90) to balance variability and stability. To further ensure data quality, outputs were automatically filtered to remove instances containing direct medical advice, interrogative forms, or unintended code-switching into English. In addition, a random 10% subset from each dialect condition was manually reviewed by native speakers to verify dialectal consistency and clinical plausibility prior to finalizing the corpus.

3.1.2. Supplementary Noisy Evaluation Set

To assess robustness under more realistic input conditions, we constructed an additional noisy evaluation set comprising 150 symptom descriptions (50 per dialect). In this setting, the generation prompt was explicitly modified to introduce characteristics commonly observed in patient-reported inputs, including misspellings, informal abbreviations, fragmented or incomplete symptom descriptions, and unstructured code-switching.

All samples were evaluated using the same evaluation protocol described in Section 3.5.2, ensuring consistency with the primary evaluation setting.

3.2. Model Configuration

All experiments are conducted using the Gemma 2B architecture [29] as a shared backbone to ensure architectural compatibility across all components.

To combine clinical expertise with dialectal specialization, we integrate three independently fine-tuned checkpoints:

Medical Domain Model: a clinically aligned Gemma-based model trained for evidence-grounded medical reasoning [30].
Egyptian Arabic Specialist: a Gemma 2B variant fine- tuned to model colloquial Egyptian lexical, syntactic, and morphological patterns. It was fine-tuned in-house on the Egyptian-SFT-Mixture dataset [31].
Moroccan Darija Specialist: an instruction-tuned Gemma 2B model optimized for Moroccan Darija generation tasks [32].

All models share the same tokenizer and vocabulary inherited from Gemma 2B. This architectural alignment allows parameter-level consolidation without token realignment or vocabulary conflicts during merging. The code and associated resources for MENARA are publicly available at: https://github.com/serag-ai/MENARA, accessed on 13 April 2026.

For clarity, we denote the Gemma 2B backbone as Gemma, the Egyptian dialect model as EGY, and the Moroccan Darija model as DRJ.

3.3. Model Merging Strategy

We adopt a parameter-space integration strategy for multi-dialect consolidation using MergeKit [33]. Among the available approaches, we considered three representative methods, namely linear weight averaging, spherical linear interpolation (SLERP), and TIES (Trim, Elect Sign, & Sign-aware Merge) [28], each reflecting a distinct philosophy of model merging.

Linear weight averaging performs a direct weighted combination of model parameters. While computationally simple and easy to implement, it assumes compatibility between parameter updates and does not explicitly address potential task interference across independently fine-tuned checkpoints.

SLERP preserves the geometric relationship between parameter vectors by interpolating along a hyperspherical manifold. Although well-suited for pairwise interpolation, extending SLERP to multi-model merging introduces additional design considerations, such as sequential interpolation or ordering strategies.

TIES, in contrast, performs structured consolidation of task-specific parameter updates. It explicitly identifies and suppresses conflicting parameter directions while selectively integrating high-signal updates. This conflict-aware mechanism is particularly relevant when merging dialect-specialized and domain-specialized checkpoints that exhibit partially overlapping capabilities.

Given the presence of three independently fine-tuned models with non-trivial parameter divergence, we adopt TIES as our integration strategy (see Appendix B for an empirical comparison with linear averaging and SLERP). The following section details the formal parameter consolidation procedure.

3.4. TIES-Based Parameter Consolidation

We apply the TIES algorithm [28] in a training-free configuration using MergeKit to construct the final unified model. As described in [12], TIES performs structured parameter-space consolidation by integrating task-specific update vectors relative to a shared backbone model while suppressing conflicting directions.

Let

θ_{m e d}

denote the medical backbone parameters and

θ_{i}

represent each dialect-specialized model. For each specialist model, we compute a task update vector:

Δ_{i} = θ_{i} - θ_{m e d}

To reduce noise and mitigate destructive interference, updates are sparsified using a fixed density of 0.6 (see Appendix A for an analysis across density values of 0.4, 0.6, and 0.8). Directional inconsistencies across retained parameters are resolved through sign agreement before weighted aggregation into the base model. The merged parameters are then obtained via:

θ_{merged} = θ_{m e d} + \sum_{i} w_{i} Δ_{i}^{sparse}

where dialect specialists are assigned higher integration weights relative to the base backbone to preserve dialectal signal while maintaining clinical grounding. We assign higher weights to the dialect specialists (0.6) than to the medical backbone (0.4), enabling improved linguistic adaptation while preserving core clinical knowledge.

All hyperparameters match those used in the Proceedings of ArabicNLP 2025 [12] to ensure comparability across studies. No additional fine-tuning or gradient updates are performed after merging.

3.5. Evaluation

3.5.1. Evaluation Concepts

We define three core evaluation concepts to assess the performance of the proposed framework:

Dialectal Fidelity refers to the degree to which a generated response conforms to the lexical and morphosyntactic norms of the target Arabic variety. It is evaluated on a 1–5 Likert scale using both LLM-based and human assessments under a structured rubric.

Medical Competence denotes the factual accuracy and clinical safety of a generated response. It is assessed by domain experts on the same 1–5 Likert scale, where lower scores are assigned to incomplete, misleading, or potentially harmful outputs.

Cross-Dialect Robustness captures the ability of the model to maintain consistent performance across dialect conditions. It is evaluated through score variance across dialects and further supported by dialect composition analysis, demonstrating stable dialect alignment without unintended leakage.

3.5.2. Evaluation Protocol

Model behavior was assessed using a multi-layer evaluation protocol comprising automated scoring, human linguistic assessment, medical expert review, cross-model benchmarking, and backbone-retention testing.

LLM-Based Dialect Scoring. Dialectal fidelity was evaluated using an external Arabic-capable foundation model (Qwen 3 Base) under a structured rubric. For each generated response, the evaluator assigned a rating on a five-point scale reflecting conformity to the requested dialect. Each dialect condition was evaluated over 300 prompts using identical scoring instructions (see Figure 4).

Dialect Composition Analysis. Beyond scalar scoring, we conducted a lexical attribution analysis to quantify dialectal distribution within model outputs. Using a structured LLM-as-judge protocol, each response was decomposed into proportional contributions from: the target dialect, MSA, other Arabic dialects, and non-Arabic tokens. This analysis enables measurement of dialect purity and unintended code-switching behavior, providing a more granular assessment of linguistic adherence.

Human Linguistic Evaluation. Native speakers of Egyptian Arabic and Moroccan Darija independently evaluated a subset of 30 outputs per dialect condition. Raters assessed linguistic naturalness and coherence using a five-point Likert scale. This assessment was designed to capture cultural authenticity and contextual fluency beyond model-based scoring.

Medical Expert Evaluation. To assess clinical safety and factual correctness, two medical subject matter experts (SMEs) evaluated a subset of generated responses using a five-point rubric ranging from 1 (Severely Incorrect/Dangerous) to 5 (Completely Correct and Nuanced). Evaluation criteria included factual accuracy, omission of critical details, identification of potential harm, and inclusion of appropriate safety caveats.

Cross-Model Benchmarking. To contextualize performance, we evaluated the merged model against publicly available Arabic LLMs (Jais-13B, ALLaM-7B, Fanar-7B) using the same LLM-based scoring protocol described above. All models were evaluated under identical conditions, including the same set of prompts, the same evaluator model (Qwen 3 Base), and the same scoring rubric.

Inter-run variability. To assess the reliability of the evaluation procedure, we conducted an inter-run consistency analysis using a random 10% subset of the dataset, which was independently evaluated in two separate runs under identical conditions.

3.6. Statistical Analysis

To test for differences between results, t-tests were used for normally distributed data, and the Mann–Whitney U test was used for non-normal distributions (normality was assessed using the Shapiro–Wilk test). Statistical significance was defined as

p < 0.05

.

4. Results

4.1. Qualitative Cross-Dialect Generation

Figure 5 presents representative cross-dialect generation examples from the merged model. The examples illustrate the model’s ability to produce coherent and dialectally appropriate responses when the input and output varieties differ, including (A) MSA prompt → Egyptian Arabic response, (B) Egyptian Arabic prompt → Moroccan Darija response, and (C) Moroccan Darija prompt → MSA response.

These examples qualitatively demonstrate the effectiveness of the TIES-based consolidation in enabling controlled dialect switching while preserving clinical coherence.

4.2. LLM-Based Dialect Scoring

Figure 6 presents dialectal fidelity scores (1–5 scale) assigned by the Qwen 3 Base evaluator across 300 prompts per dialect condition. Error bars denote standard deviation across prompts.

For MSA prompts, MENARA achieved the highest mean fidelity score, slightly exceeding the Gemma backbone (

p = 0.013

) and both dialect-specialized models. Importantly, variance in this condition is comparatively low, indicating stable performance on the standard register despite multi-dialect integration.

Under Egyptian Arabic prompts, the EGY specialist produced the strongest dialect adherence, consistent with its single-task optimization. MENARA achieved substantially higher scores than the Gemma backbone and DRJ model, demonstrating successful transfer of Egyptian dialectal competence without exclusive specialization (

p = 0.002

). Although the specialist remains optimal, MENARA narrows the performance gap while maintaining broader capability.

For Moroccan Darija prompts, DRJ achieved the highest fidelity, with MENARA performing closely behind (

p = 0.210

). The merged model markedly outperformed both the backbone and the Egyptian specialist, indicating effective retention of Moroccan dialectal features after consolidation.

Across all conditions, MENARA exhibits the most balanced performance profile. While single-dialect specialists dominate their respective varieties, MENARA maintains consistently strong scores without catastrophic degradation in any dialect, reflecting the intended behavior of parameter-space merging.

4.3. Human Linguistic Evaluation

Table 1 presents native-speaker ratings of dialectal naturalness and coherence. MENARA maintains consistently high linguistic realism across dialects, supporting the automated findings while providing qualitative validation of authenticity and fluency.

To further analyze dialectal behavior and address concerns regarding potential MSA leakage, we conducted a lexical composition analysis of MENARA’s outputs across all dialect conditions. The results are summarized in Figure 7. The analysis quantifies the proportion of tokens attributable to each language variety (MSA, Egyptian Arabic, Moroccan Darija, and other languages) in the generated responses when prompted in each target dialect.

When prompted in MSA, MENARA produces outputs that are strongly dominated by MSA (88%), with minimal Egyptian Arabic presence (5%), 7% other-language content, and no observable Moroccan Darija leakage. This confirms that the merged model reliably defaults to the standard register when prompted accordingly, without unintended dialectal interference.

When prompted in Egyptian Arabic, the model generates responses consisting of 71% Egyptian Arabic, with MSA reduced to 24% and no Moroccan Darija leakage. This indicates successful dialectal adaptation, while the presence of limited MSA content reflects natural code-switching patterns commonly observed in Egyptian Arabic, particularly in semi-formal or medical contexts.

When prompted in Moroccan Darija, MENARA produces 54% Moroccan Darija, alongside 27% MSA, 8% Egyptian Arabic, and 11% other languages. The presence of MSA and non-Arabic tokens should not be interpreted as a failure of the merging process. Rather, it reflects the linguistic reality of Moroccan Darija, where code-switching with MSA and borrowing from French are prevalent, particularly for technical and medical terminology.

Overall, these results demonstrate that MENARA preserves strong dialectal alignment while exhibiting realistic and context-appropriate code-switching behavior.

4.4. Medical Expert Evaluation

To assess clinical safety and factual correctness, two medical subject matter experts independently evaluated a subset of generated responses. Descriptive statistics indicated that SME 1 assigned higher scores on average (M = 3.80, SD = 1.40) than SME 2 (M = 3.17, SD = 1.49). English responses were consistently rated as accurate and clinically sound. In contrast, dialectal responses, while generally plausible, exhibited moderate variability, which may explain the differences in scoring tendencies between the two experts.

4.5. Cross-Model Benchmarking

Table 2 compares MENARA against larger general-purpose Arabic LLMs (Jais-13B, ALLaM-7B, and Fanar-7B) under the same LLM-based evaluation protocol. MENARA achieves the highest overall average fidelity score (3.68), substantially exceeding all baselines despite its smaller parameter count. On Egyptian Arabic prompts, MENARA (3.02) clearly outperforms all comparison models, indicating stronger dialectal adaptation than scaling alone provides. A similar pattern is observed for Moroccan Darija, where MENARA (3.12) surpasses larger models by a wide margin.

For MSA prompts, MENARA remains competitive (4.89), performing on par with the strongest baseline (Fanar: 4.91) and exceeding the others. These results suggest that structured integration of dialect specialists can yield more robust cross-dialect performance than increasing model size without explicit dialect adaptation.

To evaluate robustness under more realistic input conditions, we assessed MENARA on the supplementary noisy dataset described in Section 3.1. Table 3 summarizes the fidelity scores across clean and noisy conditions.

Overall, MENARA demonstrates stable performance across clean and noisy conditions, with only minor variations in fidelity scores. The largest absolute decrease is observed in MSA, while Egyptian Arabic remains nearly unchanged and Moroccan Darija shows a slight increase.

The results indicate no statistically significant difference between clean and noisy conditions for any dialect: MSA (

p = 0.294

), Egyptian Arabic (

p = 0.818

), and Moroccan Darija (

p = 0.821

). These findings suggest that MENARA is robust to moderate levels of input noise as simulated in our synthetic setting, with no evidence of systematic degradation in output quality across dialects.

4.6. Inter-Run Variability

To assess the reliability of the evaluation procedure, we conducted an inter-run consistency analysis using a random 10% subset of the dataset, which was independently evaluated in two separate runs under identical conditions. Agreement between the two runs was quantified using Krippendorff’s

α

, yielding a value of 0.89. This level of agreement is considered near-perfect and exceeds the commonly accepted threshold of 0.80, indicating high stability of the evaluator. These results suggest that the scoring process is robust to stochastic variation and that the evaluator produces consistent judgments across repeated runs.

5. Discussion

This study provides empirical evidence that parameter-space model merging enables unified multidialectal medical NLP without per-dialect fine-tuning. In response to our core research question, the results show that model merging can produce a single unified language model capable of supporting cross-dialect medical communication without requiring per-dialect fine-tuning. By integrating dialect-specialized models with a medical domain model using TIES merging, we construct MENARA—a unified system that effectively processes and generates medical content across Egyptian Arabic, Moroccan Darija, and MSA.

Quantitative evaluation indicates that MENARA achieves a strong balance between dialectal specificity and cross-dialect generalization. Across all test scenarios, the merged model attained dialectal fidelity scores ranging from 3.02 to 4.89, consistently performing near the top for MSA, Egyptian Arabic, and Moroccan Darija prompts. While dialect-specialized models unsurprisingly performed best on their respective target dialects, MENARA demonstrated competitive fidelity across dialects without requiring separate dialect-specific deployments.

Comparisons involving the Egyptian Arabic model should be interpreted with caution, as EGY is the only model fine-tuned in-house, whereas the remaining models were incorporated as off-the-shelf checkpoints and merged directly. This difference in training provenance may partially explain EGY’s strong performance on Egyptian Arabic prompts. Notably, MENARA achieves robust cross-dialect performance without additional fine-tuning, underscoring the practical advantage of the merging strategy.

MENARA showed particular strength in cross-dialect interpretation, accurately processing Moroccan Darija symptom descriptions for Egyptian Arabic-speaking clinicians. This capability directly addresses real-world communication barriers in multilingual healthcare environments, where patients and clinicians often rely on different Arabic varieties. Analysis of dialect composition further revealed linguistically plausible behavior: responses to Egyptian Arabic prompts exhibited high dialect purity, while Moroccan Darija outputs naturally incorporated MSA and French loanwords. This reflects authentic usage patterns in Moroccan Darija, where technical concepts are frequently expressed through code-switching rather than purely dialectal forms.

Qualitative assessment confirmed the model’s practical utility in clinically relevant scenarios, demonstrating accurate interpretation of inputs in one dialect and coherent reformulation in another. Human evaluation further validated real-world applicability, with native speakers rating naturalness and coherence for Egyptian Arabic (average score 4.87) and for Moroccan Darija (average score 4.20). These findings support the premise that linguistic form and domain knowledge can be effectively disentangled during merging, consistent with recent work on parameter-efficient multitask learning.

Benchmark comparisons against larger Arabic LLMs, including Jais, ALLaM, and Fanar, highlight the effectiveness of the proposed approach. Despite its relatively compact size (2B parameters), MENARA outperformed these general-purpose models on dialectal fidelity in medical settings, emphasizing the value of specialization through merging rather than scale alone.

However, expert evaluation of medical correctness revealed a clear performance gradient across languages. English outputs were consistently accurate and clinically sound. Dialectal responses, while generally plausible, exhibited moderate variability. This limitation reflects the underlying training distribution of the base medical model, which is predominantly English- and MSA-centric, with limited exposure to dialectal medical data. While merging successfully transfers dialectal linguistic patterns, it cannot enrich medical knowledge beyond what exists in the base model. This highlights a broader challenge in dialectal medical NLP: linguistic fluency does not necessarily imply equivalent medical precision, particularly for low-resource dialects with limited standardized terminology.

From a deployment perspective, the resource efficiency of the approach is particularly compelling. The TIES-merging process completed in approximately 10 min on a single L4 GPU, using 9.3 GB of memory, and reduced storage requirements by 67% compared to maintaining separate specialized models. This lightweight computational footprint makes dialect-aware medical NLP feasible in resource-constrained environments where per-dialect fine-tuning would be impractical.

Importantly, the merged model retained strong English performance, achieving scores comparable to the base medical model. This confirms that model merging is not a zero-sum process: dialectal specialization can be added without degrading core medical knowledge, which remains essential for accessing global clinical literature.

Overall, this work illustrates several practical implications of model merging. It validates merging as an effective strategy for building a single, unified model capable of supporting multiple specialized behaviors, offers a flexible pathway for extending model capabilities without full retraining, and demonstrates that core competencies can be preserved alongside new specializations. More broadly, the proposed methodology provides a generalizable template for applying resource-efficient model merging in other fragmented domains—linguistic, regional, or topical—such as legal, educational, or customer service applications.

Despite these promising results, limitations remain. Dialectal coverage is currently restricted to Egyptian Arabic and Moroccan Darija; extending the approach to additional varieties such as Levantine or Gulf Arabic would better assess scalability. While synthetic data mitigated data scarcity, real-world patient utterances are likely to exhibit greater variability and noise than those captured in our evaluation. Moreover, this study focused primarily on clinician-facing comprehension; future work should examine patient-facing generation tasks, including dialect-specific medical advice, to more fully characterize bidirectional clinical utility.

6. Conclusions

This study establishes parameter-space model merging as a practical strategy for addressing Arabic dialectal fragmentation in medical NLP. By integrating a clinically grounded backbone with Egyptian Arabic and Moroccan Darija specialists, we demonstrate that cross-dialect medical communication can be achieved without additional fine-tuning and without degrading core medical or English-language competence. Through benchmarking, lexical composition analysis, and cross-lingual retention evaluation, we show that modular specialization can coexist within a compact architecture. The resulting efficiency and scalability make this approach particularly suitable for linguistically fragmented and resource-constrained healthcare settings. More broadly, the framework offers a scalable pathway for integrating specialized capabilities in other multilingual or domain-fragmented applications.

7. Declaration of Generative AI and AI-Assisted Technologies in the Manuscript Preparation Process

During manuscript preparation, Claude Sonnet 4.6 was used for editorial refinement of language. All content was reviewed and verified by the authors, who take full responsibility for the final version.

Author Contributions

Conceptualization, A.I. and A.S.; methodology, A.I. and A.S.; software, A.I. and A.S.; validation, A.I., A.H., H.H., M.A., A.A., W.L. and A.S.; formal analysis, A.I., H.H., M.A., A.A., W.L. and A.S.; investigation, A.I., H.H., M.A., A.A., W.L. and A.S.; resources, A.S.; data curation, A.I. and A.S.; writing—original draft preparation, A.I. and A.S.; writing—review and editing, A.I., A.H., H.H., M.A., A.A., W.L. and A.S.; visualization, A.I. and A.S.; supervision, A.S.; project administration, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code, associated generated data, and models are publicly available at https://github.com/serag-ai/MENARA (accessed on 13 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. TIES Merging Hyperparameters and Ablation Study

In MergeKit’s implementation of the TIES algorithm, two primary per-model hyperparameters govern the merging process: density and weight.

The density parameter controls the fraction of task-vector parameters retained after sparsification, specifically selecting the top-k% parameters by magnitude. For example, at

density = 0.6

, the top 60% of highest-magnitude updates from each specialist model are retained prior to sign-based aggregation.

Our choice of

density = 0.6

is supported by both theoretical and empirical considerations. Since all merged models share the same Gemma 2B backbone, the setting aligns with assumptions of Linear Mode Connectivity, where moderate-to-high density values are expected to preserve meaningful parameter directions. Empirically, community observations (e.g., MergeKit discussions and Hugging Face implementation guides) indicate that densities in the range of 0.5–0.9 yield improved performance for same-backbone merges, with values slightly above those proposed in the original work often performing better in practice.

To assess the impact of the density parameter on performance, we conducted an ablation study across three values: 0.4, 0.6, and 0.8. Table A1 summarizes the results.

Table A1. Effect of density on dialectal fidelity (1–5 scale). Bold values indicate the best average performance.

Density	MSA	Egyptian	Moroccan	Average
0.4	4.72	2.84	2.94	3.50
0.6 (ours)	4.89	3.02	3.12	3.68
0.8	4.74	2.73	2.83	3.42

The results indicate that

density = 0.6

achieves the best overall performance. Lower density (0.4) likely removes informative task-vector components, while higher density (0.8) introduces increased parameter interference, reducing overall fidelity.

Complementarily, the weight parameter controls each model’s relative contribution during merging and rescales task vectors prior to sign election. This directly influences which model dominates the consensus direction for each parameter. In our setting, higher weights were assigned to dialect-specialized models relative to the medical backbone. This design reflects the objective of preserving clinical competence (already present in the backbone) while enhancing dialectal adaptation, which constitutes the primary challenge.

Appendix B. Comparison with Alternative Merging Strategies

In addition to TIES, we evaluated two alternative model merging approaches during development: linear averaging and spherical linear interpolation (SLERP). All methods were assessed using the same evaluation protocol described in Section 3.5.2. Table A2 presents dialectal fidelity scores across the three approaches.

Table A2. Comparison of merging strategies (dialectal fidelity, 1–5 scale). Bold values indicate the best average performance.

Method	MSA	Egyptian	Moroccan	Average
Linear Averaging	4.61	2.81	3.15	3.52
SLERP	4.70	2.86	3.06	3.54
TIES (MENARA)	4.89	3.02	3.12	3.68

TIES achieves the highest overall performance, particularly improving Egyptian Arabic and maintaining strong performance across all dialects. While linear averaging and SLERP provide reasonable baselines, they lack mechanisms to resolve parameter conflicts, such as sign disagreement and magnitude-based filtering.

These results support the use of TIES for multi-dialect model merging, as it effectively balances contributions from specialized models while mitigating interference between conflicting parameter updates.

References

Alasmari, A. A Scoping Review of Arabic Natural Language Processing for Mental Health. Healthcare 2025, 13, 963. [Google Scholar] [CrossRef] [PubMed]
Inoue, G.; Khalifa, S.; Habash, N. Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects. In Findings of the Association for Computational Linguistics: ACL 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1708–1719. [Google Scholar] [CrossRef]
Trentman, E.; Shiri, S. The Mutual Intelligibility of Arabic Dialects: Implications for the Language Classroom. Crit. Multiling. Stud. 2020, 8, 104–134. [Google Scholar]
Shoufan, A.; Alameri, S. Natural Language Processing for Dialectical Arabic: A Survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing; Habash, N., Vogel, S., Darwish, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 36–48. [Google Scholar] [CrossRef]
Wu, X.K.; Chen, M.; Li, W.; Wang, R.; Lu, L.; Liu, J.; Hwang, K.; Hao, Y.; Pan, Y.; Meng, Q.; et al. Llm fine-tuning: Concepts, opportunities, and challenges. Big Data Cogn. Comput. 2025, 9, 87. [Google Scholar] [CrossRef]
Ibrahim, A.; Hosseini, A.; Ibrahim, S.; Sattar, A.; Serag, A. D3: A Small Language Model for Drug-Drug Interaction prediction and comparison with Large Language Models. Mach. Learn. Appl. 2025, 20, 100658. [Google Scholar] [CrossRef]
Ibrahim, A.; Khalili, A.; Arabi, M.; Sattar, A.; Hosseini, A.; Serag, A. MERA: Medical Electronic Records Assistant. Mach. Learn. Knowl. Extr. 2025, 7, 73. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; Microsoft Corporation. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Brunet, G.; Chechik, M.; Easterbrook, S.; Nejati, S.; Niu, N.; Sabetzadeh, M. A manifesto for model merging. In Proceedings of the 2006 International Workshop on Global Integrated Model Management, Shanghai, China, 22 May 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 5–12. [Google Scholar]
Xu, Z.; Yuan, K.; Wang, H.; Wang, Y.; Song, M.; Song, J. Training-free pretrained model merging. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; Computer Vision Foundation: New York, NY, USA, 2024; pp. 5915–5925. [Google Scholar]
Ibrahim, A.; Hosseini, A.; Helmy, H.; Lakhdhar, W.; Serag, A. Bridging Dialectal Gaps in Arabic Medical LLMs through Model Merging. In Proceedings of the Third Arabic Natural Language Processing Conference; Darwish, K., Ali, A., Abu Farha, I., Touileb, S., Zitouni, I., Abdelali, A., Al-Ghamdi, S., Alkhereyf, S., Zaghouani, W., Khalifa, S., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 338–346. [Google Scholar] [CrossRef]
Kwaik, K.A.; Saad, M.; Chatzikyriakidis, S.; Dobnik, S. A lexical distance study of Arabic dialects. Procedia Comput. Sci. 2018, 142, 2–13. [Google Scholar] [CrossRef]
Al-Wer, E.; de Jong, R. Dialects of Arabic. In The Handbook of Dialectology; John Wiley & Sons, Inc.: Piscataway, NJ, USA, 2017; pp. 523–534. [Google Scholar]
Salameh, M.; Bouamor, H.; Habash, N. Fine-grained Arabic dialect identification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NW, USA, 20–26 August 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1332–1344. [Google Scholar]
Abdelali, A.; Mubarak, H.; Samih, Y.; Hassan, S.; Darwish, K. QADI: Arabic Dialect Identification in the Wild. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Habash, N., Bouamor, H., Hajj, H., Magdy, W., Zaghouani, W., Bougares, F., Tomeh, N., Abu Farha, I., Touileb, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–10. [Google Scholar]
Alhuzali, H.; Al-Eisawi, W.; Abdul-Mageed, M.; Abouzahir, C.; Abu-Daoud, M.; Alasmari, A.; Al-Monef, R.; Alqahtani, A.; Ayash, L.; Kharouf, L.; et al. AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering. In Proceedings of the Third Arabic Natural Language Processing Conference: Shared Tasks; Darwish, K., Ali, A., Abu Farha, I., Touileb, S., Zitouni, I., Abdelali, A., Al-Ghamdi, S., Alkhereyf, S., Zaghouani, W., Khalifa, S., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 107–118. [Google Scholar] [CrossRef]
Al-Majmar, N.A.; Gawbah, H.; Alsubari, A. AHD: Arabic healthcare dataset. Data Brief 2024, 56, 110855. [Google Scholar] [CrossRef] [PubMed]
Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Habash, N., Bouamor, H., Hajj, H., Magdy, W., Zaghouani, W., Bougares, F., Tomeh, N., Abu Farha, I., Touileb, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 82–91. [Google Scholar]
Mohammad, R.; Alkhnbashi, O.S.; Hammoudeh, M. Optimizing large language models for arabic healthcare communication: A focus on patient-centered NLP applications. Big Data Cogn. Comput. 2024, 8, 157. [Google Scholar] [CrossRef]
Yang, E.; Shen, L.; Guo, G.; Wang, X.; Cao, X.; Zhang, J.; Tao, D. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. arXiv 2024, arXiv:2408.07666. [Google Scholar] [CrossRef]
Matena, M.S.; Raffel, C.A. Merging models with fisher-weighted averaging. Adv. Neural Inf. Process. Syst. 2022, 35, 17703–17716. [Google Scholar]
Kodali, P.; Shivkumar, V.; Joshi, S.; Choudhary, M.; Kumaraguru, P.; Shrivastava, M. Adapting Multilingual Models to Code-Mixed Tasks via Model Merging. arXiv 2025, arXiv:2510.19782. [Google Scholar] [CrossRef]
Wang, Y.; Gu, Y.; Zhang, Y.; Zhou, Q.; Yan, Z.; Xie, C.; Wang, X.; Yuan, J.; Yang, H. Model Merging Scaling Laws in Large Language Models. arXiv 2025, arXiv:2509.24244. [Google Scholar] [CrossRef]
Bandarkar, L.; Peng, N. The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025); Adelani, D.I., Arnett, C., Ataman, D., Chang, T.A., Gonen, H., Raja, R., Schmidt, F., Stap, D., Wang, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 131–148. [Google Scholar] [CrossRef]
Incremona, A.; Pozzi, A.; Guiscardi, A.; Tessera, D. A differentiable and uncertainty-aware mutual information regularizer for bias mitigation. Neurocomputing 2026, 669, 132498. [Google Scholar] [CrossRef]
Sarridis, I.; Koutlis, C.; Papadopoulos, S.; Diou, C. FLAC: Fairness-Aware Representation Learning by Suppressing Attribute-Class Associations. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1148–1160. [Google Scholar] [CrossRef] [PubMed]
Yadav, P.; Tam, D.; Choshen, L.; Raffel, C.A.; Bansal, M. Ties-merging: Resolving interference when merging models. Adv. Neural Inf. Process. Syst. 2023, 36, 7093–7115. [Google Scholar]
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
OpenMeditron. Meditron3-Gemma2-2B. Hugging Face Model Repository. 2024. Available online: https://huggingface.co/OpenMeditron/Meditron3-Gemma2-2B (accessed on 13 April 2026).
MBZUAI-Paris. Egyptian-SFT-Mixture Dataset. 2024. Available online: https://huggingface.co/datasets/MBZUAI-Paris/Egyptian-SFT-Mixture (accessed on 13 April 2026).
Shang, G.; Abdine, H.; Khoubrane, Y.; Mohamed, A.; Abbahaddou, Y.; Ennadir, S.; Momayiz, I.; Ren, X.; Moulines, E.; Nakov, P.; et al. Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect. In Proceedings of the First Workshop on Language Models for Low-Resource Languages; Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., Tan, F.A., Uyangodage, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 9–30. [Google Scholar]
Goddard, C.; Siriwardhana, S.; Ehghaghi, M.; Meyers, L.; Karpukhin, V.; Benedict, B.; McQuade, M.; Solawetz, J. Arcee’s MergeKit: A Toolkit for Merging Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track; Dernoncourt, F., Preoţiuc-Pietro, D., Shimorina, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 477–485. [Google Scholar] [CrossRef]

Figure 1. The framework consists of three stages: (1) test data generation using a LLM to produce dialect-specific medical symptom descriptions; (2) model merging via the TIES algorithm, integrating Egyptian Arabic, Moroccan Darija, and medical domain LLMs into one model; and (3) dual evaluation of the merged model through LLM-based and human-based assessments, focusing on both dialectal comprehension and medical accuracy.

Figure 2. Prompt for data generation. Identical clinical constraints were applied across all dialects, with only the language specification modified.

Figure 3. Examples of generated test data. (A) MSA, (B) Egyptian Arabic, (C) Moroccan Darija.

Figure 4. Evaluation prompt used for LLM-based scoring. Qwen 3 Base was instructed to assess each model response for dialectal fidelity using a consistent rubric across MSA, Egyptian Arabic, and Moroccan Darija.

Figure 5. (i) Inference-phase prompts used to steer the model to respond in a specified dialect. (ii) Examples from the merged model showing cross-dialect input and output: MSA question → Egyptian Arabic answer (A), Egyptian Arabic question → Moroccan Darija answer (B), and Moroccan Darija question → MSA answer (C).

Figure 6. Dialectal fidelity scores (1–5) for the evaluated models across three Arabic language variants.

Figure 7. Dialect composition of MENARA outputs across prompting conditions. Token distributions are shown for MSA (blue), Egyptian Arabic (orange), Moroccan Darija (green), and Other (red). The model demonstrates strong alignment with the target dialect in each condition, with minimal cross-dialect leakage.

Table 1. Averaged Human Evaluation (naturalness & coherence).

	MSA	Egyptian	Moroccan
Quality (1–5)	4.91	4.87	4.20

Table 2. Model comparison on dialectal outputs (1–5 scale). Our merged model achieves superior dialectal performance. Bold values indicate the best average performance.

Model	MSA	Egyptian	Moroccan	Average
MENARA (Ours)	4.89	3.02	3.12	3.68
ALLaM	4.77	2.31	1.72	2.93
Jais	4.53	2.10	1.88	2.84
Fanar	4.91	1.48	1.25	2.55

Table 3. Dialectal fidelity under clean and noisy input conditions.

Dialect	Clean	Noisy
MSA	4.88	4.72
Egyptian Arabic	3.14	3.10
Moroccan Darija	2.98	3.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ibrahim, A.; Hosseini, A.; Helmy, H.; Arabi, M.; AlShareef, A.; Lakhdhar, W.; Serag, A. MENARA: Medical Natural Arabic Response Assistant. Mach. Learn. Knowl. Extr. 2026, 8, 110. https://doi.org/10.3390/make8040110

AMA Style

Ibrahim A, Hosseini A, Helmy H, Arabi M, AlShareef A, Lakhdhar W, Serag A. MENARA: Medical Natural Arabic Response Assistant. Machine Learning and Knowledge Extraction. 2026; 8(4):110. https://doi.org/10.3390/make8040110

Chicago/Turabian Style

Ibrahim, Ahmed, Abdullah Hosseini, Hoda Helmy, Maryam Arabi, Aya AlShareef, Wafa Lakhdhar, and Ahmed Serag. 2026. "MENARA: Medical Natural Arabic Response Assistant" Machine Learning and Knowledge Extraction 8, no. 4: 110. https://doi.org/10.3390/make8040110

APA Style

Ibrahim, A., Hosseini, A., Helmy, H., Arabi, M., AlShareef, A., Lakhdhar, W., & Serag, A. (2026). MENARA: Medical Natural Arabic Response Assistant. Machine Learning and Knowledge Extraction, 8(4), 110. https://doi.org/10.3390/make8040110

Article Menu

MENARA: Medical Natural Arabic Response Assistant

Abstract

1. Introduction

2. Related Work

2.1. Dialectal Variation in Arabic NLP

2.2. Arabic Medical Language Processing

2.3. Model Merging as Modular Adaptation

3. Methodology

3.1. Dataset

3.1.1. Primary Evaluation Set

3.1.2. Supplementary Noisy Evaluation Set

3.2. Model Configuration

3.3. Model Merging Strategy

3.4. TIES-Based Parameter Consolidation

3.5. Evaluation

3.5.1. Evaluation Concepts

3.5.2. Evaluation Protocol

3.6. Statistical Analysis

4. Results

4.1. Qualitative Cross-Dialect Generation

4.2. LLM-Based Dialect Scoring

4.3. Human Linguistic Evaluation

4.4. Medical Expert Evaluation

4.5. Cross-Model Benchmarking

4.6. Inter-Run Variability

5. Discussion

6. Conclusions

7. Declaration of Generative AI and AI-Assisted Technologies in the Manuscript Preparation Process

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. TIES Merging Hyperparameters and Ablation Study

Appendix B. Comparison with Alternative Merging Strategies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI