1. Introduction
Deploying medical language technologies in healthcare systems presents challenges that extend beyond conventional multilingual Natural Language Processing (NLP), particularly in regions where communication occurs through diverse dialectal varieties rather than standardized language forms. In the Arabic-speaking world, this issue becomes especially pronounced. Unlike languages with relatively standardized spoken forms, Arabic exists as a continuum of regionally grounded dialects that often diverge substantially in vocabulary, phonology, and syntax. In clinical practice, patients rarely communicate in Modern Standard Arabic (MSA); instead, they rely on colloquial dialects that may be only partially intelligible outside their region. This linguistic reality complicates the safe deployment of medical large language models (LLMs), particularly in high-stakes healthcare settings [
1,
2]. Dialects such as Egyptian Arabic and Moroccan Darija exhibit marked phonological and lexical divergence, significantly reducing mutual intelligibility across regions [
3]. When clinicians are unfamiliar with a patient’s dialectal nuances, critical symptom descriptions may be misinterpreted, potentially leading to diagnostic errors, inappropriate treatment decisions, or delayed care [
4].
Traditional adaptation strategies for large language models rely heavily on dialect-specific fine-tuning [
5,
6,
7]. While effective in controlled environments, this approach scales poorly in practice. Training and maintaining independent medical models for each dialect requires large annotated corpora, repeated computational investment, and ongoing version management. In healthcare systems that already operate under resource constraints, maintaining parallel dialect-specific models is operationally unrealistic. Moreover, such fragmentation increases the risk of inconsistent clinical behavior across model variants. Even parameter-efficient strategies such as LoRA or quantization do not eliminate the cumulative computational burden associated with maintaining multiple specialized systems [
8,
9]. Given the breadth of Arabic dialectal diversity, per-dialect fine-tuning is inherently difficult to scale.
An alternative perspective views dialectal specialization not as a separate training objective, but as a modular capability that can be integrated into a shared medical foundation. Recent advances in model merging provide a mechanism for combining independently fine-tuned models without retraining from scratch [
10,
11]. By aligning and consolidating task-specific parameter updates, merging enables a single unified architecture to inherit multiple specializations while mitigating task interference. This paradigm offers a promising pathway for constructing dialect-aware medical LLMs without incurring the computational and logistical costs of maintaining per-dialect fine-tuned models.
In this study, we propose MENARA (MEdical Natural Arabic Response Assistant), as illustrated in
Figure 1. We investigate whether merging dialect-specialized Arabic models with a medical-domain language model can yield a unified system capable of robust cross-dialect clinical communication. Specifically, we integrate Egyptian Arabic and Moroccan Darija specialists with a medical-domain expert. Our goal is not merely to demonstrate dialect generation, but to evaluate whether linguistic adaptation can be achieved while preserving medical competence and foundational language knowledge.
Beyond validating dialectal performance, we examine three additional dimensions critical for real-world deployment. First, we introduce a fine-grained dialect composition analysis that quantifies lexical purity and code-switching behavior across generated outputs. Second, we benchmark the merged system against state-of-the-art Arabic LLMs to situate its performance within the broader ecosystem. Third, we evaluate catastrophic forgetting by testing whether merging degrades the model’s English medical capabilities. Together, these analyses move beyond proof-of-concept and toward a comprehensive assessment of scalability, safety, and retention.
Our central research question is therefore reformulated as follows: Can model merging produce a resource-efficient, clinically reliable, and linguistically adaptable Arabic medical LLM that maintains foundational knowledge while supporting cross-dialect communication?
Through systematic evaluation combining automated scoring, subject matter expert review, benchmarking comparisons, and deployment analysis, we demonstrate that merging provides a practical and scalable solution for dialect-aware medical NLP. This work contributes not only to Arabic healthcare NLP but also to broader efforts in modular specialization and resource-efficient model adaptation for linguistically fragmented domains.
A preliminary version of this work appeared in the Proceedings of ArabicNLP 2025 [
12]. The present manuscript substantially extends the conference publication through several new analyses and evaluations. First, we construct a synthetic yet clinically constrained dataset containing symptom descriptions that remain linguistically consistent across multiple Arabic dialectal varieties, including MSA, Egyptian Arabic, and Moroccan Darija. Building on this dataset, we introduce a dialect composition analysis to quantify lexical purity and structured code-switching patterns across dialect outputs. We further benchmark the merged model against additional state-of-the-art Arabic large language models, including Jais, ALLaM, and Fanar, situating its performance within the broader Arabic model ecosystem. In addition to automated competence metrics, subject matter experts perform medical-focused evaluations to provide domain-level validation of the model’s outputs. Finally, we present an expanded discussion examining performance gradients across dialects, mechanisms for mitigating model interference, and practical deployment considerations in resource-constrained healthcare environments.
3. Methodology
3.1. Dataset
Developing and evaluating AI systems for dialectal medical language understanding is often constrained by the limited availability of dialect-specific clinical data and by strict data privacy regulations. To address these challenges, we constructed a synthetic but clinically grounded dataset using structured prompting with GPT-4.1-nano. This approach enables controlled, reproducible experimentation while ensuring full compliance with patient privacy requirements.
The final dataset comprises a total of 1050 symptom descriptions. Of these, 900 samples form the primary evaluation set, while an additional 150 samples are used as a supplementary noisy evaluation set designed to assess robustness under more realistic input conditions.
3.1.1. Primary Evaluation Set
To promote linguistic diversity while maintaining clinical consistency, the primary dataset consists of parallel symptom descriptions across three language varieties: MSA, Egyptian Arabic, and Moroccan Darija (300 samples per dialect, totaling 900 samples). All samples were generated using a unified prompting framework designed to enforce consistent medical content while varying only the target dialect. Specifically, the generation process was guided by a structured template that constrains symptom presentation, demographic attributes (e.g., age and gender), and clinical plausibility (see
Figure 2).
Rather than allowing unconstrained free-form generation, prompts were carefully designed to ensure: (i) dialect-specific lexical and stylistic consistency, (ii) medically plausible symptom descriptions, and (iii) controlled variation across demographic factors. This design ensures that differences across dialect conditions are linguistic rather than semantic, enabling fair evaluation of cross-dialect robustness.
For dialect generation, the prompt’s language instruction was explicitly modified to request “authentic Egyptian colloquial Arabic” or “authentic Moroccan Darija expressions,” while preserving all other constraints. This ensures semantic alignment across dialects while capturing natural linguistic variation. Examples of the generated test data are shown in
Figure 3.
Generation was performed using controlled decoding parameters (temperature = 0.2, top-p = 0.90) to balance variability and stability. To further ensure data quality, outputs were automatically filtered to remove instances containing direct medical advice, interrogative forms, or unintended code-switching into English. In addition, a random 10% subset from each dialect condition was manually reviewed by native speakers to verify dialectal consistency and clinical plausibility prior to finalizing the corpus.
3.1.2. Supplementary Noisy Evaluation Set
To assess robustness under more realistic input conditions, we constructed an additional noisy evaluation set comprising 150 symptom descriptions (50 per dialect). In this setting, the generation prompt was explicitly modified to introduce characteristics commonly observed in patient-reported inputs, including misspellings, informal abbreviations, fragmented or incomplete symptom descriptions, and unstructured code-switching.
All samples were evaluated using the same evaluation protocol described in
Section 3.5.2, ensuring consistency with the primary evaluation setting.
3.2. Model Configuration
All experiments are conducted using the Gemma 2B architecture [
29] as a shared backbone to ensure architectural compatibility across all components.
To combine clinical expertise with dialectal specialization, we integrate three independently fine-tuned checkpoints:
Medical Domain Model: a clinically aligned Gemma-based model trained for evidence-grounded medical reasoning [
30].
Egyptian Arabic Specialist: a Gemma 2B variant fine- tuned to model colloquial Egyptian lexical, syntactic, and morphological patterns. It was fine-tuned in-house on the Egyptian-SFT-Mixture dataset [
31].
Moroccan Darija Specialist: an instruction-tuned Gemma 2B model optimized for Moroccan Darija generation tasks [
32].
All models share the same tokenizer and vocabulary inherited from Gemma 2B. This architectural alignment allows parameter-level consolidation without token realignment or vocabulary conflicts during merging. The code and associated resources for MENARA are publicly available at:
https://github.com/serag-ai/MENARA, accessed on 13 April 2026.
For clarity, we denote the Gemma 2B backbone as Gemma, the Egyptian dialect model as EGY, and the Moroccan Darija model as DRJ.
3.3. Model Merging Strategy
We adopt a parameter-space integration strategy for multi-dialect consolidation using MergeKit [
33]. Among the available approaches, we considered three representative methods, namely linear weight averaging, spherical linear interpolation (SLERP), and TIES (Trim, Elect Sign, & Sign-aware Merge) [
28], each reflecting a distinct philosophy of model merging.
Linear weight averaging performs a direct weighted combination of model parameters. While computationally simple and easy to implement, it assumes compatibility between parameter updates and does not explicitly address potential task interference across independently fine-tuned checkpoints.
SLERP preserves the geometric relationship between parameter vectors by interpolating along a hyperspherical manifold. Although well-suited for pairwise interpolation, extending SLERP to multi-model merging introduces additional design considerations, such as sequential interpolation or ordering strategies.
TIES, in contrast, performs structured consolidation of task-specific parameter updates. It explicitly identifies and suppresses conflicting parameter directions while selectively integrating high-signal updates. This conflict-aware mechanism is particularly relevant when merging dialect-specialized and domain-specialized checkpoints that exhibit partially overlapping capabilities.
Given the presence of three independently fine-tuned models with non-trivial parameter divergence, we adopt TIES as our integration strategy (see
Appendix B for an empirical comparison with linear averaging and SLERP). The following section details the formal parameter consolidation procedure.
3.4. TIES-Based Parameter Consolidation
We apply the TIES algorithm [
28] in a training-free configuration using MergeKit to construct the final unified model. As described in [
12], TIES performs structured parameter-space consolidation by integrating task-specific update vectors relative to a shared backbone model while suppressing conflicting directions.
Let
denote the medical backbone parameters and
represent each dialect-specialized model. For each specialist model, we compute a task update vector:
To reduce noise and mitigate destructive interference, updates are sparsified using a fixed density of 0.6 (see
Appendix A for an analysis across density values of 0.4, 0.6, and 0.8). Directional inconsistencies across retained parameters are resolved through sign agreement before weighted aggregation into the base model. The merged parameters are then obtained via:
where dialect specialists are assigned higher integration weights relative to the base backbone to preserve dialectal signal while maintaining clinical grounding. We assign higher weights to the dialect specialists (0.6) than to the medical backbone (0.4), enabling improved linguistic adaptation while preserving core clinical knowledge.
All hyperparameters match those used in the Proceedings of ArabicNLP 2025 [
12] to ensure comparability across studies. No additional fine-tuning or gradient updates are performed after merging.
3.5. Evaluation
3.5.1. Evaluation Concepts
We define three core evaluation concepts to assess the performance of the proposed framework:
Dialectal Fidelity refers to the degree to which a generated response conforms to the lexical and morphosyntactic norms of the target Arabic variety. It is evaluated on a 1–5 Likert scale using both LLM-based and human assessments under a structured rubric.
Medical Competence denotes the factual accuracy and clinical safety of a generated response. It is assessed by domain experts on the same 1–5 Likert scale, where lower scores are assigned to incomplete, misleading, or potentially harmful outputs.
Cross-Dialect Robustness captures the ability of the model to maintain consistent performance across dialect conditions. It is evaluated through score variance across dialects and further supported by dialect composition analysis, demonstrating stable dialect alignment without unintended leakage.
3.5.2. Evaluation Protocol
Model behavior was assessed using a multi-layer evaluation protocol comprising automated scoring, human linguistic assessment, medical expert review, cross-model benchmarking, and backbone-retention testing.
LLM-Based Dialect Scoring. Dialectal fidelity was evaluated using an external Arabic-capable foundation model (Qwen 3 Base) under a structured rubric. For each generated response, the evaluator assigned a rating on a five-point scale reflecting conformity to the requested dialect. Each dialect condition was evaluated over 300 prompts using identical scoring instructions (see
Figure 4).
Dialect Composition Analysis. Beyond scalar scoring, we conducted a lexical attribution analysis to quantify dialectal distribution within model outputs. Using a structured LLM-as-judge protocol, each response was decomposed into proportional contributions from: the target dialect, MSA, other Arabic dialects, and non-Arabic tokens. This analysis enables measurement of dialect purity and unintended code-switching behavior, providing a more granular assessment of linguistic adherence.
Human Linguistic Evaluation. Native speakers of Egyptian Arabic and Moroccan Darija independently evaluated a subset of 30 outputs per dialect condition. Raters assessed linguistic naturalness and coherence using a five-point Likert scale. This assessment was designed to capture cultural authenticity and contextual fluency beyond model-based scoring.
Medical Expert Evaluation. To assess clinical safety and factual correctness, two medical subject matter experts (SMEs) evaluated a subset of generated responses using a five-point rubric ranging from 1 (Severely Incorrect/Dangerous) to 5 (Completely Correct and Nuanced). Evaluation criteria included factual accuracy, omission of critical details, identification of potential harm, and inclusion of appropriate safety caveats.
Cross-Model Benchmarking. To contextualize performance, we evaluated the merged model against publicly available Arabic LLMs (Jais-13B, ALLaM-7B, Fanar-7B) using the same LLM-based scoring protocol described above. All models were evaluated under identical conditions, including the same set of prompts, the same evaluator model (Qwen 3 Base), and the same scoring rubric.
Inter-run variability. To assess the reliability of the evaluation procedure, we conducted an inter-run consistency analysis using a random 10% subset of the dataset, which was independently evaluated in two separate runs under identical conditions.
3.6. Statistical Analysis
To test for differences between results, t-tests were used for normally distributed data, and the Mann–Whitney U test was used for non-normal distributions (normality was assessed using the Shapiro–Wilk test). Statistical significance was defined as .
4. Results
4.1. Qualitative Cross-Dialect Generation
Figure 5 presents representative cross-dialect generation examples from the merged model. The examples illustrate the model’s ability to produce coherent and dialectally appropriate responses when the input and output varieties differ, including (A) MSA prompt → Egyptian Arabic response, (B) Egyptian Arabic prompt → Moroccan Darija response, and (C) Moroccan Darija prompt → MSA response.
These examples qualitatively demonstrate the effectiveness of the TIES-based consolidation in enabling controlled dialect switching while preserving clinical coherence.
4.2. LLM-Based Dialect Scoring
Figure 6 presents dialectal fidelity scores (1–5 scale) assigned by the Qwen 3 Base evaluator across 300 prompts per dialect condition. Error bars denote standard deviation across prompts.
For MSA prompts, MENARA achieved the highest mean fidelity score, slightly exceeding the Gemma backbone () and both dialect-specialized models. Importantly, variance in this condition is comparatively low, indicating stable performance on the standard register despite multi-dialect integration.
Under Egyptian Arabic prompts, the EGY specialist produced the strongest dialect adherence, consistent with its single-task optimization. MENARA achieved substantially higher scores than the Gemma backbone and DRJ model, demonstrating successful transfer of Egyptian dialectal competence without exclusive specialization (). Although the specialist remains optimal, MENARA narrows the performance gap while maintaining broader capability.
For Moroccan Darija prompts, DRJ achieved the highest fidelity, with MENARA performing closely behind (). The merged model markedly outperformed both the backbone and the Egyptian specialist, indicating effective retention of Moroccan dialectal features after consolidation.
Across all conditions, MENARA exhibits the most balanced performance profile. While single-dialect specialists dominate their respective varieties, MENARA maintains consistently strong scores without catastrophic degradation in any dialect, reflecting the intended behavior of parameter-space merging.
4.3. Human Linguistic Evaluation
Table 1 presents native-speaker ratings of dialectal naturalness and coherence. MENARA maintains consistently high linguistic realism across dialects, supporting the automated findings while providing qualitative validation of authenticity and fluency.
To further analyze dialectal behavior and address concerns regarding potential MSA leakage, we conducted a lexical composition analysis of MENARA’s outputs across all dialect conditions. The results are summarized in
Figure 7. The analysis quantifies the proportion of tokens attributable to each language variety (MSA, Egyptian Arabic, Moroccan Darija, and other languages) in the generated responses when prompted in each target dialect.
When prompted in MSA, MENARA produces outputs that are strongly dominated by MSA (88%), with minimal Egyptian Arabic presence (5%), 7% other-language content, and no observable Moroccan Darija leakage. This confirms that the merged model reliably defaults to the standard register when prompted accordingly, without unintended dialectal interference.
When prompted in Egyptian Arabic, the model generates responses consisting of 71% Egyptian Arabic, with MSA reduced to 24% and no Moroccan Darija leakage. This indicates successful dialectal adaptation, while the presence of limited MSA content reflects natural code-switching patterns commonly observed in Egyptian Arabic, particularly in semi-formal or medical contexts.
When prompted in Moroccan Darija, MENARA produces 54% Moroccan Darija, alongside 27% MSA, 8% Egyptian Arabic, and 11% other languages. The presence of MSA and non-Arabic tokens should not be interpreted as a failure of the merging process. Rather, it reflects the linguistic reality of Moroccan Darija, where code-switching with MSA and borrowing from French are prevalent, particularly for technical and medical terminology.
Overall, these results demonstrate that MENARA preserves strong dialectal alignment while exhibiting realistic and context-appropriate code-switching behavior.
4.4. Medical Expert Evaluation
To assess clinical safety and factual correctness, two medical subject matter experts independently evaluated a subset of generated responses. Descriptive statistics indicated that SME 1 assigned higher scores on average (M = 3.80, SD = 1.40) than SME 2 (M = 3.17, SD = 1.49). English responses were consistently rated as accurate and clinically sound. In contrast, dialectal responses, while generally plausible, exhibited moderate variability, which may explain the differences in scoring tendencies between the two experts.
4.5. Cross-Model Benchmarking
Table 2 compares MENARA against larger general-purpose Arabic LLMs (Jais-13B, ALLaM-7B, and Fanar-7B) under the same LLM-based evaluation protocol. MENARA achieves the highest overall average fidelity score (3.68), substantially exceeding all baselines despite its smaller parameter count. On Egyptian Arabic prompts, MENARA (3.02) clearly outperforms all comparison models, indicating stronger dialectal adaptation than scaling alone provides. A similar pattern is observed for Moroccan Darija, where MENARA (3.12) surpasses larger models by a wide margin.
For MSA prompts, MENARA remains competitive (4.89), performing on par with the strongest baseline (Fanar: 4.91) and exceeding the others. These results suggest that structured integration of dialect specialists can yield more robust cross-dialect performance than increasing model size without explicit dialect adaptation.
To evaluate robustness under more realistic input conditions, we assessed MENARA on the supplementary noisy dataset described in
Section 3.1.
Table 3 summarizes the fidelity scores across clean and noisy conditions.
Overall, MENARA demonstrates stable performance across clean and noisy conditions, with only minor variations in fidelity scores. The largest absolute decrease is observed in MSA, while Egyptian Arabic remains nearly unchanged and Moroccan Darija shows a slight increase.
The results indicate no statistically significant difference between clean and noisy conditions for any dialect: MSA (), Egyptian Arabic (), and Moroccan Darija (). These findings suggest that MENARA is robust to moderate levels of input noise as simulated in our synthetic setting, with no evidence of systematic degradation in output quality across dialects.
4.6. Inter-Run Variability
To assess the reliability of the evaluation procedure, we conducted an inter-run consistency analysis using a random 10% subset of the dataset, which was independently evaluated in two separate runs under identical conditions. Agreement between the two runs was quantified using Krippendorff’s , yielding a value of 0.89. This level of agreement is considered near-perfect and exceeds the commonly accepted threshold of 0.80, indicating high stability of the evaluator. These results suggest that the scoring process is robust to stochastic variation and that the evaluator produces consistent judgments across repeated runs.
5. Discussion
This study provides empirical evidence that parameter-space model merging enables unified multidialectal medical NLP without per-dialect fine-tuning. In response to our core research question, the results show that model merging can produce a single unified language model capable of supporting cross-dialect medical communication without requiring per-dialect fine-tuning. By integrating dialect-specialized models with a medical domain model using TIES merging, we construct MENARA—a unified system that effectively processes and generates medical content across Egyptian Arabic, Moroccan Darija, and MSA.
Quantitative evaluation indicates that MENARA achieves a strong balance between dialectal specificity and cross-dialect generalization. Across all test scenarios, the merged model attained dialectal fidelity scores ranging from 3.02 to 4.89, consistently performing near the top for MSA, Egyptian Arabic, and Moroccan Darija prompts. While dialect-specialized models unsurprisingly performed best on their respective target dialects, MENARA demonstrated competitive fidelity across dialects without requiring separate dialect-specific deployments.
Comparisons involving the Egyptian Arabic model should be interpreted with caution, as EGY is the only model fine-tuned in-house, whereas the remaining models were incorporated as off-the-shelf checkpoints and merged directly. This difference in training provenance may partially explain EGY’s strong performance on Egyptian Arabic prompts. Notably, MENARA achieves robust cross-dialect performance without additional fine-tuning, underscoring the practical advantage of the merging strategy.
MENARA showed particular strength in cross-dialect interpretation, accurately processing Moroccan Darija symptom descriptions for Egyptian Arabic-speaking clinicians. This capability directly addresses real-world communication barriers in multilingual healthcare environments, where patients and clinicians often rely on different Arabic varieties. Analysis of dialect composition further revealed linguistically plausible behavior: responses to Egyptian Arabic prompts exhibited high dialect purity, while Moroccan Darija outputs naturally incorporated MSA and French loanwords. This reflects authentic usage patterns in Moroccan Darija, where technical concepts are frequently expressed through code-switching rather than purely dialectal forms.
Qualitative assessment confirmed the model’s practical utility in clinically relevant scenarios, demonstrating accurate interpretation of inputs in one dialect and coherent reformulation in another. Human evaluation further validated real-world applicability, with native speakers rating naturalness and coherence for Egyptian Arabic (average score 4.87) and for Moroccan Darija (average score 4.20). These findings support the premise that linguistic form and domain knowledge can be effectively disentangled during merging, consistent with recent work on parameter-efficient multitask learning.
Benchmark comparisons against larger Arabic LLMs, including Jais, ALLaM, and Fanar, highlight the effectiveness of the proposed approach. Despite its relatively compact size (2B parameters), MENARA outperformed these general-purpose models on dialectal fidelity in medical settings, emphasizing the value of specialization through merging rather than scale alone.
However, expert evaluation of medical correctness revealed a clear performance gradient across languages. English outputs were consistently accurate and clinically sound. Dialectal responses, while generally plausible, exhibited moderate variability. This limitation reflects the underlying training distribution of the base medical model, which is predominantly English- and MSA-centric, with limited exposure to dialectal medical data. While merging successfully transfers dialectal linguistic patterns, it cannot enrich medical knowledge beyond what exists in the base model. This highlights a broader challenge in dialectal medical NLP: linguistic fluency does not necessarily imply equivalent medical precision, particularly for low-resource dialects with limited standardized terminology.
From a deployment perspective, the resource efficiency of the approach is particularly compelling. The TIES-merging process completed in approximately 10 min on a single L4 GPU, using 9.3 GB of memory, and reduced storage requirements by 67% compared to maintaining separate specialized models. This lightweight computational footprint makes dialect-aware medical NLP feasible in resource-constrained environments where per-dialect fine-tuning would be impractical.
Importantly, the merged model retained strong English performance, achieving scores comparable to the base medical model. This confirms that model merging is not a zero-sum process: dialectal specialization can be added without degrading core medical knowledge, which remains essential for accessing global clinical literature.
Overall, this work illustrates several practical implications of model merging. It validates merging as an effective strategy for building a single, unified model capable of supporting multiple specialized behaviors, offers a flexible pathway for extending model capabilities without full retraining, and demonstrates that core competencies can be preserved alongside new specializations. More broadly, the proposed methodology provides a generalizable template for applying resource-efficient model merging in other fragmented domains—linguistic, regional, or topical—such as legal, educational, or customer service applications.
Despite these promising results, limitations remain. Dialectal coverage is currently restricted to Egyptian Arabic and Moroccan Darija; extending the approach to additional varieties such as Levantine or Gulf Arabic would better assess scalability. While synthetic data mitigated data scarcity, real-world patient utterances are likely to exhibit greater variability and noise than those captured in our evaluation. Moreover, this study focused primarily on clinician-facing comprehension; future work should examine patient-facing generation tasks, including dialect-specific medical advice, to more fully characterize bidirectional clinical utility.