This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation
by
Maha Jarallah Althobaiti
Maha Jarallah Althobaiti
Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia
Information 2026, 17(2), 139; https://doi.org/10.3390/info17020139 (registering DOI)
Submission received: 24 December 2025
/
Revised: 26 January 2026
/
Accepted: 27 January 2026
/
Published: 1 February 2026
Abstract
Modern Standard Arabic (MSA) and the many regional dialects differ substantially in vocabulary, morphology, and pragmatic usage. Most available annotated resources are in MSA, and zero-shot transfer from MSA to dialectal tasks suffers a large performance drop. This paper addresses generalised cross-dialectal Arabic question answering (QA), where the context and the question are written in different Arabic varieties. We propose a training-free augmentation framework that generates code-mixed questions to bridge lexical gaps across Arabic varieties. The method produces semantically faithful, balanced code-mixed questions through the following two-stage procedure: lexicon-based partial substitution with semantic similarity and substitution-rate constraints, followed by fallback neural machine translation with word-level alignment when needed. We also introduce automated multidialectal lexicon construction using machine translation, embedding-based alignment, and semantic checks. We carry out our evaluation in a zero-shot setting, where the model is fine-tuned only on MSA and then tested on dialectal inputs using ArDQA, covering five Arabic varieties and three domains (SQuAD, Vlogs, and Narratives). Experiments show consistent improvements under context-question dialect mismatch as follows: +1.09 F1/+0.87 EM on SQuAD, +1.54/+1.25 on Vlogs, and +2.75/+2.27 on Narratives, with the largest gains for Maghrebi questions in Narratives (+12.13 F1/+8.45 EM). These results show that our method improves zero-shot cross-dialectal transfer without fine-tuning or retraining.
Share and Cite
MDPI and ACS Style
Althobaiti, M.J.
Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information 2026, 17, 139.
https://doi.org/10.3390/info17020139
AMA Style
Althobaiti MJ.
Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information. 2026; 17(2):139.
https://doi.org/10.3390/info17020139
Chicago/Turabian Style
Althobaiti, Maha Jarallah.
2026. "Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation" Information 17, no. 2: 139.
https://doi.org/10.3390/info17020139
APA Style
Althobaiti, M. J.
(2026). Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information, 17(2), 139.
https://doi.org/10.3390/info17020139
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.