Next Article in Journal
HSE-GNN-CP: Spatiotemporal Teleconnection Modeling and Conformalized Uncertainty Quantification for Global Crop Yield Forecasting
Previous Article in Journal
FusionGraphRAG: An Adaptive Retrieval-Augmented Generation Framework for Complex Disease Management in the Elderly
Previous Article in Special Issue
Validating the Use of Natural Language Processing and Text Mining for Hospital-Based Violence Intervention Programs and Criminal Justice Articles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation

by
Maha Jarallah Althobaiti
Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia
Information 2026, 17(2), 139; https://doi.org/10.3390/info17020139 (registering DOI)
Submission received: 24 December 2025 / Revised: 26 January 2026 / Accepted: 27 January 2026 / Published: 1 February 2026

Abstract

Modern Standard Arabic (MSA) and the many regional dialects differ substantially in vocabulary, morphology, and pragmatic usage. Most available annotated resources are in MSA, and zero-shot transfer from MSA to dialectal tasks suffers a large performance drop. This paper addresses generalised cross-dialectal Arabic question answering (QA), where the context and the question are written in different Arabic varieties. We propose a training-free augmentation framework that generates code-mixed questions to bridge lexical gaps across Arabic varieties. The method produces semantically faithful, balanced code-mixed questions through the following two-stage procedure: lexicon-based partial substitution with semantic similarity and substitution-rate constraints, followed by fallback neural machine translation with word-level alignment when needed. We also introduce automated multidialectal lexicon construction using machine translation, embedding-based alignment, and semantic checks. We carry out our evaluation in a zero-shot setting, where the model is fine-tuned only on MSA and then tested on dialectal inputs using ArDQA, covering five Arabic varieties and three domains (SQuAD, Vlogs, and Narratives). Experiments show consistent improvements under context-question dialect mismatch as follows: +1.09 F1/+0.87 EM on SQuAD, +1.54/+1.25 on Vlogs, and +2.75/+2.27 on Narratives, with the largest gains for Maghrebi questions in Narratives (+12.13 F1/+8.45 EM). These results show that our method improves zero-shot cross-dialectal transfer without fine-tuning or retraining.
Keywords: Arabic question answering; Arabic dialects; cross-dialectal transfer; embedding-based alignment; generalised zero-shot transfer; MSA; neural machine translation; semantic validation; word-level alignment Arabic question answering; Arabic dialects; cross-dialectal transfer; embedding-based alignment; generalised zero-shot transfer; MSA; neural machine translation; semantic validation; word-level alignment
Graphical Abstract

Share and Cite

MDPI and ACS Style

Althobaiti, M.J. Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information 2026, 17, 139. https://doi.org/10.3390/info17020139

AMA Style

Althobaiti MJ. Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information. 2026; 17(2):139. https://doi.org/10.3390/info17020139

Chicago/Turabian Style

Althobaiti, Maha Jarallah. 2026. "Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation" Information 17, no. 2: 139. https://doi.org/10.3390/info17020139

APA Style

Althobaiti, M. J. (2026). Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information, 17(2), 139. https://doi.org/10.3390/info17020139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop