Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation

Althobaiti, Maha Jarallah

doi:10.3390/info17020139

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation

by

Maha Jarallah Althobaiti

Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia

Information 2026, 17(2), 139; https://doi.org/10.3390/info17020139 (registering DOI)

Submission received: 24 December 2025 / Revised: 26 January 2026 / Accepted: 27 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Natural Language Processing (NLP) with Applications and Natural Language Understanding (NLU), 2nd Edition)

Download

Browse Figure

Versions Notes

Abstract

Modern Standard Arabic (MSA) and the many regional dialects differ substantially in vocabulary, morphology, and pragmatic usage. Most available annotated resources are in MSA, and zero-shot transfer from MSA to dialectal tasks suffers a large performance drop. This paper addresses generalised cross-dialectal Arabic question answering (QA), where the context and the question are written in different Arabic varieties. We propose a training-free augmentation framework that generates code-mixed questions to bridge lexical gaps across Arabic varieties. The method produces semantically faithful, balanced code-mixed questions through the following two-stage procedure: lexicon-based partial substitution with semantic similarity and substitution-rate constraints, followed by fallback neural machine translation with word-level alignment when needed. We also introduce automated multidialectal lexicon construction using machine translation, embedding-based alignment, and semantic checks. We carry out our evaluation in a zero-shot setting, where the model is fine-tuned only on MSA and then tested on dialectal inputs using ArDQA, covering five Arabic varieties and three domains (SQuAD, Vlogs, and Narratives). Experiments show consistent improvements under context-question dialect mismatch as follows: +1.09 F1/+0.87 EM on SQuAD, +1.54/+1.25 on Vlogs, and +2.75/+2.27 on Narratives, with the largest gains for Maghrebi questions in Narratives (+12.13 F1/+8.45 EM). These results show that our method improves zero-shot cross-dialectal transfer without fine-tuning or retraining.

Keywords: Arabic question answering; Arabic dialects; cross-dialectal transfer; embedding-based alignment; generalised zero-shot transfer; MSA; neural machine translation; semantic validation; word-level alignment

Graphical Abstract

Share and Cite

MDPI and ACS Style

Althobaiti, M.J. Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information 2026, 17, 139. https://doi.org/10.3390/info17020139

AMA Style

Althobaiti MJ. Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information. 2026; 17(2):139. https://doi.org/10.3390/info17020139

Chicago/Turabian Style

Althobaiti, Maha Jarallah. 2026. "Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation" Information 17, no. 2: 139. https://doi.org/10.3390/info17020139

APA Style

Althobaiti, M. J. (2026). Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation. Information, 17(2), 139. https://doi.org/10.3390/info17020139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Generalised Cross-Dialectal Arabic Question Answering Through Adaptive Code-Mixed Data Augmentation

Abstract

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI