Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models
Abstract
1. Introduction
2. Methodology
2.1. Domain Specification
2.2. Initial Sentence Generation
2.3. Dialectal Transformation
2.4. Deduplication and Data Refinement
2.4.1. Stage 1: MSA-Based Filtering and Deduplication
2.4.2. Stage 2: Parallel Entry Deduplication
2.4.3. Final Output:
3. Results
3.1. Corpus Characteristics
3.2. Preliminary Statistical Analysis of the Corpus
3.2.1. Frequency Distribution of Words and N-Grams
3.2.2. Sentence Length Distribution Analysis
3.2.3. Lexical Overlap Between MSA and Dialects
4. Discussion
4.1. Dialectal Coverage and Linguistic Diversity
4.2. Limitations
5. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- United Nations Arabic Language Day. Available online: https://www.un.org/en/observances/arabiclanguageday?utm_source=chatgpt.com (accessed on 4 October 2025).
- Habash, N. Introduction to Arabic Natural Language Processing; Hirst, G., Ed.; Synthesis Lectures on Human Language Technologies; Morgan & Claypool: San Rafael, CA, USA, 2010. [Google Scholar]
- Alsudais, A.; Alotaibi, W.; Alomary, F. Similarities between Arabic Dialects: Investigating Geographical Proximity. Inf. Process Manag. 2022, 59, 102770. [Google Scholar] [CrossRef]
- Jabbari, M. Diglossia in Arabic A Comparative Study of the Modern Standard Arabic and Egyptian Colloquial Arabic. Glob. J. Hum.-Soc. Sci. 2012, 12, 23–35. [Google Scholar]
- Bouamor, H.; Alshikhabobakr, H.; Mohit, B.; Oflazer, K. A Human Judgement Corpus and a Metric for Arabic MT Evaluation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qata, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 207–213. [Google Scholar]
- Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A Fast and Furious Segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 11–16. [Google Scholar]
- Parker, R.; Graff, D.; Chen, K.; Kong, J.; Maeda, K. Arabic Gigaword, 5th ed.; Abacus Data Network: Del Mar, CA, USA, 2011. [Google Scholar]
- Canavan, A.; Zipperlen, G.; Graff, D. CALLHOME Egyptian Arabic Speech; Linguistic Data Consortium: Philadelphia, PA, USA, 1997. [Google Scholar]
- Bouamor, H.; Habash, N.; Salameh, M.; Zaghouani, W.; Rambow, O.; Abdulrahim, D.; Obeid, O.; Khalifa, S.; Eryani, F.; Erdmann, A.; et al. MADAR: A Large-Scale Multi-Arabic Dialect Applications and Resources Project. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- LDC Linguistic Data Consortium. Available online: https://www.ldc.upenn.edu/ (accessed on 5 October 2025).
- ELRA ELRA Catalogue of Language Resources. Available online: https://catalogue.elra.info/en-us/ (accessed on 5 October 2025).
- Besdouri, F.Z.; Zribi, I.; Belguith, L.H. Arabic Automatic Speech Recognition: Challenges and Progress. Speech Commun. 2024, 163, 103–110. [Google Scholar] [CrossRef]
- Morin, C.; Marttinen Larsson, M. Large Corpora and Large Language Models: A Replicable Method for Automating Grammatical Annotation. Linguist. Vanguard 2025. [Google Scholar] [CrossRef]
- Uchida, S. Using Early LLMs for Corpus Linguistics: Examining ChatGPT’s Potential and Limitations. Appl. Corpus Linguist. 2024, 4, 100089. [Google Scholar] [CrossRef]
- Busker, T.; Choenni, S.; Bargh, M.S. Exploiting GPT for Synthetic Data Generation: An Empirical Study. Gov. Inf. Q. 2025, 42, 101988. [Google Scholar] [CrossRef]
- Perea-Trigo, M.; Botella-López, C.; Martínez-del-Amor, M.Á.; Álvarez-García, J.A.; Soria-Morillo, L.M.; Vegas-Olmos, J.J. Synthetic Corpus Generation for Deep Learning-Based Translation of Spanish Sign Language. Sensors 2024, 24, 1472. [Google Scholar] [CrossRef] [PubMed]
- Fan, L.; Li, L.; Ma, Z.; Lee, S.; Yu, H.; Hemphill, L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Trans. Intell. Syst. Technol. 2024, 15, 91. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 106. [Google Scholar] [CrossRef]
- Imankulova, A.; Sato, T.; Komachi, M. Filtered Pseudo-Parallel Corpus Improves Low-Resource Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2020, 19, 24. [Google Scholar] [CrossRef]
- YirMibEşoğlu, Z.; Güngör, T. Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 92. [Google Scholar] [CrossRef]
- AlKhamissi, B.; ElNokrashy, M.; Alkhamissi, M.; Diab, M. Investigating Cultural Alignment of Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Long papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1, pp. 12404–12422. [Google Scholar]
- Al-Shenaifi, N.; Azmi, A.M.; Hosny, M. Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia. Mathematics 2024, 12, 3120. [Google Scholar] [CrossRef]
- El Haff, K.; Jarrar, M.; Hammouda, T.; Zaraket, F. Curras + Baladi: Towards a Levantine Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), Marseille, France, 20–25 June 2022; European Language Resources Association: Marseille, France, 2022; pp. 769–778. [Google Scholar]
- Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R.; Gal, Y. AI Models Collapse When Trained on Recursively Generated Data. Nature 2024, 631, 755–759. [Google Scholar] [CrossRef] [PubMed]
- Google Gemini. Available online: https://gemini.google.com/ (accessed on 2 October 2025).
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; Lenc, C.; et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
- Aldawsari, M.; Dawood, O. AraEventCoref: An Arabic Event Coreference Dataset and LLM Benchmarks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 67. [Google Scholar] [CrossRef]
- Daoud, M.A.; Abouzahir, C.; Kharouf, L.; Al-Eisawi, W.; Shamout, F.E.; Habash, N. MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks. In Proceedings of the Machine Learning for Healthcare (ML4HC), Rochester, MN, USA, 15 August 2025; pp. 1–40. [Google Scholar]
- Sallam, M.; Al-Mahzoum, K.; Almutawaa, R.A.; Alhashash, J.A.; Dashti, R.A.; AlSafy, D.R.; Almutairi, R.A.; Barakat, M. The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: A comparative analysis of English and Arabic responses. BMC Res. Notes 2024, 17, 247. [Google Scholar] [CrossRef] [PubMed]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
- Almeman, K.; Lee, M. Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words. In Proceedings of the Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on Communications, Signal Processing and Their Applications, Sharjah, United Arab Emirates, 12–14 February 2013; IEEE: New York, NY, USA, 2013; pp. 1–6. [Google Scholar]
- Biemann, C.; Shin, S.-I.; Choi, K.-S. Semiautomatic Extension of CoreNet Using a Bootstrapping Mechanism on Corpus-Based Co-Occurrences. In Proceedings of the 20th International Conference on Computational Linguistics—COLING’04, Geneva, Switzerland, 23–27 August 2004; Association for Computational Linguistics: Morristown, NJ, USA, 2004; pp. 1227–1232. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR). arXiv 2023, arXiv:2211.09110. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 1877–1901. [Google Scholar]
- van der Lee, C.; Gatt, A.; van Miltenburg, E.; Wubben, S.; Krahmer, E. Best Practices for the Human Evaluation of Automatically Generated Text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; van Deemter, K., Lin, C., Takamura, H., Eds.; Association for Computational Linguistics: Tokyo, Japan, 2019; pp. 355–368. [Google Scholar]
- Al-Twairesh, N.; Al-Khalifa, H.; Al-Salman, A.; Al-Ohali, Y. AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets. Procedia Comput. Sci. 2017, 117, 63–72. [Google Scholar] [CrossRef]
- Bouamor, H.; Habash, N.; Oflazer, K. A Multidialectal Parallel Corpus of Arabic. In Proceedings of the Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26 May 2014; European Language Resources Association (ELRA): Paris, France; pp. 1240–1245. [Google Scholar]
- Bowker, L.; Pearson, J. Working with Specialized Corpora; Studies in Corpus Linguistics; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2002; Volume 6, ISBN 9789027222774. [Google Scholar]
- Sinclair, J. Corpus, Concordance, Collocation; Describing English Language; Oxford University Press: Oxford, UK, 1991; ISBN 9780194371445. [Google Scholar]









| English | MSA | Saudi | Egyptian | Iraqi | Levantine | Moroccan |
|---|---|---|---|---|---|---|
| Can I change my trip date? | هل يمكنني تغيير موعد رحلتي؟ | هل أقدر أغير موعد رحلتي؟ | ينفع أغير معاد رحلتي؟ | أگدر أغير موعد سفري؟ | فيني غيّر موعد سفري؟ | واش نقدر نبدل موعد الرحلة ديالي؟ |
| I want to visit museums and historical sites. | أريد زيارة المتاحف والمعالم التاريخية. | أبغى أزور المتاحف والمعالم التاريخية. | أنا عايز أزور المتاحف والمعالم الأثرية. | أريد أزور المتاحف والمعالم التاريخية. | بدي زور المتاحف والمعالم التاريخية. | بغيت نزور المتاحف والمعالم التاريخية. |
| Can I book a room with a sea view? | هل يمكنني حجز غرفة مطلة على البحر؟ | هل أقدر أحجز غرفة مطلة على البحر؟ | ممكن أحجز أوضة مطلة على البحر؟ | أگدر أحجز غرفة مطلة على البحر؟ | فيني أحجز غرفة مطلّة عالبحر؟ | واش يمكن ليا نحجز بيت مطل على البحر؟ |
| Are there restaurants nearby? | هل يوجد مطاعم قريبة من هنا؟ | فيه مطاعم قريبة من هنا؟ | فيه مطاعم قريبة من هنا؟ | اكو مطاعم قريبة من هنا؟ | في مطاعم قريبة من هون؟ | واش كاين شي مطاعم قراب من هنا؟ |
| I’m looking for a family-friendly hotel. | أبحث عن فندق مناسب للعائلات. | أدوّر عن فندق مناسب للعوايل. | بدوّر على فندق مناسب للعائلات. | أدوّر على فندق مناسب للعوائل. | عم دوّر على فندق مناسب للعيل. | كنْقَلّب على فندق مناسب للعائلات. |
| I want to try traditional local food. | أريد أن أجرب الأطعمة المحلية التقليدية. | أبغى أجرب الأكلات الشعبية. | أنا عايز أجرب الأكلات البلدي التقليدية. | أريد أجرب الأكل الشعبي التراثي. | بدي جرب الأكلات البلدية التقليدية. | بغيت نجرب الماكولات التقليدية ديال البلاد. |
| I want to book a round-trip flight. | أريد حجز رحلة ذهاب وإياب. | أبغى أحجز رحلة ذهاب وإياب. | أنا عايز أحجز رحلة ذهاب وإياب. | أريد أحجز رحلة ذهاب وإياب. | بدي أحجز رحلة ذهاب وإياب. | بغيت نحجز رحلة ذهاب وإياب. |
| Are there vegetarian options? | هل يوجد خيارات للنباتيين؟ | فيه خيارات للنباتيين؟ | فيه حاجات للنباتيين؟ | اكو خيارات للنباتيين؟ | في خيارات للنباتيين؟ | واش كاين شي حاجة للنباتيين؟ |
| Is there a shuttle service from the airport to the hotel? | هل يوجد خدمة توصيل من المطار إلى الفندق؟ | فيه خدمة توصيل من المطار للفندق؟ | فيه خدمة توصيل من المطار للفندق؟ | اكو خدمة توصيل من المطار للفندق؟ | في خدمة توصيل من المطار عالندق؟ | واش كاينة شي خدمة ديال التوصيل من المطَارللفندق؟ |
| What activities are available in this area? | ما هي الأنشطة المتاحة في هذه المنطقة؟ | وش الأنشطة اللي موجودة في هالمنطقة؟ | إيه الأنشطة اللي موجودة في المنطقة دي؟ | شنو الأنشطة المتوفرة بهاي المنطقة؟ | شو الأنشطة يلي موجودة بهالمنطقة؟ | شنو الأنشطة اللي كاينة فهاد المنطقة؟ |
| Metric | MSA | Saudi | Egyptian | Iraqi | Levantine | Moroccan |
|---|---|---|---|---|---|---|
| Total Sentences | 51,840 | 51,840 | 51,840 | 51,840 | 51,840 | 51,840 |
| Total Tokens | 264,879 | 283,222 | 298,088 | 284,760 | 285,037 | 311,882 |
| Unique Tokens | 8826 | 9224 | 9299 | 10,106 | 10,008 | 11,489 |
| Type–Token Ratio (TTR) % | 3.33 | 3.26 | 3.12 | 3.55 | 3.51 | 3.68 |
| Avg. Sentence Length (Tokens) | 5.30 | 5.66 | 5.96 | 5.69 | 5.70 | 6.24 |
| Avg. Sentence Length (Chars) | 29.85 | 35.46 | 39.70 | 35.84 | 38.94 | 41.82 |
| Dialect | Shared Unique Words with MSA | % of MSA Vocab in Dialect | % of Dialect Vocab in MSA | Unique Words in Dialect (Not in MSA) |
|---|---|---|---|---|
| Saudi | 5963 | 69.06 | 64.72 | 3251 |
| Egyptian | 5287 | 61.23 | 56.98 | 3991 |
| Iraqi | 5747 | 66.55 | 56.94 | 4346 |
| Levantine | 5421 | 62.78 | 54.25 | 4572 |
| Moroccan | 4961 | 57.45 | 43.36 | 6481 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Almeman, K. Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models. Data 2025, 10, 208. https://doi.org/10.3390/data10120208
Almeman K. Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models. Data. 2025; 10(12):208. https://doi.org/10.3390/data10120208
Chicago/Turabian StyleAlmeman, Khalid. 2025. "Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models" Data 10, no. 12: 208. https://doi.org/10.3390/data10120208
APA StyleAlmeman, K. (2025). Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models. Data, 10(12), 208. https://doi.org/10.3390/data10120208

