ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content

Alkhamisi, Afyaa Atyan; Bamashmoos, Fatmah; Alsaggaf, Wafaa

doi:10.3390/app16125964

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content

by

Afyaa Atyan Alkhamisi

^*

,

Fatmah Bamashmoos

and

Wafaa Alsaggaf

Department of Information Technology, Faculty of Computing & Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5964; https://doi.org/10.3390/app16125964 (registering DOI)

Submission received: 3 May 2026 / Revised: 6 June 2026 / Accepted: 9 June 2026 / Published: 12 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download Versions Notes

Abstract

Arabic natural language processing (NLP) faces major difficulties due to the language’s rich morphological structure and the scarcity of high-quality datasets, especially for educational material distributed across diverse online platforms. Many existing large-scale corpus construction methods depend on extensive web crawling followed by substantial post-processing. This process may introduce irrelevant or low-quality data and often fails to represent the target domain adequately. As a result, a robust approach to developing corpora tailored for domain-sensitive educational NLP systems and linguistic depth is critical, as most current resources are inadequate. This paper presents ArabicEduCrawler, an AI-assisted focused crawling framework designed to improve the acquisition, discovery, and organization of Arabic educational web content. The framework integrates domain-aware source selection, in-crawl Arabic language detection using FastText, large language model (LLM)-assisted XPath extraction, and metadata retrieval to support corpus quality and traceability. Its two-layer architecture combines dynamic web crawling using Scrapy-Playwright with advanced NLP processing, including automatic linguistic annotation with GateNLP and Stanza and a sentence-aware chunking strategy designed for transformer-compatible token limits. Experiments across four major Arabic educational domains resulted in the creation of the Arabic Educational Web Corpus (AraEdu-WC), which consists of 101,770 documents segmented into approximately 286 k text chunks, with more than 50 million tokens, 289,778 sentences, and nearly 3.5 million named entities. The system achieved a harvest ratio of 95.25%, indicating its effectiveness in filtering and retaining relevant content. The sentence-aware chunking evaluation showed consistent improvements in top-ranked retrieval, achieving the highest Hit Rate@10 and MRR@10 across all four embedding models. In particular, the multilingual-E5-large model achieved a Hit Rate@10 of 70%, Precision@10 of 18%, and MRR@10 of 57%. These findings demonstrate that the proposed approach provides an effective balance between crawl efficiency, language purity, and content richness, offering a high-quality Arabic educational corpus for downstream NLP and retrieval research.

Keywords: Arabic natural language processing; information retrieval; focused web crawling; web scraping; corpus construction; educational web content; language identification; text chunking

Share and Cite

MDPI and ACS Style

Alkhamisi, A.A.; Bamashmoos, F.; Alsaggaf, W. ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Appl. Sci. 2026, 16, 5964. https://doi.org/10.3390/app16125964

AMA Style

Alkhamisi AA, Bamashmoos F, Alsaggaf W. ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Applied Sciences. 2026; 16(12):5964. https://doi.org/10.3390/app16125964

Chicago/Turabian Style

Alkhamisi, Afyaa Atyan, Fatmah Bamashmoos, and Wafaa Alsaggaf. 2026. "ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content" Applied Sciences 16, no. 12: 5964. https://doi.org/10.3390/app16125964

APA Style

Alkhamisi, A. A., Bamashmoos, F., & Alsaggaf, W. (2026). ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Applied Sciences, 16(12), 5964. https://doi.org/10.3390/app16125964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI