This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content
by
Afyaa Atyan Alkhamisi
Afyaa Atyan Alkhamisi *
,
Fatmah Bamashmoos
Fatmah Bamashmoos
and
Wafaa Alsaggaf
Wafaa Alsaggaf
Department of Information Technology, Faculty of Computing & Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(12), 5964; https://doi.org/10.3390/app16125964 (registering DOI)
Submission received: 3 May 2026
/
Revised: 6 June 2026
/
Accepted: 9 June 2026
/
Published: 12 June 2026
Abstract
Arabic natural language processing (NLP) faces major difficulties due to the language’s rich morphological structure and the scarcity of high-quality datasets, especially for educational material distributed across diverse online platforms. Many existing large-scale corpus construction methods depend on extensive web crawling followed by substantial post-processing. This process may introduce irrelevant or low-quality data and often fails to represent the target domain adequately. As a result, a robust approach to developing corpora tailored for domain-sensitive educational NLP systems and linguistic depth is critical, as most current resources are inadequate. This paper presents ArabicEduCrawler, an AI-assisted focused crawling framework designed to improve the acquisition, discovery, and organization of Arabic educational web content. The framework integrates domain-aware source selection, in-crawl Arabic language detection using FastText, large language model (LLM)-assisted XPath extraction, and metadata retrieval to support corpus quality and traceability. Its two-layer architecture combines dynamic web crawling using Scrapy-Playwright with advanced NLP processing, including automatic linguistic annotation with GateNLP and Stanza and a sentence-aware chunking strategy designed for transformer-compatible token limits. Experiments across four major Arabic educational domains resulted in the creation of the Arabic Educational Web Corpus (AraEdu-WC), which consists of 101,770 documents segmented into approximately 286 k text chunks, with more than 50 million tokens, 289,778 sentences, and nearly 3.5 million named entities. The system achieved a harvest ratio of 95.25%, indicating its effectiveness in filtering and retaining relevant content. The sentence-aware chunking evaluation showed consistent improvements in top-ranked retrieval, achieving the highest Hit Rate@10 and MRR@10 across all four embedding models. In particular, the multilingual-E5-large model achieved a Hit Rate@10 of 70%, Precision@10 of 18%, and MRR@10 of 57%. These findings demonstrate that the proposed approach provides an effective balance between crawl efficiency, language purity, and content richness, offering a high-quality Arabic educational corpus for downstream NLP and retrieval research.
Share and Cite
MDPI and ACS Style
Alkhamisi, A.A.; Bamashmoos, F.; Alsaggaf, W.
ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Appl. Sci. 2026, 16, 5964.
https://doi.org/10.3390/app16125964
AMA Style
Alkhamisi AA, Bamashmoos F, Alsaggaf W.
ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Applied Sciences. 2026; 16(12):5964.
https://doi.org/10.3390/app16125964
Chicago/Turabian Style
Alkhamisi, Afyaa Atyan, Fatmah Bamashmoos, and Wafaa Alsaggaf.
2026. "ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content" Applied Sciences 16, no. 12: 5964.
https://doi.org/10.3390/app16125964
APA Style
Alkhamisi, A. A., Bamashmoos, F., & Alsaggaf, W.
(2026). ArabicEduCrawler: AI-Assisted Focused Crawling and Corpus Construction for Arabic Educational Web Content. Applied Sciences, 16(12), 5964.
https://doi.org/10.3390/app16125964
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.