ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training

Majd Hawasly; Muhammad Tasnim Mohiuddin; Hamdy Mubarak; Sabri Boughorbel

doi:10.18653/v1/2025.arabicnlp-main.36

ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training

Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, Sabri Boughorbel

Abstract

The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.

Anthology ID:: 2025.arabicnlp-main.36
Volume:: Proceedings of The Third Arabic Natural Language Processing Conference
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:: ArabicNLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 436–447
Language:
URL:: https://aclanthology.org/2025.arabicnlp-main.36/
DOI:: 10.18653/v1/2025.arabicnlp-main.36
Bibkey:
Cite (ACL):: Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, and Sabri Boughorbel. 2025. ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 436–447, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training (Hawasly et al., ArabicNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.arabicnlp-main.36.pdf

PDF Cite Search Fix data