Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic

Samir Belayachi, Azzeddine Mazroui


Abstract
The availability of large annotated corpora remains a major challenge for the development of natural language processing systems for under-resourced languages such as Arabic. In this paper, we present two annotated corpora dedicated to Modern Standard Arabic. These corpora are open-source and freely available on the Hugging Face platform. The first corpus, annotated by theme and designed to provide a balanced representation of contemporary Arabic usage, comprises approximately 76 million words collected from diverse sources covering multiple domains and geographical regions. The second corpus, containing approximately one million words, is a sub-corpus extracted from the first. It was annotated with lemma tags using a semi-automatic approach that combines automatic annotation with the Alkhalil lemmatizer and MADAMIRA, followed by manual validation.
Anthology ID:
2026.abjadnlp-1.27
Volume:
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
192–197
Language:
URL:
https://aclanthology.org/2026.abjadnlp-1.27/
DOI:
Bibkey:
Cite (ACL):
Samir Belayachi and Azzeddine Mazroui. 2026. Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 192–197, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic (Belayachi & Mazroui, AbjadNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.abjadnlp-1.27.pdf