Arabic Topic Classification Corpus of the Nakba Short Stories

Osama Hamed; Nadeem Zaidkilani

Arabic Topic Classification Corpus of the Nakba Short Stories

Abstract

In this paper, we enrich Arabic Natural Language Processing (NLP) resources by introducing the “Nakba Topic Classification Corpus (NTCC),” a novel annotated Arabic corpus derived from narratives about the Nakba. The NTCC comprises approximately 470 sentences extracted from eight short stories and captures the thematic depth of the Nakba narratives, providing insights into both historical and personal dimensions. The corpus was annotated in a two-step process. One third of the dataset was manually annotated, achieving an IAA of 87% (later resolved to 100%), while the rest was annotated using a rule-based system based on thematic patterns. This approach ensures consistency and reproducibility, enhancing the corpus’s reliability for NLP research. The NTCC contributes to the preservation of the Palestinian cultural heritage while addressing key challenges in Arabic NLP, such as data scarcity and linguistic complexity. By like topic modeling and classification tasks, the NTCC offers a valuable resource for advancing Arabic NLP research and fostering a deeper understanding of the Nakba narratives

Anthology ID:: 2025.nakbanlp-1.6
Volume:: Proceedings of the first International Workshop on Nakba Narratives as Language Resources
Month:: January
Year:: 2025
Address:: Abu Dhabi
Editors:: Mustafa Jarrar, Nizar Habash, Mo El-Haj, Amal Haddad Haddad, Zeina Jallad, Camille Mansour, Diana Allan, Paul Rayson, Tymaa Hammouda, Sanad Malaysha
Venues:: NakbaNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 48–55
Language:
URL:: https://aclanthology.org/2025.nakbanlp-1.6/
DOI:
Bibkey:
Cite (ACL):: Osama Hamed and Nadeem Zaidkilani. 2025. Arabic Topic Classification Corpus of the Nakba Short Stories. In Proceedings of the first International Workshop on Nakba Narratives as Language Resources, pages 48–55, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):: Arabic Topic Classification Corpus of the Nakba Short Stories (Hamed & Zaidkilani, NakbaNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.nakbanlp-1.6.pdf

PDF Cite Search Fix data