The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature

Izza AbuHaija; Salim Al Mandhari; Mo El-Haj; Jonas Sibony; Paul Rayson

The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature

Izza AbuHaija, Salim Al Mandhari, Mo El-Haj, Jonas Sibony, Paul Rayson

Abstract

This paper introduces the Nakba Lexicon, a comprehensive dataset derived from the poetry collection Asifa ‘Ala al-Iz‘aj (Sorry for the Disturbance) by Istiqlal Eid, a Palestinian poet from El-Birweh. Eid’s work poignantly reflects on themes of Palestinian identity, displacement, and resilience, serving as a resource for preserving linguistic and cultural heritage in the context of post-Nakba literature. The dataset is structured into ten thematic domains, including political terminology, memory and preservation, sensory and emotional lexicon, toponyms, nature, and external linguistic influences such as Hebrew, French, and English, thereby capturing the socio-political, emotional, and cultural dimensions of the Nakba. The Nakba Lexicon uniquely emphasises the contributions of women to Palestinian literary traditions, shedding light on often-overlooked narratives of resilience and cultural continuity. Advanced Natural Language Processing (NLP) techniques were employed to analyse the dataset, with fine-tuned pre-trained models such as ARABERT and MARBERT achieving F1-scores of 0.87 and 0.68 in language and lexical classification tasks, respectively, significantly outperforming traditional machine learning models. These results highlight the potential of domain-specific computational models to effectively analyse complex datasets, facilitating the preservation of marginalised voices. By bridging computational methods with cultural preservation, this study enhances the understanding of Palestinian linguistic heritage and contributes to broader efforts in documenting and analysing endangered narratives. The Nakba Lexicon paves the way for future interdisciplinary research, showcasing the role of NLP in addressing historical trauma, resilience, and cultural identity.

Anthology ID:: 2025.nakbanlp-1.5
Volume:: Proceedings of the first International Workshop on Nakba Narratives as Language Resources
Month:: January
Year:: 2025
Address:: Abu Dhabi
Editors:: Mustafa Jarrar, Nizar Habash, Mo El-Haj, Amal Haddad Haddad, Zeina Jallad, Camille Mansour, Diana Allan, Paul Rayson, Tymaa Hammouda, Sanad Malaysha
Venues:: NakbaNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37–47
Language:
URL:: https://aclanthology.org/2025.nakbanlp-1.5/
DOI:
Bibkey:
Cite (ACL):: Izza AbuHaija, Salim Al Mandhari, Mo El-Haj, Jonas Sibony, and Paul Rayson. 2025. The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature. In Proceedings of the first International Workshop on Nakba Narratives as Language Resources, pages 37–47, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):: The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature (AbuHaija et al., NakbaNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.nakbanlp-1.5.pdf

PDF Cite Search Fix data