Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

Nagham Hamad; Mohammed Khalilia; Mustafa Jarrar

doi:10.18653/v1/2025.findings-acl.382

Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

Nagham Hamad, Mohammed Khalilia, Mustafa Jarrar

Abstract

We introduce , a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. is open-source and publicly available at https://sina.birzeit.edu/wojood/#download

Anthology ID:: 2025.findings-acl.382
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7316–7331
Language:
URL:: https://aclanthology.org/2025.findings-acl.382/
DOI:: 10.18653/v1/2025.findings-acl.382
Bibkey:
Cite (ACL):: Nagham Hamad, Mohammed Khalilia, and Mustafa Jarrar. 2025. Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7316–7331, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition (Hamad et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.382.pdf

PDF Cite Search Fix data