Orthographic Robustness of Persian Named Entity Recognition Models

Henry Gagnier; Sophie Gagnier

Orthographic Robustness of Persian Named Entity Recognition Models

Abstract

Named Entity Recognition (NER) models trained on clean text often fail on real-world data containing orthographic noise. Work on NER for Persian is emerging, but it has not yet explored the orthographic robustness of models to perturbations often exhibited in user-generated content. We evaluate ParsBERT, ParsBERT v2.0, BertNER, and two XLM-r-based models on a subset of Persian-NER-Dataset-500k after applying eleven different perturbations, including simulated typos, code-switching, and segmentation errors. All models were competitive with each other, but XLM-r-large consistently displayed the best robustness to perturbations. Code-switching, typos, similar character swaps, segmentation errors, and noisy text all decreased F1 scores, while Latinized numbers increased F1 scores in ParsBERT. Removing diacritics, zero-width non-joiners, and normalizing Yeh/Kaf all did not have an effect on F1. These findings suggest that Persian NER models require improvement for performance on noisy text, and that the Perso-Arabic script introduces unique factors into NER not present in many high-resource languages, such as code-switching and Eastern Arabic numerals. This work creates a foundation for the development of robust Persian NER models and highlights the necessity of evaluating low-resource NER models under challenging and realistic conditions.

Anthology ID:: 2026.abjadnlp-1.14
Volume:: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 110–114
Language:
URL:: https://aclanthology.org/2026.abjadnlp-1.14/
DOI:
Bibkey:
Cite (ACL):: Henry Gagnier and Sophie Gagnier. 2026. Orthographic Robustness of Persian Named Entity Recognition Models. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 110–114, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Orthographic Robustness of Persian Named Entity Recognition Models (Gagnier & Gagnier, AbjadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.abjadnlp-1.14.pdf

PDF Cite Search Fix data