TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

Mullosharaf Kurbonovich Arabov

TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

Abstract

This work introduces TajPersLexon, a curated Tajik–Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families:(i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.

Anthology ID:: 2026.silkroadnlp-1.4
Volume:: The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Rayyan Merchant, Karine Megerdoomian
Venues:: SilkRoadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29–37
Language:
URL:: https://aclanthology.org/2026.silkroadnlp-1.4/
DOI:
Bibkey:
Cite (ACL):: Mullosharaf Kurbonovich Arabov. 2026. TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP. In The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family, pages 29–37, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP (Arabov, SilkRoadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.silkroadnlp-1.4.pdf

PDF Cite Search Fix data