Arabov Mullosharaf Kurbonovich

2026

Character-Level Transformer for Tajik–Persian Transliteration with a Parallel Lexical Corpus
Arabov Mullosharaf Kurbonovich
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik–Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik–Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The best Transformer configuration with beam search (k=3) achieves a CER of 0.3182 and an exact-match accuracy of 0.3215, achieving lower error rates than dictionary-based rule-based and recurrent neural baselines. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All resources (dataset, preprocessing scripts, splits, and training configurations) will be released publicly to ensure reproducibility and facilitate future work on Tajik–Persian transliteration, cross-script NLP, and lexicographic applications.

Co-authors

Venues

AbjadNLP1
WS1

Fix author