Arabov Mullosharaf Kurbonovich


2026

This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik–Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik–Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The best Transformer configuration with beam search (k=3) achieves a CER of 0.3182 and an exact-match accuracy of 0.3215, achieving lower error rates than dictionary-based rule-based and recurrent neural baselines. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All resources (dataset, preprocessing scripts, splits, and training configurations) will be released publicly to ensure reproducibility and facilitate future work on Tajik–Persian transliteration, cross-script NLP, and lexicographic applications.
Search
Co-authors
    Venues
    Fix author