ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration

Rayyan Merchant, Kevin Tang


Abstract
Despite speaking dialects of the same language, Persian speakers from Tajikistan cannot read Persian texts from Iran and Afghanistan. This is due to the fact that Tajik Persian is written in the Tajik-Cyrillic script, while Iranian and Afghan Persian are written in the Perso-Arabic script. As the formal registers of these dialects all maintain high levels of mutual intelligibility with each other, machine transliteration has been proposed as a more practical and appropriate solution than machine translation. Unfortunately, Persian texts written in both scripts are much more common in print in Tajikistan than online. This paper introduces a novel corpus meant to remedy that gap: ParsText. ParsText contains 2,813 Persian sentences written in both Tajik-Cyrillic and Perso-Arabic manually collected from blog pages and news articles online. This paper presents the need for such a corpus, previous and related work, data collection and alignment procedures, corpus statistics, and discusses directions for future work.
Anthology ID:
2024.cawl-1.1
Volume:
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Kyle Gorman, Emily Prud'hommeaux, Brian Roark, Richard Sproat
Venues:
CAWL | WS
SIG:
SIGWrit
Publisher:
ELRA and ICCL
Note:
Pages:
1–7
Language:
URL:
https://aclanthology.org/2024.cawl-1.1
DOI:
Bibkey:
Cite (ACL):
Rayyan Merchant and Kevin Tang. 2024. ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration. In Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024, pages 1–7, Torino, Italia. ELRA and ICCL.
Cite (Informal):
ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration (Merchant & Tang, CAWL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.cawl-1.1.pdf