Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

Jivnesh Sandhan, Ayush Daksh, Om Adideva Paranjay, Laxmidhar Behera, Pawan Goyal


Abstract
Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at: https://github.com/frozentoad9/CMST.
Anthology ID:
2022.latechclfl-1.4
Volume:
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Stefania Degaetano, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
Venue:
LaTeCHCLfL
SIG:
SIGHUM
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
24–29
Language:
URL:
https://aclanthology.org/2022.latechclfl-1.4
DOI:
Bibkey:
Cite (ACL):
Jivnesh Sandhan, Ayush Daksh, Om Adideva Paranjay, Laxmidhar Behera, and Pawan Goyal. 2022. Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 24–29, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages (Sandhan et al., LaTeCHCLfL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.latechclfl-1.4.pdf
Code
 frozentoad9/CMST