UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages

Jamaluddin; Subhankar Panda; Aditya Narendra; Kamanksha Prasad Dubey; Mohammad Nadeem

UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages

Jamaluddin, Subhankar Panda, Aditya Narendra, Kamanksha Prasad Dubey, Mohammad Nadeem

Abstract

Speech-to-Speech Translation (S2ST) focuses on generating spoken output in a target language directly from spoken input in a source language. Despite progress in S2ST modeling, low-resource Indic languages remain poorly supported, primarily because large-scale parallel speech corpora are unavailable. We present UrHiOdSynth, a three-language parallel S2ST dataset containing approximately 75 hours of speech across Urdu, Hindi, and Odia. The corpus consists of 10,735 aligned sentence triplets, with an average utterance length of 8.45 seconds. To our knowledge, UrHiOdSynth represents the largest multi-domain resource offering aligned speech and text for S2ST in this language context. Beyond speech-to-speech translation, the dataset supports tasks such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, and machine translation. This flexibility enables the training of unified multilingual models, particularly for low-resource Indic languages.

Anthology ID:: 2026.loreslm-1.50
Volume:: Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:: LoResLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 584–594
Language:
URL:: https://aclanthology.org/2026.loreslm-1.50/
DOI:
Bibkey:
Cite (ACL):: Jamaluddin, Subhankar Panda, Aditya Narendra, Kamanksha Prasad Dubey, and Mohammad Nadeem. 2026. UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 584–594, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages (Jamaluddin et al., LoResLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loreslm-1.50.pdf

PDF Cite Search Fix data