Speech Data from Radio Broadcasts for Low Resource Languages

Bismarck Bamfo Odoom, Leibny Paola Garcia Perera, Prangthip Hansanti, Loic Barrault, Christophe Ropers, Matthew Wiesner, Kenton Murray, Alexandre Mourachko, Philipp Koehn


Abstract
We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.
Anthology ID:
2024.iwslt-1.18
Volume:
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Marine Carpuat
Venue:
IWSLT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
134–139
Language:
URL:
https://aclanthology.org/2024.iwslt-1.18
DOI:
Bibkey:
Cite (ACL):
Bismarck Bamfo Odoom, Leibny Paola Garcia Perera, Prangthip Hansanti, Loic Barrault, Christophe Ropers, Matthew Wiesner, Kenton Murray, Alexandre Mourachko, and Philipp Koehn. 2024. Speech Data from Radio Broadcasts for Low Resource Languages. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 134–139, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Speech Data from Radio Broadcasts for Low Resource Languages (Bamfo Odoom et al., IWSLT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.iwslt-1.18.pdf