The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition

Jonathan Mukiibi; Andrew Katumba; Joyce Nakatumba-Nabende; Ali Hussein; Joshua Meyer

The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition

Jonathan Mukiibi, Andrew Katumba, Joyce Nakatumba-Nabende, Ali Hussein, Joshua Meyer

Abstract

Building a usable radio monitoring automatic speech recognition (ASR) system is a challenging task for under-resourced languages and yet this is paramount in societies where radio is the main medium of public communication and discussions. Initial efforts by the United Nations in Uganda have proved how understanding the perceptions of rural people who are excluded from social media is important in national planning. However, these efforts are being challenged by the absence of transcribed speech datasets. In this paper, The Makerere Artificial Intelligence research lab releases a Luganda radio speech corpus of 155 hours. To our knowledge, this is the first publicly available radio dataset in sub-Saharan Africa. The paper describes the development of the voice corpus and presents baseline Luganda ASR performance results using Coqui STT toolkit, an open-source speech recognition toolkit.

Anthology ID:: 2022.lrec-1.208
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1945–1954
Language:
URL:: https://aclanthology.org/2022.lrec-1.208/
DOI:
Bibkey:
Cite (ACL):: Jonathan Mukiibi, Andrew Katumba, Joyce Nakatumba-Nabende, Ali Hussein, and Joshua Meyer. 2022. The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1945–1954, Marseille, France. European Language Resources Association.
Cite (Informal):: The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition (Mukiibi et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.208.pdf

PDF Cite Search Fix data