Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta

Solomon Teferra Abate, Martha Yifiru Tachbelie, Michael Melese, Hafte Abera, Tewodros Gebreselassie, Wondwossen Mulugeta, Yaregal Assabie, Million Meshesha Beyene, Solomon Atinafu, Binyam Ephrem Seyoum


Abstract
Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.
Anthology ID:
2020.winlp-1.5
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–17
Language:
URL:
https://aclanthology.org/2020.winlp-1.5
DOI:
10.18653/v1/2020.winlp-1.5
Bibkey:
Cite (ACL):
Solomon Teferra Abate, Martha Yifiru Tachbelie, Michael Melese, Hafte Abera, Tewodros Gebreselassie, Wondwossen Mulugeta, Yaregal Assabie, Million Meshesha Beyene, Solomon Atinafu, and Binyam Ephrem Seyoum. 2020. Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 13–17, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta (Abate et al., WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929541