Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR

Martha Yifiru Tachbelie, Solomon Teferra Abate, Tanja Schultz


Abstract
In this paper, we present the analysis of GlobalPhone (GP) and speech corpora of Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta). The aim of the analysis is to select speech data from GP for the development of multilingual Automatic Speech Recognition (ASR) system for the Ethiopian languages. To this end, phonetic overlaps among GP and Ethiopian languages have been analyzed. The result of our analysis shows that there is much phonetic overlap among Ethiopian languages although they are from three different language families. From GP, Turkish, Uyghur and Croatian are found to have much overlap with the Ethiopian languages. On the other hand, Korean has less phonetic overlap with the rest of the languages. Moreover, morphological complexity of the GP and Ethiopian languages, reflected by type to token ration (TTR) and out of vocabulary (OOV) rate, has been analyzed. Both metrics indicated the morphological complexity of the languages. Korean and Amharic have been identified as extremely morphologically complex compared to the other languages. Tigrigna, Russian, Turkish, Polish, etc. are also among the morphologically complex languages.
Anthology ID:
2020.lrec-1.511
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4152–4156
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.511
DOI:
Bibkey:
Cite (ACL):
Martha Yifiru Tachbelie, Solomon Teferra Abate, and Tanja Schultz. 2020. Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4152–4156, Marseille, France. European Language Resources Association.
Cite (Informal):
Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR (Tachbelie et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.511.pdf