2020
bib
abs
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta
Solomon Teferra Abate
|
Martha Yifiru Tachbelie
|
Michael Melese
|
Hafte Abera
|
Tewodros Gebreselassie
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Million Meshesha Beyene
|
Solomon Atinafu
|
Binyam Ephrem Seyoum
Proceedings of the Fourth Widening Natural Language Processing Workshop
Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.
pdf
bib
abs
Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR
Martha Yifiru Tachbelie
|
Solomon Teferra Abate
|
Tanja Schultz
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present the analysis of GlobalPhone (GP) and speech corpora of Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta). The aim of the analysis is to select speech data from GP for the development of multilingual Automatic Speech Recognition (ASR) system for the Ethiopian languages. To this end, phonetic overlaps among GP and Ethiopian languages have been analyzed. The result of our analysis shows that there is much phonetic overlap among Ethiopian languages although they are from three different language families. From GP, Turkish, Uyghur and Croatian are found to have much overlap with the Ethiopian languages. On the other hand, Korean has less phonetic overlap with the rest of the languages. Moreover, morphological complexity of the GP and Ethiopian languages, reflected by type to token ration (TTR) and out of vocabulary (OOV) rate, has been analyzed. Both metrics indicated the morphological complexity of the languages. Korean and Amharic have been identified as extremely morphologically complex compared to the other languages. Tigrigna, Russian, Turkish, Polish, etc. are also among the morphologically complex languages.
pdf
bib
abs
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta
Solomon Teferra Abate
|
Martha Yifiru Tachbelie
|
Michael Melese
|
Hafte Abera
|
Tewodros Abebe
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Million Meshesha
|
Solomon Afnafu
|
Binyam Ephrem Seyoum
Proceedings of the Twelfth Language Resources and Evaluation Conference
Automatic Speech Recognition (ASR) is one of the most important technologies to support spoken communication in modern life. However, its development benefits from large speech corpus. The development of such a corpus is expensive and most of the human languages, including the Ethiopian languages, do not have such resources. To address this problem, we have developed four large (about 22 hours) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo and Wolaytta. To assess usability of the corpora for (the purpose of) speech processing, we have developed ASR systems for each language. In this paper, we present the corpora and the baseline ASR systems we have developed. We have achieved word error rates (WERs) of 37.65%, 31.03%, 38.02%, 33.89% for Amharic, Tigrigna, Oromo and Wolaytta, respectively. This results show that the corpora are suitable for further investigation towards the development of ASR systems. Thus, the research community can use the corpora to further improve speech processing systems. From our results, it is clear that the collection of text corpora to train strong language models for all of the languages is still required, especially for Oromo and Wolaytta.
pdf
bib
abs
DNN-Based Multilingual Automatic Speech Recognition for Wolaytta using Oromo Speech
Martha Yifiru Tachbelie
|
Solomon Teferra Abate
|
Tanja Schultz
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
It is known that Automatic Speech Recognition (ASR) is very useful for human-computer interaction in all the human languages. However, due to its requirement for a big speech corpus, which is very expensive, it has not been developed for most of the languages. Multilingual ASR (MLASR) has been suggested to share existing speech corpora among related languages to develop an ASR for languages which do not have the required speech corpora. Literature shows that phonetic relatedness goes across language families. We have, therefore, conducted experiments on MLASR taking two language families: one as source (Oromo from Cushitic) and the other as target (Wolaytta from Omotic). Using Oromo Deep Neural Network (DNN) based acoustic model, Wolaytta pronunciation dictionary and language model we have achieved Word Error Rate (WER) of 48.34% for Wolaytta. Moreover, our experiments show that adding only 30 minutes of speech data from the target language (Wolaytta) to the whole training data (22.8 hours) of the source language (Oromo) results in a relative WER reduction of 32.77%. Our results show the possibility of developing ASR system for a language, if we have pronunciation dictionary and language model, using an existing speech corpus of another language irrespective of their language family.
2019
bib
abs
English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Biniyam Ephrem
|
Tewodros Gebreselassie
|
Wondimagegnhue Tsegaye Tufa
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the 2019 Workshop on Widening NLP
In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting bi-directional SMT experiments. The BLEU scores of the bi-directional SMT systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT especially when the targets are Ethiopian languages.
2018
pdf
bib
abs
Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Binyam Ephrem
|
Tewodros Abebe
|
Wondimagegnhue Tsegaye
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the 27th International Conference on Computational Linguistics
In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting a bi-directional statistical machine translation experiments. The BLEU scores of the bi-directional Statistical Machine Translation (SMT) systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT specially when the targets are Ethiopian languages. Now we are working towards an optimal alignment for a bi-directional English-Ethiopian languages SMT.
pdf
bib
abs
Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Binyam Ephrem
|
Tewodros Abebe
|
Wondimagegnhue Tsegaye
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
In this paper, we describe the development of parallel corpora for Ethiopian Languages: Amharic, Tigrigna, Afan-Oromo, Wolaytta and Geez. To check the usability of all the corpora we conducted baseline bi-directional statistical machine translation (SMT) experiments for seven language pairs. The performance of the bi-directional SMT systems shows that all the corpora can be used for further investigations. We have also shown that the morphological complexity of the Ethio-Semitic languages has a negative impact on the performance of the SMT especially when they are target languages. Based on the results we obtained, we are currently working towards handling the morphological complexities to improve the performance of statistical machine translation among the Ethiopian languages.
2016
pdf
bib
Combining syntactic patterns and Wikipedia’s hierarchy of hyperlinks to extract meronym relations
Debela Tesfaye Gemechu
|
Michael Zock
|
Solomon Teferra
Proceedings of the NAACL Student Research Workshop
2012
pdf
bib
Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]
Hadrien Gelas
|
Solomon Teferra Abate
|
Laurent Besacier
|
François Pellegrino
JEP-TALN-RECITAL 2012, Workshop TALAf 2012: Traitement Automatique des Langues Africaines (TALAf 2012: African Language Processing)
2010
pdf
bib
Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach
Solomon Teferra Abate
|
Laurent Besacier
|
Sopheap Seng
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing
2007
pdf
bib
Syllable-Based Speech Recognition for Amharic
Solomon Teferra Abate
|
Wolfgang Menzel
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources