2020
bib
abs
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta
Solomon Teferra Abate
|
Martha Yifiru Tachbelie
|
Michael Melese
|
Hafte Abera
|
Tewodros Gebreselassie
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Million Meshesha Beyene
|
Solomon Atinafu
|
Binyam Ephrem Seyoum
Proceedings of the Fourth Widening Natural Language Processing Workshop
Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.
2019
bib
abs
English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Biniyam Ephrem
|
Tewodros Gebreselassie
|
Wondimagegnhue Tsegaye Tufa
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the 2019 Workshop on Widening NLP
In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting bi-directional SMT experiments. The BLEU scores of the bi-directional SMT systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT especially when the targets are Ethiopian languages.
2018
pdf
bib
abs
Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Binyam Ephrem
|
Tewodros Abebe
|
Wondimagegnhue Tsegaye
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the 27th International Conference on Computational Linguistics
In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting a bi-directional statistical machine translation experiments. The BLEU scores of the bi-directional Statistical Machine Translation (SMT) systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT specially when the targets are Ethiopian languages. Now we are working towards an optimal alignment for a bi-directional English-Ethiopian languages SMT.
pdf
bib
abs
Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs
Solomon Teferra Abate
|
Michael Melese
|
Martha Yifiru Tachbelie
|
Million Meshesha
|
Solomon Atinafu
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Hafte Abera
|
Binyam Ephrem
|
Tewodros Abebe
|
Wondimagegnhue Tsegaye
|
Amanuel Lemma
|
Tsegaye Andargie
|
Seifedin Shifaw
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
In this paper, we describe the development of parallel corpora for Ethiopian Languages: Amharic, Tigrigna, Afan-Oromo, Wolaytta and Geez. To check the usability of all the corpora we conducted baseline bi-directional statistical machine translation (SMT) experiments for seven language pairs. The performance of the bi-directional SMT systems shows that all the corpora can be used for further investigations. We have also shown that the morphological complexity of the Ethio-Semitic languages has a negative impact on the performance of the SMT especially when they are target languages. Based on the results we obtained, we are currently working towards handling the morphological complexities to improve the performance of statistical machine translation among the Ethiopian languages.