Design of a Tigrinya Language Speech Corpus for Speech Recognition

Hafte Abera, Sebsibe H/Mariam


Abstract
In this paper, we describe the first Tigrinya Languages speech corpora designed and development for speech recognition purposes. Tigrinya, often written as Tigrigna (ትግርኛ) /tɪˈɡrinjə/ belongs to the Semitic branch of the Afro-Asiatic languages where it shows the characteristic features of a Semitic language. It is spoken by ethnic Tigray-Tigrigna people in the Horn of Africa. The paper outlines different corpus designing process analysis of related work on speech corpora creation for different languages. The authors provide also procedures that were used for the creation of Tigrinya speech recognition corpus which is the under-resourced language. One hundred and thirty speakers, native to Tigrinya language, were recorded for training and test dataset set. Each speaker read 100 texts, which consisted of syllabically rich and balanced sentences. Ten thousand sets of sentences were used to prompt sheets. These sentences contained all of the contextual syllables and phones.
Anthology ID:
W18-3811
Volume:
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Peter Machonis, Anabela Barreiro, Kristina Kocijan, Max Silberztein
Venue:
LR4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–82
Language:
URL:
https://aclanthology.org/W18-3811
DOI:
Bibkey:
Cite (ACL):
Hafte Abera and Sebsibe H/Mariam. 2018. Design of a Tigrinya Language Speech Corpus for Speech Recognition. In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 78–82, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Design of a Tigrinya Language Speech Corpus for Speech Recognition (Abera & H/Mariam, LR4NLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3811.pdf