The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic

Tadesse Destaw; Abinew Ayele; Seid Muhie Yimam

The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic

Tadesse Destaw, Abinew Ayele, Seid Muhie Yimam

Abstract

Amharic is the second most spoken Semitic language after Arabic and serves as the official working language of Ethiopia. While Amharic NLP research is getting wider attention recently, the main bottleneck is that the resources and related tools are not publicly released, which makes it still a low-resource language. Due to this reason, we observe that different researchers try to repeat the same NLP research again and again. In this work, we investigate the existing approach in Amharic NLP and take the first step to publicly release tools, datasets, and models to advance Amharic NLP research. We build Python-based preprocessing tools for Amharic (tokenizer, sentence segmenter, and text cleaner) that can easily be used and integrated for the development of NLP applications. Furthermore, we compiled the first moderately large-scale Amharic text corpus (6.8m sentences) along with the word2Vec, fastText, RoBERTa, and FLAIR embeddings models. Finally, we compile benchmark datasets and build classification models for the named entity recognition task.

Anthology ID:: 2021.winlp-1.5
Volume:: Proceedings of the Fifth Workshop on Widening Natural Language Processing
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Kyathi Chandu, Xanda Schofield, Surangika Ranathunga, Haley Lepp, Tirthankar Ghosal
Venue:: WiNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25–28
Language:
URL:: https://aclanthology.org/2021.winlp-1.5
DOI:
Bibkey:
Cite (ACL):: Tadesse Destaw, Abinew Ayele, and Seid Muhie Yimam. 2021. The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic. In Proceedings of the Fifth Workshop on Widening Natural Language Processing, pages 25–28, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic (Destaw et al., WiNLP 2021)
Copy Citation:

Cite Search