%0 Conference Proceedings
%T The Development of Pre-processing Tools and Pre-trained Embedding Models for Amharic
%A Destaw, Tadesse
%A Ayele, Abinew
%A Yimam, Seid Muhie
%Y Varis, Erika
%Y Georgi, Ryan
%Y Tsai, Alicia
%Y Anastasopoulos, Antonios
%Y Chandu, Kyathi
%Y Schofield, Xanda
%Y Ranathunga, Surangika
%Y Lepp, Haley
%Y Ghosal, Tirthankar
%S Proceedings of the Fifth Workshop on Widening Natural Language Processing
%D 2021
%8 November
%I Association for Computational Linguistics
%C Punta Cana, Dominican Republic
%F destaw-etal-2021-development
%X Amharic is the second most spoken Semitic language after Arabic and serves as the official working language of Ethiopia. While Amharic NLP research is getting wider attention recently, the main bottleneck is that the resources and related tools are not publicly released, which makes it still a low-resource language. Due to this reason, we observe that different researchers try to repeat the same NLP research again and again. In this work, we investigate the existing approach in Amharic NLP and take the first step to publicly release tools, datasets, and models to advance Amharic NLP research. We build Python-based preprocessing tools for Amharic (tokenizer, sentence segmenter, and text cleaner) that can easily be used and integrated for the development of NLP applications. Furthermore, we compiled the first moderately large-scale Amharic text corpus (6.8m sentences) along with the word2Vec, fastText, RoBERTa, and FLAIR embeddings models. Finally, we compile benchmark datasets and build classification models for the named entity recognition task.
%U https://aclanthology.org/2021.winlp-1.5
%P 25-28