2024
pdf
bib
abs
The Zeno’s Paradox of ‘Low-Resource’ Languages
Hellina Hailu Nigatu
|
Atnafu Lambebo Tonja
|
Benjamin Rosman
|
Thamar Solorio
|
Monojit Choudhury
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a ‘low-resource language.’ To understand how NLP papers define and study ‘low resource’ languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword ‘low-resource.’ Based on our analysis, we show how several interacting axes contribute to ‘low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.
pdf
bib
abs
Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets
Israel Abebe Azime
|
Atnafu Lambebo Tonja
|
Tadesse Destaw Belay
|
Mitiku Yohannes Fuge
|
Aman Kassahun Wassie
|
Eyasu Shiferaw Jada
|
Yonas Chanie
|
Walelign Tewabe Sewunetie
|
Seid Muhie Yimam
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We also explore the effectiveness of translated instruction datasets compared to the dataset we created. Our dataset creation pipeline, along with instruction datasets, trained models, and evaluation outputs, is made publicly available to encourage research in language-specific models.
pdf
bib
abs
EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation
Atnafu Lambebo Tonja
|
Israel Abebe Azime
|
Tadesse Destaw Belay
|
Mesay Gemeda Yigezu
|
Moges Ahmed Ah Mehamed
|
Abinew Ali Ayele
|
Ebrahim Chekol Jibril
|
Michael Melese Woldeyohannis
|
Olga Kolesnikova
|
Philipp Slusallek
|
Dietrich Klakow
|
Seid Muhie Yimam
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM – multilingual large language models for five Ethiopian languages (Amharic, Ge’ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark – a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
pdf
bib
abs
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
Atnafu Lambebo Tonja
|
Olga Kolesnikova
|
Alexander Gelbukh
|
Jugal Kalita
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT – a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
pdf
bib
Proceedings of the Eighth Widening NLP Workshop
Atnafu Lambebo Tonja
|
Alfredo Gomez
|
Chanjun Park
|
Hellina Hailu Nigatu
|
Santosh T.Y.S.S
|
Tanvi Anand
|
Wiem Ben Rim
Proceedings of the Eighth Widening NLP Workshop
2023
pdf
bib
abs
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
Atnafu Lambebo Tonja
|
Christian Maldonado-sifuentes
|
David Alejandro Mendoza Castillo
|
Olga Kolesnikova
|
Noé Castro-Sánchez
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
In this paper, we present a parallel Spanish- Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook m2m100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The results indicate that translation performance is influenced by the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) and is more effective when indigenous languages are used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings.
pdf
bib
abs
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Atnafu Lambebo Tonja
|
Hellina Hailu Nigatu
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
|
Jugal Kalita
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
This paper describes CIC NLP’s submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) — Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.
pdf
bib
abs
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India’s Very Low-Resource Languages
Atnafu Lambebo Tonja
|
Melkamu Mersha
|
Ananya Kalita
|
Olga Kolesnikova
|
Jugal Kalita
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
pdf
bib
MasakhaNEWS: News Topic Classification for African languages
David Ifeoluwa Adelani
|
Marek Masiak
|
Israel Abebe Azime
|
Jesujoba Alabi
|
Atnafu Lambebo Tonja
|
Christine Mwase
|
Odunayo Ogundepo
|
Bonaventure F. P. Dossou
|
Akintunde Oladipo
|
Doreen Nixdorf
|
Chris Chinenye Emezue
|
Sana Al-azzawi
|
Blessing Sibanda
|
Davis David
|
Lolwethu Ndolela
|
Jonathan Mukiibi
|
Tunde Ajayi
|
Tatiana Moteu
|
Brian Odhiambo
|
Abraham Owodunni
|
Nnaemeka Obiefuna
|
Muhidin Mohamed
|
Shamsuddeen Hassan Muhammad
|
Teshome Mulugeta Ababu
|
Saheed Abdullahi Salahudeen
|
Mesay Gemeda Yigezu
|
Tajuddeen Gwadabe
|
Idris Abdulmumin
|
Mahlet Taye
|
Oluwabusayo Awoyomi
|
Iyanuoluwa Shode
|
Tolulope Adelani
|
Habiba Abdulganiyu
|
Abdul-Hakeem Omotayo
|
Adetola Adeeko
|
Abeeb Afolabi
|
Anuoluwapo Aremu
|
Olanrewaju Samuel
|
Clemencia Siro
|
Wangari Kimotho
|
Onyekachi Ogbu
|
Chinedu Mbonu
|
Chiamaka Chukwuneke
|
Samuel Fanijo
|
Jessica Ojo
|
Oyinkansola Awosan
|
Tadesse Kebede
|
Toadoum Sari Sakayo
|
Pamela Nyatsine
|
Freedmore Sidume
|
Oreen Yousuf
|
Mardiyyah Oduwole
|
Kanda Tshinu
|
Ussen Kimanuka
|
Thina Diko
|
Siyanda Nxakama
|
Sinodos Nigusse
|
Abdulmejid Johar
|
Shafie Mohamed
|
Fuad Mire Hassan
|
Moges Ahmed Mehamed
|
Evrard Ngabire
|
Jules Jules
|
Ivan Ssenkungu
|
Pontus Stenetorp
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
abs
Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Atnafu Lambebo Tonja
|
Tadesse Destaw Belay
|
Israel Abebe Azime
|
Abinew Ali Ayele
|
Moges Ahmed Mehamed
|
Olga Kolesnikova
|
Seid Muhie Yimam
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia.Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to disseminate information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.
pdf
bib
abs
Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages
Israel Abebe Azime
|
Sana Al-azzawi
|
Atnafu Lambebo Tonja
|
Iyanuoluwa Shode
|
Jesujoba Alabi
|
Ayodele Awokoya
|
Mardiyyah Oduwole
|
Tosin Adewumi
|
Samuel Fanijo
|
Awosan Oyinkansola
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Detecting harmful content on social media plat-forms is crucial in preventing the negative ef-fects these posts can have on social media users. This paper presents our methodology for tack-ling task 10 from SemEval23, which focuseson detecting and classifying online sexism insocial media posts. We constructed our solu-tion using an ensemble of transformer-basedmodels (that have been fine-tuned; BERTweet,RoBERTa, and DeBERTa). To alleviate the var-ious issues caused by the class imbalance inthe dataset provided and improve the general-ization of our model, our framework employsdata augmentation and semi-supervised learn-ing. Specifically, we use back-translation fordata augmentation in two scenarios: augment-ing the underrepresented class and augment-ing all classes. In this study, we analyze theimpact of these different strategies on the sys-tem’s overall performance and determine whichtechnique is the most effective. Extensive ex-periments demonstrate the efficacy of our ap-proach. For sub-task A, the system achievedan F1-score of 0.8613. The source code to re-produce the proposed solutions is available onGithub
pdf
bib
abs
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR
Tobi Olatunji
|
Tejumade Afonja
|
Aditya Yadavalli
|
Chris Chinenye Emezue
|
Sahib Singh
|
Bonaventure F. P. Dossou
|
Joanne Osuchukwu
|
Salomey Osei
|
Atnafu Lambebo Tonja
|
Naome Etori
|
Clinton Mbataku
Transactions of the Association for Computational Linguistics, Volume 11
Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
pdf
bib
Proceedings of the Seventh Widening NLP Workshop (WiNLP 2023)
Bonaventure F. P. Dossou
|
Isidora Tourni
|
Hatem Haddad
|
Shaily Bhatt
|
Fatemehsadat Mireshghallah
|
Sunipa Dev
|
Tanvi Anand
|
Weijia Xu
|
Atnafu Lambebo Tonja
|
Alfredo Gomez
|
Chanjun Park
Proceedings of the Seventh Widening NLP Workshop (WiNLP 2023)
2022
pdf
bib
abs
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts
Atnafu Lambebo Tonja
|
Mesay Gemeda Yigezu
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Language Identification at the Word Level in Kannada-English Texts. This paper describes the system paper of CoLI-Kanglish 2022 shared task. The goal of this task is to identify the different languages used in CoLI-Kanglish 2022. This dataset is distributed into different categories including Kannada, English, Mixed-Language, Location, Name, and Others. This Code-Mix was compiled by CoLI-Kanglish 2022 organizers from posts on social media. We use two classification techniques, KNN and SVM, and achieve an F1-score of 0.58 and place third out of nine competitors.
pdf
bib
abs
Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach
Mesay Gemeda Yigezu
|
Atnafu Lambebo Tonja
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
The goal of code-mixed language identification (LID) is to determine which language is spoken or written in a given segment of a speech, word, sentence, or document. Our task is to identify English, Kannada, and mixed language from the provided data. To train a model we used the CoLI-Kenglish dataset, which contains English, Kannada, and mixed-language words. In our work, we conducted several experiments in order to obtain the best performing model. Then, we implemented the best model by using Bidirectional Long Short Term Memory (Bi-LSTM), which outperformed the other trained models with an F1-score of 0.61%.
pdf
bib
abs
CIC NLP at SMM4H 2022: a BERT-based approach for classification of social media forum posts
Atnafu Lambebo Tonja
|
Olumide Ebenezer Ojo
|
Mohammed Arif Khan
|
Abdul Gafar Manuel Meque
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
This paper describes our submissions for the Social Media Mining for Health (SMM4H) 2022 shared tasks. We participated in 2 tasks: a) Task 4: Classification of Tweets self-reporting exact age and b) Task 9: Classification of Reddit posts self-reporting exact age. We evaluated the two( BERT and RoBERTa) transformer based models for both tasks. For Task 4 RoBERTa-Large achieved an F1 score of 0.846 on the test set and BERT-Large achieved an F1 score of 0.865 on the test set for Task 9.
pdf
bib
abs
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Bonaventure F. P. Dossou
|
Atnafu Lambebo Tonja
|
Oreen Yousuf
|
Salomey Osei
|
Abigail Oppong
|
Iyanuoluwa Shode
|
Oluwabusayo Olufunke Awoyomi
|
Chris Emezue
Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present
AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines,
AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that
AfroLM is able to generalize well across various domains. We release the code source, and our datasets used in our framework at
https://github.com/bonaventuredossou/MLM_AL.