2024
pdf
bib
abs
Cheetah: Natural Language Generation for 517 African Languages
Ife Adebara
|
AbdelRahim Elmadany
|
Muhammad Abdul-Mageed
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Low-resource African languages pose unique challenges for natural language processing (NLP) tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across six generation downstream tasks. In five of the six tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape. We will publicly release our models for research.
pdf
bib
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
Vinodkumar Prabhakaran
|
Sunipa Dev
|
Luciana Benotti
|
Daniel Hershcovich
|
Laura Cabello
|
Yong Cao
|
Ife Adebara
|
Li Zhou
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
pdf
bib
abs
Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability
Wei-Rui Chen
|
Ife Adebara
|
Khai Doan
|
Qisheng Liao
|
Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: NAACL 2024
ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT ‘knows’, we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 23 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT’s (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current large language models would benefit from further development before they can sufficiently serve diverse communities.
pdf
bib
abs
Toucan: Many-to-Many Translation for 150 African Language Pairs
AbdelRahim Elmadany
|
Ife Adebara
|
Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2024
We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create Toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed Afro-Lingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa.
pdf
bib
abs
Interplay of Machine Translation, Diacritics, and Diacritization
Wei-Rui Chen
|
Ife Adebara
|
Muhammad Abdul-Mageed
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.
2023
pdf
bib
abs
SERENGETI: Massively Multilingual Language Models for Africa
Ife Adebara
|
AbdelRahim Elmadany
|
Muhammad Abdul-Mageed
|
Alcides Alcoba Inciarte
Findings of the Association for Computational Linguistics: ACL 2023
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research. Anonymous link
pdf
bib
abs
UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis
Gagan Bhatia
|
Ife Adebara
|
Abdelrahim Elmadany
|
Muhammad Abdul-mageed
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six language models, including further pretraining of some of these models as well as a final finetuning stage. Our best performing models achieve an F1-score of 70.36 on development data and an F1-score of 66.13 on test data. Unsurprisingly, our results demonstrate the effectiveness of transfer learning and finetuning techniques for sentiment analysis across multiple languages. Our approach can be applied to other sentiment analysis tasks in different languages and domains.
2022
pdf
bib
abs
Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go
Ife Adebara
|
Muhammad Abdul-Mageed
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aligning with ACL 2022 special Theme on “Language Diversity: from Low Resource to Endangered Languages”, we discuss the major linguistic and sociopolitical challenges facing development of NLP technologies for African languages. Situating African languages in a typological framework, we discuss how the particulars of these languages can be harnessed. To facilitate future research, we also highlight current efforts, communities, venues, datasets, and tools. Our main objective is to motivate and advocate for an Afrocentric approach to technology development. With this in mind, we recommend what technologies to build and how to build, evaluate, and deploy them based on the needs of local African communities.
pdf
bib
abs
Linguistically-Motivated Yorùbá-English Machine Translation
Ife Adebara
|
Muhammad Abdul-Mageed
|
Miikka Silfverberg
Proceedings of the 29th International Conference on Computational Linguistics
Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation. When translating into English which marks (in)definiteness morphologically, from Yorùbá which uses bare nouns but marks these features contextually, ambiguities arise. In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems (BiLSTM and Transformer) when translating bare nouns in Yorùbá into English. We investigate how the systems what extent they identify BNs, correctly translate them, and compare with human translation patterns. We also analyze the type of errors each model makes and provide a linguistic description of these errors. We glean insights for evaluating model performance in low-resource settings. In translating bare nouns, our results show the transformer model outperforms the SMT and BiLSTM models for 4 categories, the BiLSTM outperforms the SMT model for 3 categories while the SMT outperforms the NMT models for 1 category.
pdf
bib
abs
AfroLID: A Neural Language Identification Tool for African Languages
Ife Adebara
|
AbdelRahim Elmadany
|
Muhammad Abdul-Mageed
|
Alcides Inciarte
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations
2021
pdf
bib
abs
Improving Similar Language Translation With Transfer Learning
Ife Adebara
|
Muhammad Abdul-Mageed
Proceedings of the Sixth Conference on Machine Translation
We investigate transfer learning based on pre-trained neural machine translation models to translate between (low-resource) similar languages. This work is part of our contribution to the WMT 2021 Similar Languages Translation Shared Task where we submitted models for different language pairs, including French-Bambara, Spanish-Catalan, and Spanish-Portuguese in both directions. Our models for Catalan-Spanish (82.79 BLEU)and Portuguese-Spanish (87.11 BLEU) rank top 1 in the official shared task evaluation, and we are the only team to submit models for the French-Bambara pairs.
2020
pdf
bib
abs
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
Ife Adebara
|
El Moatez Billah Nagoudi
|
Muhammad Abdul Mageed
Proceedings of the Fifth Conference on Machine Translation
In this work we investigate different approaches to translate between similar languages despite low resource limitations. This work is done as the participation of the UBC NLP research group in the WMT 2019 Similar Languages Translation Shared Task. We participated in all language pairs and performed various experiments. We used a transformer architecture for all the models and used back-translation for one of the language pairs. We explore both bilingual and multi-lingual approaches. We describe the pre-processing, training, translation and results for each model. We also investigate the role of mutual intelligibility in model performance.