Freshia Sackey - ACL Anthology

Freshia Sackey

2022

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
David Ifeoluwa Adelani | Jesujoba Oluwadara Alabi | Angela Fan | Julia Kreutzer | Xiaoyu Shen | Machel Reid | Dana Ruiter | Dietrich Klakow | Peter Nabende | Ernie Chang | Tajuddeen Gwadabe | Freshia Sackey | Bonaventure F. P. Dossou | Chris Emezue | Colin Leong | Michael Beukman | Shamsuddeen H. Muhammad | Guyo D. Jarso | Oreen Yousuf | Andre N. Niyongabo Rubungo | Gilles Hacheme | Eric Peter Wairagala | Muhammad Umair Nasir | Benjamin A. Ajibade | Tunde Oluwaseyi Ajayi | Yvonne Wambui Gitau | Jade Abbott | Mohamed Ahmed | Millicent Ochieng | Anuoluwapo Aremu | Perez Ogayo | Jonathan Mukiibi | Fatoumata Ouoba Kabore | Godson Koffi Kalipe | Derguene Mbaye | Allahsera Auguste Tapo | Victoire M. Memdjokam Koagne | Edwin Munkoh-Buabeng | Valencia Wagner | Idris Abdulmumin | Ayodele Awokoya | Happy Buzaaba | Blessing Sibanda | Andiswa Bukula | Sam Manthalu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent advances in the pre-training for language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages that are not well represented on the web and therefore excluded from the large-scale crawls for datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pretraining? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a novel African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.

2020

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Wilhelmina Nekoto | Vukosi Marivate | Tshinondiwa Matsila | Timi Fasubaa | Taiwo Fagbohungbe | Solomon Oluwole Akinola | Shamsuddeen Muhammad | Salomon Kabongo Kabenamualu | Salomey Osei | Freshia Sackey | Rubungo Andre Niyongabo | Ricky Macharm | Perez Ogayo | Orevaoghene Ahia | Musie Meressa Berhe | Mofetoluwa Adeyemi | Masabata Mokgesi-Selinga | Lawrence Okegbemi | Laura Martinus | Kolawole Tajudeen | Kevin Degila | Kelechi Ogueji | Kathleen Siminyu | Julia Kreutzer | Jason Webster | Jamiil Toure Ali | Jade Abbott | Iroro Orife | Ignatius Ezeani | Idris Abdulkadir Dangana | Herman Kamper | Hady Elsahar | Goodness Duru | Ghollah Kioko | Murhabazi Espoir | Elan van Biljon | Daniel Whitenack | Christopher Onyefuluchi | Chris Chinenye Emezue | Bonaventure F. P. Dossou | Blessing Sibanda | Blessing Bassey | Ayodele Olabiyi | Arshath Ramkilowan | Alp Öktem | Adewale Akinfaderin | Abdallah Bashir
Findings of the Association for Computational Linguistics: EMNLP 2020

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github.com/masakhane-io/masakhane-mt.

Co-authors

Blessing Kudzaishe Sibanda 2

Idris Abdulmumin 1

David Ifeoluwa Adelani 1

Mofetoluwa Adeyemi 1

Orevaoghene Ahia 1

Mohamed Ahmed 1

Tunde Oluwaseyi Ajayi 1

Benjamin A. Ajibade 1

Adewale Akinfaderin 1

Solomon Oluwole Akinola 1

Jesujoba Alabi 1

Jamiil Toure Ali 1

Anuoluwapo Aremu 1

Ayodele Awokoya 1

Abdallah Bashir 1

Blessing Bassey 1

Musie Meressa Berhe 1

Michael Beukman 1

Andiswa Bukula 1

Happy Buzaaba 1

Idris Abdulkadir Dangana 1

Goodness Duru 1

Murhabazi Espoir 1

Ignatius Ezeani 1

Taiwo Fagbohungbe 1

Yvonne Wambui Gitau 1

Tajuddeen Gwadabe 1

Gilles Hacheme 1

Guyo D. Jarso 1

Salomon Kabongo Kabenamualu 1

Godson Koffi Kalipe 1

Herman Kamper 1

Ghollah Kioko 1

Dietrich Klakow 1

Ricky Macharm 1

Vukosi Marivate 1

Laura Martinus 1

Tshinondiwa Matsila 1

Derguene Mbaye 1

Victoire M. Memdjokam Koagne 1

Masabata Mokgesi-Selinga 1

Jonathan Mukiibi 1

Edwin Munkoh-Buabeng 1

Peter Nabende 1

Muhammad Umair Nasir 1

Wilhelmina Nekoto 1

Rubungo Andre Niyongabo 1

Andre N. Niyongabo Rubungo 1

Millicent Ochieng 1

Kelechi Ogueji 1

Lawrence Okegbemi 1

Ayodele Olabiyi 1

Christopher Onyefuluchi 1

Fatoumata Ouoba Kabore 1

Arshath Ramkilowan 1

Kathleen Siminyu 1

Kolawole Tajudeen 1

Allahsera Auguste Tapo 1

Valencia Wagner 1

Eric Peter Wairagala 1

Jason Webster 1

Daniel Whitenack 1

Elan van Biljon 1

Venues

Findings1
NAACL1