Ikechukwu Onyenwe


pdf bib
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
Cheikh M. Bamba Dione | David Ifeoluwa Adelani | Peter Nabende | Jesujoba Alabi | Thapelo Sindane | Happy Buzaaba | Shamsuddeen Hassan Muhammad | Chris Chinenye Emezue | Perez Ogayo | Anuoluwapo Aremu | Catherine Gitau | Derguene Mbaye | Jonathan Mukiibi | Blessing Sibanda | Bonaventure F. P. Dossou | Andiswa Bukula | Rooweither Mabuya | Allahsera Auguste Tapo | Edwin Munkoh-Buabeng | Victoire Memdjokam Koagne | Fatoumata Ouoba Kabore | Amelia Taylor | Godson Kalipe | Tebogo Macucwa | Vukosi Marivate | Tajuddeen Gwadabe | Mboning Tchiaze Elvis | Ikechukwu Onyenwe | Gratien Atindogbe | Tolulope Adelani | Idris Akinade | Olanrewaju Samuel | Marien Nahimana | Théogène Musabeyezu | Emile Niyomutabazi | Ester Chimhenga | Kudzai Gotosa | Patrick Mizha | Apelete Agbolo | Seydou Traore | Chinedu Uchechukwu | Aliyu Yusuf | Muhammad Abdullahi | Dietrich Klakow
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages.


pdf bib
Igbo Diacritic Restoration using Embedding Models
Ignatius Ezeani | Mark Hepple | Ikechukwu Onyenwe | Enemouh Chioma
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Igbo is a low-resource language spoken by approximately 30 million people worldwide. It is the native language of the Igbo people of south-eastern Nigeria. In Igbo language, diacritics - orthographic and tonal - play a huge role in the distinguishing the meaning and pronunciation of words. Omitting diacritics in texts often leads to lexical ambiguity. Diacritic restoration is a pre-processing task that replaces missing diacritics on words from which they have been removed. In this work, we applied embedding models to the diacritic restoration task and compared their performances to those of n-gram models. Although word embedding models have been successfully applied to various NLP tasks, it has not been used, to our knowledge, for diacritic restoration. Two classes of word embeddings models were used: those projected from the English embedding space; and those trained with Igbo bible corpus (≈ 1m). Our best result, 82.49%, is an improvement on the baseline n-gram models.

pdf bib
Transferred Embeddings for Igbo Similarity, Analogy, and Diacritic Restoration Tasks
Ignatius Ezeani | Ikechukwu Onyenwe | Mark Hepple
Proceedings of the Third Workshop on Semantic Deep Learning

Existing NLP models are mostly trained with data from well-resourced languages. Most minority languages face the challenge of lack of resources - data and technologies - for NLP research. Building these resources from scratch for each minority language will be very expensive, time-consuming and amount largely to unnecessarily re-inventing the wheel. In this paper, we applied transfer learning techniques to create Igbo word embeddings from a variety of existing English trained embeddings. Transfer learning methods were also used to build standard datasets for Igbo word similarity and analogy tasks for intrinsic evaluation of embeddings. These projected embeddings were also applied to diacritic restoration task. Our results indicate that the projected models not only outperform the trained ones on the semantic-based tasks of analogy, word-similarity, and odd-word identifying, but they also achieve enhanced performance on the diacritic restoration with learned diacritic embeddings.


pdf bib
Lexical Disambiguation of Igbo using Diacritic Restoration
Ignatius Ezeani | Mark Hepple | Ikechukwu Onyenwe
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

Properly written texts in Igbo, a low-resource African language, are rich in both orthographic and tonal diacritics. Diacritics are essential in capturing the distinctions in pronunciation and meaning of words, as well as in lexical disambiguation. Unfortunately, most electronic texts in diacritic languages are written without diacritics. This makes diacritic restoration a necessary step in corpus building and language processing tasks for languages with diacritics. In our previous work, we built some n-gram models with simple smoothing techniques based on a closed-world assumption. However, as a classification task, diacritic restoration is well suited for and will be more generalisable with machine learning. This paper, therefore, presents a more standard approach to dealing with the task which involves the application of machine learning algorithms.


pdf bib
Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language
Ikechukwu Onyenwe | Mark Hepple | Chinedu Uchechukwu | Ignatius Ezeani
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects


pdf bib
Part-of-speech Tagset and Corpus Development for Igbo, an African Language
Ikechukwu Onyenwe | Chinedu Uchechukwu | Mark Hepple
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop