Shafie Abdi Mohamed
2026
Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machinelearning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpuscovering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% forexcerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages
2024
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Jiayi Wang | David Ifeoluwa Adelani | Sweta Agrawal | Marek Masiak | Ricardo Rei | Eleftheria Briakou | Marine Carpuat | Xuanli He | Sofia Bourhim | Andiswa Bukula | Muhidin Mohamed | Temitayo Olatoye | Tosin Adewumi | Hamam Mokayed | Christine Mwase | Wangui Kimotho | Foutse Yuehgoh | Anuoluwapo Aremu | Jessica Ojo | Shamsuddeen Hassan Muhammad | Salomey Osei | Abdul-Hakeem Omotayo | Chiamaka Chukwuneke | Perez Ogayo | Oumaima Hourrane | Salma El Anigri | Lolwethu Ndolela | Thabiso Mangwana | Shafie Abdi Mohamed | Ayinde Hassan | Oluwabusayo Olufunke Awoyomi | Lama Alkhaled | Sana Al-Azzawi | Naome A. Etori | Millicent Ochieng | Clemencia Siro | Samuel Njoroge | Eric Muchiri | Wangari Kimotho | Lyse Naomi Wamba Momo | Daud Abolade | Simbiat Ajao | Iyanuoluwa Shode | Ricky Macharm | Ruqayya Nasir Iro | Saheed S. Abdullahi | Stephen E. Moore | Bernard Opoku | Zainab Akinjobi | Abeeb Afolabi | Nnaemeka Obiefuna | Onyekachi Raphael Ogbu | Sam Brian | Verrah Akinyi Otiende | Chinedu Emmanuel Mbonu | Sakayo Toadoum Sari | Yao Lu | Pontus Stenetorp
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Jiayi Wang | David Ifeoluwa Adelani | Sweta Agrawal | Marek Masiak | Ricardo Rei | Eleftheria Briakou | Marine Carpuat | Xuanli He | Sofia Bourhim | Andiswa Bukula | Muhidin Mohamed | Temitayo Olatoye | Tosin Adewumi | Hamam Mokayed | Christine Mwase | Wangui Kimotho | Foutse Yuehgoh | Anuoluwapo Aremu | Jessica Ojo | Shamsuddeen Hassan Muhammad | Salomey Osei | Abdul-Hakeem Omotayo | Chiamaka Chukwuneke | Perez Ogayo | Oumaima Hourrane | Salma El Anigri | Lolwethu Ndolela | Thabiso Mangwana | Shafie Abdi Mohamed | Ayinde Hassan | Oluwabusayo Olufunke Awoyomi | Lama Alkhaled | Sana Al-Azzawi | Naome A. Etori | Millicent Ochieng | Clemencia Siro | Samuel Njoroge | Eric Muchiri | Wangari Kimotho | Lyse Naomi Wamba Momo | Daud Abolade | Simbiat Ajao | Iyanuoluwa Shode | Ricky Macharm | Ruqayya Nasir Iro | Saheed S. Abdullahi | Stephen E. Moore | Bernard Opoku | Zainab Akinjobi | Abeeb Afolabi | Nnaemeka Obiefuna | Onyekachi Raphael Ogbu | Sam Brian | Verrah Akinyi Otiende | Chinedu Emmanuel Mbonu | Sakayo Toadoum Sari | Yao Lu | Pontus Stenetorp
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
Search
Fix author
Co-authors
- Saheed S. Abdullahi 1
- Daud Abolade 1
- David Ifeoluwa Adelani 1
- Tosin Adewumi 1
- Abeeb Afolabi 1
- Sweta Agrawal 1
- Simbiat Ajao 1
- Zainab Akinjobi 1
- Sana Al-Azzawi 1
- Lama Alkhaled 1
- Anuoluwapo Aremu 1
- Houssein A Assowe 1
- Oluwabusayo Olufunke Awoyomi 1
- Sofia Bourhim 1
- Eleftheria Briakou 1
- Sam Brian 1
- Andiswa Bukula 1
- Marine Carpuat 1
- Chiamaka Chukwuneke 1
- Salma El Anigri 1
- Naome A. Etori 1
- Abdifatah Ahmed Gedi 1
- Ayinde Hassan 1
- Fuad Mire Hassan 1
- Xuanli He 1
- Oumaima Hourrane 1
- Ruqayya Nasir Iro 1
- Wangui Kimotho 1
- Wangari Kimotho 1
- Yao Lu 1
- Ricky Macharm 1
- Thabiso Mangwana 1
- Marek Masiak 1
- Chinedu Emmanuel Mbonu 1
- Muhidin Mohamed 1
- Muhidin A. Mohamed 1
- Hamam Mokayed 1
- Stephen E. Moore 1
- Eric Muchiri 1
- Shamsuddeen Hassan Muhammad 1
- Christine Mwase 1
- Lolwethu Ndolela 1
- Samuel Njoroge 1
- Nnaemeka Obiefuna 1
- Millicent Ochieng 1
- Perez Ogayo 1
- Onyekachi Raphael Ogbu 1
- Jessica Ojo 1
- Temitayo Olatoye 1
- Abdul-Hakeem Omotayo 1
- Bernard Opoku 1
- Salomey Osei 1
- Verrah Akinyi Otiende 1
- Ricardo Rei 1
- Iyanuoluwa Shode 1
- Clemencia Siro 1
- Pontus Stenetorp 1
- Sakayo Toadoum Sari 1
- Lyse Naomi Wamba Momo 1
- Jiayi Wang 1
- Foutse Yuehgoh 1
- Yusuf A. Yusuf 1