Datasets Creation and Empirical Evaluations of Cross-Lingual Learning on Extremely Low-Resource Languages: A Focus on Comorian Dialects

Abdou Mohamed Naira, Benelallam Imade, Bahafid Abdessalam, Erraji Zakarya


Abstract
In this era of extensive digitalization, there are a profusion of Intelligent Systems that attempt to understand how languages are structured for the aim of providing solutions in various tasks like Text Summarization, Sentiment Analysis, Speech Recognition, etc. But for multiple reasons going from lack of data to the nonexistence of initiatives, these applications are in an embryonic stage in certain languages and dialects, especially those spoken in the African continent, like Comorian dialects. Today, thanks to the improvement of Pre-trained Large Language Models, a spacious way is open to enable these kind of technologies on these languages. In this study, we are pioneering the representation of Comorian dialects in the field of Natural Language Processing (NLP) by constructing datasets (Lexicons, Speech Recognition and Raw Text datasets) that could be used on different tasks. We also measure the impact of using pre-trained models on languages closely related to Comorian dialects to enhance the state-of-the-art in NLP for these latter, compared to using pre-trained models on languages that may not necessarily be close to these dialects. We construct models covering the following use cases: Language Identification, Sentiment Analysis, Part-Of-Speech Tagging, and Speech Recognition. Ultimately, we hope that these solutions can catalyze the improvement of similar initiatives in Comorian dialects and in languages facing similar challenges.
Anthology ID:
2024.law-1.14
Volume:
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)
Month:
March
Year:
2024
Address:
St. Julians, Malta
Editors:
Sophie Henning, Manfred Stede
Venues:
LAW | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
140–149
Language:
URL:
https://aclanthology.org/2024.law-1.14
DOI:
Bibkey:
Cite (ACL):
Abdou Mohamed Naira, Benelallam Imade, Bahafid Abdessalam, and Erraji Zakarya. 2024. Datasets Creation and Empirical Evaluations of Cross-Lingual Learning on Extremely Low-Resource Languages: A Focus on Comorian Dialects. In Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII), pages 140–149, St. Julians, Malta. Association for Computational Linguistics.
Cite (Informal):
Datasets Creation and Empirical Evaluations of Cross-Lingual Learning on Extremely Low-Resource Languages: A Focus on Comorian Dialects (Naira et al., LAW-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.law-1.14.pdf