Imade Benelallam

2025

pdf bib abs
Preserving Comorian Linguistic Heritage: Bidirectional Transliteration Between the Latin Alphabet and the Kamar-Eddine System
Abdou Mohamed Naira | Abdessalam Bahafid | Zakarya Erraji | Anass Allak | Mohamed Soibira Naoufal | Imade Benelallam
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

The Comoros Islands, rich in linguistic diversity, are home to dialects derived from Swahili and influenced by Arabic. Historically, the Kamar-Eddine system, based on the Arabic alphabet, was one of the first writing systems used for Comorian. However, it has gradually been replaced by the Latin alphabet, even though numerous archival texts are written in this system, and older speakers continue to use it, highlighting its cultural and historical significance. In this article, we present Shialifube, a bidirectional transliteration tool between Latin and Arabic scripts, designed in accordance with the rules of the Kamar-Eddine system. To evaluate its performance, we applied a round-trip transliteration technique, achieving a word error rate of 14.84% and a character error rate of 9.56%. These results demonstrate the reliability of our system for complex tasks. Furthermore, Shialifube was tested in a practical case related to speech recognition, showcasing its potential in Natural Language Processing. This project serves as a bridge between tradition and modernity, contributing to the preservation of Comorian linguistic heritage while paving the way for better integration of local dialects into advanced technologies.

2024

pdf bib abs
Datasets Creation and Empirical Evaluations of Cross-Lingual Learning on Extremely Low-Resource Languages: A Focus on Comorian Dialects
Abdou Mohamed Naira | Abdessalam Bahafid | Zakarya Erraji | Imade Benelallam
Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII)

In this era of extensive digitalization, there are a profusion of Intelligent Systems that attempt to understand how languages are structured for the aim of providing solutions in various tasks like Text Summarization, Sentiment Analysis, Speech Recognition, etc. But for multiple reasons going from lack of data to the nonexistence of initiatives, these applications are in an embryonic stage in certain languages and dialects, especially those spoken in the African continent, like Comorian dialects. Today, thanks to the improvement of Pre-trained Large Language Models, a spacious way is open to enable these kind of technologies on these languages. In this study, we are pioneering the representation of Comorian dialects in the field of Natural Language Processing (NLP) by constructing datasets (Lexicons, Speech Recognition and Raw Text datasets) that could be used on different tasks. We also measure the impact of using pre-trained models on languages closely related to Comorian dialects to enhance the state-of-the-art in NLP for these latter, compared to using pre-trained models on languages that may not necessarily be close to these dialects. We construct models covering the following use cases: Language Identification, Sentiment Analysis, Part-Of-Speech Tagging, and Speech Recognition. Ultimately, we hope that these solutions can catalyze the improvement of similar initiatives in Comorian dialects and in languages facing similar challenges.

2022

pdf bib abs
SI2M & AIOX Labs at WANLP 2022 Shared Task: Propaganda Detection in Arabic, A Data Augmentation and Name Entity Recognition Approach
Kamel Gaanoun | Imade Benelallam
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

This paper presents SI2M & AIOX Labs work among the propaganda detection in Arabic text shared task. The objective of this challenge is to identify the propaganda techniques used in specific propaganda fragments. We use a combination of data augmentation, Name Entity Recognition, rule-based repetition detection, and ARBERT prediction to develop our system. The model we provide scored 0.585 micro F1-Score and ranked 6th out of 12 teams.

2021

pdf bib abs
Sarcasm and Sentiment Detection in Arabic language A Hybrid Approach Combining Embeddings and Rule-based Features
Kamel Gaanoun | Imade Benelallam
Proceedings of the Sixth Arabic Natural Language Processing Workshop

This paper presents the ArabicProcessors team’s system designed for sarcasm (subtask 1) and sentiment (subtask 2) detection shared task. We created a hybrid system by combining rule-based features and both static and dynamic embeddings using transformers and deep learning. The system’s architecture is an ensemble of Naive bayes, MarBERT and Mazajak embedding. This process scored an F1-score of 51% on sarcasm and 71% for sentiment detection.

2020

pdf bib abs
Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy
Kamel Gaanoun | Imade Benelallam
Proceedings of the Fifth Arabic Natural Language Processing Workshop

This paper presents the ArabicProcessors team’s deep learning system designed for the NADI 2020 Subtask 1 (country-level dialect identification) and Subtask 2 (province-level dialect identification). We used Arabic-Bert in combination with data augmentation and ensembling methods. Unlabeled data provided by task organizers (10 Million tweets) was split into multiple subparts, to which we applied semi-supervised learning method, and finally ran a specific ensembling process on the resulting models. This system ranked 3rd in Subtask 1 with 23.26% F1-score and 2nd in Subtask 2 with 5.75% F1-score.

Co-authors

Mohamed Soibira Naoufal 1

Venues

Fix author