Rahma Boujelbane


pdf bib
ANLP-RG at NADI 2023 shared task: Machine Translation of Arabic Dialects: A Comparative Study of Transformer Models
Wiem Derouich | Sameh Kchaou | Rahma Boujelbane
Proceedings of ArabicNLP 2023

In this paper, we present our findings within the context of the NADI-2023 Shared Task (Subtask 2). Our task involves developing a translation model from the Palestinian, Jordanian, Emirati, and Egyptian dialects to Modern Standard Arabic (MSA) using the MADAR parallel corpus, even though it lacks a parallel subset for the Emirati dialect. To address this challenge, we conducted a comparative analysis, evaluating the fine-tuning results of various transformer models using the MADAR corpus as a learning resource. Additionally, we assessed the effectiveness of existing translation tools in achieving our translation objectives. The best model achieved a BLEU score of 11.14% on the dev set and 10.02 on the test set.


pdf bib
A deep sentiment analysis of Tunisian dialect comments on multi-domain posts in different social media platforms
Emna Fsih | Rahma Boujelbane | Lamia Hadrich Belguith
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf bib
Standardisation of Dialect Comments in Social Networks in View of Sentiment Analysis : Case of Tunisian Dialect
Saméh Kchaou | Rahma Boujelbane | Emna Fsih | Lamia Hadrich-Belguith
Proceedings of the Thirteenth Language Resources and Evaluation Conference

With the growing access to the internet, the spoken Arabic dialect language becomes informal languages written in social media. Most users post comments using their own dialect. This linguistic situation inhibits mutual understanding between internet users and makes difficult to use computational approaches since most Arabic resources are intended for the formal language: Modern Standard Arabic (MSA). In this paper, we present a pipeline to standardize the written texts in social networks by translating them to the standard language MSA. We fine-tun at first an identification bert-based model to select Tunisian Dialect (TD) from MSA and other dialects. Then, we learned transformer model to translate TD to MSA. The final system includes the translated TD text and the originally text written in MSA. Each of these steps was evaluated on the same test corpus. In order to test the effectiveness of the approach, we compared two opinion analysis models, the first intended for the Sentiment Analysis (SA) of dialect texts and the second for the MSA texts. We concluded that through standardization we obtain the best score.

pdf bib
Benchmarking transfer learning approaches for sentiment analysis of Arabic dialect
Emna Fsih | Sameh Kchaou | Rahma Boujelbane | Lamia Hadrich-Belguith
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Arabic has a widely varying collection of dialects. With the explosion of the use of social networks, the volume of written texts has remarkably increased. Most users express themselves using their own dialect. Unfortunately, many of these dialects remain under-studied due to the scarcity of resources. Researchers and industry practitioners are increasingly interested in analyzing users’ sentiments. In this context, several approaches have been proposed, namely: traditional machine learning, deep learning transfer learning and more recently few-shot learning approaches. In this work, we compare their efficiency as part of the NADI competition to develop a country-level sentiment analysis model. Three models were beneficial for this sub-task: The first based on Sentence Transformer (ST) and achieve 43.23% on DEV set and 42.33% on TEST set, the second based on CAMeLBERT and achieve 47.85% on DEV set and 41.72% on TEST set and the third based on multi-dialect BERT model and achieve 66.72% on DEV set and 39.69% on TEST set.


pdf bib
Parallel resources for Tunisian Arabic Dialect Translation
Saméh Kchaou | Rahma Boujelbane | Lamia Hadrich-Belguith
Proceedings of the Fifth Arabic Natural Language Processing Workshop

The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.


pdf bib
A Conventional Orthography for Tunisian Arabic
Inès Zribi | Rahma Boujelbane | Abir Masmoudi | Mariem Ellouze | Lamia Belguith | Nizar Habash
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.

pdf bib
De l’arabe standard vers l’arabe dialectal : projection de corpus et ressources linguistiques en vue du traitement automatique de l’oral dans les médias tunisiens [From Modern Standard Arabic to Tunisian dialect: corpus projection and linguistic resources towards the automatic processing of speech in the Tunisian media]
Rahma Boujelbane | Mariem Ellouze | Frédéric Béchet | Lamia Belguith
Traitement Automatique des Langues, Volume 55, Numéro 2 : Traitement automatique du langage parlé [Spoken language processing]


pdf bib
Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model
Rahma Boujelbane | Mariem Ellouze khemekhem | Siwar BenAyed | Lamia Hadrich Belguith
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
Translating verbs between MSA and arabic dialects through deep morphological analysis (Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde) [in French]
Ahmed Hamdi | Rahma Boujelbane | Nizar Habash | Alexis Nasr
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib
Generation of tunisian dialect corpora for adapting language models (Génération des corpus en dialecte tunisien pour la modélisation de langage d’un système de reconnaissance) [in French]
Rahma Boujelbane
Proceedings of RECITAL 2013

pdf bib
The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation
Ahmed Hamdi | Rahma Boujelbane | Nizar Habash | Alexis Nasr
Proceedings of Machine Translation Summit XIV: Papers

pdf bib
Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora
Rahma Boujelbane | Mariem Ellouze Khemekhem | Lamia Hadrich Belguith
Proceedings of the Sixth International Joint Conference on Natural Language Processing