2023
pdf
bib
abs
ANLP-RG at NADI 2023 shared task: Machine Translation of Arabic Dialects: A Comparative Study of Transformer Models
Wiem Derouich
|
Sameh Kchaou
|
Rahma Boujelbane
Proceedings of ArabicNLP 2023
In this paper, we present our findings within the context of the NADI-2023 Shared Task (Subtask 2). Our task involves developing a translation model from the Palestinian, Jordanian, Emirati, and Egyptian dialects to Modern Standard Arabic (MSA) using the MADAR parallel corpus, even though it lacks a parallel subset for the Emirati dialect. To address this challenge, we conducted a comparative analysis, evaluating the fine-tuning results of various transformer models using the MADAR corpus as a learning resource. Additionally, we assessed the effectiveness of existing translation tools in achieving our translation objectives. The best model achieved a BLEU score of 11.14% on the dev set and 10.02 on the test set.
2022
pdf
bib
abs
Benchmarking transfer learning approaches for sentiment analysis of Arabic dialect
Emna Fsih
|
Sameh Kchaou
|
Rahma Boujelbane
|
Lamia Hadrich-Belguith
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Arabic has a widely varying collection of dialects. With the explosion of the use of social networks, the volume of written texts has remarkably increased. Most users express themselves using their own dialect. Unfortunately, many of these dialects remain under-studied due to the scarcity of resources. Researchers and industry practitioners are increasingly interested in analyzing users’ sentiments. In this context, several approaches have been proposed, namely: traditional machine learning, deep learning transfer learning and more recently few-shot learning approaches. In this work, we compare their efficiency as part of the NADI competition to develop a country-level sentiment analysis model. Three models were beneficial for this sub-task: The first based on Sentence Transformer (ST) and achieve 43.23% on DEV set and 42.33% on TEST set, the second based on CAMeLBERT and achieve 47.85% on DEV set and 41.72% on TEST set and the third based on multi-dialect BERT model and achieve 66.72% on DEV set and 39.69% on TEST set.
pdf
bib
abs
Standardisation of Dialect Comments in Social Networks in View of Sentiment Analysis : Case of Tunisian Dialect
Saméh Kchaou
|
Rahma Boujelbane
|
Emna Fsih
|
Lamia Hadrich-Belguith
Proceedings of the Thirteenth Language Resources and Evaluation Conference
With the growing access to the internet, the spoken Arabic dialect language becomes informal languages written in social media. Most users post comments using their own dialect. This linguistic situation inhibits mutual understanding between internet users and makes difficult to use computational approaches since most Arabic resources are intended for the formal language: Modern Standard Arabic (MSA). In this paper, we present a pipeline to standardize the written texts in social networks by translating them to the standard language MSA. We fine-tun at first an identification bert-based model to select Tunisian Dialect (TD) from MSA and other dialects. Then, we learned transformer model to translate TD to MSA. The final system includes the translated TD text and the originally text written in MSA. Each of these steps was evaluated on the same test corpus. In order to test the effectiveness of the approach, we compared two opinion analysis models, the first intended for the Sentiment Analysis (SA) of dialect texts and the second for the MSA texts. We concluded that through standardization we obtain the best score.
2020
pdf
bib
abs
Text and Speech-based Tunisian Arabic Sub-Dialects Identification
Najla Ben Abdallah
|
Saméh Kchaou
|
Fethi Bougares
Proceedings of the Twelfth Language Resources and Evaluation Conference
Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.75% using our best speech-based identification system while the F-1 score is limited to 54.16% using text-based DID on the same test set.
pdf
bib
abs
Parallel resources for Tunisian Arabic Dialect Translation
Saméh Kchaou
|
Rahma Boujelbane
|
Lamia Hadrich-Belguith
Proceedings of the Fifth Arabic Natural Language Processing Workshop
The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.
2019
pdf
bib
abs
LIUM-MIRACL Participation in the MADAR Arabic Dialect Identification Shared Task
Saméh Kchaou
|
Fethi Bougares
|
Lamia Hadrich-Belguith
Proceedings of the Fourth Arabic Natural Language Processing Workshop
This paper describes the joint participation of the LIUM and MIRACL Laboratories at the Arabic dialect identification challenge of the MADAR Shared Task (Bouamor et al., 2019) conducted during the Fourth Arabic Natural Language Processing Workshop (WANLP 2019). We participated to the Travel Domain Dialect Identification subtask. We built several systems and explored different techniques including conventional machine learning methods and deep learning algorithms. Deep learning approaches did not perform well on this task. We experimented several classification systems and we were able to identify the dialect of an input sentence with an F1-score of 65.41% on the official test set using only the training data supplied by the shared task organizers.