ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation

Omer Nacar; Abdullah Alharbi; Serry Sibaee; Samar Ahmed; Lahouari Ghouti; Anis Koubaa

ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation

Omer Nacar, Abdullah Alharbi, Serry Sibaee, Samar Ahmed, Lahouari Ghouti, Anis Koubaa

Abstract

The translation between Modern Standard Arabic (MSA) and the various Arabic dialects presents unique challenges due to the significant linguistic, cultural, and contextual variations across the regions where Arabic is spoken. This paper presents a system description of our participation in the OSACT 2024 Dialect to MSA Translation Shared Task. We explain our comprehensive approach which combines data augmentation techniques using generative pre-trained transformer models (GPT-3.5 and GPT-4) with fine-tuning of AraT5 V2, a model specifically designed for Arabic translation tasks. Our methodology has significantly expanded the training dataset, thus improving the model’s performance across five major Arabic dialects, namely Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. We have rigorously evaluated our approach, using BLEU score, to ensure translation accuracy, fluency, and the preservation of meaning. Our results showcase the effectiveness of our refined models in addressing the challenges posed by diverse Arabic dialects and Modern Standard Arabic (MSA), achieving a BLEU score of 80% on the validation test set and 22.25% on the blind test set. However, it’s important to note that while utilizing a larger dataset, such as Madar + Dev, resulted in significantly higher evaluation BLEU scores, the performance on the blind test set was relatively lower. This observation underscores the importance of dataset size in model training, revealing potential limitations in generalization to unseen data due to variations in data distribution and domain mismatches.

Anthology ID:: 2024.osact-1.13
Volume:: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed
Venues:: OSACT | WS
SIG:: SIGARAB
Publisher:: ELRA and ICCL
Note:
Pages:: 104–111
Language:
URL:: https://aclanthology.org/2024.osact-1.13/
DOI:
Bibkey:
Cite (ACL):: Omer Nacar, Abdullah Alharbi, Serry Sibaee, Samar Ahmed, Lahouari Ghouti, and Anis Koubaa. 2024. ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 104–111, Torino, Italia. ELRA and ICCL.
Cite (Informal):: ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation (Nacar et al., OSACT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.osact-1.13.pdf

PDF Cite Search Fix data