LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task

AhmedElmogtaba Abdelmoniem Ali Abdelaziz; Ashraf Hatim Elneima; Kareem Darwish

LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task

AhmedElmogtaba Abdelmoniem Ali Abdelaziz, Ashraf Hatim Elneima, Kareem Darwish

Abstract

This paper presents our approach to the Dialect to Modern Standard Arabic (MSA) Machine Translation shared task, conducted as part of the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). Our primary contribution is the development of a novel dataset derived from The Saudi Audio Dataset for Arabic (SADA) an Arabic audio corpus. By employing an automated method utilizing ChatGPT 3.5, we translated the dialectal Arabic texts to their MSA equivalents. This process not only yielded a unique and valuable dataset but also showcased an efficient method for leveraging language models in dataset generation. Utilizing this dataset, alongside additional resources, we trained a machine translation model based on the Transformer architecture. Through systematic experimentation with model configurations, we achieved notable improvements in translation quality. Our findings highlight the significance of LLM-assisted dataset creation methodologies and their impact on advancing machine translation systems, particularly for languages with considerable dialectal diversity like Arabic.

Anthology ID:: 2024.osact-1.14
Volume:: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed
Venues:: OSACT | WS
SIG:: SIGARAB
Publisher:: ELRA and ICCL
Note:
Pages:: 112–116
Language:
URL:: https://aclanthology.org/2024.osact-1.14/
DOI:
Bibkey:
Cite (ACL):: AhmedElmogtaba Abdelmoniem Ali Abdelaziz, Ashraf Hatim Elneima, and Kareem Darwish. 2024. LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 112–116, Torino, Italia. ELRA and ICCL.
Cite (Informal):: LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task (Abdelaziz et al., OSACT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.osact-1.14.pdf

PDF Cite Search Fix data