Dial2MSA-Verified: A Multi-Dialect Arabic Social Media Dataset for Neural Machine Translation to Modern Standard Arabic

Abdullah Salem Khered, Youcef Benkhedda, Riza Batista-Navarro


Abstract
Social media has become an essential focus for Natural Language Processing (NLP) research due to its widespread use and unique linguistic characteristics. Normalising social media content, especially for morphologically rich languages like Arabic, remains a complex task due to limited parallel corpora. Arabic encompasses Modern Standard Arabic (MSA) and various regional dialects, collectively termed Dialectal Arabic (DA), which complicates NLP efforts due to their informal nature and variability. This paper presents Dial2MSA-Verified, an extension of the Dial2MSA dataset that includes verified translations for Gulf, Egyptian, Levantine, and Maghrebi dialects. We evaluate the performance of Seq2Seq models on this dataset, highlighting the effectiveness of state-of-the-art models in translating local Arabic dialects. We also provide insights through error analysis and outline future directions for enhancing Seq2Seq models and dataset development. The Dial2MSA-Verified dataset is publicly available to support further research.
Anthology ID:
2025.wacl-1.6
Volume:
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
Venues:
WACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
50–62
Language:
URL:
https://aclanthology.org/2025.wacl-1.6/
DOI:
Bibkey:
Cite (ACL):
Abdullah Salem Khered, Youcef Benkhedda, and Riza Batista-Navarro. 2025. Dial2MSA-Verified: A Multi-Dialect Arabic Social Media Dataset for Neural Machine Translation to Modern Standard Arabic. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 50–62, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Dial2MSA-Verified: A Multi-Dialect Arabic Social Media Dataset for Neural Machine Translation to Modern Standard Arabic (Khered et al., WACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.wacl-1.6.pdf