Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification

Bashar Talafha, Ali Fadel, Mahmoud Al-Ayyoub, Yaser Jararweh, Mohammad AL-Smadi, Patrick Juola


Abstract
In this paper, we describe our team’s effort on the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. The task requires building a system capable of differentiating between 25 different Arabic dialects in addition to MSA. Our approach is simple. After preprocessing the data, we use Data Augmentation (DA) to enlarge the training data six times. We then build a language model and extract n-gram word-level and character-level TF-IDF features and feed them into an MNB classifier. Despite its simplicity, the resulting model performs really well producing the 4th highest F-measure and region-level accuracy and the 5th highest precision, recall, city-level accuracy and country-level accuracy among the participating teams.
Anthology ID:
W19-4638
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Wassim El-Hajj, Lamia Hadrich Belguith, Fethi Bougares, Walid Magdy, Imed Zitouni, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
285–289
Language:
URL:
https://aclanthology.org/W19-4638
DOI:
10.18653/v1/W19-4638
Bibkey:
Cite (ACL):
Bashar Talafha, Ali Fadel, Mahmoud Al-Ayyoub, Yaser Jararweh, Mohammad AL-Smadi, and Patrick Juola. 2019. Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 285–289, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification (Talafha et al., WANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4638.pdf