Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Bashar Talafha, Wael Farhan, Ahmed Altakrouri, Hussein Al-Natsheh


Abstract
Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.
Anthology ID:
W19-4629
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WANLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
239–243
Language:
URL:
https://aclanthology.org/W19-4629
DOI:
10.18653/v1/W19-4629
Bibkey:
Cite (ACL):
Bashar Talafha, Wael Farhan, Ahmed Altakrouri, and Hussein Al-Natsheh. 2019. Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 239–243, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification (Talafha et al., 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4629.pdf