Identifying Nuanced Dialect for Arabic Tweets with Deep Learning and Reverse Translation Corpus Extension System

Rawan Tahssin, Youssef Kishk, Marwan Torki


Abstract
In this paper, we present our work for the NADI Shared Task (Abdul-Mageed and Habash, 2020): Nuanced Arabic Dialect Identification for Subtask-1: country-level dialect identification. We introduce a Reverse Translation Corpus Extension Systems (RTCES) to handle data imbalance along with reported results on several experimented approaches of word and document representations and different models architectures. The top scoring model was based on AraBERT (Antoun et al., 2020), with our modified extended corpus based on reverse translation of the given Arabic tweets. The selected system achieved a macro average F1 score of 20.34% on the test set, which places us as the 7th out of 18 teams in the final ranking Leaderboard.
Anthology ID:
2020.wanlp-1.30
Volume:
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
288–294
Language:
URL:
https://aclanthology.org/2020.wanlp-1.30
DOI:
Bibkey:
Cite (ACL):
Rawan Tahssin, Youssef Kishk, and Marwan Torki. 2020. Identifying Nuanced Dialect for Arabic Tweets with Deep Learning and Reverse Translation Corpus Extension System. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 288–294, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
Identifying Nuanced Dialect for Arabic Tweets with Deep Learning and Reverse Translation Corpus Extension System (Tahssin et al., WANLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wanlp-1.30.pdf