Faheem at NADI shared task: Identifying the dialect of Arabic tweet

Nouf AlShenaifi, Aqil Azmi


Abstract
This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being under-studied due to the scarcity of the resources, the objective is to identify the Arabic dialect used in the tweet, country wise. We propose a machine learning approach where we utilize word-level n-gram (n = 1 to 3) and tf-idf features and feed them to six different classifiers. We train the system using a data set of 21,000 tweets—provided by the organizers—covering twenty-one Arab countries. Our top performing classifiers are: Logistic Regression, Support Vector Machines, and Multinomial Na ̈ıve Bayes.
Anthology ID:
2020.wanlp-1.29
Volume:
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
282–287
Language:
URL:
https://aclanthology.org/2020.wanlp-1.29
DOI:
Bibkey:
Cite (ACL):
Nouf AlShenaifi and Aqil Azmi. 2020. Faheem at NADI shared task: Identifying the dialect of Arabic tweet. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 282–287, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
Faheem at NADI shared task: Identifying the dialect of Arabic tweet (AlShenaifi & Azmi, WANLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wanlp-1.29.pdf