Machine Learning-Based Approach for Arabic Dialect Identification

Hamada Nayel, Ahmed Hassan, Mahmoud Sobhi, Ahmed El-Sawy


Abstract
This paper describes our systems submitted to the Second Nuanced Arabic Dialect Identification Shared Task (NADI 2021). Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. There are four subtasks, two subtasks for country-level identification and the other two subtasks for province-level identification. The data in this task covers a total of 100 provinces from all 21 Arab countries and come from the Twitter domain. The proposed systems depend on five machine-learning approaches namely Complement Naïve Bayes, Support Vector Machine, Decision Tree, Logistic Regression and Random Forest Classifiers. F1 macro-averaged score of Naïve Bayes classifier outperformed all other classifiers for development and test data.
Anthology ID:
2021.wanlp-1.34
Volume:
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:
April
Year:
2021
Address:
Kyiv, Ukraine (Virtual)
Editors:
Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
287–290
Language:
URL:
https://aclanthology.org/2021.wanlp-1.34
DOI:
Bibkey:
Cite (ACL):
Hamada Nayel, Ahmed Hassan, Mahmoud Sobhi, and Ahmed El-Sawy. 2021. Machine Learning-Based Approach for Arabic Dialect Identification. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 287–290, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):
Machine Learning-Based Approach for Arabic Dialect Identification (Nayel et al., WANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wanlp-1.34.pdf