Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT

Anshul Wadhawan


Abstract
This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI). The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic tweet in the form of modern standard Arabic or dialect comes from. We solve the task in two parts. The first part involves pre-processing the provided dataset by cleaning, adding and segmenting various parts of the text. This is followed by carrying out experiments with different versions of two Transformer based models, AraBERT and AraELECTRA. Our final approach achieved macro F1-scores of 0.216, 0.235, 0.054, and 0.043 in the four subtasks, and we were ranked second in MSA identification subtasks and fourth in DA identification subtasks.
Anthology ID:
2021.wanlp-1.35
Volume:
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:
April
Year:
2021
Address:
Kyiv, Ukraine (Virtual)
Editors:
Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
291–295
Language:
URL:
https://aclanthology.org/2021.wanlp-1.35
DOI:
Bibkey:
Cite (ACL):
Anshul Wadhawan. 2021. Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 291–295, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):
Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT (Wadhawan, WANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wanlp-1.35.pdf