Deep Models for Arabic Dialect Identification on Benchmarked Data

Mohamed Elaraby; Muhammad Abdul-Mageed

Deep Models for Arabic Dialect Identification on Benchmarked Data

Abstract

The Arabic Online Commentary (AOC) (Zaidan and Callison-Burch, 2011) is a large-scale repos-itory of Arabic dialects with manual labels for4varieties of the language. Existing dialect iden-tification models exploiting the dataset pre-date the recent boost deep learning brought to NLPand hence the data are not benchmarked for use with deep learning, nor is it clear how much neural networks can help tease the categories in the data apart. We treat these two limitations:We (1) benchmark the data, and (2) empirically test6different deep learning methods on thetask, comparing peformance to several classical machine learning models under different condi-tions (i.e., both binary and multi-way classification). Our experimental results show that variantsof (attention-based) bidirectional recurrent neural networks achieve best accuracy (acc) on thetask, significantly outperforming all competitive baselines. On blind test data, our models reach87.65%acc on the binary task (MSA vs. dialects),87.4%acc on the 3-way dialect task (Egyptianvs. Gulf vs. Levantine), and82.45%acc on the 4-way variants task (MSA vs. Egyptian vs. Gulfvs. Levantine). We release our benchmark for future work on the dataset

Anthology ID:: W18-3930
Volume:: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Month:: August
Year:: 2018
Address:: Santa Fe, New Mexico, USA
Editors:: Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, Ahmed Ali
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 263–274
Language:
URL:: https://aclanthology.org/W18-3930/
DOI:
Bibkey:
Cite (ACL):: Mohamed Elaraby and Muhammad Abdul-Mageed. 2018. Deep Models for Arabic Dialect Identification on Benchmarked Data. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 263–274, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Deep Models for Arabic Dialect Identification on Benchmarked Data (Elaraby & Abdul-Mageed, VarDial 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-3930.pdf

PDF Cite Search Fix data