Arabic Dialect Identification in Speech Transcripts

Shervin Malmasi, Marcos Zampieri


Abstract
In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus. We competed under the team name MAZA in the Arabic Dialect Identification sub-task of the 2016 Discriminating between Similar Languages (DSL) shared task. Our system achieved an F1-score of 0.51 in the closed training track, ranking first among the 18 teams that participated in the sub-task. Our system utilizes a classifier ensemble with a set of linear models as base classifiers. We experimented with three different ensemble fusion strategies, with the mean probability approach providing the best performance.
Anthology ID:
W16-4814
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
106–113
Language:
URL:
https://aclanthology.org/W16-4814/
DOI:
Bibkey:
Cite (ACL):
Shervin Malmasi and Marcos Zampieri. 2016. Arabic Dialect Identification in Speech Transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 106–113, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Arabic Dialect Identification in Speech Transcripts (Malmasi & Zampieri, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4814.pdf