Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts

Areej Alshutayri, Eric Atwell, Abdulrahman Alosaimy, James Dickins, Michael Ingleby, Janet Watson


Abstract
This paper describes an Arabic dialect identification system which we developed for the Discriminating Similar Languages (DSL) 2016 shared task. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. We experimented with several classifiers and the best accuracy was achieved using the Sequential Minimal Optimization (SMO) algorithm for training and testing process set to three different feature-sets for each testing process. Our approach achieved an accuracy equal to 42.85% which is considerably worse in comparison to the evaluation scores on the training set of 80-90% and with training set “60:40” percentage split which achieved accuracy around 50%. We observed that Buckwalter transcripts from the Saarland Automatic Speech Recognition (ASR) system are given without short vowels, though the Buckwalter system has notation for these. We elaborate such observations, describe our methods and analyse the training dataset.
Anthology ID:
W16-4826
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venues:
VarDial | WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
204–211
Language:
URL:
https://aclanthology.org/W16-4826
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/W16-4826.pdf