Maha J. Althobaiti


2021

pdf bib
Country-level Arabic Dialect Identification Using Small Datasets with Integrated Machine Learning Techniques and Deep Learning Models
Maha J. Althobaiti
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Arabic is characterised by a considerable number of varieties including spoken dialects. In this paper, we presented our models developed to participate in the NADI subtask 1.2 that requires building a system to distinguish between 21 country-level dialects. We investigated several classical machine learning approaches and deep learning models using small datasets. We examined an integration technique between two machine learning approaches. Additionally, we created dictionaries automatically based on Pointwise Mutual Information and labelled datasets, which enriched the feature space when training models. A semi-supervised learning approach was also examined and compared to other methods that exploit large unlabelled datasets, such as building pre-trained word embeddings. Our winning model was the Support Vector Machine with dictionary-based features and Pointwise Mutual Information values, achieving an 18.94% macros-average F1-score.
Search
Co-authors
    Venues