Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification

Cristian Popa, Vlad Ștefănescu


Abstract
We study the ability of large fine-tuned transformer models to solve a binary classification task of dialect identification, with a special interest in comparing the performance of multilingual to monolingual ones. The corpus analyzed contains Romanian and Moldavian samples from the news domain, as well as tweets for assessing the performance. We find that the monolingual models are superior to the multilingual ones and the best results are obtained using an SVM ensemble of 5 different transformer-based models. We provide our experimental results and an analysis of the attention mechanisms of the best-performing individual classifiers to explain their decisions. The code we used was released under an open-source license.
Anthology ID:
2020.vardial-1.18
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
193–201
Language:
URL:
https://aclanthology.org/2020.vardial-1.18
DOI:
Bibkey:
Cite (ACL):
Cristian Popa and Vlad Ștefănescu. 2020. Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 193–201, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification (Popa & Ștefănescu, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.18.pdf