Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams

Andrea Ceolin, Hong Zhang


Abstract
We applied word unigram models, character ngram models, and CNNs to the task of distinguishing tweets of two related dialects of Romanian (standard Romanian and Moldavian) for the VarDial 2020 RDI shared task (Gaman et al. 2020). The main challenge of the task was to perform cross-genre text classification: specifically, the models must be trained using text from news articles, and be used to predict tweets. Our best model was a Naive Bayes model trained on character ngrams, with the most common ngrams filtered out. We also applied SVMs and CNNs, but while they yielded the best performance on an evaluation dataset of news article, their accuracy significantly dropped when they were used to predict tweets. Our best model reached an F1 score of 0.715 on the evaluation dataset of tweets, and 0.667 on the held-out test dataset. The model ended up in the third place in the shared task.
Anthology ID:
2020.vardial-1.25
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
265–272
Language:
URL:
https://aclanthology.org/2020.vardial-1.25
DOI:
Bibkey:
Cite (ACL):
Andrea Ceolin and Hong Zhang. 2020. Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 265–272, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams (Ceolin & Zhang, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.25.pdf
Code
 AndreaCeolin/VarDial2020
Data
MOROCO