HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models

Yves Scherrer, Nikola Ljubešić


Abstract
This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation. Our solutions are based on the BERT Transformer models, the constrained versions of our models reaching 1st place in two subtasks and 3rd place in one subtask, while our unconstrained models outperform all the constrained systems by a large margin. We show in our analyses that Transformer-based models outperform traditional models by far, and that improvements obtained by pre-training models on large quantities of (mostly standard) text are significant, but not drastic, with single-language models also outperforming multilingual models. Our manual analysis shows that two types of signals are the most crucial for a (mis)prediction: named entities and dialectal features, both of which are handled well by our models.
Anthology ID:
2020.vardial-1.19
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
202–211
Language:
URL:
https://aclanthology.org/2020.vardial-1.19
DOI:
Bibkey:
Cite (ACL):
Yves Scherrer and Nikola Ljubešić. 2020. HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 202–211, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models (Scherrer & Ljubešić, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.19.pdf