Brandeis at VarDial 2024 DSL-ML Shared Task: Multilingual Models, Simple Baselines and Data Augmentation

Jonne Sälevä, Chester Palen-Michel


Abstract
This paper describes the Brandeis University submission to VarDial 2024 DSL-ML Shared Task on multilabel classification for discriminating between similar languages. Our submission consists of three entries per language to the closed track, where no additional data was permitted. Our approach involves a set of simple non-neural baselines using logistic regression, random forests and support vector machines. We follow this by experimenting with finetuning multilingual BERT, either on a single language or all the languages concatenated together.In addition to benchmarking the model architectures against one another on the development set, we perform extensive hyperparameter tuning, which is afforded by the small size of the training data.Our experiments on the development set suggest that finetuned mBERT systems significantly benefit most languages compared to the baseline.However, on the test set, our results indicate that simple models based on scikit-learn can perform surprisingly well and even outperform pretrained language models, as we see with BCMS.Our submissions achieve the best performance on all languages as reported by the organizers. Except for Spanish and French, our non-neural baseline also ranks in the top 3 for all other languages.
Anthology ID:
2024.vardial-1.22
Volume:
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
241–251
Language:
URL:
https://aclanthology.org/2024.vardial-1.22
DOI:
10.18653/v1/2024.vardial-1.22
Bibkey:
Cite (ACL):
Jonne Sälevä and Chester Palen-Michel. 2024. Brandeis at VarDial 2024 DSL-ML Shared Task: Multilingual Models, Simple Baselines and Data Augmentation. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 241–251, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Brandeis at VarDial 2024 DSL-ML Shared Task: Multilingual Models, Simple Baselines and Data Augmentation (Sälevä & Palen-Michel, VarDial-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.vardial-1.22.pdf
Supplementary material:
 2024.vardial-1.22.SupplementaryMaterial.txt