Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

Olli Kuparinen


Abstract
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.
Anthology ID:
2023.vardial-1.3
Volume:
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–39
Language:
URL:
https://aclanthology.org/2023.vardial-1.3
DOI:
10.18653/v1/2023.vardial-1.3
Bibkey:
Cite (ACL):
Olli Kuparinen. 2023. Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 31–39, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets (Kuparinen, VarDial 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.vardial-1.3.pdf
Video:
 https://aclanthology.org/2023.vardial-1.3.mp4