Dialect Representation Learning with Neural Dialect-to-Standard Normalization

Olli Kuparinen, Yves Scherrer


Abstract
Language label tokens are often used in multilingual neural language modeling and sequence-to-sequence learning to enhance the performance of such models. An additional product of the technique is that the models learn representations of the language tokens, which in turn reflect the relationships between the languages. In this paper, we study the learned representations of dialects produced by neural dialect-to-standard normalization models. We use two large datasets of typologically different languages, namely Finnish and Norwegian, and evaluate the learned representations against traditional dialect divisions of both languages. We find that the inferred dialect embeddings correlate well with the traditional dialects. The methodology could be further used in noisier settings to find new insights into language variation.
Anthology ID:
2023.vardial-1.20
Volume:
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
200–212
Language:
URL:
https://aclanthology.org/2023.vardial-1.20
DOI:
10.18653/v1/2023.vardial-1.20
Bibkey:
Cite (ACL):
Olli Kuparinen and Yves Scherrer. 2023. Dialect Representation Learning with Neural Dialect-to-Standard Normalization. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 200–212, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Dialect Representation Learning with Neural Dialect-to-Standard Normalization (Kuparinen & Scherrer, VarDial 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.vardial-1.20.pdf
Video:
 https://aclanthology.org/2023.vardial-1.20.mp4