Tuuli Tuisk
2022
Machine Translation for Livonian: Catering to 20 Speakers
Matīss Rikters
|
Marili Tomingas
|
Tuuli Tuisk
|
Valts Ernštreits
|
Mark Fishel
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Livonian is one of the most endangered languages in Europe with just a tiny handful of speakers and virtually no publicly available corpora. In this paper we tackle the task of developing neural machine translation (NMT) between Livonian and English, with a two-fold aim: on one hand, preserving the language and on the other – enabling access to Livonian folklore, lifestories and other textual intangible heritage as well as making it easier to create further parallel corpora. We rely on Livonian’s linguistic similarity to Estonian and Latvian and collect parallel and monolingual data for the four languages for translation experiments. We combine different low-resource NMT techniques like zero-shot translation, cross-lingual transfer and synthetic data creation to reach the highest possible translation quality as well as to find which base languages are empirically more helpful for transfer to Livonian. The resulting NMT systems and the collected monolingual and parallel data, including a manually translated and verified translation benchmark, are publicly released via OPUS and Huggingface repositories.
Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator
Mika Hämäläinen
|
Khalid Alnajjar
|
Tuuli Tuisk
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to “dialectalize” standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.
Search
Fix data
Co-authors
- Khalid Alnajjar 1
- Valts Ernštreits 1
- Mark Fishel 1
- Mika Hämäläinen 1
- Matīss Rikters 1
- show all...