Andre Tättar


2023

pdf bib
Machine Translation for Low-resource Finno-Ugric Languages
Lisa Yankovskaya | Maali Tars | Andre Tättar | Mark Fishel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper focuses on neural machine translation (NMT) for low-resource Finno-Ugric languages. Our contributions are three-fold: (1) we extend existing and collect new parallel and monolingual corpora for 20 languages, (2) we expand the 200-language translation benchmark FLORES-200 with manual translations into nine new languages, and (3) we present experiments using the collected data to create NMT systems for the included languages and investigate the impact of back-translation data on the NMT performance for low-resource languages. Experimental results show that carefully selected limited amounts of back-translation directions yield the best results in terms of translation scores, for both high-resource and low-resource output languages.

2022

pdf bib
Multilingual Neural Machine Translation With the Right Amount of Sharing
Taido Purason | Andre Tättar
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Large multilingual Transformer-based machine translation models have had a pivotal role in making translation systems available for hundreds of languages with good zero-shot translation performance. One such example is the universal model with shared encoder-decoder architecture. Additionally, jointly trained language-specific encoder-decoder systems have been proposed for multilingual neural machine translation (NMT) models. This work investigates various knowledge-sharing approaches on the encoder side while keeping the decoder language- or language-group-specific. We propose a novel approach, where we use universal, language-group-specific and language-specific modules to solve the shortcomings of both the universal models and models with language-specific encoders-decoders. Experiments on a multilingual dataset set up to model real-world scenarios, including zero-shot and low-resource translation, show that our proposed models achieve higher translation quality compared to purely universal and language-specific approaches.

pdf bib
MTee: Open Machine Translation Platform for Estonian Government
Toms Bergmanis | Marcis Pinnis | Roberts Rozis | Jānis Šlapiņš | Valters Šics | Berta Bernāne | Guntars Pužulis | Endijs Titomers | Andre Tättar | Taido Purason | Hele-Andra Kuulmets | Agnes Luhtaru | Liisa Rätsep | Maali Tars | Annika Laumets-Tättar | Mark Fishel
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We present the MTee project - a research initiative funded via an Estonian public procurement to develop machine translation technology that is open-source and free of charge. The MTee project delivered an open-source platform serving state-of-the-art machine translation systems supporting four domains for six language pairs translating from Estonian into English, German, and Russian and vice-versa. The platform also features grammatical error correction and speech translation for Estonian and allows for formatted document translation and automatic domain detection. The software, data and training workflows for machine translation engines are all made publicly available for further use and research.

pdf bib
Teaching Unseen Low-resource Languages to Large Translation Models
Maali Tars | Taido Purason | Andre Tättar
Proceedings of the Seventh Conference on Machine Translation (WMT)

In recent years, large multilingual pre-trained neural machine translation model research has grown and it is common for these models to be publicly available for usage and fine-tuning. Low-resource languages benefit from the pre-trained models, because of knowledge transfer from high- to medium-resource languages. The recently available M2M-100 model is our starting point for cross-lingual transfer learning to Finno-Ugric languages, like Livonian. We participate in the WMT22 General Machine Translation task, where we focus on the English-Livonian language pair. We leverage data from other Finno-Ugric languages and through that, we achieve high scores for English-Livonian translation directions. Overall, instead of training a model from scratch, we use transfer learning and back-translation as the main methods and fine-tune a publicly available pre-trained model. This in turn reduces the cost and duration of training high-quality multilingual neural machine translation models.

2021

pdf bib
Extremely low-resource machine translation for closely related languages
Maali Tars | Andre Tättar | Mark Fišel
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

An effective method to improve extremely low-resource neural machine translation is multilingual training, which can be improved by leveraging monolingual data to create synthetic bilingual corpora using the back-translation method. This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish geographical regions. We find that multilingual learning and synthetic corpora increase the translation quality in every language pair for which we have data. We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results. We collected new parallel data for Võro, North and South Saami and present first results of neural machine translation for these languages.

2020

pdf bib
ParBLEU: Augmenting Metrics with Automatic Paraphrases for the WMT’20 Metrics Shared Task
Rachel Bawden | Biao Zhang | Andre Tättar | Matt Post
Proceedings of the Fifth Conference on Machine Translation

We describe parBLEU, parCHRF++, and parESIM, which augment baseline metrics with automatically generated paraphrases produced by PRISM (Thompson and Post, 2020a), a multilingual neural machine translation system. We build on recent work studying how to improve BLEU by using diverse automatically paraphrased references (Bawden et al., 2020), extending experiments to the multilingual setting for the WMT2020 metrics shared task and for three base metrics. We compare their capacity to exploit up to 100 additional synthetic references. We find that gains are possible when using additional, automatically paraphrased references, although they are not systematic. However, segment-level correlations, particularly into English, are improved for all three metrics and even with higher numbers of paraphrased references.

pdf bib
A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing
Rachel Bawden | Biao Zhang | Lisa Yankovskaya | Andre Tättar | Matt Post
Findings of the Association for Computational Linguistics: EMNLP 2020

We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional *diverse* references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU’s capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.

2019

pdf bib
University of Tartu’s Multilingual Multi-domain WMT19 News Translation Shared Task Submission
Andre Tättar | Elizaveta Korotkova | Mark Fishel
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the University of Tartu’s submission to the news translation shared task of WMT19, where the core idea was to train a single multilingual system to cover several language pairs of the shared task and submit its results. We only used the constrained data from the shared task. We describe our approach and its results and discuss the technical issues we faced.

pdf bib
Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings
Elizaveta Yankovskaya | Andre Tättar | Mark Fishel
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We propose the use of pre-trained embeddings as features of a regression model for sentence-level quality estimation of machine translation. In our work we combine freely available BERT and LASER multilingual embeddings to train a neural-based regression model. In the second proposed method we use as an input features not only pre-trained embeddings, but also log probability of any machine translation (MT) system. Both methods are applied to several language pairs and are evaluated both as a classical quality estimation system (predicting the HTER score) as well as an MT metric (predicting human judgements of translation quality).

2018

pdf bib
Phrase-based Unsupervised Machine Translation with Compositional Phrase Embeddings
Maksym Del | Andre Tättar | Mark Fishel
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Tartu’s submission to the unsupervised machine translation track of WMT18 news translation shared task. We build several baseline translation systems for both directions of the English-Estonian language pair using monolingual data only; the systems belong to the phrase-based unsupervised machine translation paradigm where we experimented with phrase lengths of up to 3. As a main contribution, we performed a set of standalone experiments with compositional phrase embeddings as a substitute for phrases as individual vocabulary entries. Results show that reasonable n-gram vectors can be obtained by simply summing up individual word vectors which retains or improves the performance of phrase-based unsupervised machine tranlation systems while avoiding limitations of atomic phrase vectors.

pdf bib
Quality Estimation with Force-Decoded Attention and Cross-lingual Embeddings
Elizaveta Yankovskaya | Andre Tättar | Mark Fishel
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submissions of the team from the University of Tartu for the sentence-level Quality Estimation shared task of WMT18. The proposed models use features based on attention weights of a neural machine translation system and cross-lingual phrase embeddings as input features of a regression model. Two of the proposed models require only a neural machine translation system with an attention mechanism with no additional resources. Results show that combining neural networks and baseline features leads to significant improvements over the baseline features alone.

2017

pdf bib
bleu2vec: the Painfully Familiar Metric on Continuous Vector Space Steroids
Andre Tättar | Mark Fishel
Proceedings of the Second Conference on Machine Translation