Hele-Andra Kuulmets


2023

pdf bib
Translated Benchmarks Can Be Misleading: the Case of Estonian Question Answering
Hele-Andra Kuulmets | Mark Fishel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Translated test datasets are a popular and cheaper alternative to native test datasets. However, one of the properties of translated data is the existence of cultural knowledge unfamiliar to the target language speakers. This can make translated test datasets differ significantly from native target datasets. As a result, we might inaccurately estimate the performance of the models in the target language. In this paper, we use both native and translated Estonian QA datasets to study this topic more closely. We discover that relying on the translated test dataset results in an overestimation of the model’s performance on native Estonian data.

2022

pdf bib
MTee: Open Machine Translation Platform for Estonian Government
Toms Bergmanis | Marcis Pinnis | Roberts Rozis | Jānis Šlapiņš | Valters Šics | Berta Bernāne | Guntars Pužulis | Endijs Titomers | Andre Tättar | Taido Purason | Hele-Andra Kuulmets | Agnes Luhtaru | Liisa Rätsep | Maali Tars | Annika Laumets-Tättar | Mark Fishel
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We present the MTee project - a research initiative funded via an Estonian public procurement to develop machine translation technology that is open-source and free of charge. The MTee project delivered an open-source platform serving state-of-the-art machine translation systems supporting four domains for six language pairs translating from Estonian into English, German, and Russian and vice-versa. The platform also features grammatical error correction and speech translation for Estonian and allows for formatted document translation and automatic domain detection. The software, data and training workflows for machine translation engines are all made publicly available for further use and research.