2024
pdf
bib
abs
Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer
Hele-Andra Kuulmets
|
Taido Purason
|
Agnes Luhtaru
|
Mark Fishel
Findings of the Association for Computational Linguistics: NAACL 2024
This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.
pdf
bib
abs
To Err Is Human, but Llamas Can Learn It Too
Agnes Luhtaru
|
Taido Purason
|
Martin Vainikko
|
Maksym Del
|
Mark Fishel
Findings of the Association for Computational Linguistics: EMNLP 2024
This study explores enhancing grammatical error correction (GEC) through automatic error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2 LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models using these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT3.5 and GPT4) also results in synthetic errors beneficially affecting error generation models. We openly release trained models for error generation and correction as well as all the synthesized error datasets for the covered languages.
pdf
bib
abs
No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models
Agnes Luhtaru
|
Elizaveta Korotkova
|
Mark Fishel
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Grammatical Error Correction (GEC) enhances language proficiency and promotes effective communication, but research has primarily centered around English. We propose a simple approach to multilingual and low-resource GEC by exploring the potential of multilingual machine translation (MT) models for error correction. We show that MT models are not only capable of error correction out-of-the-box, but that they can also be fine-tuned to even better correction quality. Results show the effectiveness of this approach, with our multilingual model outperforming similar-sized mT5-based models and even competing favourably with larger models.
pdf
bib
abs
Multilinguality or Back-translation? A Case Study with Estonian
Elizaveta Korotkova
|
Taido Purason
|
Agnes Luhtaru
|
Mark Fishel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Machine translation quality is highly reliant on large amounts of training data, and, when a limited amount of parallel data is available, synthetic back-translated or multilingual data can be used in addition. In this work, we introduce SynEst, a synthetic corpus of translations from 11 languages into Estonian which totals over 1 billion sentence pairs. Using this corpus, we investigate whether adding synthetic or English-centric additional data yields better translation quality for translation directions that do not include English. Our results show that while both strategies are effective, synthetic data gives better results. Our final models improve the performance of the baseline No Language Left Behind model while retaining its source-side multilinguality.
2023
pdf
bib
abs
Automatic Transcription for Estonian Children’s Speech
Agnes Luhtaru
|
Rauno Jaaska
|
Karl Kruusamäe
|
Mark Fishel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We evaluate the impact of recent improvements in Automatic Speech Recognition (ASR) on transcribing Estonian children’s speech. Our research focuses on fine-tuning large ASR models with a 10-hour Estonian children’s speech dataset to create accurate transcriptions. Our results show that large pre-trained models hold great potential when fine-tuned first with a more substantial Estonian adult speech corpus and then further trained with children’s speech.
2022
pdf
bib
abs
MTee: Open Machine Translation Platform for Estonian Government
Toms Bergmanis
|
Marcis Pinnis
|
Roberts Rozis
|
Jānis Šlapiņš
|
Valters Šics
|
Berta Bernāne
|
Guntars Pužulis
|
Endijs Titomers
|
Andre Tättar
|
Taido Purason
|
Hele-Andra Kuulmets
|
Agnes Luhtaru
|
Liisa Rätsep
|
Maali Tars
|
Annika Laumets-Tättar
|
Mark Fishel
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
We present the MTee project - a research initiative funded via an Estonian public procurement to develop machine translation technology that is open-source and free of charge. The MTee project delivered an open-source platform serving state-of-the-art machine translation systems supporting four domains for six language pairs translating from Estonian into English, German, and Russian and vice-versa. The platform also features grammatical error correction and speech translation for Estonian and allows for formatted document translation and automatic domain detection. The software, data and training workflows for machine translation engines are all made publicly available for further use and research.