2024
pdf
bib
abs
HPLT’s First Release of Data and Models
Nikolay Arefyev
|
Mikko Aulamo
|
Pinzhen Chen
|
Ona De Gibert Bonet
|
Barry Haddow
|
Jindřich Helcl
|
Bhavitvya Malik
|
Gema Ramírez-Sánchez
|
Pavel Stepachev
|
Jörg Tiedemann
|
Dušan Variš
|
Jaume Zaragoza-Bernabeu
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
2023
pdf
bib
abs
Negative Lexical Constraints in Neural Machine Translation
Josef Jon
|
Dusan Varis
|
Michal Novák
|
João Paulo Aires
|
Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the NMT model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedback-based translation refinement. We also studied how the methods “evade” the constraints, meaning that the disallowed expression is still present in the output, but in a changed form, most interestingly the case where a different surface form (for example different inflection) is produced. We propose a way to mitigate the issue through training with stemmed negative constraints, so that the ability of the model to induce different forms of a word might be used to prohibit the usage of all possible forms of the constraint. This helps to some extent, but the problem still persists in many cases.
2021
pdf
bib
abs
End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages
Josef Jon
|
João Paulo Aires
|
Dusan Varis
|
Ondřej Bojar
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases. Although current approaches can enforce terms to appear in the translation, they often struggle to make the constraint word form agree with the rest of the generated output. Our manual analysis shows that 46% of the errors in the output of a baseline constrained model for English to Czech translation are related to agreement. We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints. In particular, we focus on methods based on training the model with constraints provided as part of the input sequence. Our experiments on English-Czech language pair show that this approach improves translation of constrained terms in both automatic and manual evaluation by reducing errors in agreement. Our approach thus eliminates inflection errors, without introducing new errors or decreasing overall quality of the translation.
pdf
bib
abs
European Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajic
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Jose Manuel Gomez-Perez
|
Ulrich Germann
|
Rémi Calizzano
|
Nils Feldhus
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Julian Moreno-Schneider
|
Dimitris Galanis
|
Penny Labropoulou
|
Miltos Deligiannis
|
Katerina Gkirtzou
|
Athanasia Kolovou
|
Dimitris Gkoumas
|
Leon Voukoutis
|
Ian Roberts
|
Jana Hamrlova
|
Dusan Varis
|
Lukas Kacena
|
Khalid Choukri
|
Valérie Mapelli
|
Mickaël Rigault
|
Julija Melnika
|
Miro Janosik
|
Katja Prinz
|
Andres Garcia-Silva
|
Cristian Berrio
|
Ondrej Klejch
|
Steve Renals
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.
pdf
bib
abs
CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task
Josef Jon
|
Michal Novák
|
João Paulo Aires
|
Dusan Varis
|
Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation
This paper describes Charles University sub-mission for Terminology translation shared task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high overall translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms. We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words, when they differ from the forms provided in the terminology database.
pdf
bib
abs
CUNI Systems for WMT21: Terminology Translation Shared Task
Josef Jon
|
Michal Novák
|
João Paulo Aires
|
Dusan Varis
|
Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation
This paper describes Charles University sub-mission for Terminology translation Shared Task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high overall translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms. We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words, when they differ from the forms provided in the terminology database. Our submission ranked second in Exact Match metric which evaluates the ability of the model to produce desired terms in the translation.
pdf
bib
abs
Sequence Length is a Domain: Length-based Overfitting in Transformer Models
Dusan Varis
|
Ondřej Bojar
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing tasks and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
2019
pdf
bib
abs
Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation
Dušan Variš
|
Ondřej Bojar
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
This work presents our ongoing research of unsupervised pretraining in neural machine translation (NMT). In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data and then fine-tune the model on parallel data using Elastic Weight Consolidation (EWC) to avoid forgetting of the original language modeling task. We compare the regularization by EWC with the previous work that focuses on regularization by language modeling objectives. The positive result is that using EWC with the decoder achieves BLEU scores similar to the previous work. However, the model converges 2-3 times faster and does not require the original unlabeled training data during the fine-tuning stage. In contrast, the regularization using EWC is less effective if the original and new tasks are not closely related. We show that initializing the bidirectional NMT encoder with a left-to-right language model and forcing the model to remember the original left-to-right language modeling task limits the learning capacity of the encoder for the whole bidirectional context.
2018
pdf
bib
abs
CUNI Basque-to-English Submission in IWSLT18
Tom Kocmi
|
Dušan Variš
|
Ondřej Bojar
Proceedings of the 15th International Conference on Spoken Language Translation
We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English. Our submission is based on the current state-of-the-art self-attentive neural network architecture, Transformer. We further improve this strong baseline by exploiting available monolingual data using the back-translation technique. We also present further improvements gained by a transfer learning, a technique that trains a model using a high-resource language pair (Czech-English) and then fine-tunes the model using the target low-resource language pair (Basque-English).
pdf
bib
Improving a Neural-based Tagger for Multiword Expressions Identification
Dušan Variš
|
Natalia Klyueva
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Neural Monkey: The Current State and Beyond
Jindřich Helcl
|
Jindřich Libovický
|
Tom Kocmi
|
Tomáš Musil
|
Ondřej Cífka
|
Dušan Variš
|
Ondřej Bojar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
pdf
bib
abs
CUNI System for the WMT18 Multimodal Translation Task
Jindřich Helcl
|
Jindřich Libovický
|
Dušan Variš
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a self-attentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it as an auxiliary objective. For our submission, we acquired both textual and multimodal additional data. Both of the proposed methods yield significant improvements over recurrent networks and self-attentive textual baselines.
2017
pdf
bib
CUNI submission in WMT17: Chimera goes neural
Roman Sudarikov
|
David Mareček
|
Tom Kocmi
|
Dušan Variš
|
Ondřej Bojar
Proceedings of the Second Conference on Machine Translation
pdf
bib
CUNI Experiments for WMT17 Metrics Task
David Mareček
|
Ondřej Bojar
|
Ondřej Hübsch
|
Rudolf Rosa
|
Dušan Variš
Proceedings of the Second Conference on Machine Translation
pdf
bib
CUNI System for WMT17 Automatic Post-Editing Task
Dušan Variš
|
Ondřej Bojar
Proceedings of the Second Conference on Machine Translation
pdf
bib
abs
CUNI NMT System for WAT 2017 Translation Tasks
Tom Kocmi
|
Dušan Variš
|
Ondřej Bojar
Proceedings of the 4th Workshop on Asian Translation (WAT2017)
The paper presents this year’s CUNI submissions to the WAT 2017 Translation Task focusing on the Japanese-English translation, namely Scientific papers subtask, Patents subtask and Newswire subtask. We compare two neural network architectures, the standard sequence-to-sequence with attention (Seq2Seq) and an architecture using convolutional sentence encoder (FBConv2Seq), both implemented in the NMT framework Neural Monkey that we currently participate in developing. We also compare various types of preprocessing of the source Japanese sentences and their impact on the overall results. Furthermore, we include the results of our experiments with out-of-domain data obtained by combining the corpora provided for each subtask.