Marina Fomicheva


2021

pdf bib
Bayesian Model-Agnostic Meta-Learning with Matrix-Valued Kernels for Quality Estimation
Abiola Obamuyide | Marina Fomicheva | Lucia Specia
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Most current quality estimation (QE) models for machine translation are trained and evaluated in a fully supervised setting requiring significant quantities of labelled training data. However, obtaining labelled data can be both expensive and time-consuming. In addition, the test data that a deployed QE model would be exposed to may differ from its training data in significant ways. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. Thus, it is desirable to be able to adapt QE models efficiently to new user data with limited supervision data. To address these challenges, we propose a Bayesian meta-learning approach for adapting QE models to the needs and preferences of each user with limited supervision. To enhance performance, we further propose an extension to a state-of-the-art Bayesian meta-learning approach which utilizes a matrix-valued kernel for Bayesian meta-learning of quality estimation. Experiments on data with varying number of users and language characteristics demonstrates that the proposed Bayesian meta-learning approach delivers improved predictive performance in both limited and full supervision settings.

pdf bib
Exploring Supervised and Unsupervised Rewards in Machine Translation
Julia Ive | Zixu Wang | Marina Fomicheva | Lucia Specia
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Reinforcement Learning (RL) is a powerful framework to address the discrepancy between loss functions used during training and the final evaluation metrics to be used at test time. When applied to neural Machine Translation (MT), it minimises the mismatch between the cross-entropy loss and non-differentiable evaluation metrics like BLEU. However, the suitability of these metrics as reward function at training time is questionable: they tend to be sparse and biased towards the specific words used in the reference texts. We propose to address this problem by making models less reliant on such metrics in two ways: (a) with an entropy-regularised RL method that does not only maximise a reward function but also explore the action space to avoid peaky distributions; (b) with a novel RL method that explores a dynamic unsupervised reward function to balance between exploration and exploitation. We base our proposals on the Soft Actor-Critic (SAC) framework, adapting the off-policy maximum entropy model for language generation applications such as MT. We demonstrate that SAC with BLEU reward tends to overfit less to the training data and performs better on out-of-domain data. We also show that our dynamic unsupervised reward can lead to better translation of ambiguous words.

pdf bib
Backtranslation Feedback Improves User Confidence in MT, Not Quality
Vilém Zouhar | Michal Novák | Matúš Žilinec | Ondřej Bojar | Mateo Obregón | Robin L. Hill | Frédéric Blain | Marina Fomicheva | Lucia Specia | Lisa Yankovskaya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.

pdf bib
Continual Quality Estimation with Online Bayesian Meta-Learning
Abiola Obamuyide | Marina Fomicheva | Lucia Specia
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most current quality estimation (QE) models for machine translation are trained and evaluated in a static setting where training and test data are assumed to be from a fixed distribution. However, in real-life settings, the test data that a deployed QE model would be exposed to may differ from its training data. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. To address this challenge, we propose an online Bayesian meta-learning framework for the continuous training of QE models that is able to adapt them to the needs of different users, while being robust to distributional shifts in training and test data. Experiments on data with varying number of users and language characteristics validate the effectiveness of the proposed approach.

pdf bib
Knowledge Distillation for Quality Estimation
Amit Gajbhiye | Marina Fomicheva | Fernando Alva-Manchego | Frédéric Blain | Abiola Obamuyide | Nikolaos Aletras | Lucia Specia
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Unsupervised Quality Estimation for Neural Machine Translation
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Francisco Guzmán | Mark Fishel | Nikolaos Aletras | Vishrav Chaudhary | Lucia Specia
Transactions of the Association for Computational Linguistics, Volume 8

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

pdf bib
Multi-Hypothesis Machine Translation Evaluation
Marina Fomicheva | Lucia Specia | Francisco Guzmán
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%.

pdf bib
Findings of the WMT 2020 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Erick Fonseca | Vishrav Chaudhary | Francisco Guzmán | André F. T. Martins
Proceedings of the Fifth Conference on Machine Translation

We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.

pdf bib
BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Vishrav Chaudhary | Mark Fishel | Francisco Guzmán | Lucia Specia
Proceedings of the Fifth Conference on Machine Translation

This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.

pdf bib
An Exploratory Study on Multilingual Quality Estimation
Shuo Sun | Marina Fomicheva | Frédéric Blain | Vishrav Chaudhary | Ahmed El-Kishky | Adithya Renduchintala | Francisco Guzmán | Lucia Specia
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform single-language models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.

pdf bib
Exploring Model Consensus to Generate Translation Paraphrases
Zhenhao Li | Marina Fomicheva | Lucia Specia
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). This task focuses on improving the ability of neural MT systems to generate diverse translations. Our submission explores various methods, including N-best translation, Monte Carlo dropout, Diverse Beam Search, Mixture of Experts, Ensembling, and Lexical Substitution. Our main submission is based on the integration of multiple translations from multiple methods using Consensus Voting. Experiments show that the proposed approach achieves a considerable degree of diversity without introducing noisy translations. Our final submission achieves a 0.5510 weighted F1 score on the blind test set for the English-Portuguese track.

2019

pdf bib
Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments
Marina Fomicheva | Lucia Specia
Computational Linguistics, Volume 45, Issue 3 - September 2019

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

2018

pdf bib
MAJE Submission to the WMT2018 Shared Task on Parallel Corpus Filtering
Marina Fomicheva | Jesús González-Rubio
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the participation of Webinterpret in the shared task on parallel corpus filtering at the Third Conference on Machine Translation (WMT 2018). The paper describes the main characteristics of our approach and discusses the results obtained on the data sets published for the shared task.

2016

pdf bib
Reference Bias in Monolingual Machine Translation Evaluation
Marina Fomicheva | Lucia Specia
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Using Contextual Information for Machine Translation Evaluation
Marina Fomicheva | Núria Bel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic evaluation of Machine Translation (MT) is typically approached by measuring similarity between the candidate MT and a human reference translation. An important limitation of existing evaluation systems is that they are unable to distinguish candidate-reference differences that arise due to acceptable linguistic variation from the differences induced by MT errors. In this paper we present a new metric, UPF-Cobalt, that addresses this issue by taking into consideration the syntactic contexts of candidate and reference words. The metric applies a penalty when the words are similar but the contexts in which they occur are not equivalent. In this way, Machine Translations (MTs) that are different from the human translation but still essentially correct are distinguished from those that share high number of words with the reference but alter the meaning of the sentence due to translation errors. The results show that the method proposed is indeed beneficial for automatic MT evaluation. We report experiments based on two different evaluation tasks with various types of manual quality assessment. The metric significantly outperforms state-of-the-art evaluation systems in varying evaluation settings.

pdf bib
CobaltF: A Fluent Metric for MT Evaluation
Marina Fomicheva | Núria Bel | Lucia Specia | Iria da Cunha | Anton Malinovskiy
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
USFD at SemEval-2016 Task 1: Putting different State-of-the-Arts into a Box
Ahmet Aker | Frederic Blain | Andres Duque | Marina Fomicheva | Jurica Seva | Kashif Shah | Daniel Beck
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
UPF-Cobalt Submission to WMT15 Metrics Task
Marina Fomicheva | Núria Bel | Iria da Cunha | Anton Malinovskiy
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib
Boosting the creation of a treebank
Blanca Arias | Núria Bel | Mercè Lorente | Montserrat Marimón | Alba Milà | Jorge Vivaldi | Muntsa Padró | Marina Fomicheva | Imanol Larrea
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.