Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages

Thierry Etchegoyhen; David Ponce

Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages

Abstract

Quality Estimation (QE) of Machine Translation output suffers from the lack of annotated data to train supervised models across domains and language pairs. In this work, we describe a method to generate synthetic QE data based on Neural Machine Translation (NMT) models at different learning stages. Our approach consists in training QE models on the errors produced by different NMT model checkpoints, obtained during the course of model training, under the assumption that gradual learning will induce errors that more closely resemble those produced by NMT models in adverse conditions. We test this approach on English-German and Romanian-English WMT QE test sets, demonstrating that pairing translations from earlier checkpoints with translations of converged models outperforms the use of reference human translations and can achieve competitive results against human-labelled data. We also show that combining post-edited data with our synthetic data yields to significant improvements across the board. Our approach thus opens new possibilities for an efficient use of monolingual corpora to generate quality synthetic QE data, thereby mitigating the data bottleneck.

Anthology ID:: 2023.mtsummit-research.8
Volume:: Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Month:: September
Year:: 2023
Address:: Macau SAR, China
Editors:: Masao Utiyama, Rui Wang
Venue:: MTSummit
SIG:
Publisher:: Asia-Pacific Association for Machine Translation
Note:
Pages:: 84–98
Language:
URL:: https://aclanthology.org/2023.mtsummit-research.8/
DOI:
Bibkey:
Cite (ACL):: Thierry Etchegoyhen and David Ponce. 2023. Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 84–98, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):: Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages (Etchegoyhen & Ponce, MTSummit 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.mtsummit-research.8.pdf

PDF Cite Search Fix data