Quality Estimation without Human-labeled Data

Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many approaches exist for quality estimation, they are based on supervised machine learning requiring costly human labelled data. As an alternative, we propose a technique that does not rely on examples from human-annotators and instead uses synthetic training data. We train off-the-shelf architectures for supervised quality estimation on our synthetic data and show that the resulting models achieve comparable performance to models trained on human-annotated data, both for sentence and word-level prediction.


Introduction
The adoption of Machine Translation (MT) has been increasing in areas ranging from government and finance, to even social media due to the substantial improvements achieved from Neural Machine Translation (NMT). However, even with improved performance, translation quality is not consistent across language pairs, domains, and sentences. This can be detrimental to end-user's trust and can cause unintended consequences arising from poor translations. Thus, having metrics to assess the quality of translated content is crucial to ensure that only high-quality translations are provided to end-users or downstream tasks. Quality Estimation (QE) metrics aim to predict translation quality without access to reference translations (Blatz et al., 2004;Specia et al., 2009Specia et al., , 2013. State-of-the-art QE techniques have leveraged MT systems and language-specific human annotations as supervision, including direct assessment and post-editing (Kepler et al., 2019a;Fonseca et al., 2019;Sun et al., 2020). However, these annotations are costly and time-consuming, particularly for word-level QE, where each token needs a label.
Some unsupervised approaches take inspiration from statistical MT (Popović, 2012;Moreau and Vogel, 2012;Etchegoyhen et al., 2018) or apply uncertainty quantification (Fomicheva et al., 2020) for QE. However, their performance is inferior to that of supervised models. In related areas such as automatic post-editing, parallel data has been used to create synthetic post-editing data (Negri et al., 2018), however this technique only compares machine-translated sentences to references. Our approach augments MT errors with additional errors via masked language model rewriting.
We leverage noisy, mined comparable sentences obtained by weakly-supervised techniques (El-Kishky et al., 2020b). These noisy bitexts have been mined from a variety of domains such as Wikipedia (Schwenk et al., 2019a) and large webcrawls (Schwenk et al., 2019b;El-Kishky et al., 2020a; and have been shown to be an invaluable source of training data for NMT models. Using this data is crucial to avoid data leakage between a trained NMT model and the data we use to create synthetic QE data. For each source-target sentence pair from the mined data, we apply an MT system to generate a candidate translation of the source sentence. Additionally we rewrite each target reference sentence using a masked language model to introduce errors. These two approaches generate two alternative "translations" of the source sentence. We then produce pseudo-labels for each token in these translations by edit distance alignment to the original reference sentence. This results in each translated word being pseudo-labelled as correct or incorrect, which is our synthetic QE training data. Analogously, sentence-level training data is derived as the proportion of incorrect words per sentence. Our main contributions are: (i) We explore a simple technique to effectively generate synthetic data for QE that allows for both word-level and sentence-level estimation (ii) we demonstrate that our technique performs comparably to off-the-shelf models trained on human-annotated data.

QE Task Description
Word-level QE has been mainly framed as the task of predicting which words in the translation need to be post-edited. As such, word-level QE aims to assign a tag for each word and gap between words in a machine-generated translation as correct, i.e., the word does not need editing, or incorrect, i.e., the words should be substituted, deleted, or inserted (tags for gaps) (Specia et al., 2020).
For word-level, we denote the tag of each word in a translation as m t ∈ {OK, BAD}, where t ∈ [1, T ] and T is the length of the translation. Also, we denote the tag of each gap between two words (including the beginning and the end) as g t ∈ {OK, BAD}, where t ∈ [1, T + 1].
In traditional QE, data is collected by first translating source sentences using an MT model. Second, experts post-edit these translations. Third, the post-edits and machine translations are aligned in such a way that induces the minimum edit distance between the tokens of each. Finally, each m t is labelled as BAD if it should be deleted or substituted and each g t is labelled as BAD if at least a word should be inserted there. Sentence-level QE labels can be generated by computing the Human-targeted Translation Error Rate (HTER) (Snover and Brent, 2001;Snover et al., 2006), which is the minimum ratio of edit operations needed to fix the translation to the number of its tokens. We explore the possibility to skip the costly human post-editing process by proposing a data synthesis pipeline, which we then test on human labelled data.

Approach to Data Synthesis
As depicted in Figure 1, we synthesize data from mined Wikipedia datasets, where each example consists of a (source, target) sentence pair. We create candidate translations of source sentences in two ways: For the first approach, we apply the NMT model to translate each source sentence. For the second approach, we rewrite each reference target sentence using a masked language model (MLM), as shown in the MLM Rewrites block in Figure 1. The two approaches create two forms of translations. Then, by treating target sentences as if they were post-edited data (pseudo post-edits), we identify errors in each candidate translation by looking at the insertions, deletions, and substitutions between the references and generated translations.
Neural Machine Translation. For the first approach to generating synthetic data, we use a pretrained NMT model to create translations. The NMT model is the same model that was used to generate translations in the supervised data; the architecture is a standard transformer as used in (Vaswani et al., 2017;Ott et al., 2019). The process of creating synthetic QE data first involves translating each source sentence using this model and taking the output as a translation which will later be used to generate the synthetic labels. When decoding, we apply a beam of 5 following the NMT models available in Fomicheva et al. (2020) to generate a candidate translation. Next, we take the mined reference target sentence and treat it as a pseudo post-editing of the machine translation. We then compute the edit distance between MTs and pseudo post-edits. The resulted edit operations are the pseudo tags, which consist of word tags m t and gap tags g t . This process is illustrated in Algorithm 1.

Rewriting by Masked Language Model (MLM).
Our second approach to creating synthetic QE training data is to introduce errors by rewriting target sentences. We inject these errors by performing text-infilling (Zhu et al., 2019;Lewis et al., 2019). As displayed in Figure 2, we perform text-infilling by applying three operations: (1) randomly substituting a proportion of tokens with a <mask> token, (2) deleting consecutive tokens, and (3) inserting additional consecutive <mask> tokens. We determine the lengths of consecutive deletions and insertions by drawing them from a Poisson distribution with mean λ = 1 shifted by 1 to avoid zero-length insertions or deletions. We then use a pre-trained masked language model (MLM) supplied with the source sentence as input to infill the masked reference sentence. We select multilingual BERT (Devlin et al., 2019) as it is pre-trained on Wikipedia which is in-domain to our test set. We present the target-rewriting approach in detail in Algorithm 2.
In Section 4, we will investigate the performance of QE models trained on NMT-based synthetic data, rewriter-based synthetic data, and a two-model ensemble where each model is trained on a different form of synthetic data.

Experiments and Results
We focus on data released by the WMT20 shared task on QE for predicting post-editing effort, which includes English-to-German (En-De) and English-to-Chinese (En-Zh) word-level data and their sentence-level HTER (Specia et al., 2020). 1 As the human-annotated data is sampled from Wikipedia, we choose to synthesize data from Wiki-Matrix (Schwenk et al., 2019a), which consists of mined Wikipedia parallel data from which we sample pairs with a LASER (Artetxe and Schwenk, 2019) margin score threshold of 1.06 to ensure high-quality pairs. We note that the original QE data is not a subset of WikiMatrix. The German and Chinese text were tokenized using the Moses 2 and Jieba 3 tokenizers, respectively. We list the statistics of the filtered Wikimatrix data as well as our resulting synthetic data in Table 1.
For the off-the-shelf QE model, we choose the   multi-task predictor-estimator model (Kim et al., 2017) implemented by OpenKiwi v0.1.3 (Kepler et al., 2019b). This was the top-performing architecture for QE at WMT19 (Kepler et al., 2019a;Fonseca et al., 2019). We train the predictor on parallel MT data provided by the WMT20 QE shared task. The predictor reads in words' contextualized word representations, the estimator passes these features through a 2-layer 125-dimension bidirectional LSTM (biLSTM) and then feeds the outputs into 1-layer linear word-level classifier. The first output of the biLSTM is also fed into a multi-layer perceptron to predict a sentence-level score. For multi-task learning, we train the model with both word-and sentence-level data.
For a fair comparison, we take the pre-trained predictor provided by the WMT20 QE shared task, fine-tune the whole model on the human annotated data, and compare results to those when fine-tuned on our synthetic data. We test by comparing model predictions and held-out human-annotated QE at word and sentence-level. At the word level, we measure QE performance with Matthew's Correlation Coefficient (MCC) (Matthews, 1975) (main metric), as well as F1 scores for BAD and OK tags. At the sentence-level, we measure the sentencelevel Pearson's correlation (Benesty et al., 2009), mean absolute error (MAE) and Root-mean-square deviation (RMSE). Table 2, for word-level QE, 4 the model trained on synthetic data generated from NMT translations performs comparably to the same model trained on the original 7k human-annotated post-edits. This suggests that having human annotators post-edit each translation to create training data may be unnecessary and using reference sentences is good enough. The model trained on the MLM rewriting synthetic data generally under-performs compared to NMT generated data on MCC. However, we note that it performs better on F1 on OK tags. Therefore, we also ensemble the two models trained on each set of synthetic data through a linear combination. This yields comparable or better performance than the model trained on humanannotated data according to the main metric, MCC.

As shown in
In Table 3, we compare the models trained on human-annotated data to our synthetic data for predicting sentence-level HTER scores. Again our synthetic data from NMT-generated translations outperforms MLM-rewriting data. Both underperform models trained on human-annotated data, but when combined they significantly improve and even outperform human-annotated for En-Zh. This once again suggests that the two forms of synthetic data are complementary and provide valuable signals for QE.  Table 4: Ablation study of synthetic data amounts.

Discussion
In this section, we further analyze how the quantity of synthetic data impacts performance, and what types of errors are represented in each of the MLM and NMT portions of the synthetic data.

Amount of Synthetic Data
As previously observed, the amount of synthetic data is orders of magnitude larger than the amount of human-annotated data. It begs the question: How much benefit do we get from smaller amounts of synthetic data? To analyze how the quantity of synthetic data affects QE performance, we conduct an ablation study of word-level QE. 5 As shown in Table 4, using only about half of the synthetic data generated (200k for En-De and 100k for En-Zh) is comparable to using the full generated set. While this suggests an upper-bound in performance to training on synthetic data. The ablation also suggests that this synthetic process can yield good performance with even a small amount of synthetic data.

Error Analysis
In addition to the performance, we posit that there are essential differences between MLM and NMT synthetic data. To test that, bilingual volunteers qualitatively analyzed the types of mistakes from MLM rewrites vs traditional NMT translations. The major reported differences in error types are: 1. Deletions from NMT translations appear more natural and do not destroy the sentence fluency. However, deletions in MLM rewrites are more destructive (e.g., "new york restaurants" vs "new restaurants" The semantics is changed).
2. Most incorrect insertions or deletions from NMT translations are due to re-ordering 5 The ablation study is only trained on word-level data.
In summary, NMT translations and MLMrewrites appear to generate different types of errors -the former leads to more subtle errors while the latter often introduces more catastrophic errors. Since a high-quality QE model should be able to detect both types of errors, ensembling the models trained on these two forms of synthetic data indeed is expected to outperform using only one form of synthetic data.

Conclusions and Future Work
In this work we devise a technique for building word and sentence-level QE models by creating synthetic training data. By training an off-the-shelf model on our synthetic data, we achieve performance comparable to and often better than training on human-annotated data. This technique for data synthesis can be invaluable if human annotation is difficult to come-by, for example when dealing with low-resource scenarios.
This work can be extended in various ways. While we investigate the scenario of utilizing solely synthetic data, further work can study the effects of augmenting human-labeled data with synthetic data. Further work can analyze the efficacy of this technique into low-resource language pairs where such human-annotation is difficult to obtain. Additionally, instead of a simple MLM re-writer, adversarial training to generate and detect errors could provide more realistic synthetic data.