Levenshtein Training for Word-level Quality Estimation

We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we propose a two-stage transfer learning procedure on both augmented data and human post-editing data. We also propose heuristics to construct reference labels that are compatible with subword-level finetuning and inference. Results on WMT 2020 QE shared task dataset show that our proposed method has superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting.


Introduction
Quality estimation (QE) is the task of estimating the quality of translation without access to a humangenerated reference. Most recent advances on quality estimation (Rei et al., 2020;Thompson and Post, 2020a;Ranasinghe et al., 2020, inter alia) focus on estimating the quality of translation on either the corpus or segment-level. However, in practice, the end-users of machine translation (MT) often call for quality signals on more fine-grained level-the level of individual words in a translation. Such signals are not only useful for more fine-grained triage of translation quality, but also open up the potential for targeted post-processing and faster human post-editing.
To automatically assess the quality of translations, it is natural to consider starting from a machine translation model, which has already acquired the translation knowledge. However, it is not * Shuoyang Ding had a part-time affiliation with Microsoft at the time of this work. (  clear how to best transfer the translation knowledge to a word-level quality estimation setting, since autoregressive translation models only see the preceding context of the word they generate. Hence, they are not well-equipped to perform word-level quality assessments or edits on an existing translation, as there are both preceding and succeeding context for a translated word. Towards this goal, we leverage Levenshtein Transformer (LevT, Gu et al., 2019), a non-autoregressive neural machine translation (NMT) model trained to generate translations by starting with an empty output sequence and iteratively performing edits on the sequence. Because of this special training and decoding procedure, the model should have already learned to edit an existing translation sentence without supervision from post-edited translations. We then use multi-stage transfer learning to teach the model to perform the actual QE task, first on artificially-crafted pseudo post-editing, then on real human post-edited data.
We show that our method achieves better performance than the currently widely adopted Predictor-Estimator scheme (Kim et al., 2017;Kepler et al., 2019) under the data-constrained setting, while also being competitive when compared with the highranking submissions to the WMT 2020 word-level QE shared task (Specia et al., 2020) under the unconstrained setting.
2 Levenshtein Training for Word-level Quality Estimation

Word-level Quality Estimation
We follow the problem formulation of word-level quality estimation (QE) in WMT QE shared task (Specia et al., 2020). Given the source-side input sentence and MT output pairs, the participants are asked to perform two binary classification tasks: (1) whether each word in the target-side translation is correct or not (translation tag), and (2) whether there are missing words in between each pair of output words (gap tag). The reference tags for such prediction are generated by performing human post-editing on the MT outputs and construct edit alignments with a Translation Error Rate (TER, Snover et al., 2006) computation tool. An example is shown in Figure 1a. Each submission is evaluated by the Matthews Correlation Coefficient (MCC, Matthews, 1975).
The state-of-the-art approach to this task is based on the Predictor-Estimator (PredEst) architecture (Kim et al., 2017;Kepler et al., 2019). At a very high level, the predictor training uses a crosslingual masked language model (MLM) objective, which trains the model to predict a word in the target sentence given the source and both the left and right context on the target side. An estimator is then finetuned from the predictor model to predict word-level QE tags. In recent years, the top-ranking systems also incorporate large-scale pre-trained crosslingual encoder such as XLM-RoBERTa (Conneau et al., 2020), glass-box features (Moura et al., 2020) and pseudo post-editing data augmentation Lee, 2020).

Levenshtein Transformer
Intuitively, translation knowledge is very beneficial for the word-level QE task. Hence, a natural choice for this task is to start from an NMT model and finetune it to produce word-level quality estimation outputs. However, there are two major limitations of NMT model that makes it unfit for this task: (1) Most NMT architectures are trained to perform inference in a left-to-right manner, and are therefore ill-equipped to perform edits on an existing translation output; (2) Most NMT architectures do not have a mechanism to predict whether there words missing at a given location. From our introduction below, the readers should notice that Levenshein Transformer successfully addresses both of these limitations.
The Levenshtein Transformer (LevT, Gu et al., 2019) is a neural network architecture that can iteratively generate sequences in a non-autoregressive manner. Unlike normal autoregressive sequence models that have only one prediction head A w to predict the next output words, LevT has two extra prediction heads A del and A ins that predicts deletion and insertion operations based on the output sequence from the previous iteration.
For translation generation, at the k-th iteration during decoding, with source-side input x and target-side sequence input from the previous iteration y (k−1) of length J, suppose the decoder block output is {h 0 , h 1 , . . . , h J }. The following predictions and edits will take place in order: The iterative process will continue until y (k) = y (k−1) or a maximum number of iterations is reached.
For the word-level QE task, we only perform one iteration of the above process -we use the word deletion head A del to predict quality labels for MT words and the mask insertion head A ins to predict quality labels for gaps. We perform neither the word prediction with A w nor multiple iterations of prediction. Still, one should note the similarity between the function of those prediction heads for translation prediction and for word-level QE, which is our motivation for choosing this specific architecture for word-level QE.
It should also be pointed out that using a Levenshtein Transformer translation model as a pretrained model shares some similar spirit to ELEC-TRA (Clark et al., 2020), which performs pretraining by learning to detect corrupted tokens generated from a masked language model.

Pre-trained Model
To achieve optimal performance and take advantage of multilingualism, we would also like to take advantage of the large-scale pre-training. Because there is no pre-trained LevT translation model available, we choose to incorporate M2M-100 (Fan et al., 2020), a large-scale pre-trained multilingual autoregressive transformer model. Recall that the main architectural difference between LevT and a standard transformer model is the two extra prediction heads on LevT. Hence, to adapt it into a LevT model, we first need to add randomlyinitialized extra prediction heads of LevT to the pre-trained checkpoint. These randomly-initialized prediction heads are then trained with the rest of the model on parallel data in order to adapt the autoregressive translation model to a LevT-style non-autoregressive translation model.

Transfer Learning from Translation to Word-level Quality Estimation
By training for the translation task, LevT already acquired some initial knowledge of post-editing. However, there is still some train/inference time mismatch for the word-level QE task regarding (1) the target-side context and (2) the edit tags. In terms of target-side context, during training, the target-side context is a noisy version of the real target sentence in the training set; during inference, the target-side context is a translation output from an NMT system. In terms of edit tags, during training, we want the model to predict LCS edit tags that correct the noisy version of the target sentence; during inference, we want the model to predict TER-style (Levenshtein) edit tags that correspond to human post-editing of NMT outputs.
To address such mismatch, we adopt a two-step transfer learning scheme. For both stages of transfer learning, we need translation triplets (source, MT output, post-edited output) to perform finetuning. However, in practice, human post-editing resources are quite scarce. Hence, like some previous work (Lee, 2020;, we start by performing transfer learning on synthetic translation triplets, followed by real translation triplets constructed with human post-editing. Synthetic Data Construction We explore four different methods for translation triplet synthesis: • src-mt-ref Take a parallel dataset (src, ref) and translate the source sentence with an MT model (mt).
• bt-rt-tgt Take a target-side monolingual dataset (tgt). Translate the target sentence with an backward MT model (bt) and then translate bt again with an forward MT model, thus creating a round-trip translated output (rt). Use rt as the MT output and the original tgt as the pseudo post-edited output.
• src-mt1-mt2 Take a source-side monolingual dataset (src). Translate the source sentence with a weaker MT model (mt1) and a stronger MT model (mt2). Use mt1 as the MT output and mt2 as the pseudo post-edited output.
• mvppe Take the source-side of a parallel dataset (src). Translate the source sentence with a multilingual MT model as the MT output (mt) and build a pseudo post-edited output (pe) with multiview pseudo post-edit (MVPPE) decoding, as described below. Multiview Pseudo Post-Editing (MVPPE) Inspired by Thompson and Post (2020b) which used a multilingual translation system as a zero-shot paraphraser, we propose a novel pseudo post-editing method to build synthetic post-editing dataset from parallel corpus (src, tgt). The first step is to translate the source side of the parallel corpus with a multilingual translation system as the MT output (mt) in the triplet. We then generate the pseudopost-edited output by ensembling two different views of the same model. These two views are: • the translation output distribution p t (pe | src), with src as the model input; • the paraphrase output distribution p p (pe | tgt), with tgt as the model input. Note that both views will create a distribution in the target language space, which can be ensembled in the same way as standard MT model ensembles, forming a interpolated distribution: with λ t and λ p as the interpolation weights. Similarly, beam search can also be performed on top of the ensemble. The intuition behind this idea is that such ensemble should create a target sentence that is semantically equivalent to the target side of the parallel corpus, while being close to the original MT output as much as possible, imitating the way humans perform the post-editing task. Compatibility with Subwords To the best of our knowledge, previous work on word-level quality estimation either builds models that directly output word-level tags (Lee, 2020;Hu et al., 2020;Moura et al., 2020) or uses simple heuristics to re-assign word-level tags to the first subword token . Since LevT predicts edits on a subword-level starting from translation training, we need to: (1) for inference, convert subwordlevel tags predicted by the model to word-level tags for evaluation, and (2) for both finetuning stages, build subword-level reference tags.
For inference, the conversion can be easily done by heuristics. For finetuning, a naive subword-level tag reference can be built by running TER alignments on MT and post-edited text after subword tokenization. However, a preliminary analysis shows that such reference introduces a 10% error after being converted back to word-level. Hence, we introduce another heuristic to create heuristic subwordlevel tag references. The high-level idea is to interpolate the word-level and the naive subword-level references to ensure that the interpolated subwordlevel tag reference can be perfectly converted back to the word-level references. Details for the subword tag conversions can be found in Appendix A and Algorithm 1 and 2 in the appendix.

Setup
Our experiments are based on the setup of the WMT 2020 QE shared task (Specia et al., 2020). The results are reported under two settings: the constrained setting and the unconstrained setting. Apart from the official metric MCC, we also report the F1 score of the OK and BAD tags (F1-OK and F1-BAD in the result tables).
In the constrained setting, we focus on the data efficiency of our model and use only the humanlabeled dataset, the NMT model, and the parallel data (used to train the NMT model) provided by the shared task, with neither large-scale pretrained model nor synthetic data finetuning. In the unconstrained setting, we additionally use some extra resources we have access to. For en-de, we use WMT20 en-de parallel data to train LevT model instead of the smaller parallel data from the shared task, as in the constrained setting. For en-zh, we use the same dataset because it is already close to the WMT data scale. We also experiment with the M2M-100-small initialization (Fan et al., 2020, 418M parameters)  scheme that is incompatible with the shared task setting. For our experiments with M2M-100-small, we proceed with applying sentencepiece on tokenized data during finetuning. We also experiment with synthetic data finetuning with different data synthesis methods on en-de language pair, while for en-zh, because we don't have access to an extra high-quality MT system, we only experiment with mvppe method. Details for our data synthesis setup can be found in Appendix B.1. Table 1 shows results under the data-constrained setting. Even without knowledge distillation (KD) during LevT training, our model already scores much higher than the baseline OpenKiwi system. When training with KD data generated with the shared task NMT system (trained on the same parallel data), the advantage of the LevT expands even more. This shows that our proposal to build a wordlevel QE system from LevT translation models has higher data efficiency than the widely adopted Pre-dEst approach used by the OpenKiwi baseline. Table 2 shows results under the unconstrained setting. We first notice that finetuning with synthetic data before human post-edited data almost always helps. With the transformer-base model on en-de language pair, we experimented with all four data synthesis methods, and we find that src-mt1-mt2 performs the best, closely followed by mvppe. This might be related to the fact that all LevT models are trained with KD data. Because of this, the model is better posed to fit synthesized data with MT-like output as pseudo post-edited data, instead of human-generated translations. Also, initializing with the M2M-100 model is helpful despite the tokenization scheme mismatch, although the performance gain is much more modest on en-   Table 3: Ablation analysis. All results trained with synthetic data in this table use the src-mt1-mt2 data synthesis method. +lang-adapt stands for adding an extra autoregressive MT training step using the same parallel training data as LevT training, so the M2M-100 model is adapted to translating a specific language pair. zh language pair. This is possibly influenced by the relatively low translation quality of M2M-100 model on en-zh language pair, as pointed out by Fan et al. (2020). With all our techniques applied, our best Target MCC result is only slightly behind the winning system on en-de language pair, while being significantly better than the winning system on en-zh language pair. Most notably, for en-zh language pair, even our smallest LevT model are able to beat the state-of-the-art. It should also be pointed out that all of our results are achieved without any model ensemble, and our pre-trained model architecture is just a transformer-big counterpart, while other participating teams deployed larger models.

Results
To confirm that each component of our training scheme is necessary, we conducted a comprehensive ablation study on en-de language pair, shown in Table 3. The upper part of the table demonstrates that LevT training is necessary, and we do so by conducting the finetuning directly on M2M-100-small initialization. Despite the strength of the M2M-100 model as a translation model, there is still a significant performance drop without LevT Training, and more so without synthetic finetuning. To rule out the effect of bilingual knowledge introduced with LevT training, we also experimented with continue-training the M2M-100 model (+lang-adapt in Table 3) with the same parallel data used for LevT training, but the performance gap remains. On the other hand, the lower part of the table highlights the effect of various other training techniques, where we use the best system without M2M-100-small initialization as the base. We can conclude that KD is crucial for optimal performance and that finetuning with heuristic subword-level tag reference is responsible for a small but stable performance improvement.

Conclusion
In this work, we proposed to use Levenshtein Transformer to substitute the usual MLM-style training in the Predictor-Estimator framework as the initial training step. We also proposed a series of techniques to effectively transfer the translation knowledge to the word-level QE task, including data synthesis, heuristic subword-level reference, and incorporating pre-trained translation models. Our results demonstrate superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting. We also hope this work can inspire further exploration for other uses of Levenshtein Transformer apart from the non-autoregressive translation.

A Details on Edit Tag Conversion to/from Subword-Level
Algorithm 1 shows the tag conversion algorithm from subword-level to word-level (for inference). Algorithm 2 shows the tag conversion algorithm from word-level to subword-level (for finetuning).
Algorithm 1: Conversion of subword-level tags to word-level tags Input: subword-level token sequence y sw , word-level token sequence y w , subword-level tag sequence q sw Output: word-level tag sequence q w q w ← []; for each word w k in y w do find subword index span (s k , e k ) in y sw that corresponds to w k ; q sw k ← subword-level translation and gap tags within span (s k , e k ); g sw s k ← subword-level gap tag before span (s k , e k ); For all en-de experiments, we took the first 1 million line from the Europarl corpus and conduct data synthesis.
• src-mt-ref We translate the source side of the parallel data with the NMT system provided by the shared task organizer.
• bt-rt-tgt Both the back-translation and the round-trip translation are performed with M2M-100-mid (1.2B) model.
• src-mt1-mt2 We take the source side of the parallel data and translate it with the NMT system from the shared task (weaker system, mt1) and the Facebook winning system for the WMT19 en-de news translation (Ng et al., 2019, stronger system, mt2). We remove all the cases where mt1 and mt2 are identical.
For en-zh experiments, we take the shared task en-zh parallel data but exclude the UN data for MVPPE data synthesis. The same multilingual translation model is used.
We also experimented with using larger synthetic data for en-de with some synthesis method, but didn't observe a significant performance difference compared to this smaller dataset.

B.2 Misc.
We preprocess our data by first tokenizing with Moses tokenizer, and then applying subword segmentation. For all the LevT models without M2M-100 initialization, we use the same BPE model and source/target-side vocabulary as the official NMT checkpoint provided by the WMT20 QE shared task. For models with M2M-100 initialization, we use the M2M-100 sentencepiece model.
Under the constrained setting, we use the NMT checkpoint supplied by the shared task to generate the knowledge distillation data for LevT translation training, both for en-de and en-zh. Under the unconstrained setting, for en-de, we use the Facebook winning system for the WMT19 en-de news translation, and for en-zh, we use our own Transformer-base en-zh model trained on WMT17 en-zh data.
All of our implementations are based on the Fairseq toolkit. We use the same hyperparameter for LevT translation model training as the document provided in Fairseq 1 . For both synthetic and human post-edited data finetuning, we use Adam optimizer with a learning rate 2e-5 with warmup (4000 updates for synthetic finetuning and 2000 for human post-edited data finetuning), and we use the shared task development set to select the best checkpoint.
For all the mvppe experiments, we use λ t = 2.0 and λ p = 1.0, after doing a grid search over λ t = {1.0, 2.0, 3.0} and λ p = {1.0, 1.2, 1.5} with a goal to match the TER distribution of human post-editing obtained from the en-de human PE dev data.
All the word-level and subword-level tags we use as the reference for finetuning are computed using our own TER implementation 2 , but we stick to the original reference tags in the test set for evaluation to avoid potential result mismatch. Our evaluations are done with the official evaluation scripts 3 from the shared task. The script computes the Matthews Correlation Coefficient (MCC, Matthews, 1975), which is formulated as follows: where T P /F P stands for true/false positives and T N /F N stands for true/false negatives. N stands for the number of examples in the dataset. The script also computes F1 scores of the OK and BAD tags. Table 4 shows some statistics of the data we use in our experiments.

B.3 Extra Results on the Updated Dataset
As of Sep. 2021, there is an updated version of the WMT20 shared task dataset (train/dev/test) with the same MT output but different human postedited output. For reference of future work, we provide results on this updated dataset from some of our experiment configurations in Table 5.
# sentence pairs/triplets shared task en-de parallel 3.96M shared task en-zh parallel 20.3M WMT20 en-de parallel 44.2M en-de src-mt-ref synthetic 945K en-de src-mt1-mt2 synthetic 808K en-de bt-rt-tgt synthetic 998K en-de mvppe synthetic 993K en-zh mvppe synthetic 9.26M shared task en-de human PE train 7K shared task en-zh human PE train 7K shared task en-de human PE dev 1K shared task en-zh human PE dev 1K shared task en-de human PE test 1K shared task en-zh human PE test 1K