Self-Supervised Quality Estimation for Machine Translation

Quality estimation (QE) of machine translation (MT) aims to evaluate the quality of machine-translated sentences without references and is important in practical applications of MT. Training QE models require massive parallel data with hand-crafted quality annotations, which are time-consuming and labor-intensive to obtain. To address the issue of the absence of annotated training data, previous studies attempt to develop unsupervised QE methods. However, very few of them can be applied to both sentence- and word-level QE tasks, and they may suffer from noises in the synthetic data. To reduce the negative impact of noises, we propose a self-supervised method for both sentence- and word-level QE, which performs quality estimation by recovering the masked target words. Experimental results show that our method outperforms previous unsupervised methods on several QE tasks in different language pairs and domains.


Introduction
In recent years, neural approaches Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017) have significantly improved the quality of machine translation (MT). Despite their apparent success, neural machine translation (NMT) systems still inevitably generate erroneous translations in real-world scenarios (Bentivogli et al., 2016;Castilho et al., 2017), especially for low-resource language pairs. Therefore, the evaluation of translation quality plays an important role in some applications of MT. For example, in computer-assisted translation (CAT) (Barrachina et al., 2009), the evaluation of translation quality can significantly reduce human efforts for postediting (Specia, 2011).
Quality estimation (QE) of MT aims to evaluate the quality of the outputs of an MT system without references. Training QE models often requires massive parallel data, which are composed of authentic source sentences and machine-translated target sentences with quality annotations produced by manual evaluation or human post-editing (Moura et al., 2020;Hu et al., 2020;Ranasinghe et al., 2020). As obtaining such annotated data is time-consuming and labor-intensive in practice, unsupervised QE has received increasing attention (Popović, 2012;Etchegoyhen et al., 2018;Fomicheva et al., 2020;Tuan et al., 2021).
Most of the aforementioned methods use various features to conduct unsupervised QE (Popović, 2012;Etchegoyhen et al., 2018;Fomicheva et al., 2020). These methods are simple and effective but limited to sentence-level tasks. Compared with sentencelevel QE, word-level QE can provide more finegrained quality information (Fan et al., 2019), and thus it can better assist post-editing in CAT when combined with sentence-level QE. Recently, Tuan et al. (2021) use synthetic data to train unsupervised QE models, which can be applied for both sentence-and word-level tasks. Specifically, they construct synthetic target sentences using MT models or masked language models (MLMs) and generate quality annotations by comparing the synthetic target sentences with the references using the TER tool (Snover et al., 2005).
However, the method proposed by Tuan et al. (2021) still has two major weaknesses. First, synthetic data contain biased noise and may negatively affect the model performance. On the one hand, the differences between MT outputs and references are usually larger than the differences between MT outputs and their post-editions (Snover et al., 2005), and thus more errors will be annotated in the synthetic data. On the other hand, sentences that are Our method performs quality estimation by checking whether the masked target words can be successfully recovered using the source sentence and the observed target words. Masked words are highlighted by shading. rewritten by MLMs often contain more catastrophic errors, which rarely appear in machine-translated sentences (Tuan et al., 2021). Second, the training process of this method is complex since it requires extra models to generate synthetic data.
In this work, we propose a self-supervised QE method to overcome the aforementioned weaknesses. The basic idea is to mask some target words in the machine-translated sentence and use the source sentence and the observed target words to recover the masked target words. Intuitively, a target word is correct if it can be recovered according to its surrounding context. For example, in Figure 1, since the masked target word "Er" can be successfully recovered but another masked target word "Lieder" is not identical to the recovered word "Musik", we identify "Er" as correct and "Lieder" as erroneous. Based on this intuition, our method estimates the translation quality of the target words by checking whether they can be correctly recovered. Finally, we obtain the sentencelevel quality score by summarizing the word-level predictions. Obviously, our method is not affected by the noise and is easier to train, since it involves no synthetic data. Experimental results show that our self-supervised method outperforms previous unsupervised methods.

Quality Estimation for Machine Translation
Quality estimation for machine translation aims to evaluate the quality of machine-translated sentences without using references. Currently, there are different types of QE tasks, including sentence-, word-, phrase-and document-level QE. In this work, we mainly focus on sentence-and word-level QE. Generally, both sentence-and word-level qual- ity annotations are generated by comparing the machine-translated target sentences with their posteditions using the TER tool (Snover et al., 2005). For word-level annotations, each target word is annotated with "OK" or "BAD", where "OK" denotes correct words and "BAD" denotes erroneous words. For sentence-level annotations, target sentences are annotated with Human Translation Error Rate (HTER) scores, which measure the percentage of human edits to correct MT outputs: HTER = # of edits # of words in the post-edition . (1) According to the equation above, sentence-level quality scores are calculated based on the wordlevel errors in the target sentences. In other words, HTER scores can be approximately regarded as a summary of word-level quality tags. Table 1 shows an example of QE data.

Self-Supervised Quality Estimation
Our self-supervised QE method is implemented based on the architecture of MLM (Devlin et al., 2019) (Section 3.1). We train the model to recover the masked target words in the authentic parallel corpora and estimate the translation quality by recovering the masked target words in the target sentence (Section 3.2). Besides, Monte Carlo (MC) Dropout (Gal and Ghahramani, 2016) is utilized to better calculate quality scores (Section 3.3).

Model Architecture
As shown in Figure 2, our self-supervised QE model is built on top of the masked language model (Devlin et al., 2019). We use the concatenation of a source sentence and a partially masked target sentence as the input sequence and then use a Transformer encoder to recover the masked tokens. Formally, for any parallel sentence pair x, y , we randomly divide y into two parts y m and y o and mask all tokens in y m . Then, we concatenate x and the partially masked version of y as the input sequence. Suppose the length of the target sentence is T : y = y 1 , . . . , y t , . . . , y T . If the t-th target token y t ∈ y m is masked, we use the model with parameter θ to calculate the probability of y t conditioned on x and y o (i.e., P (y t |x, y o ; θ)).
Similar to Devlin et al. (2019), we mask 15% of the tokens in the target sentence. However, since the vocabulary of BERT is built with WordPiece (Wu et al., 2016), words in the input sequence may be divided into multiple subwords. Therefore, when a subword of a word with multiple subwords is masked, the model may easily recover the masked subword according to the remaining subwords without leveraging the source sentence. This is undesirable because the source sentence should play an important role in determining whether the token is correctly translated. To address this problem, we adopt a masking strategy called Whole Word Masking (WWM) (Cui et al., 2019), preventing the model from recovering a masked subword only using the remaining subwords.

Training and Inference
As shown in Figure 3(a), our model is trained to recover the masked tokens in the target side of the authentic sentence pairs. Formally, given an unlabeled training dataset D = { x (s) , y (s) } S s=1 which consists of authentic sentence pairs, we divide each target sentence y (s) in D into the masked part y During the training process, the model θ learns to recover the masked target tokens in the authentic parallel corpora. After the training process, we use the model to perform quality estimation by checking whether the masked target tokens can be successfully recovered. Specifically, as shown in Figure 3(b), for each masked token, we use the model to calculate the probability of successful recovery conditioned on the source sentence and the observed target tokens. Obviously, the token is difficult to recover if the probability is low. In this case, we consider the token is erroneous. Otherwise, the token tends to be correct.
Formally, suppose we have a sentence pair x,ŷ which consists of an authentic source sentence x and a machine-translated target sentenceŷ. When estimating the translation quality of the t-th tokenŷ t inŷ, our method randomly divides the target sequenceŷ into the observed partŷ o and the masked partŷ m such thatŷ t ∈ŷ m . Then, we use the model to calculate the conditional probability P (ŷ t |x,ŷ o ; θ), which can be used as its quality score: As mentioned in Section 3.1, some of the input words may contain multiple subwords. In this case, we useŷ t to denote a subword in the target sequence andŵ to denote the word whichŷ t belongs to. We calculate the quality score of a target word with multiple subwords by simply averaging the quality scores of its subwords: where |ŵ| denotes the number of subwords inŵ. If a threshold τ ∈ (0, 1) is given, a real-valued quality score can be mapped to a quality tag: Finally, we calculate the sentence-level quality score by averaging the quality scores over all target words: where |ŷ| denotes the number of words inŷ. Note that we add a negative sign to the equation above since HTER scores are negatively correlated with translation quality.

Calculating Quality Scores with Monte Carlo Dropout
In this work, we also utilize Monte Carlo (MC) Dropout (Gal and Ghahramani, 2016), which is proven conducive to the performance of unsupervised QE models (Fomicheva et al., 2020). Instead of directly calculating token-level quality scores using Eq.
(3), we sample multiple models by perturbing the original model parameters using dropout (Srivastava et al., 2014) and use these models to calculate the expectation of conditional probabilities as the quality scores. Specifically, in our method, each time we can only obtain the probability of the masked target words. Therefore, if we need N samples of probability for each target token, we sample N ′ > N different models and conduct N ′ different estimations using these models, making each target word masked exactly N times among all N ′ estimations. Thus, we obtain N samples for each target token and then calculate the quality score by averaging these samples. For the details about this process, please refer to Appendix A.1.

Setup Data and Preprocessing
We mainly conducted experiments on the WMT 2019 QE tasks, which consist of tasks in two different language pairs (En-De and En-Ru). Both tasks are in the IT domain. Since our experiments were conducted in an unsupervised setting, we used parallel corpora without quality annotations as training data 2 . Specifically, for En-De, we used indomain parallel data from various sources, including the training data from the WMT 2016 IT domain translation task, the WMT 2017 QE task, and the WMT 2018 APE task, as well as the Openoffice and KDE4 corpora available in OPUS 3 (Tiedemann, 2012). For En-Ru, we used the in-domain parallel data collected by OPUS, including ada83, GNOME, KDE4, OpenOffice, PHP and Ubuntu.
To further validate our method's performance in different domains, we also conducted experiments on the WMT 2018 En-Lv QE task, which is in the biomedical domain. We used the EMEA corpus (which is also available in OPUS) as training data.
Sentences were tokenized and truecased using the scripts provided by Moses (Koehn et al., 2007). We also deduplicated the sentences in the training datasets. Table 2 shows the statistics of these datasets. Year

Baselines
We mainly compared our method with SyntheticQE (Tuan et al., 2021), which uses synthetic data to train unsupervised QE models for both sentenceand word-level tasks. This baseline has three different variants: 1. SyntheticQE-MT: The target side of the synthetic data is produced using MT models.
2. SyntheticQE-MLM: The target side of the synthetic data is produced using MLMs.
3. SyntheticQE-MT+MLM: An ensemble of the aforementioned two models.
To further validate the performance of our method, we also compared our method with the following unsupervised sentence-level baseline methods: 1. uMQE (Etchegoyhen et al., 2018): A method based on lexical translation tables and statistical language models.
2. BERTScore : A method based on similarity scores of contextual BERT embeddings.
3. BERTScore++ : A variant of BERTScore , which also uses word alignments and MLMs.

Evaluation
We evaluated the performances of our method and the baselines using the standard metrics of the WMT QE shared tasks. Specifically, we used Pearson's correlation metric for sentence-level tasks and the multiplication of F1-scores for "OK" and "BAD" classes for word-level tasks.

Implementation Details
We implemented our method on top of the Transformers library 4 (Wolf et al., 2020). We trained our model by fine-tuning the multilingual BERT (Devlin et al., 2019). We used the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 and ǫ = 10 −8 to optimize model parameters. During training, we set the batch size to 128, the maximum sequence length to 256, the number of training steps to 100,000, the learning rate to 5×10 −5 and the dropout rate to 0.1. We evaluated our model every 1,000 steps and chose the model with the best performance on the development set for inference. For the MC Dropout process, we used the same dropout rate as during training and set N to 6. Since each prediction masks about 15% of the words, we set N ′ = N/15% = 40. We tuned the threshold τ on the development set to maximize the word-level performance 5 . For ensemble models, we simply averaged the quality scores given by two different models (and then obtained the word-level tags based on the threshold). For the implementation details of the baselines, please refer to Appendix A.3.

Results
We first compared our self-supervised QE method with different variants of SyntheticQE (Tuan et al., 2021) on the WMT 2019 sentence-and word-level QE tasks. The experimental results are shown in Table 3. For single models, the baseline SyntheticQE-MT outperforms another baseline SyntheticQE-MLM except on the En-Ru sentence-level task. Our single model consistently outperforms both baselines on both sentence-and word-level tasks in two different language pairs. Additionally, our single model achieves competitive or better performance compared to the highly complex ensemble model     To further validate whether our method can generalize across different domains, we conducted experiments on the WMT 2018 En-Lv task, which is in the biomedical domain. As shown in Table 4, our single model outperforms both SyntheticQE-MT and SyntheticQE-MLM on both sentence-and word-level tasks, which confirms that our method can generalize well across different domains.
We also compared our method with other unsupervised sentence-level methods. As shown in Table 5, our method also outperforms other unsupervised methods on sentence-level tasks.

Source
switch between the snapshots to find the settings you like best .

Further Comparison with SyntheticQE
To analyze the advantages of our method, we conducted further analysis on the WMT 2019 En-De word-level development set and plot the precisionrecall curves for the "BAD" class by setting different thresholds for SyntheticQE and our method. As shown in Figure 4, between the two baseline systems, the precision of SyntheticQE-MT is relatively low when the recall is below 0.2, and the precision of SyntheticQE-MLM is relatively low when the recall is above 0.2. Compared with the baselines, our method reaches a relatively high precision whenever the recall is low or high. In SyntheticQE-MT, the target side of the synthetic data is produced by MT models, and thus more tokens may be labeled with "BAD" in the synthetic data than in the authentic data since references are less similar to machine-translated sentences than post-editions (Snover et al., 2005). In other words, some "BAD" labels in the synthetic data do not represent erroneous target words but represent words merely different from the expressions in the references. These two types of "BAD" labels cannot be significantly distinguished in the synthetic data, which may be harmful to the model's ability for detecting real errors and finally lead to lower precision when the recall is low.
In SyntheticQE-MLM, the target side of the synthetic data is produced by MLMs, and thus more catastrophic errors appear in synthetic target sentences than in machine-translated sentences (Tuan et al., 2021). In this case, the model mainly focuses on detecting rare catastrophic errors in the target sentences, but is incapable of detecting common subtle errors. Therefore, SyntheticQE-MLM reaches a relatively high precision when the recall is low but a relatively low precision when the recall is high.  By contrast, our self-supervised QE method does not rely on these noisy synthetic data. Thus our method is not affected by the noise and achieves better results whenever the recall is low or high.

ID WWM MC Dropout
Case study. To further show the advantages of our method, we provide an example in Table 6. In this example, the only erroneous word in the target sentence is "Schnappschüsse", which is corrected to "Schnappschüssen" in the post-edition. SyntheticQE-MT fails to detect this error, and wrongly predicts two correct words "gewünschten" and "finden" as erroneous. SyntheticQE-MLM also fails to detect this subtle error. Our method successfully detects the error while it does not predict other correctly translated words as erroneous.

Ablation Studies
To compare and analyze the performance of our method with different configurations, we conducted ablation studies on the WMT 2019 En-De development set. The experimental results are shown in Table 7.
Effect of masking strategies. To measure the effect of masking strategies, we conducted experiments using different masking strategies and compared their performances. According to the results, the model with WWM (row 3) outperforms its counterpart without WWM (row 1). Table 8 shows an example of word-level QE using models Source in a text box , delete the option text . Target & Golden lschen Sie den ausgewählten Text in einem Textfeld . w/o WWM lschen Sie den ausgewählten Text in einem Textfeld . w/ WWM lschen Sie den ausgewählten Text in einem Textfeld . with different masking strategies. In this example, the model without WWM fails to detect the erroneous target word "ausgewählten", which consists of 2 subwords "ausgewählt" and "##en". However, the model with WWM successfully detects this error. This indicates that WWM helps estimate the translation quality of words with multiple subwords.
Effect of MC Dropout. To measure the effect of MC Dropout, we conducted experiments without MC Dropout (row 2) and compared them with their counterparts with MC Dropout (row 3). Experimental results show that the performance decline with the absence of MC Dropout. Additionally, we also try applying MC Dropout to SyntheticQE, but we find no significant improvement over its counterpart without MC Dropout.

Related Work
Our work is closely related to two lines of research: (1) quality estimation for machine translation, and (2) masked language models.

Quality Estimation for Machine Translation
QE aims to evaluate the quality of machinetranslated sentences without references, which has been studied mainly under supervised settings. Specia et al. (2013) propose a feature-based QE method using various manually designed features and traditional machine learning models. With the recent prevalence of deep learning, various neural methods for QE have been proposed (Kim et al., 2017;Ive et al., 2018;Fan et al., 2019). Recently, with the development of pretraining, multilingual pretrained language models (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020) are also utilized in QE (Kim et al., 2019;Kepler et al., 2019;Moura et al., 2020;Ranasinghe et al., 2020;Rubino and Sumita, 2020;Zhang and van Genabith, 2020;Lee, 2020). Due to the data scarcity problem in QE, several studies have endeavored to construct unsuper-vised QE models. For example, Etchegoyhen et al. (2018) build unsupervised QE models using lexical translation tables and language models.  utilize lexical similarities based on word vectors.  propose an enhanced version of , which also utilizes explicit cross-lingual patterns obtained from word alignments and multilingual MLMs. Fomicheva et al. (2020) use different features extracted from NMT models. Tuan et al. (2021) train unsupervised QE models using synthetic data. However, these works are either limited to sentence-level tasks, or negatively affected by the noisy synthetic data. By comparison, our work develops a self-supervised method for both sentence-and word-level QE without using synthetic data.
Our work is also similar to Fan et al. (2019) and Kim et al. (2019). However, their works are designed for supervised QE and require to be finetuned on labeled training data, while our work conducts unsupervised QE by directly utilizing the conditional probabilities given by the model and does not require any further fine-tuning process. Moreover, our work utilizes different techniques like WWM (Cui et al., 2019) and MC Dropout (Gal and Ghahramani, 2016) to further improve the performance.

Masked Language Models
Recently, pretrained masked language models (MLMs) (Devlin et al., 2019) have been widely used in various NLP tasks including natural language understanding  and machine reading comprehension (Xu et al., 2019). The idea of MLM is also used in other complex NLP tasks. For example, Ghazvininejad et al. (2019) introduce a conditional masked language model (CMLM) for non-autoregressive NMT.  and Zhang and van Genabith (2021) present MLM objectives to improve neural word alignment models. MLM objectives are also used in the training process of supervised QE (Kim et al., 2019;Rubino and Sumita, 2020;Cui et al., 2021). To the best of our knowledge, our work is the first to utilize MLM objectives for QE under unsupervised settings.
Our work is also similar to translation language modeling (TLM) (Conneau and Lample, 2019). However, TLM is a multilingual pretraining schema designed for fine-tuning on various multilingual downstream tasks, while our work finetunes a multilingual pretrained model on bilingual parallel corpora for unsupervised QE.

Conclusion and Future Work
We have presented a self-supervised method for quality estimation of machine-translated sentences. The central idea is to perform quality estimation by recovering masked target words using the surrounding context. Our method is easy to implement and is not affected by noisy synthetic data. Experimental results show that our method outperforms previous unsupervised QE methods. In the future, we plan to extend our self-supervised method to phrase-and document-level tasks.

A.1 Detailed Process of Calculating Quality Scores with Monte Carlo Dropout
See Algorithm 1.

A.3 Implementation Details of Baselines Implementation Details of SyntheticQE
For SyntheticQE-MT, the target side of the synthetic data was produced in a cross-validation setting similar to Negri et al. (2018). The synthetic target sentences were translated using Moses (Koehn et al., 2007) (for SMT datasets) or THUMT (Tan et al., 2020) (for NMT datasets). Specifically, for Moses, we mainly followed the default training process and configurations. We removed sentences longer than 100 words before training. For the language models used in Moses, we used 3gram Kneser-Ney language models (Heafield et al., 2013). For THUMT, we used the Transformer (Vaswani et al., 2017) architecture with base setting for NMT models. We used the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and ǫ = 10 −9 to optimize model parameters. We used the same learning rate schedule as Vaswani et al. (2017) with 4,000 warmup steps. During training, we set the batch size to 25,000 tokens, the number of training steps to 100,000, the penalty of label smoothing to 0.1 and the dropout rate to 0.1. We performed subword segmentation using BPE (Sennrich et al., 2016) with 32,000 merge operations. For SyntheticQE-MLM, we followed Tuan et al. (2021) and produced the target side of the synthetic data by randomly substituting, deleting, and inserting words. The substitutions and insertions were performed using MLMs. Since our experiments were conducted on datasets in different domains, the MLMs we used were obtained by fine-tuning the multilingual BERT (Devlin et al., 2019) on the target side of the parallel corpora.
The TER (Snover et al., 2005) scores of the synthetic training data and the authentic development and test data are shown in Table 9.
For the QE models in SyntheticQE, we followed Kepler et al. (2019) and used a BERT-based model for both sentence-and word-level tasks. The models were fine-tuned on the synthetic data. For the optimizer, we used the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 and ǫ = 10 −8 . We set the batch size to 128, the maximum sequence length to 256, the number of training steps to 100,000, the learning rate to 5×10 −5 and the dropout rate to 0.1. We evaluated our model every 1,000 steps and chose the model with the best performance on the development set for inference. We tuned the threshold on the development set to maximize word-level performance.

Implementation Details of Other Unsupervised Sentence-Level Baselines
For uMQE (Etchegoyhen et al., 2018), we set the minimal prefix length to 4, the maximal number of candidates in the translation table to 4 and the order of the language model to 5.
For the word embeddings in BERTScore  and BERTScore++ , we used the contexutalized embeddings in the 9th layer of the multilingual BERT (Devlin et al., 2019). In BERTScore++, we set a to 0.8 and λ to 0.01.
For NMT-QE (Fomicheva et al., 2020), we use the D-TP measure for unsupervised QE. This measure uses MC Dropout (Gal and Ghahramani, 2016) to calculate the expectation of sentence-level translation probabilities. For the NMT models used in NMT-QE, we implemented it based on THUMT (Tan et al., 2020) with the base setting as presented in Vaswani et al. (2017). We evaluated our model every 1,000 steps and chose the model with the best performance on the development set for inference. For the MC Dropout process, we set N = 30.

A.4 Implementation Details of Supervised Models
The supervised models were also implemented based on the multilingual BERT (Devlin et al., 2019). We used the official training data provided by WMT to train the models. Each model were trained for 5 epochs. We set the batch size to 12 and the learning rate to 10 −5 . We tuned the threshold on the development set to maximize word-level performance.
Algorithm 1 Calculating quality scores with Monte Carlo Dropout Input: source sentence x, target sentenceŷ = (ŷ 1 , · · · ,ŷ T ), number of samples for each target token N , number of estimations N ′ , model parameter θ Output: quality scores of all target tokens score(ŷ 1 ), · · · , score(ŷ T ) 1: for n ← 1 to N ′ do 2:ŷ (n) m ← ∅ 3: for t ← 1 to T do score(ŷ t ) ← score(ŷ t ) + P (ŷ t |x,ŷ  Table 9: TER scores of the synthetic training data and the authentic development and test data. "**": the TER scores are computed using the human post-editions instead of the references.