Rethinking the Word-level Quality Estimation for Machine Translation from Human Judgement

,


Introduction
Quality Estimation of Machine Translation aims to automatically estimate the translation quality of the MT systems with no reference available.The sentence-level QE predicts a score indicating the overall translation quality, and the word-level QE needs to predict the quality of each translated word as OK or BAD.Recently, the word-level QE attracts much attention for its potential ability to directly 1 Corpus of HJQE can be found at: https://github.com/ZhenYangIACAS/HJQE detect poorly-translated words and alert the user with concrete translation errors.Currently, the collection of the word-level QE datasets mainly relies on the Translation Error Rate (TER) toolkit (Snover et al., 2006).Specifically, given the machine translations and their corresponding post-edits (PE, generated by human translators or target sentences of the parallel corpus as the pseudo-PE), the rulebased TER toolkit is used to generate the wordlevel alignment between the MT and the PE based on the principle of minimal editing (Tuan et al., 2021;Lee, 2020).All MT words not aligned to PE are annotated as BAD (shown in Figure 1).Such annotation is also referred to as post-editing effort (Fomicheva et al., 2020a;Specia et al., 2020).The post-editing effort measures the translation quality in terms of the efforts the translator needs to spend to transform the MT sentence to the golden reference.However, in our previous experiments and real applications, we find it usually conflicts with human judgments on whether the word is well or poorly translated.Two examples in Figure 2 show the conflicts between the TER-based annotation and human judgment.In figure 2a, the translated words, namely "我", "很", "高兴" and "发 言", are annotated as BAD by TER since they are not exactly in the same order with their counterparts in the PE sentence.However, from human judgment, the reordering of these words does not Source: It is happy for me to be asked to speak here.

Alignment generated by TER toolkit
MT: 我 很 高兴 被 要求 在 这里 发言 。 MT Back: I am so happy to be asked to speak here.PE: 被 邀请 在 这里 讲话 我 很 高兴 。 PE Back: Being invited to talk here makes me so happy.
TER-based: 扎 波罗 齐安海 特曼 号 随后 被 派 往 伊斯坦布尔 ，并 被 撞 在 钩 上 。 Human: 扎 波罗 齐安海 特曼 号 随后 被 派 往 伊斯坦布尔 ，并 被 撞 在 钩 上 。 a) Some words in MT are mistakenly annotated to BAD though the overall semantic is not changed.b) Human annotate the clause "被撞在钩上" as a whole, while TER-based annotations are fragmented.hurt the meaning of the translation and even makes the MT sentence polished.And the word "要求" is also regarded as a good translation by human judgment as it is the synonym of the word "邀 请".In figure 2b, the clause "扎波罗齐安海特曼 号" in a very good translation of "The Zaporizhian Hetman " from human judgment.However, it is annotated as BAD by TER since it is not aligned with any words in the PE sentence.In many application scenarios and downstream tasks, it is usually important even necessary to detect whether the word is well or poorly translated from the human judgment (Yang et al., 2021).However, most previous works still use the TER-based dataset for training and evaluation, which makes the models' predictions deviate from human judgment.
In the recent WMT22 word-level QE shared task, several language pairs, such as English-to-German, Chinese-to-English and English-to-Russian, tried to evaluate the model with the corpus based on the annotation of Multilingual Quality Metrics (MQM) which is introduced from the Metrics shared task. 2 However, the conflict between the TER-based annotation and human judgment and its effects are still unclear to the researchers.To investigate this conflict and overcome the limitations stated above, We first collect a high-quality benchmark dataset, named HJQE, where the source and MT sentences are directly taken from the original TER-based dataset and the human annotators annotate the text spans that lead to translation errors in MT sentences.With the identical source and MT sentences, it is easier for us to make insight into the underline 2 https://wmt-qe-task.github.io/causes of the conflict.Then, based on our deep analysis, we further propose two tag-correcting strategies, namely tag refinement strategy and tree-based annotation strategy, which make the TER-based annotations more consistent with human judgment.
Our contributions can be summarized as follows: 1) We collect a new dataset called HJQE that directly annotates the word-level translation errors on MT sentences.We conduct detailed analyses and demonstrate the differences between HJQE and the TER-based dataset.2) We propose two automatic tag-correcting strategies which make the TER-based artificial dataset more consistent with human judgment.3) We conduct experiments on HJQE dataset as well as its TER-based counterpart.Experimental results of the automatic and human evaluation show that our approach achieves higher consistency with human judgment.

Data Collection
To make our collected dataset comparable to TER-generated ones, we directly take the source and MT texts from MLQE-PE (Fomicheva et al., 2020a), the widely used official dataset for WMT20 QE shared tasks.MLQE-PE provides the TERgenerated annotations for English-German (En-De) and English-Chinese (En-Zh) translation directions.The source texts are sampled from Wikipedia documents and the translations are obtained from the Transformer-based system (Vaswani et al., 2017).
Our data collection follows the following process.First, we hire a number of translator experts, where 5 translators for En-Zh and 6 for En-De.They are all graduated students who major in translation and have the professional ability in the corresponding translation direction.For En-Zh, the translations are tokenized as MLQE-PE.To make the annotation process as fair and unbiased as possible, each annotator is provided only the source sentence and its corresponding translation (the human annotators are not allowed to access the PE sentences in MLQE-PE).For each sample, we randomly distribute it to two annotators.After one example has been annotated by two translators, we check whether the annotations are consistent.If they have annotation conflicts, we will re-assign the sample to the other two annotators until we get consistent annotations.For the annotation protocol, we ask human translators to find words, phrases, clauses, or even whole sentences that contain translation errors in MT sentences and annotate them as BAD tags.Here, the translation error means the translation distorts the meaning of the source sentence but excludes minor mismatches such as synonyms and punctuation.Meanwhile, if the translation does not conform to the target language's grammar, they should also find them and annotate them as BAD.The annotation and distribution of samples are automatically conducted through the annotation system.After all the samples are annotated, we ask another translator to check the annotation accuracy by sampling a small proportion (400 samples) of the full dataset and ensure the accuracy is above 98%.

Statistics and Analysis
Overall Statistics.In Table 1, we show detailed statistics of the collected HJQE.For comparison, we also present the statistics of MLQE-PE.First, we see that the total number of BAD tags decreases heavily when human's annotations replace the TERbased annotations (from 28.15% to 9.62% for En-De, and from 54.33% to 16.62% for En-Zh).It indicates that the human annotations tend to annotate OK as long as the translation correctly expresses the meaning of the source sentence, but ignores the secondary issues like synonym substitutions and constituent reordering.Second, we find the number of BAD tags in the gap (indicating a few words are missing between two MT tokens) also greatly decreases.It's because human annotations tend to regard the missing translations (i.e., the BAD gaps) and the translation errors as a whole but only annotate BAD tags on MT tokens3 .Unity of BAD Spans.To reveal the unity of the human annotations, we group the samples according to the number of BAD spans in every single sample, and show the overall distribution.From Figure 3, we can find that the TER-based annotations follow the Gaussian distribution, where a large proportion of samples contain 2, 3, or even more BAD spans, indicating the TER-based annotations are fragmented.However, our collected annotations on translation errors are more unified, with only a small proportion of samples including more than 2 BAD spans.Besides, we find a large number of samples that are fully annotated as OK in human annotations.However, the number is extremely small for TER-based annotations (78 in English- German and 5 for English-Chinese).This shows a large proportion of BAD spans in TER-based annotations do not really destroy the semantics of translations and are thus regarded as OK by human annotators.

XLM
Based on the above statistics and the examples in Figure 2, we conclude the two main issues that result in the conflicts between the TER-based annotations and human judgment.First, the PE sentences often substitute some words with better synonyms and reorder some constituents for polish purposes.These operations do not destroy the meaning of the translated sentence, but make some words mistakenly annotated under the exact matching criterion of TER; Second, when a fatal error occurs, the human annotator typically takes the whole sentence or clause as BAD.However, the TER toolkit still tries to find trivial words that align with PE, resulting in fragmented and wrong annotations.

Difference from MQM
In the recent WMT22 word-level QE shared task, several language pairs began to use MQM-based annotation introduced from the Metrics shared task as the quality estimation (Freitag et al., 2021a,c).There are two main differences between the proposed HJQE and the MQM-based corpus: 1) The MQM-based corpus is mainly collected to evaluate the metrics of MT.To temper the effect of long segments, only five errors per segment are imposed for segments containing more errors.However, as HJQE is collected to evaluate the quality of each translated word, we impose all errors in each segment； 2) HJQE are collected by taking the identical source and MT sentences to the TER-based benchmark dataset, namely MLQE-PE, which makes it more straightforward to perform comparison and analysis.

Approach
This section first introduces the model backbone and the self-supervised pre-training approach based on the large-scale MT parallel corpus.Then, we propose two correcting strategies to make the TERbased artificial tags closer to human judgment.

Model Architecture
Following (Ranasinghe et al., 2020;Lee, 2020;Moura et al., 2020;Ranasinghe et al., 2021), we select the XLM-RoBERTa (XLM-R) (Conneau et al., 2020) as the backbone of our model.XLM-R is a transformer-based masked language model pre-trained on large-scale multilingual corpus and demonstrates state-ofthe-art performance on multiple cross-lingual downstream tasks.As shown in Figure 4a, we concatenate the source sentence and the MT sentence together to make an input sample: x i = <s>w src 1 , . . ., w src m </s><s>w mt 1 , . . ., w mt n </s>, where m is the length of the source sentence (src) and n is the length of the MT sentence (mt).<s> and </s> are two special tokens to annotate the start and the end of the sentence in XLM-R, respectively.
For the j-th token w mt j in the MT sentence, we take the corresponding representation from XLM-R for binary classification to determine whether w j belongs to good translation (OK) or contains translation error (BAD) and use the binary classification loss to train the model: (1) where XLM-R j (x i ) ∈ R d (d is the hidden size of XLM-R) indicates the representation output by XLM-R corresponding to the token w mt j , σ is the sigmoid function, w ∈ R d×1 is the linear layer for binary classification and y is the ground truth label.

Self-Supervised Pre-training Approach
Since constructing the golden corpus is expensive and labor-consuming, automatically building the synthetic corpus based on the MT parallel corpus for pre-training is very promising and has widely been used by conventional works (Tuan et al., 2021;Zheng et al., 2021).As shown in Figure 4b, the conventional approaches first split the parallel corpus into the training and the test set.The NMT model is trained with the training split and then used to generate translations for all sentences in the test split.Then, a large number of triplets are obtained, each consisting of source, MT, and target sentences.Finally, the target sentence is regarded as the pseudo-PE, and the TER toolkit is used to generate word-level annotations.

Tag-correcting Strategies
As we discussed above, the conflicts between the TER-based annotation and human judgment limit the performance of the conventional selfsupervised pre-training approach on the proposed HJQE.In this section, we introduce two tag correcting strategies, namely tag refinement and treebased annotation, that target these issues and make the TER-generated synthetic QE annotations more consistent with human judgment.
Tag Refinement Strategy.In response to the first issue (i.e., wrong annotations due to the synonym substitution or constituent reordering), we propose the tag refinement strategy, which corrects the false BAD tags to OK.Specifically, as shown in Figure 5a, we first generate the alignment be-tween the MT sentence and the reference sentence (i.e., the pseudo-PE) using FastAlign4 (Dyer et al., 2013).Then we extract the phrase-to-phrase alignment by running the phrase extraction algorithm of NLTK5 (Bird, 2006).Once the phrase-level alignment is prepared, we substitute each BAD span with the corresponding aligned spans in the pseudo-PE and use the language model to calculate the change of the perplexity ∆ppl after this substitution.If |∆ppl| < α, where α is a hyper-parameter indicating the threshold, we regard that the substitution has little impact on the semantic and thus correct the BAD tags to OK.Otherwise, we regard the span does contain translation errors and keep the BAD tags unchanged (Figure 5b).
Tree-based Annotation Strategy.Human direct annotation tends to annotate the smallest constituent that causes fatal translation errors as a whole (e.g., the whole words, phrases, clauses, etc.).However, TER-based annotations are often fragmented, with the translation being split into multiple BAD spans.Besides, the BAD spans are often not well-formed in linguistics i.e., the words in the BAD span from different linguistic constituents.
To address this issue, we propose the constituent tree-based annotation strategy.It can be regarded as an enhanced version of the tag refinement strategy that gets rid of the TER-based annotation.As shown in Figure 5c, we first generate the constituent tree for the MT sentences.Each internal node (i.e., the non-leaf node) in the constituent tree represents a well-formed phrase such as a noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc.For each node, we substitute it with  (Semenick, 1990).The results on the validation sets are presented in Appendix B.
the corresponding aligned phrase in the pseudo-PE.Then we still use the change of the perplexity ∆ppl to indicate whether the substitution of this phrase improves the fluency of the whole translation.To only annotate the smallest constituents that exactly contain translation errors, we normalize ∆ppl by the number of words in the phrase and use this value to sort all internal nodes in the constituent tree: ∆ppl norm = ∆ppl r−l+1 , where l and r indicate the left and right positions of the phrase, respectively.The words of a constituent node are integrally labeled as BAD only if |∆ppl norm | < β as well as there is no overlap with nodes that are higher ranked.β is a hyper-parameter.

Experiments
Datasets.To verify the effectiveness of the proposed corpus and approach, we conduct experiments on both HJQE and MLQE-PE.Note that MLQE-PE and HJQE share the same source and MT sentences, thus they have exactly the same number of samples.We show the detailed statistics in Table 1.For the pre-training, we use the parallel dataset provided in the WMT20 QE shared task to generate the artificial QE dataset.
Baselines.To confirm the effectiveness of our proposed self-supervised pre-training approach with tag-correcting strategies, we mainly select two baselines for comparison.In the one, we do not use the pre-training, but only fine-tune XLM-R on the training set of HJQE.In the other, we pre-train the model on the TER-based artificial QE dataset and then fine-tune it on the training set of HJQE.
Implementation and Evaluation.The QE model is implemented based on an open-source framework, OpenKiwi6 .We use the large-sized XLM-R model released by the hugging-face. 7We use the KenLM8 to train the language model on all target sentences in the parallel corpus.For the tree-based annotation strategy, we obtain the constituent tree through LTP9 (Che et al., 2010) for Chinese and through Stanza10 (Qi et al., 2020) for German.We set α to 1.0 and β to -3.0 based on the empirical results on the evaluation sets.11Following the WMT20 QE shared task, we use Matthews Correlation Coefficient (MCC) as the main metric and also report the F1 score (F) for OK, BAD and BAD spans.We refer the readers to Appendix A for implementation details.

Main Results
The results are shown in Table 2.We can observe that the TER-based pre-training only brings very limited performance gain or even degrade the performance when compared to the "FT on HJQE only" setting (-1.47 for En-De and +0.53 for En-Zh).It suggests that the inconsistency between TER-based and human annotations leads to the limited effect of pre-training.However, when applying the tag-correcting strategies to the pre-training dataset, the improvement is much more significant (+2.85 for En-De and +2.24 for En-Zh), indicating that the tag correcting strategies mitigate such inconsistency, improving the effect of pre-training.On the other hand, when only pre-training is applied, the tag-correcting strategies can also improve performance.It shows our approach can also be applied to the unsupervised setting, where no humanannotated dataset is available for fine-tuning.Tag Refinement v.s.Tree-based Annotation.When comparing two tag-correcting strategies, we find the tree-based annotation strategy is generally superior to the tag refinement strategy, especially for En-Zh.The MCC improves from 19.36 to 21.53 under the pre-training only setting and improves from 40.35 to 41.33 under the pre-training then fine-tuning setting.This is probably because the tag refinement strategy still requires the TER-based annotation and fixes based on it, while the treebased annotation strategy actively selects the wellformed constituents to apply phrase substitution and gets rid of the TER-based annotation.
Span-level Metric.Through the span-level metric (F-BAD-Span), we want to measure the unity and consistency of the model's prediction against human judgment.From Table 2, we find our models with tag correcting strategies also show higher F1 score on BAD spans (from 26.66 to 27.21 for En-Zh), while TER-based pre-training even do harm to this metric (from 26.66 to 25.93 for En-Zh).This phenomenon also confirms the aforementioned fragmented issue of TER-based annotations, and our tag-correcting strategies, instead, improve the span-level metric by alleviating this issue.On the other hand, we compare the performance gain of different pre-training strategies.When evaluating on MLQE-PE, the TER-based pre-training brings higher performance gain (+6.44) than pretraining with two proposed tag correcting strategies (+1.43 and +1.77).While when evaluating on HJQE, the case is the opposite, with the TERbased pre-training bringing lower performance gain (+1.58) than tree-based annotation (+2.38) strategies.In conclusion, the pre-training always brings performance gain, no matter evaluated on MLQE-PE or HJQE.However, the optimal strategy depends on the consistency between the pre-training dataset and the downstream evaluation task.
Human Evaluation.To evaluate and compare the models pre-trained on TER-based tags and corrected tags more objectively, human evaluation is conducted for both models.For En-Zh and En-De, we randomly select 100 samples from the validation set and use two models to predict word-level tags for them.Then, the human translators (without participating the annotation process) are asked to give a score for each prediction, between 1 and 5, where 1 indicates the predicted tags are fully wrong, and 5 indicates the tags are fully correct.Table 4 shows the results.We can see that the model pretrained on corrected tags (Ours) achieves higher human evaluation scores than that pre-trained on TER-based tags.For about 90% of samples, the prediction of the model pre-trained on the corrected dataset can outperform or tie with the prediction of the model pre-trained on the TER-based dataset.The results of the human evaluation show that the proposed tag-correcting strategies can make the TER-based annotation closer to human judgment.The case study is also presented in Appendix C.
Limitation We analyze some samples that are corrected by our tag-correcting strategies and find a few bad cases.The main reasons are: 1) There is noise from the parallel corpus.2) The alignment generated by FastAlign contains unexpected errors, making some entries in the phrase-level alignments missing or misaligned.3) The scores given by KenLM, i.e., the perplexity changes, are sometimes not sensitive enough.We propose some possible solutions to the above limitations as our future exploration direction.For the noise in the parallel corpus, we can use parallel corpus filtering methods that filter out samples with low confidence.For the alignment errors, we may use more accurate neural alignment models (Lai et al., 2022).

Related Work
Early approaches on QE, such as QuEst (Specia et al., 2013) and QuEst++ (Specia et al., 2015), mainly pay attention to feature engineering.They aggregate various features and feed them to machine learning algorithms.Kim et al. (2017) first propose the neural-based QE approach, called Predictor-Estimator.They first pre-train an RNNbased predictor on the large-scale parallel corpus that predicts the target word given its context and the source sentence.Then, they extract the features from the pre-trained predictor and use them to train the estimator for the QE task.This model achieves the best performance on the WMT17 QE shard task.After that, many variants of Predictor-Estimator are proposed (Fan et al., 2019;Moura et al., 2020;Cui et al., 2021;Esplà-Gomis et al., 2019).Among them, Bilingual Expert (Fan et al., 2019) replaces RNN with multi-layer transformers as the architecture of the predictor.It achieves the best performance on WMT18.Kepler et al. (2019) release an open-source framework for QE, called OpenKiwi, that implements the most popular QE models.Recently, with the development of pre-trained language models, many works select the cross-lingual language model as the backbone (Ranasinghe et al., 2020;Lee, 2020;Moura et al., 2020;Rubino and Sumita, 2020;Ranasinghe et al., 2021;Zhao et al., 2021).Many works also explore the joint learning or transfer learning of the multilingual QE task (Sun et al., 2020;Ranasinghe et al., 2020Ranasinghe et al., , 2021)).Meanwhile, Fomicheva et al. (2021) propose a shared task with the newcollected dataset on explainable QE, aiming to provide word-level hints for sentence-level QE score.Freitag et al. (2021b) also study multidimensional human evaluation for MT and collect a large-scale dataset for evaluating the metrics of MT.Additionally, Fomicheva et al. (2020b); Cambra and Nunziatini (2022) evaluate the translation quality from the features of the NMT systems directly.
The QE model can be applied to the post-editing process.Wang et al. (2020) and Lee et al. (2021) use the QE model to identify which parts of the MT sentence need to be corrected.Yang et al. (2021) needs the QE model to determine error spans before giving translation suggestions.

Conclusion
In this paper, we focus on the task of word-level QE in machine translation and target the inconsistency issues between TER-based annotation and human judgment.We collect and release a benchmark dataset called HJQE which has identical source and MT sentences with the TER-based corpus and reflects the human judgment on the translation errors in MT sentences.Besides, we propose two tagcorrecting strategies, which make the TER-based annotations closer to human judgment and improve the final performance on the proposed benchmark dataset HJQE.We conduct thorough experiments and analyses, demonstrating the necessity of our proposed dataset and the effectiveness of our proposed approach.Our future directions include improving the performance of phrase-level alignment.We hope our work will provide some help for future research on quality estimation.

A Implementation Details
In the pre-processing phase, we filter out parallel samples that are too long or too short, and only reserve sentences with 10-100 tokens.We pre-train the model on 8 NVIDIA Tesla V100 (32GB) GPUs for two epochs, with the batch size set to 8 for each GPU.Then we fine-tune the model on a single NVIDIA Tesla V100 (32GB) GPU for up to 10 epochs, with the batch size set to 8 as well.Early stopping is used in the fine-tuning phase, with the patience set to 20.We evaluate the model every 10% steps in one epoch.The pre-training often takes more than 15 hours and the fine-tuning takes 1 or 2 hours.We use Adam (Kingma and Ba, 2014) to optimize the model with the learning rate set to 5e-6 in both the pre-training and fine-tuning phases.For all hyper-parameters in our experiments, we manually tune them on the validation set of HJQE.

B Main Results on the Validation Set
In Table 5, we also report the main results on the validation set of HJQE.

C Case Study
In Figure 6, we show some cases from the validation set of the English-Chinese language pair.
From the examples, we can see that the TER-based model (noted as PE Effort Prediction) often annotates wrong BAD spans and is far from human judgment.For the first example, the MT sentence correctly reflects the meaning of the source sentence, and the PE is just a paraphrase of the MT sentence.Our model correctly annotates all words as OK, while the TER-based one still annotates many BAD words.For the second example, the key issue is the translation of "unifies" in Chinese.Though "统一" is the direct translation of "unifies" in Chinese, it can not express the meaning of winning two titles in the Chinese context.And our model precisely annotated the "统一 了" in the MT sentence as BAD.For the third example, the MT model fails to translate the "parsley" and the "sumac" to "欧芹" and "盐肤木" in Chinese, since they are very rare words.While the TERbased model mistakenly predicts long BAD spans, our model precisely identifies both mistranslated parts in the MT sentence.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
Left blank.

C Did you run computational experiments?
Left blank.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Left blank.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 2 :
Figure 2: Two examples show the gap between the TER-based and human's direct annotation on detecting translation errors.The red color indicates BAD tags (text with translation errors), while the green color indicates OK tags.For readability, we also provide the back translation from Google Translate for the Chinese sentences.

Figure 3 :
Figure 3: The distribution that reveals how many BAD spans in every single validation sample.

Figure 4 :
Figure 4: The model architecture and the construction of artificial QE dataset.

Figure 5 :
Figure 5: The proposed two tag correcting strategies: Tag Refinement strategy and Tree-based Annotation strategy.

Table 1 :
The statistics of TER-based MLQE-PE dataset and the collected HJQE.

Table 2 :
Performance on the test set of HJQE.PT indicates pre-training and FT indicates fine-tuning.Results are all reported by ×100.The numbers with * indicate the significant improvement over the corresponding baseline with p < 0.05 under t-test

Table 3 :
Performance comparison for En-Zh with different fine-tuning and evaluation settings.Since the test labels of MLQE-PE are not publicly available, we report the results on the validation set of both datasets.MCC* indicates the MCC score considering both the target tokens and the target gaps.

Table 4 :
The results of human evaluation.We select the best-performed model fine-tuned on MLQE-PE and HJQE respectively.To demonstrate the difference between the MLQE-PE and our HJQE datasets, and analyze how the pre-training and finetuning influence the results on both datasets, we compare the performance of different models on MLQE-PE and HJQE respectively.The results for En-Zh are shown in Table3.When comparing re- sults in each group, we find that fine-tuning on the training set identical to the evaluation set is necessary for achieving high performance.Otherwise, fine-tuning provides marginal improvement (e.g., fine-tuning on MLQE-PE and evaluating on HJQE) or even degrades the performance (e.g., fine-tuning on HJQE and evaluating on MLQE-PE).This reveals the difference in data distribution between HJQE and MLQE-PE.Besides, Our best model on MLQE-PE outperforms WMT20's best model (61.85 v.s.59.28) using the same MCC* metric, showing that the modeling ability of our model is strong enough even under the TER-based setting.

Table 5 :
The word-level QE performance on the validation set of HJQE for two language pairs, En-De and En-Zh.PT indicates pre-training and FT indicates fine-tuning.On April 28, Juan Dí az Unified the WBA and WBO lightweight titles after defeating Acelino Freitas.PE: 4 月 28 日 ， Juan Dí az 在 击败 Acelino Freitas 之后 ， 将 W 世界 拳击 协会 和 世界 拳击 组织 的 轻量级 冠军 揽于 一身 。 PE Back: On April 28, Juan Dí az won both the WBA and WBO lightweight titles after defeating Acelino Freitas.Fattoush is a combination of toasted bread pieces and parsley with chopped cucumbers, radishes, tomatoes and flavored by sumac.MT: 法杜什是 烤面包片 和 帕斯 莱 与 切碎 的 黄瓜 、 萝卜 、 西红柿 、 和 洋葱 以及 香味 的 消耗品 的 组合 。 MT Back: Fadush is a combination of toast and pasai with chopped cucumbers, radishes, tomatoes and onions and scented consumables.Fattoush is a combination of toast and parsley with chopped cucumbers, radishes, tomatoes and scallions, seasoned with rhus salt.B1.Did you cite the creators of artifacts you used?Left blank.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Left blank.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.