T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generally fall into one of two paradigms: discriminative metrics that are trained to directly predict which outputs are of higher quality according to supervised human annotations, and generative metrics that are trained to evaluate text based on the probabilities of a generative model. Both have their advantages; discriminative metrics are able to directly optimize for the problem of distinguishing between good and bad outputs, while generative metrics can be trained using abundant raw text. In this paper, we present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. We perform an extensive empirical comparison with other existing metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility of our method. Experimental results show that: T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level. We release our code and models at https://github.com/qinyiwei/T5Score.


Introduction
Automatically evaluating the quality of generated text plays an essential role in the development of text generation systems (Lin and Hovy, 2003;Peyrard, 2019;Mathur et al., 2020a).A key element of this evaluation is the design of an automated metric that can recognize high-quality texts.The current most-popular approach to create such high-quality metrics is the discriminative paradigm.The former uses parallel data and maximizes the probability of output data conditioned on the input data.The latter ranks two hypotheses by their manual scores and maximizes the probability of the better hypothesis while minimizing the probability of the worse hypothesis.L gen and L dis denote generative loss and discriminative loss respectively.
These models are generally trained by taking a sentence embedding model and fine-tuning it using human judgments of generated text quality as a learning signal, allowing metrics to directly predict the quality score of a text.Popular examples include COMET (Rei et al., 2020) and BLEURT (Sellam et al., 2020).However, the effectiveness of this method comes at the cost of expensive manual annotation of human judgements, and thus these models are less broadly applicable than more common lexical metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004).
More recently, there has been promising work on generative metrics.These metrics recognize high-quality texts by formulating evaluation as a generation task and using the generative likelihood as an indication of quality, making it possible to train the models without explicit human annotation.Examples of these metrics include BARTScore (Yuan et al., 2021) and PRISM (Thompson and Post, 2020a).However, because such generative models do not utilize human judgements at training time, these models are inherently at a disadvantage compared to metrics that can utilize supervision.
In this work, we argue that it is crucial to utilize all possible supervision symbols that could indicate the quality of the text.To this end, we propose a framework for learning evaluation metrics based on the assumption that generative and discriminative objectives can work in concert to train a better evaluator, as shown in Fig. 1.
We achieve this idea by (1) starting with the pre-trained model mT5 (Xue et al., 2021), (2) training mT5 in a generative fashion by maximizing the probability of existing parallel data, then (3) fine-tuning mT5 discriminatively by minimizing a contrastive loss function to teach it to ensure that the generative probability of high-quality texts is higher than that of low-quality texts.At evaluation time, the probability of generating a text is used as the quality score, because the model has learned to assign high probability to superior texts.Our framework has the flexibility to choose from a supervised training strategy and an unsupervised training strategy depending on if human judgements are available for a given language or task, while keeping the evaluation process the same.
We evaluate the proposed metric (T5SCORE) on 5 datasets covering machine translation (MT) and summarization tasks across 19 languages.Regarding reference-based experiments, at the segment level, T5SCORE trained generatively achieves the best performance on one dataset without human annotated training examples; T5SCORE trained discriminatively achieves the best performance on 4 datasets with human annotated training examples against top-scoring counterparts.At the system level, T5SCORE trained discriminatively achieves the best performance in 5 of 6 test settings (2 correlation methods × 3 datasets).Empirical results also show the effectiveness of generative training, especially for tasks without human judgments.Regarding source-based experiments, we find it better at evaluating top-scoring systems compared to reference-based evaluation, showing the importance of developing source-based evaluation as ma-chines generate higher-quality texts.

Task Formulation
Text generation evaluation aims to design a function auto_eval(•) that takes in a source text x, some reference outputs y and a system output ŷ and predicts a scalar value that indicates the quality of the system output.The validity of the designed function depends on the degree of correlation between auto_eval(•) and human judgements (which can be denoted as manual_eval(•)).The better the correlation between the two, the more effective we consider our designed function to be.
Specifically, in this work, we call an evaluation function (1) source-based if it takes only x and ŷ and predicts using auto_eval(x, ŷ) (2) and call an evaluation function reference-based if it takes only y and ŷ and predicts using auto_eval(y, ŷ), or it takes x, y and ŷ and predicts using auto_eval(x, y, ŷ).

Metric Design
In this section, we describe T5SCORE and explain how to train the metric in both a generative and discriminative fashion.

Evaluation as Generation
Following Yuan et al. (2021), we formulate text generation evaluation as a text generation problem.
Specifically, the quality of a generated text is measured by calculating the per-token conditional probability of one text a given another text b, which we also abbreviate as "b → a": According to preliminary experiments, F score correlated better with human evaluation scores on the DA20 dataset ( §4.1) than Precision and Recall, so we adopt F score for default.In order to support multilingual evaluation, we choose mT5 (Xue et al., 2021) as our pre-trained model.

Generative Training for T5SCORE
Generative training aims to teach the model to generate target text from the input text with a standard negative log likelihood loss: We use the MT dataset ParaCotta (Aji et al., 2022) and paraphrasing dataset MT-prism (Thompson and Post, 2020b) as parallel corpora 4 to train our models generatively.

Discriminative Training for T5SCORE
We also design discriminative training methods where human judgments for generation quality are available.Suppose we have an annotated training dataset D = {x i , y i , ŷi , m i |i = 1, ..., N }, where x i , y i , ŷi , and m i denote the i-th example of the source text, the reference text, the hypothesis text, and the manual score, respectively (ŷ i and m i can be multiple hypotheses with their corresponding quality scores).We first generate a relative rank dataset by finding a pair of hypotheses ŷ+ i with higher manual score m + i and ŷ− i with lower manual score m − i for the same source text x i and reference text y i .Then, to encourage the model to assign higher probabilities to the better hypothesis ŷ+ , we adopt a contrastive loss function, following Liu et al. (2022); Hopkins and May (2011): (3) where α is the weight of the margin term.f is defined as f (ŷ) = 1 m m t=1 log p(ŷ t |ŷ <t , y, θ) for reference-based methods, and f (ŷ) = 1 m m t=1 log p(ŷ t |ŷ <t , x, θ) for source-based methods, where m is the number of tokens in ŷ.
Because we adopt F score for evaluation by default, our training process also considers two generation directions: from x or y to ŷ and from ŷ to x or y.We augment the training samples by repeating 4 Appendix A.9.2 shows the corpus details.
the corpus D RR and changing x or y which is originally the model's input to the output and changing ŷ which is originally the model's output to the input.Thus, half of the time we calculate f (ŷ) =  2021)).Details in Appendix A.9.1.

Correlation Measurements
We consider both system-level and segment-level correlations with human judgments when evaluating automated metrics.System-level evaluation calculates average human scores for each generation system to produce a scalar rating for the system performance.We employ the Pearson correlation (sys-p) and Kendall's Tau correlation (sys-k) as the evaluation measure for system-level metrics.Segment-level correlation measures the correlation over segment-level assessments.We keep the same setup as in Mathur et al. (2020b) converting Direct Assessment (DA) to DA relative rank (DARR) and adopting a Kendall's Tau-like (seg-k) formulation as the evaluation measure.We adopt the bootstrapping method (p-value < 0.05) (Koehn, 2004;Graham et al., 2014) for pair-wise significance tests.

Baseline Metrics
We consider the following baseline metrics for comparison: BLEU (Papineni et al., 2002) which is the precision of n-grams of the MT output compared to the reference; ROUGE (Lin, 2004) which measures the lexical overlap between the system and reference; COMET (Rei et al., 2020) which is a discriminative metric that uses XLM-RoBERTa to encode source, hypothesis and reference and can be optimised towards different ob-jectives; BERTScore (Zhang et al., 2019) which computes the cosine similarity between the reference and hypothesis tokens' embeddings based on BERT (Devlin et al., 2018); BLEURT (Sellam et al., 2020) which is a BERT-based regression model trained on synthetic examples and ratings from WMT; PRISM (Thompson and Post, 2020a) which is a generative metric that scores MT system outputs conditioned on their respective human references; BARTScore (Yuan et al., 2021) which is a generative metric that uses BART (Lewis et al., 2019) to evaluate the generated text.5

Reference-based Evaluation
We consider two tasks in reference-based evaluation: machine translation (DA20, MQM20 and MQM21) and summarization (MultiSumm).

Training Details
We consider four different sizes of base models: mT5-B (580M parameters), mT5-L (1.2B parameters), mT5-XL (3.7B parameters), and mT5-XXL (11B parameters).Both generative and discriminative training are considered, with the former based on ParaCotta corpora and the latter based on WMT DA corpora from 2017 to 2019.Our model implementation is based on Huggingface transformers (Wolf et al., 2020).More details of the hyperparameters, training time, computing resources can be found at Appendix A.

Results
For DA20, Tab.1/Tab.7 shows segment level Kendall's Tau correlation results of diverse metrics for 10/8 language pairs with English as target/source;7 Fig. 2 shows system level results on average.For MQM20 and MQM21, Tab.2 shows both segment level and system level results.For Multi-Summ, Fig. 3 illustrates the segment Kendall's Tau correlation. 8 9 In all tables, ‡denotes correlations not significantly outperformed by any other metric for the given language pair, while †denotes correlations not significantly outperformed by any other unsupervised metric.The highest correlation for each language pair by unsupervised methods is underlined, and the highest correlation overall is bold.
From the above tables and figures, we observe: 1) At segment level, our method achieves the best performance on average.Supervised T5SCORE-Table 1: Segment-level Kendall's Tau correlations for the WMT DA20 corpus.Avg-en denotes the average correlation achieved by a metric across all x-en language pairs.Avg-x denotes the average correlation across all en-x language pairs, and Avg denotes the average correlation across all language pairs.This table shows the results of all x-en language pairs and the results of en-x language pairs can be found at Appendix A.3. cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en Avg-en Avg-x Avg XL surpasses all baselines for DA20, MQM20 and MQM21; unsupervised T5SCORE-L surpasses all baselines for MultiSumm.
2) At segment level, as language model size increases, the metric performance tends to saturate.In Tab.7, from T5SCORE-B to T5SCORE-L and from T5SCORE-L to T5SCORE-XL, the performance of our unsupervised metric improves by 1.4 and 0.6 on average, respectively; while the performance of our supervised metric improves by 1.7 and 0.6, respectively. 10However, this is not so clear at system level.A possible reason is that at the system level, there are usually less than 20 systems to be evaluated, much fewer than the number of examples at the segment level, so tiny differences in one MT system can have a large impact on the final results.
3) At system level, our method is better at Kendall's Tau correlation compared to Pearson correlation.In Fig. 2, our method achieves the highest Kendall's Tau correlation compared to other baselines while performs slightly worse than COMET in terms of Pearson correlation.This can be attributed to our training process, which adopts a contrastive loss function, making our models better at predicting the relative rank of examples or systems instead of the absolute score.
10 From T5SCORE-XL to T5SCORE-XXL, the performance of our unsupervised metrics improves by 0.3.Detailed results can be found at Appendix A.5. Due to the limited computational resources and the limited performance improvement of T5SCORE-XXL, we don't train supervised T5SCORE-XXL.

4)
For datasets without human annotated training examples, our unsupervised method achieves the best performance.In Fig. 3, our supervised methods perform worse than unsupervised methods, and other supervised methods do not work well either.The reason could be that these methods are trained on MT data, and their direct use for the summarization task may impair their performance.These results also indicate that our method has the advantage that the unsupervised version still works well in the absence of human annotated data.

Source-Based Evaluation
We also support source-based discriminatively trained methods.In this section, we show the effectiveness of the source-based method.We consider the task of machine translation and inspect the re- sults on three datasets: DA20, MQM21 and QE20.

Training details
Hyperparameters are kept the same as in Sec.5.1.DA20 Generative training is performed on MTprism, while discriminative training is performed on the WMT DA corpus from 2017 to 2019.MQM21 We take the models from DA20 and further train them for 2 additional epochs on MQM20.QE20 DA20 models are further trained on the QE20 train split.The best checkpoint is picked based on its performance on QE20 development split and the results on the test split are reported.Tab.3 illustrates both the segment level and system level results.For DA20, we compare our supervised model with COMET src baseline (Rei et al., 2021). 12Results can be found at Appendix A.6.For QE20, we choose PRISM qe (Thompson and Post, 2020a) which is a reference free version of PRISM as the unsupervised baseline, and Tran-sQuest (Ranasinghe et al., 2020b,a) which is the winner of WMT2020 Shared Task on Quality Estimation (Specia et al., 2021) as the supervised baseline. 13Tab. 4 illustrates the segment Pearson correlation 14 of different evaluation metrics.

Results
We have the following observations: 1) For MQM21, T5SCORE surpasses the baseline at both segment and system level on average.For DA20, T5SCORE is better than the baseline at the segment level on average, and better than the baseline for most language pairs at the system level.For QE20, both supervised and unsupervised T5SCORE surpass the corresponding baselines.
2) Overall, source-based models perform worse than the reference-based models, but their differences are much smaller on MQM21 than DA20.We conjecture that the reason for this may be related to the human evaluation process where WMT-DA uses a mixture of source-based and reference-based annotations, while MQM uses the source.

Analysis
We design analyses to better understand the mechanism of T5SCORE and its strenghs over other metrics, specifically asking three questions: Q1:  egories on the MQM datasets.To evaluate the performance of the metrics on a given error category, we use the score of each example in that error category as the gold standard to compare with the score given by the automated evaluation metrics.There are six error categories in total, including the five error categories described in Sec.4.1 and an overall category that measures all errors.Fig. 5 shows the Root Mean Square Error (RMSE) 16 of diverse evaluation metrics under different error categories.We observe that: (1) Our model T5SCORE-XL sup ranks first overall in every error category except accuracy where BLEURT ranks first.( 2) Supervised metrics (BLEURT, COMET, T5SCORE-XL sup ) perform better than other unsupervised metrics. (3) The evaluation perspectives that all metrics excel at are the same.All metrics perform best in terms of accuracy, much better than other error categories.

Top-k Analysis
To answer Q3, we conduct experiments on MQM21 and evaluate on the subset of the data from the top-k performing MT systems.17Results are presented in Fig. 6.We find that: (1) The advantage of the source-based version of T5SCORE over the reference-based version increases as we evaluate fewer systems, i.e., only high-quality systems.Although not as pronounced in COMET, which also uses the source for its reference-based approach, it has roughly the same trend.This suggests that the source-based approach is more suitable for evaluating top systems, and that sourcebased evaluation should be considered as machine systems are improved.( 2

Related Work
The increasing performance of generation systems equipped with large pre-trained models puts forward a higher requirement of the evaluation ability of automated metrics.As such, researchers are exploring different evaluation frameworks by teaching metric to learn diverse types of knowledge.
Despite the superior performance of optimizing the correlation with human judgments, this method is expensive in creating human annotations.To bypass this challenge, researchers attempt to evaluate generated texts in an unsupervised way by calculat-ing the lexical or semantic similarity between reference and generated texts with surface-based string match (e.g., BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and CHRF (Popović, 2015)) or unsupervised pre-trained components (e.Recently, works such as PRISM (Thompson and Post, 2020a) and BARTScore (Yuan et al., 2021) start to formulate evaluation as a generation task, which can not only make full use of pre-trained knowledge but also find more training data that can provide useful supervision for metrics to learn.
In this paper, we propose a framework that allows different types of signals to be incorporated into metrics.Concurrent with our work, UniEval (Zhong et al., 2022) formulates evaluation as a boolean question answering task and is trained on semi-supervised data.By contrast, our method is based on the generation formulation, which enables us to utilize a large amount of raw parallel data.

Conclusions
In this paper, we augment evaluation metrics with the ability to use different types of signal from data, which based on the assumption that a good evaluation metric not only should be informed of how to score different qualities of texts but also how highquality texts are generated.We achieve this goal by proposing a discriminative generation frame-work for the evaluation of generated texts, which outperform 8 existing top-performing metrics on 5 datasets from 19 languages.

Limitations
This study has potential limitations: (1) While different perspectives (e.g.informativeness, fluency, or factuality) of text could be evaluated, we assign one overall quality score, which cannot necessarily reflect the quality of a certain aspect.In the future work, more interpretable metrics could be designed specifically for a certain evaluation perspective.( 2) Due to the contrastive training objective of T5SCORE, our metric may not be as good at predicting the absolute score of segments or systems compared to predicting their relative ranks.
(3) We focus on evaluating automatically generated texts by machine systems.Human generated texts which might have different features from machine generated texts could be addressed in future work.(4) We study reference-based and sourcebased metrics separately.A combination of both could be studied in the future.( 5) The potential risk associated with our model is that the quality annotations we use may have been generated according to the preferences of a certain demographic group, for which explicit metadata is not available.As our model is trained on these annotations, it may inherit the underlying bias.

A.2 Training Details
We adopt the Adafactor (Shazeer and Stern, 2018) optimizer following Raffel et al. (2020).For generative training, our model is trained on ParaCotta, and WMT-19 is used as validation set, while WMT-20 is used as test set.The hyper-parameter tuned is learning rate.For discriminative training, on the base of the generative model, we further train our model on the z-score of WMT DA corpus from 2017 to 2019, using the discriminative loss function in equation3.The hyper-parameters tuned are learning rate, dropout rate and α.We choose 4 language pairs (ru-en, en-ru, en-pl, en-cs) as the validation set to tune the hyper-parameters and pick the best model checkpoint.
Tab.6 shows the hyper-parameters, training time and computing resources of T5SCORE.We find as the model size increases, we need less training steps to get the best performance on the validation set, so maximum training steps is decreased for larger models.We use the linear learning rate scheduler and use 10% of the maximum training steps as the wamp up step.In the table, when GPU is larger than 1, we use model parallelism.

A.3 WMT-DA segment level results
A supplement to the Tab. 1 showing the segmentlevel results on language pairs with English as source for the WMT DA20 corpus.

A.4 WMT-DA system level results
Besides segment level correlation, we also calculate system level correlation on WMT DA20 corpus, and the results are shown in Tab.8,9,10,11.

A.6 WMT-DA Source Based Evaluation Results
We conduct source-based evaluation experiments on WMT DA20 corpus using supervised T5SCORE.We compare our supervised source-based model with COMET src baseline (Rei et al., 2021), which is a source-based metric trained to predict WMT DA scores from 2017 to 2019. 19Tab.13 illustrates the segment Kendall's Tau correlation of diverse evaluation metrics on WMT-DA.The results show that at segment level T5SCORE-B src is comparable to the baseline and larger models surpass the baseline.Our system level results, shown in Tab.14,15 are also comparable to or better than the baseline for most language pairs except iu-en and en-iu.In all tables, Avg-en denotes the average correlation across all x-en language pairs; Avg-x denotes the average correlation for all en-x language pairs; Avg denotes the average correlation for all language pairs.

A.7 Effectiveness of Generative Training
To show the importance of unsupervised generative training, we compare the performance of T5SCORE-B and T5SCORE-L with and without unsupervised generatively training in Tab.16,17,18,19.Avg denotes the average correlation for all language pairs.

A.8 MultiSumm Segment Level Pearson Correlation Result
Tab.20/Tab.21illustrates the segment Kendall's Tau/Pearson correlation of diverse evaluation methods for 8 language pairs on MultiSumm dataset.
MQM20 & MQM21 MQM20 (Freitag et al., 2021a) and MQM21 (Freitag et al., 2021b) are datasets obtained by professional translators who re-annotated the outputs from WMT20 and WMT21 shared task according to the Multidimensional Quality Metrics (MQM) framework (Lommel et al., 2014).The MQM framework contains assessments of five aspects of the text, which are accuracy, fluency, terminology, style, and local.MQM20 covers 2 language pairs and 20 systems: en-de(10), zh-en(10) and comprises 1,418 and 2,000 segments for language pair en-de, zhen respectively.In our experiments, we excluded 3 human translation systems for en-de and 2 human translation systems for zh-en.MQM21 covers 2 language pairs and 32 systems: en-de(17), zhen(15) including 527 and 650 samples for en-de, zh-en respectively.QE20 QE20 (Specia et al., 2021) is the dataset of WMT20 shared task on Quality Estimation (QE).
MultiSumm MultiSumm (Koto et al., 2021) is a multilingual summarization dataset containing texts and their summaries in eight languages (en, id, fr, tr, zh, ru, de, es).The dataset collects 135 documents in each language, as well as summaries  Table 8: System level Pearson correlations on language pairs with English as target for the WMT DA20 corpus.Avg.denotes the average correlation across all x-en language pairs.cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en Avg Table 9: System level Pearson correlations on language pairs with English as source for the WMT DA20 corpus.Avg.denotes the average correlation across all en-x language pairs and Avg-all denotes the average correlation across all language pairs.en-cs en-de en-iu en-ja en-pl en-ru en-ta en- Table 10: System level Kendall's Tau correlations on language pairs with English as target for the WMT DA20 corpus.Avg.denotes the average correlation across all x-en language pairs.cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en Avg

Figure 1 :
Figure 1: Our framework supports generative and discriminative training.The former uses parallel data and maximizes the probability of output data conditioned on the input data.The latter ranks two hypotheses by their manual scores and maximizes the probability of the better hypothesis while minimizing the probability of the worse hypothesis.L gen and L dis denote generative loss and discriminative loss respectively.

2 .DA20
Evaluation of the training models is carried out on the DA20 dataset.MQM We consider various discriminative training data, resulting in the following models: 6 (a) T5SCORE-* 20 is trained on WMT DA corpus from 2017 to 2019.(b) T5SCORE-* 21 is trained on WMT DA corpus from 2017 to 2020.(c) T5SCORE-* 21 mqm is (b) further trained for 1 additional epoch on MQM20.MultiSumm Due to the small size of the Multi-Summ dataset (135 examples per language pair), we do not undertake the additional training of a model specific to summarization.Instead, we use models trained on ParaCotta and WMT directly.

Figure 2 :
Figure 2: System-level Kendall's Tau and Pearson correlations for the WMT DA20 corpus.Detailed results can be found at Appendix A.4.
generative training necessary before discriminative training?Q2: What are the strengths and weaknesses of each evaluation metric?Q3: When will source-based evaluation outperforms reference-based evaluation?Effectiveness of Generative Training In our experiments, discriminative training is based on the model trained generatively.To answer Q1, we compare the performance of discriminatively trained models with and without generative training on DA20.The results are shown in Fig. 4. 15 We observe that (1) Both T5SCORE-B and T5SCORE-L are enhanced by generative training before discriminative training under three correlation metrics (except that T5SCORE-B using segment level Kendall's Tau correlation has a little performance drop), which means that generative training is necessary to improve model performance.(2) Larger model T5SCORE-L benefits more from generative training, compared to T5SCORE-B, indicating that larger models are better at keeping knowledge from generative training.Multi-Dimension Analysis For Q2, we compare diverse evaluation metrics under different error cat-15 Appendix A.7 has detailed results of every language pair.

Figure 4 :
Figure 4: Segment Kendall's Tau, system Pearson and system Kendall's Tau correlations of different models with and without generative training on DA20.T5SCORE-B w and T5SCORE-L w are models with generative training, while T5SCORE-B w/o and T5SCORE-L w/o are models without generative training.

Figure 5 :
Figure 5: RMSE of different metrics in each error category on MQM dataset.
) T5SCORE outperforms COMET on all top-k systems under all three correlation measures, except for the reference-based version under seg-k correlation.The better performance of COMET's reference-based version may be attributed to its combination of reference and source, further indicating the importance of source.

Table 2 :
Segment Kendall's Tau, system Pearson and system Kendall's Tau of different metrics on MQM20 and MQM21 dataset.Avg.denotes the average correlation achieved by a metric across two language pairs and two years.Method COMET uses model wmt20-comet-da and wmt21-comet-mqm for MQM20 and MQM21 respectively.Method T5SCORE-XL sup uses model T5SCORE-* 20 and T5SCORE-* 21 mqm for MQM20 and MQM21 respectively.

Table 3 :
Segment Kendall's Tau, system Pearson and system Kendall's Tau on the MQM21 dataset for sourcebased methods.The highest correlation for each language pair under each correlation method is bold.
11 COMET mqmsrc is a source-based metric trained on WMT

Table 4 :
Segment level Pearson correlation on QE20 corpus for source-based methods.The highest correlation by unsupervised method is underlined, and the highest correlation overall is bold.

Table 6 :
Hyper-parameters, training time and computing resources.Max-step means the maximum training steps.Save-step means we save a checkpoint every save-step.GPU shows the GPU type and the number of GPUs.A6000 is the NVIDIA RTX TM A6000.Time is the wall clock training time.

Table 7 :
Segment-level Kendall's Tau correlations on language pairs with English as source for the WMT DA20 corpus.Avg denotes the average correlation achieved by a metric across all en-x language pairs.

Table 13 :
Segment level Kendall's Tau correlations on WMT DA20 corpus using source-based models.The highest correlation for each language pair is bold.

Table 14 :
System level Pearson correlations on WMT DA20 corpus using source based models.

Table 15 :
System level Kendall's Tau correlations on WMT DA20 corpus using source based models.

Table 17 :
Segment Kendall's Tau, system Pearson and system Kendall's Tau correlations for WMT DA20 corpus without unsupervised training using model T5SCORE-L.

Table 18 :
Segment Kendall's Tau, system Pearson and system Kendall's Tau correlations for WMT DA20 corpus with unsupervised training using model T5SCORE-B.cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en

Table 19 :
Segment Kendall's Tau, system Pearson and system Kendall's Tau correlations for WMT DA20 corpus with unsupervised training using T5SCORE-L.cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en