Uncertainty-Aware Machine Translation Evaluation

Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on noisy, biased and scarce human judgements, often resulting in unreliable quality predictions. In this paper, we introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality. We combine the COMET framework with two uncertainty estimation methods, Monte Carlo dropout and deep ensembles, to obtain quality scores along with confidence intervals. We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task, augmented with MQM annotations. We experiment with varying numbers of references and further discuss the usefulness of uncertainty-aware quality estimation (without references) to flag possibly critical translation mistakes.


Introduction
Evaluation of machine translation (MT) quality is a key problem with several use cases: it is needed to compare and select MT systems, to decide on the fly whether a translation is ready for publication or needs to be post-edited by a human, and more generally to track progress in the field (Specia et al., 2018;Mathur et al., 2020). Even when reference translations are available, the increasing quality of neural MT systems has made traditional lexical-based metrics such as BLEU (Papineni et al., 2002) or CHRF (Popović, 2015) insufficient to distinguish the best systems. This fostered a line of work on neural-based metrics, with recent proposals such as BLEURT (Sellam et al., 2020), COMET (Rei et al., 2020a) and PRISM (Thompson and Post, 2020a). Metrics for quality estimation (QE; when references are not available) have also been devel- Gloss: "She said, «That will not work" Table 1: Example of uncertainty-aware MT evaluation for a sentence in the WMT20 dataset. Shown are two Russian translations of the same English source "She said, 'That's not going to work." with reference "Она сказала: "Не получится." For the first sentence, COMET provides a point estimate (in red) that overestimates quality, as compared to a human direct assessment (DA), while our UA-COMET (in green) returns a large 95% confidence interval which contains the DA value. For the second sentence UA-COMET is confident and returns a narrow 95% confidence interval.
oped as part of OPENKIWI (Kepler et al., 2019) and TRANSQUEST (Ranasinghe et al., 2020). While the metrics above have enjoyed some success in system-level evaluation -where the goal is to compare different systems -their segment-level quality scores are often unreliable for practical use. They all share the limitation that their output is a single point estimate -they do not provide any uncertainty information, such as confidence intervals, with their quality predictions. This is an important limitation: often, complex or out-of-domain sentences receive quality estimates that are far from their true quality (as illustrated in Table 1). This may lead to translations with critical mistakes being undetected, and hinders worst-case performance analysis of MT systems.
In this paper, we propose a simple and effective method to obtain uncertainty-aware quality/metric estimation systems, by representing quality as a distribution, rather than a single value. To this end, we make use of and compare two well-studied techniques for uncertainty estimation: Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) and deep ensembles (Lakshminarayanan et al., 2017). In both cases, our method is agnostic to the particular metric estimation system, as long as it can be ensembled or perturbed. In our experiments we use COMET (Rei et al., 2020a), and we call our uncertainty-aware version UA-COMET. 1 Our method allows using the same system with a varying number of references. We show that confidence intervals tend to shrink as more references are added, which matches the intuition that MT evaluation systems should become more confident as they have access to more information.
We evaluate our approach using data from the WMT20 metrics task (Mathur et al., 2020), including its recent extension with Google MQM annotations (Freitag et al., 2021), and the QT21 dataset (Specia et al., 2017). The results show that our uncertainty-aware systems exhibit better calibration with respect to human direct assessments (DA; Graham et al. 2013), multi-dimensional quality metric scores (MQM;Lommel et al. 2014), and human translation error rates (HTER;Snover et al. 2006) than a simple baseline, while their average quality scores achieve similar or better correlation than the vanilla COMET system. Finally, we illustrate a potential quality estimation use case enabled by our approach: automatically detecting low-quality translations with a risk-based criterion.

Related Work
Automatic MT evaluation Reference-based approaches for MT evaluation include traditional metrics such as BLEU (Papineni et al., 2002) and ME-TEOR (Denkowski and Lavie, 2014), as well as recently proposed BLEURT (Sellam et al., 2020), BERTSCORE (Zhang et al., 2020), PRISM (Thompson and Post, 2020a) and COMET (Rei et al., 2020a). Approaches that do not make use of human references are generally referred to as QE systems (Specia et al., 2018;Kepler et al., 2019;Ranasinghe et al., 2020). Our proposed approach augments reference-based approaches and enables a single system that can be used with multiple references, with the added advantage of providing uncertainty information. To the best of our knowledge, predictive uncertainty in QE has been approached only with Gaussian processes (Beck et al., 2016), which are not competitive or easy to integrate with current neural architectures.
Confidence estimation in MT A related line of work is confidence estimation of sentence-level MT outputs (Blatz et al., 2004;Quirk, 2004;Wang et al., 2019). The work that relates the most with ours is the one by Fomicheva et al. (2020), who propose an unsupervised glass-box approach to QE, extracting uncertainty-related features from the MT system via MC dropout. They show that the more confident the decoder (as measured by the lower variance of its output), the higher the quality of the MT output. Our work builds upon this perspective to propose uncertainty estimation of the QE systems themselves, rather than uncertainty of MT.
Performance prediction in NLP A related problem is that of predicting the performance of an NLP system without having to train it (Xia et al., 2020). Recent approaches perform such predictions by adding confidence intervals (Ye et al., 2021) and measuring calibration error. We take inspiration from these works to improve the calibration of our methods (Guo et al., 2017;Desai and Durrett, 2020) and to evaluate how good our uncertainty estimates are with a suite of performance indicators.
Uncertainty estimation Overall the concepts and methods of uncertainty quantification (Huellermeier and Waegeman, 2021) have been widely explored and compared for many different tasks, including MT (Ott et al., 2018). Uncertainty estimation in neural networks has traditionally been approached with Bayesian methods, replacing point estimates of weights with probability distributions (Mackay, 1992;Graves, 2011;Welling and Teh, 2011;Tran et al., 2019). However, Bayesian neural networks are costly, and in order to avoid high training costs, various approximations come in handy. Model ensembling (Dietterich, 2000;Garmash and Monz, 2016;McClure and Kriegeskorte, 2017;Lakshminarayanan et al., 2017;Pearce et al., 2020;Jain et al., 2020) is a commonly used approach, which employs an ensemble of neural networks to obtain multiple point predictions and then uses their empirical variance as an approximate measure of uncertainty. Its main disadvantage is the need to train multiple models. An alternative is MC dropout (Gal and Ghahramani, 2016), which builds upon dropout regularization (Srivastava et al., 2014) but uses it at test time, by performing several stochastic forward passes through the network and computing mean and variance of the resulting outputs as a proxy for the model's uncertainty. Our work applies and compares the last two techniques to MT evaluation. Note that more elaborate approaches have been proposed to address uncertainty quantification on classification tasks, including calibration approaches (Guo et al., 2017;Kuleshov et al., 2018a), the use of Dirichlet distributions (Sensoy et al., 2018;Malinin and Gales, 2018;Charpentier et al., 2020) and entropy measures (Smith and Gal, 2018). However, uncertainty in MT evaluation is a regression task which is so far largely overlooked in terms of predictive uncertainty. Our paper can be seen as a first step towards uncertainty-aware MT evaluation models.
3 Uncertainty-Aware MT Evaluation

Problem definition
Typical MT evaluation systems take as input a tuple s, t, R , where s is source text, t is machine translated text, and R = {r 1 , . . . , r |R| } is a (possibly empty) set of reference translations. Their goal is to predict an automatic scoreq ∈ R which assesses the quality of the translation. Supervised systems such as COMET or BLEURT are trained to approximate ground truth scores q * obtained from human annotations, such as DA, MQM and HTER. In this paper, we assume that q * is a continuous real-valued score, but the main ideas extend to the case where q * are discrete classes or quality bins.

Sources of uncertainty
There are several challenges with learning MT evaluation systems: 1. Noisy scores. The human-generated scores q * are not always reliable and often suffer from high variability, exhibiting low inter-annotator agreement. This problem can be mitigated by averaging over a sufficient number of references, but this brings considerable annotation costs (Freitag et al., 2021;Mathur et al., 2020).
2. Noisy or insufficient references. The references R do not always have good quality, and their sparsity (small |R|) is often insufficient to represent the space of possible correct translations well (Freitag et al., 2020). 2 An extreme case is when there are no references (R = ∅), a problem known as "QE as a metric." 3. Complex translations. Correct translations are often non-literal, and it may be hard for an automatic system to grasp the semantic relation between the translated sentence and the references, as they may be confused with hallucinations.
4. Out-of-domain text. The text where the MT evaluation system is run may belong to a different domain from the one it was trained on.
The first two points can be seen as aleatoric uncertainty (noise in the input or output data), whereas the last two are instances of epistemic uncertainty, reflecting the limited knowledge of the model (Hora, 1996;Kiureghian and Ditlevsen, 2009). Unfortunately, these uncertainties add up. To cope with the different sources of uncertainty, we treat the quality score Q as a random variable and predict a distributionp Q (q), as opposed to a point estimateq. This way, we obtain an uncertaintyaware system, which can return a peaked distribution when it is confident about its quality estimate, or a flatter distribution in cases where it is more uncertain. This allows, among other things, managing the risk of treating a translation as good quality when it is not (see §5.4). When estimating quality on the fly without references, knowing the system's confidence in the quality of the produced translations might help obtain informative worst-case indicators on whether a human post-edit is required, e.g. by evaluating the cumulative distribution func-tionF Q (χ) = χ −∞p Q (q)dq which quantifies the translation risk, i.e., the probability of a translation being below a quality threshold χ. Moreover, having access to such distributions of quality estimates can be beneficial when deciding if a system outperforms another with some level of confidence.

Uncertainty and confidence intervals
To obtainp Q (q), our approach builds upon a vanilla MT evaluation system h (such as COMET) that produces point estimatesq = h( s, t, R ), and augments it to produce uncertainty estimates. Our approach is completely agnostic about the system h, as long as it can be ensembled or perturbed.
The first step is to use h to produce a set Q = {q 1 , . . . ,q N } of quality scores for a given input s, t, R , which will be interpreted as a sample fromp Q (q). For this, we experiment with two methods: MC dropout (Gal and Ghahramani, 2016), which obtains Q by running N stochastic forward-passes on h with units dropped out with a given probability; and deep ensembles (Lakshminarayanan et al., 2017), in which N separate models are trained with different random initializations and then run in parallel to obtain Q. While both methods have shown to be effective in several tasks (Fomicheva et al., 2020;Malinin and Gales, 2021), MC dropout is usually more convenient (because only one model is required), but generally requires many more samples for good performance (larger N ) compared to deep ensembles.
The second step is to use the resulting set Q to represent model's uncertainty. One way of representing uncertainty is through confidence intervals, that is, given a desired confidence level γ ∈ [0, 1] (e.g. γ = 0.95), specifying the smallest possible quality interval I(γ) = [q min (γ), q max (γ)] such that P (q ∈ I(γ)) = qmax q minp Q (q)dq ≥ γ. There are two possible strategies to obtain such intervals: a parametric approach, which parametrizes the distributionp Q (q), produces estimates of its parameters by fitting the distribution on Q, and uses them to compute confidence intervals at arbitrary levels γ; and a non-parametric approach, which bypasses the estimation ofp Q (q) and focuses on estimating its quantiles for the desired values of γ directly from Q. In this paper, we opted for a simple parametric Gaussian approach, which worked well in practice and seemed to fit our data well (see Figure 3 in App. B). However, we did experiment with a non-parametric bootstrapping technique using the percentile method (Efron, 1979;Johnson, 2001;Ye et al., 2021), which we report in App. E.
In our approach, we treat Q as a sample drawn from a Gaussian distribution,p Q (q) = N (q;μ,σ 2 ), and estimate the parametersμ and σ 2 as the sample mean and variance, respectively. Oncep Q (q) is fit to Q, the confidence intervals I(γ) = [q min (γ), q max (γ)] can be estimated at the desired level of confidence γ, using the probit (quantile) function probit(p) = √ 2erf −1 (2p − 1) (where erf is the error function):

MT evaluation with multi-references
As our framework can model uncertainty, it is interesting to consider the case where the number of available references R may vary. Intuitively, we expect the uncertainty to decrease when the model observes more references. Specifically, relying on a single reference might prove problematic, since even human generated references can be noisy and prone to errors. Additionally, for source sentences with multiple and diverse valid translations, relying on a single reference might result in potential underestimation of the quality of valid MT hypotheses. For the above reasons, additional references, even if they are paraphrased versions of the originals (Freitag et al., 2020), can help obtain better evaluations of the MT systems' outputs.
As a result, relying on human-generated references can be a constraint in terms of learning and predicting accurate quality estimates for adequately diverse data (Sun et al., 2020). We thus want to assess the impact of additional references (both independently generated and paraphrased) on the estimated confidence intervals.
Even though our approach works with any underlying MT evaluation system h which produces point estimates, most existing systems cannot seamlessly handle a varying number of references or no references without architecture modifications. For example, COMET originally receives exactly one reference as input to predict the quality of a s, t pair. We take the following approach to handle a varying number of references (|R| > 1): we obtain a set of N quality predictions for each available reference, r ∈ R, for a given s, t pair, resulting in a set of N × |R| quality predictions. We then compute the pointwise average across the |R| dimension, leading to N quality scores Q = {q 1 , . . . ,q N } that aggregate information from all the |R| references. We can then apply the same approach as described earlier. Intuitively, the averaging operation should reduce variance in the quality scores, which would result in narrower confidence intervals as |R| increases. We validate this hypothesis in our experiments in §5.4.

Post-calibration
In our initial experiments, we observed that the magnitude of the predicted varianceσ 2 depends significantly on several hyperparameters, such as the choice of dropout value, number of samples, and language pair. In classification tasks, a similar phenomenon has been reported by Malinin and Gales (2021), who recommended combining these methods with temperature calibration (Platt, 1999) to adjust uncertainties and obtain more reliable confidence intervals. For regression tasks -our case Since temperature scaling is only applicable in classification, they propose an isotonic regression technique instead (Niculescu-Mizil and Caruana, 2005). We found that we can obtain highly calibrated uncertainty estimates in a much simpler way, by learning an affine transformation σ 2 → σ 2 calib = ασ 2 + β, where α and β are scalars, tuned to minimize the calibration error (see Eq. 2-3) on a validation set. We use the tuned σ calib in our experiments ( §5), and show the improvement on ECE for different confidence levels with σ calib in Figure 1.

Evaluating Uncertainty
Having described our framework, we now turn to the problem of verifying the effectiveness and informativeness of the proposed uncertainty quantification method. Two crucial aspects to take into account when evaluating uncertainty-aware systems are: (i) the system should not harm the predictive accuracy compared to a system without uncertainty and (ii) the uncertainty estimate should reflect the failure probability of the system well, meaning that the system "knows when it does not know." In what follows, we assume a test or validation set consisting of examples together with their ground truth quality scores.
Calibration Error One way of understanding if models can be trusted is analyzing whether they are calibrated (Raftery et al., 2005;Jiang et al., 2011;Kendall and Gal, 2017), that is, if the confidence estimates of its predictions are aligned with the empirical likelihoods (Guo et al., 2017). In classification tasks, this is assessed by the expected calibration error (ECE; Naeini et al. 2015), which has been generalized to regression by Kuleshov et al. (2018b). It is defined as: where each b is a bin representing a confidence level γ b , and acc(γ b ) is the fraction of times the ground truth q * falls inside the confidence interval We use this metric with M = 100.
Negative log-likelihood To evaluate parametric methods that represent the full distributionp Q (q), we can use a single metric that captures both accuracy and uncertainty, the average negative loglikelihood of the ground truth quality scores according to the model: (4) This metric penalizes predictions that are accurate but have high uncertainty (since they will become flat distributions with low probability everywhere), and even more severely incorrect predictions with high confidence (as they will be peaked in the wrong location), but is more forgiving to predictions that are inaccurate but have high uncertainty.
Sharpness The metrics above do not sufficiently account for how "tight" the uncertainty interval is around the predicted value, and thus might generally favour predictors that produce wide and uninformative confidence intervals. To guarantee useful uncertainty estimation, confidence intervals should not only be calibrated, but also sharp. We measure sharpness using the predicted varianceσ 2 , as defined in Kuleshov et al. (2018b): Pearson correlations As shown by Ashukha et al. (2020), NLL and ECE alone might not be enough to evaluate uncertainty-aware systems. Therefore, we complement the indicators above with two Pearson correlations involving the system's predictions and the ground truth quality scores coming from human judgements. The first, which we call the predictive Pearson score (PPS), is useful to assess the predictive accuracy of the system, regardless of the uncertainty estimate -it is the Pearson correlation r(q * ,μ) between the ground truth quality scores q * and the average system predictionsμ in the dataset D (for the baseline point estimate system, we useq instead ofμ). We expect this score to be similar to the baseline or slightly better due to the ensemble effect. The second is the uncertainty Pearson score (UPS) r(|q * −μ|,σ), which measures the alignment between the prediction errors |q * −μ| and the uncertainty estimateŝ σ. Note that achieving a high UPS is much more challenging -a model with a very high score would know how to correct its own predictions to obtain perfect accuracy. We confirm this claim later in our experiments.

Datasets
We apply our method to predict three types of human judgement scores at segment-level: DA, MQM and HTER. We use the WMT20 metrics shared task dataset (Mathur et al., 2020) for the DA judgements, and the Google MQM annotations for English-German (EN-DE) and Chinese-English (ZH-EN) on the same corpus (Freitag et al., 2021). For language pairs where both human-and systemgenerated translations are provided, we remove the human translations before evaluating (Human-A, Human-B, Human-P in WMT20). For the HTER experiments, we use the QT21 dataset (Specia et al., 2017). Dataset statistics are presented in App. B.

Experimental setup
For the experiments presented below, we use COMET as the underlying MT quality evaluation system (Rei et al., 2020a). 3 For evaluation, we perform k-fold cross-validation: we split the test partition into k = 5 folds, so that each fold contains translations of every MT system and has approximately the same number of documents. The k-fold splits are generated in such a way that there are unique source-reference pairs in each fold, and the documents are disjunct across the folds. Since documents vary in their length, the number of segments per fold can differ. We use 4 folds for validation and the remaining one for testing. As we experi-ment with human annotations of different scales,q and q * are standardized on the validation set and the model is post-calibrated as described in §3.5.

MC dropout (MCD)
We apply a dropout probability of 0.1 and run N = 100 runs of MC dropout. Dropout was applied at encoder, pooling and feedforward layers as we found it produces more useful σ values, corroborating the findings of Verdoja and Kyrki (2020) and Kendall et al. (2017). More details on tuning the hyperparameters can be found in App. C.
Deep Ensembles (DE) We train ensembles with N = 5 models and random initialization. For training, we follow the procedure described by Rei et al. (2020b), training each model for 2 epochs.
Baseline As a simple baseline, we take the original point estimatesq provided by the underlying COMET system and map them to a Gaussian distribution N (q;μ,σ 2 ) withμ :=q and a fixed varianceσ 2 := σ 2 fixed (i.e., the same variance is assigned to all the examples). We compute σ 2 fixed on the validation set so that it minimizes the average NLL value, which has the following closed form expression (see App. A for a proof): This baseline was found surprisingly strong on several performance indicators (Tables 2, 3, 4). Table 2 presents results for the performance indicators described in §4 for 9 language pairs in the WMT20 dataset, encompassing a mix of highresource and low-resource languages. We observe that both uncertainty-aware methods (MCD and DE) show consistent improvement over the baseline in all metrics and language pairs, with the exception of NLL in two language pairs (ZH-EN and EN-IU). We also see that, overall, deep ensembles provide more accurate predictions and narrower confidence intervals compared to MC dropout, but without a significant improvement for the other performance indicators across pairs. Considering the computational cost of training and tuning multiple models for the deep ensemble, MC dropout seems preferable for the presented MT evaluation setup. While these results are encouraging, we stress that experiments on higher quality data at a larger scale are necessary to fully validate and compare uncertainty-aware methods, as the numbers in Table 2 are influenced by the inconsistencies in DA annotations, which are known to be particularly noisy (Toral, 2020;Freitag et al., 2021). To mitigate this, we further compare performance on the recently released Google MQM annotations for EN-DE and ZH-EN, shown in Table 3. As expected from the higher quality of these annotations, and even though the underlying COMET system was still trained on DAs and evaluated on the MQM assessments, we get higher uncertainty correlations, with the MC dropout approach benefiting the most. We also notice a significant improvement across all indicators for the ZH-EN dataset, which was poorly correlated with the predictions on the DA dataset. We use the MQM annotations to provide a more in-depth analysis on specific use cases on translation evaluation in §5.4 -5.5.

Segment-level analysis
Finally, Table 4 shows the results on HTER prediction on the QT21 dataset. 4 For this metric and dataset, the Pearson correlations are generally higher than in previous experiments (with the exception of UPS for EN-CS) and the sharpness scores indicate that the predicted confidence intervals are considerably narrower, showing that for these experiments the models are generally more accurate and more confident than when predicting DA and MQM. This might be explained by the fact that HTER, which quantifies the amount of post-editing required to fix a translation, is a less subjective metric than a quality assessment, and therefore the aleatoric uncertainty caused by noisy scores may be smaller.

Impact of reference quantity
We next experiment with the WMT20 EN-DE to get some insights on the impact of using multiple references as described in §3.4. This dataset contains 3 human references (Human A, B, and P) for each source sentence generated in different ways: A and B are generated independently by annotators and P is a paraphrased as-much-as-possible version of A. Our goal is to simulate the availability of multiple human references of varying quality levels. As reported in the findings of WMT20 Metrics task (Mathur et al., 2020)  Underlined numbers indicate the best result for each language pair and evaluation metric. Reported are the predictive Pearson score r(μ, q * ) (PPS), the uncertainty Pearson score r(|q * −μ|,σ) (UPS), the negative loglikelihood (NLL), the expected calibration error (ECE), and the sharpness (Sha.) Note that the UPS of the baseline is always zero, since it has a fixed variance. levels, and the quality of human references is not always known. We thus calculate the performance when using each of the Human-A, Human-B and Human-P references individually, and then compare randomly sampling r from R with averaging predictions over each r in R, hypothesizing that the combination of references will result in reduced model uncertainty.
We can see in Table 5 that when having access to multiple references, combining all available references (Mul) results in narrower confidence intervals compared to sampling single references (S-1) or even pairs of references (S-2) as indicated by the decreasing values in sharpness. Apart from sharpness, the model seems to benefit from the addition of new knowledge, since we see consistent improvement in performance for PPS and NLL metrics. Thus, with the incorporation of additional   human references we obtain models that are more confident -and rightly so, since they are more predictive too. Combining this information with the performance of singleton reference sets in Table 6, we note that even among human references, the estimated reference quality seems to have an impact both on the predictive accuracy (PPS) and confidence (UPS, NLL, Sharpness). Both for S-N and Mul approaches, the inclusion of Human-P in the reference set results in performance drop across all metrics. Still, the negative impact of Human-P decreases with the increase of combined references and we can conclude that when there is no information on the estimated quality of references the best approach is to combine them: for R = {A, B, P }, Mul results in similar performance to Human-A.

Detection of critical translation mistakes
One of the key applications where the use of uncertainty-aware MT evaluation is particularly relevant is the identification of critical translation errors that would require human assisted editing. To investigate whether uncertainty can improve performance of critical error detection, we treat the error detection as an information retrieval problem where   we aim to identify the worst translations based on human annotations. We experiment with the EN-DE dataset and the corresponding MQM annotations, since MQM scores specifically designed with the distinction between major and minor translation errors in mind (Burchardt and Lommel, 2014).
In this experiment we also take into consideration the number of words in the MT sentence and normalize scores accordingly to avoid over-penalizing for critical very long translations with accumulated minor errors. We elaborate and provide comparative examples regarding this choice in Appendix F. We calculate and average the MQM scores for all 3 annotators per segment and then normalize for MT length. We then use the segments with the n% lowest scores as the retrieval targets. We present the results for the 2% lowest quality segments in Figure  2 and we provide additional results (with n ranging from 1% to 20% lowest quality segments) in Appendix F. We provide the statistics for the MQM data 5 used in this experiment in Table 7. Our hypothesis is that we can provide better predictions of erroneous translations, using the cumulative distribution function over Q for each s, t, R to predict the probability P (Q ≤ q err ), where q err is a quality threshold tuned on the validation set to optimize average recall@N. We can then compare 3 ways of scoring the translations automatically: (1) using the scoresq predicted by h to rank translations, (2) using the meanμ of the estimated distribution p Q (q) instead of the single point estimateq, and (3) using the uncertainty-aware parametric models to compute and rank by the probability of q err . Since this scenario is more relevant to realtime/on demand translation evaluation, we test it under the assumption that there is no access to a human reference. To handle this referenceless case (R = ∅, also known as quality estimation), we can use translations produced by an MT system outside the WMT20 participants as pseudo-references (Scarton and Specia, 2014; Duma and Menzel, 2018). We use PRISM 6 , which was originally trained as a multilingual NMT model, (Thompson and Post, 2020b,a). We evaluate all scoring approaches using Recall@N and Precision@N as shown in Figure 2. We can see that while for very small values all approaches perform similarly, the uncertainty-aware approach (UA-COMET) outperforms the other two for Recall as N increases, while it also demonstrates higher Precision especially for small N values, which are of greatest interest since we want to correct as many critical errors as possible with minimal human intervention.

Conclusions
We introduced uncertainty-aware MT evaluation and showed how MT-related applications can benefit from this approach. We compared two techniques to estimate uncertainty, MC dropout and deep ensembles, across several performance indicators. Through experiments on three datasets with different human quality assessments encompassing several language pairs, we have shown that the resulting confidence intervals are informative and correlated with the prediction errors, leading to slightly more accurate predictions with informative uncertainty. Our uncertainty-aware system 5 We use a fixed dev/test split instead of k-fold crossvalidation in this case. We still ensure that we do not split any document across dev/test and that test remains "unseen". 6 We use the m39v1 model in https://github.com/ thompsonb/prism and the zero-shot translation setup.   can take into account multiple references and becomes more confident (and accurate) when more references are available; it can so perform quality estimation without any human reference by relying on pseudo-references from other MT systems (PRISM). We show that uncertainty-aware MT evaluation is a promising path. As a future direction, we aspire to further explore uncertainty predicting methods that tackle the different kinds of aleatoric and epistemic uncertainty described in §3.2 and are better tailored to the specifics of this task.

A Baseline with Fixed Variance
We show here that, whenp Q (q) = N (q,μ,σ 2 ) is a Gaussian distribution, the optimal fixed variance that minimizes NLL is To show this, observe that where we made the variable substitution y = 1 2σ 2 and we defined the function F : R >0 → R, which is convex on its domain and tends to +∞ when y → 0 + and when y → +∞, hence it has a global minimum. Equating the derivative of the objective function to zero, we get from which we get

B Datasets
We present in Table 8 descriptive statistics of datasets used in our experiments. In Fig.3 we show the distribution of predicted quality estimates for a random sample from WMT20 dataset, (EN-TA language pair 7 ), with the corresponding superimposed gaussian to demonstrate the perceived fit. 7 Based on a translation produced by the OPPO system, for the segment with index 473 (randomly sampled).

C Hyperparameter Tuning
The number of dropout runs was tuned on the [25,200] interval with a step of 25 on the EN-DE WMT20 data. We show the results in Table 9. In preliminary experiments, we found that increasing the dropout probability beyond 0.1 did not bring any gains, therefore we used this number. We also found that dropping only the feed-forward layers of COMET and/or the pooling layers was ineffective, therefore we applied dropout on all COMET layers for all experiments presented in this paper.    Table 10 shows the hyperparameters used to train the DA and HTER estimators for our deep ensembles. In both cases we trained 4 models with different seeds and used as fifth model the wmtlarge-da-estimator-1719 and the wmt-large-hterestimator available in https://github.com/ Unbabel/COMET. Each of these models has 583M parameters and were trained on a single Nvidia Quadro RTX 8000 GPU 8 for ≈ 34 and ≈ 3.5 hours for the DA models and HTER models, respectively. Regarding the validation performance recorded during training, the DA models achieve a PPS of 0.612 ± 0.002, while the HTER models achieve a PPS of 0.663 ± 0.012.

E Non-parametric Estimation of Confidence Intervals
The parametric Gaussian approach we chose to obtain confidence intervals, described in §3, fits relatively well our data (see Figure 3). However, this approach makes a strong assumption about the shape ofp Q (q), and therefore we experimented also with a non-parametric bootstrapping technique to estimate confidence intervals. Such approach has been successful in several NLP tasks (Koehn, 2004;Li et al., 2017;Ye et al., 2021). In this case, we construct the confidence intervals I(γ) by using the percentile method (Efron, 1979;Johnson, 2001). We take the range of point estimates in Q that cover equal γ 2 proportions around the median of thep Q (q) distribution as the desired confidence interval, represented by the corresponding sample quantiles. Since this approach typically require 8 https://www.nvidia.com/en-us/ design-visualization/quadro/rtx-8000/ many samples to obtain accurate estimates of the quantiles, we left out the deep ensemble method from this experiment (which would require training too many models) and focused only on samples obtained from MC dropout, using M = 100 as in the parametric Gaussian experiments.
Since this approach does not produce a full distributionp Q (q) but only the medianμ med and confidence intervals I(γ), the evaluation metrics UPS, NLL, and sharpness cannot be directly applied. Therefore, we evaluated with the following modifications of predictive Pearson score and ECE.
Predictive Pearson score For Pearson-related evaluation we use the PPS performance indicator defined in § 4, but we measure the correlation between groundtruth quality scores q * and the median µ med , instead of the averageμ.
Calibration Error To compute ECE we use the same method as defined in Eq. 2. We use this metric with M = 20 to assess the ability of the nonparametric method to estimate confidence intervals.
Experiments The results are shown in Table 11. Overall, MC dropout outperforms the baseline across both measures (except for PPS in EN-CS) but the improvement is marginal. The performance of the parametric approach for the same dataset in Table 2 is better than non-parametric for both reported ECE and PPS. Still, ECE values are close to the ones obtained with the parametric approach for all language pairs, and we can obtain a wellcalibrated model with the non-parametric approach too (compared to the baseline).
The observed performance of a non-parametric approach could be limited by the number of observed samples and the method used to generate those (MC dropout

F Detection of Critical Translation Mistakes
We provide more detailed experiments of the critical translation error detection in Figure 4, showing the Recall@N and Precision@N for different error proportions from the dataset, ranging from 1% to 20%. We can see that while increasing the proportion of errors considered critical, the Recall@N performance gap for UA-COMET and COMET decreases.
We show examples of the worst translations according to MQM scores with and without length normalisation in Tables 12 and 13 respectively, in order to better demonstrate the impact of length normalisation on the selection of critical errors.  "Currently we are targeting young people 18 to 24 years. For the young people that's the age bracket we are looking at but of course any one above 18 and it's because we do not have evidence of children by the Constitution but as more evidence unfolds we are going to get there. For the men, we give the kit to the mother and they take it to the partner, key and priority populations such sex workers," Mr Geoffrey Tasi, the technical officer-in-charge of HIV testing services, said yesterday.

19.07
Parents of 5-month-old stuffed in suitcase and tossed in dumpster get 6 years in prison Eltern von 5-Monats-Alt in Koffer gefüllt und in Mülleimer geworfen bekommen 6 Jahre im Gefängnis

18.67
Parents of 5-month-old stuffed in suitcase and tossed in dumpster get 6 years in prison Eltern von 5 Monaten in Koffer gestopft und in Müllcontainer geworfen bekommen 6 Jahre Gefängnis

18.67
Parents of 5-month-old stuffed in suitcase and tossed in dumpster get 6 years in prison Eltern von 5 Monaten, die in Koffer gestopft und in Müllcontainer geworfen werden, bekommen 6 Jahre Gefängnis

18.33
Sacramento police also announced Thursday their internal investigation did not find any policy or training violations.

18
Vulnerable Dems air impeachment concerns to Pelosi Verletzliche Dems-Luft-Impeachment-Bedenken gegen Pelosi 17.67 The 35-year-old star dumped the NBA player for good earlier this year after he was accused of cheating on her with family friend Jordyn Woods -having previously cheated when she was nine months pregnant with their daughter, True.

17.67
Vulnerable Dems air impeachment concerns to Pelosi Anfällige Dems Luft Amtsenthebungsbedenken an Pelosi 17.67 It comes just days after Tristan wrote: "Perfection" alongside the heart eye emojis underneath one of the reality stars other photos, which saw her modelling for Guess Jeans.

17.43
"You're going out a youngster, but you've got to come back a star!" Blanks wrote in an Instagram caption on Wednesday, quoting the film "42nd Street." "Du gehst als Jugendlicher aus, aber du musst einen Stern zurückkommen!" Blanks schrieb am Mittwoch in einem Instagram-Titel den Film "42nd Street".

17.43
"Sounding more and more like the so-called whistle-blower isn't a whistle-blower at all," he tweeted. "In addition, all second-hand information that proved to be so inaccurate that there may not have been somebody else, a leaker or spy, feeding it to him or her? A partisan operative?" "Immer mehr nach dem sogenannten Whistleblower zu klingen, ist überhaupt kein Whistleblower", twitterte er. "Außerdem alle Informationen aus zweiter Hand, die sich als so ungenau erwiesen haben, dass möglicherweise nicht jemand anderes, ein Leckerbissen oder ein Spion, sie ihm oder ihr gefüttert hat? Ein Partisanen-Agent?" 17.43 "Currently, 86 per cent people living with HIV know their status; that means it leave us with 14 per cent of those living with HIV and do not know their status. So how do we now utilise that additional innovation. Really for me this is it ... how do we now move from this kit and create demand, especially for that 14 per cent that are sick and they need care and they are not getting care," Dr Atwine said.

17.4
Sacramento police also announced Thursday their internal investigation did not find any policy or training violations.

17.33
"Currently we are targeting young people 18 to 24 years. For the young people that's the age bracket we are looking at but of course any one above 18 and it's because we do not have evidence of children by the Constitution but as more evidence unfolds we are going to get there. For the men, we give the kit to the mother and they take it to the partner, key and priority populations such sex workers," Mr Geoffrey Tasi, the technical officer-in-charge of HIV testing services, said yesterday. "Gegenwärtig richten wir uns an junge Menschen zwischen 18 und 24 Jahren. Für die jungen Menschen ist das die Altersgruppe, die wir betrachten, aber natürlich jede über 18, und das liegt daran, dass wir keine Beweise für Kinder durch die Verfassung haben, aber wenn sich mehr Beweise entwickeln, werden wir dorthin gelangen. Für die Männer geben wir das Kit an die Mutter und sie bringen es an den Partner, Schlüssel-und Prioritätspopulationen wie Sexarbeiter", sagte gestern Geoffrey Tasi, der zuständige technische Offizier für HIV-Tests.

17.33
Vulnerable Dems air impeachment concerns to Pelosi Anfällige Dems Luft-Impeachment Bedenken gegen Pelosi 17.33 Table 13: Worst 20 translations according to MQM scores (averaged over 3 annotators) for EN-DE. Highlighted rows are common in both ranking approaches.