Toward More Effective Human Evaluation for Machine Translation

Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.


Introduction
As automatic natural language generation systems improve, evaluating them is becoming more challenging for both human and automatic methods (Çelikyilmaz et al., 2020;Gehrmann et al., 2022).In machine translation, this has led to increased adoption of techniques such as MQM (Freitag et al., 2021a,b), an elaborate error-based methodology for scoring output, typically carried out by skilled human annotators.While MQM is more accurate than traditional crowd-based Likert-type scoring, it can also be significantly slower and more expensive, creating a strong incentive to reduce annotation time and cost.
In this paper we investigate a simple solution to this problem, namely reducing the number of text segments that a human annotator must rate.We assume a basic scenario in which a single annotator is given a test set to rate, and the task is to predict the average MQM score they would assign to the whole set by having them rate only a selected subset.This is a natural and versatile way to deploy human annotation effort within a framework like MQM; it differs from the tasks considered by recent work with similar motivation, which focus on system ranking (Mendonça et al., 2021;Thorleiksdóttir et al., 2021) or combining human and metric scores without the express aim of predicting human performance (Hashimoto et al., 2019;Singla et al., 2021).Although our experiments are carried out with MQM-based scores, our methodology is applicable to any setting in which numerical scores are assigned to items for later averaging.
We approach the task of choosing segments as a sampling problem, and investigate classical methods for reducing sample variance and bounding estimation error.To improve accuracy, we employ two sources of supplementary information.First, in keeping with recent practice, we assume segments are grouped into documents.This lets us exploit the tendency of segments within a document to be relatively homogeneous.Second, we make use of modern automatic metrics such as COMET (Rei et al., 2020) and BLEURT (Sellam et al., 2020) which correlate better at the segment level with human judgments than traditional surface-based metrics like BLEU (Papineni et al., 2002).These serve as a rough proxy for human scores.
We show that document and metric information can be used to reduce average estimation error by up to 20% over a pure random sampling baseline.Due to high sample variance, it is difficult to reliably achieve a similar reduction in annotator effort for a given error tolerance.However, we suggest an alternative perspective in which our technique can be used to improve estimates made on the basis of a fixed rating budget.Although there is no guarantee of beating random sampling in any particular case, there is a high probability of improving on average.This improved estimator is easy to implement, and applicable to any human labeling task that produces numerical scores, and for which document membership and automatic metrics are available.
Our work is most similar to that of Chaganty et al. (2018), which we extend in several ways.We adopt their use of control variates, but consider multiple metrics rather than just one, including learned metric combinations; we also employ modern neural metrics rather than metrics based on surface information.We combine control variates with stratified sampling using either proportional or optimal allocation, and additionally evaluate an incremental scenario in which sampling adapts to observed ratings.Finally, we investigate two analytical methods for bounding the error in our estimate.

Methods
We assume a fixed test set consisting of translated segment pairs, and a single human rater who assigns scores to segments.Each segment belongs to a document, and has an associated vector of scores from automatic metrics.Our goal is to select an informative subset of segments to be labeled by the rater, and use the subset to predict the average score that would have resulted if we had asked the rater to label the whole set.By exploiting document and metric information, we hope to reduce the number of segments that must be manually labeled.
Formally, let x 1 , . . .x N be the segment scores, µ = N i=1 x i /N be the test-set score to be predicted, and σ 2 be the variance of the scores.The following side information is available for each segment i: an index d i that indicates its membership in one of D documents, and a vector of automaticmetric scores y i ∈ R M .Unlike the segment scores, which are only revealed if they are in the selected subset, the side information is always available for the whole test set.
We approach this task as the problem of sampling n ≤ N scores X 1 , . . ., X n without replacement from the test set and deriving an estimate μ for µ from the sample such that E(μ) = µ (that is, μ is unbiased) and Var(μ) is as small as possible.Low-variance estimators make it more likely that the estimation error |µ − μ| will be small.A baseline is to draw n segments at random and compute their mean.This gives an estimate that is unbiased, with variance: We investigated two classical unbiased strategies for reducing variance relative to this baseline: stratified sampling and control variates (Rice, 2007;Bratley et al., 2012).

Stratified sampling
Stratified sampling involves partitioning scores into bins that group similar items, then sampling some items from each bin.Intuitively, the idea is that if the variance within each bin is low, drawing too many samples from a particular bin is inefficient because it only serves to improve an already good estimate-therefore the sample should be spread evenly (in some sense) across bins.See Figure 1a for an illustration.As a side benefit, having human scores more evenly distributed across different types of segments is a useful characteristic if the labeled segments are to be the subject of further analysis.
Formally, suppose the test set is divided into L bins, where bin l contains N l segments of which n l have been sampled, with sample mean μl .Then the stratified estimate is: (1) It is easy to verify that this is unbiased.Stratified sampling requires a method for partitioning the test set into bins and a way of allocating the n segments in the sample to individual bins.We investigated two methods for partitioning the test set: by documents and by metric-score similarity.The optimal (lowest variance) allocation assigns segments proportional to a bin's size and variance: Since the bin variances σ l are unknown, a conservative strategy is to assume they are all equal, resulting in pure proportional allocation: n l = n N l /N .A potential enhancement is to approximate optimal allocation using estimated variances σl ≈ σ l derived from the metric scores in each bin.Two technical issues arise in stratified sampling.First, the per-bin sizes specified by equation ( 2) are not necessarily whole numbers.This can be solved using a rounding scheme that minimizes L l=1 |n l − n l |, where n l are whole numbers that sum to n.A second problem is that n l can be greater than the number of available segments N l when using optimal allocation in high-variance bins.When this occurs, we choose the bin for which n l − N l is largest, set n l = N l , then recursively reallocate the remaining bins.Note that both these strategies can result in bins for which n l = 0 when n is small.

Incremental sampling
Hitherto we have assumed that sampling works by choosing a fixed batch of n segments, then sending them to a rater for scoring.It is also possible to consider an interactive scenario where the rater labels segments sequentially, and the sampling procedure is refined after each new rating is received.A convenient way to incorporate known ratings is to use them for improving the per-bin variance estimates σl in optimal allocation.We tested two ways of accomplishing this: empirically estimate σl from the known ratings in each bin; and learn a general mapping from metrics y to rating x over all known ratings, then use this mapping to estimate the unknown ratings in each bin, and derive σl from those estimates.

Control variates
The control-variates estimator makes use of an auxiliary random variable Z that is standardized (has zero mean and unit variance) on the test set: where Xn and Zn are mean values over the sample, and the covariance is over the whole test set.This is the lowest-variance estimator that uses information from Z.It is unbiased because Xn is unbiased, Cov(X, Z) is independent of the current sample, and E( Zn ) = 0.The control-variates estimator can be thought of as using Zn to infer the direction that Xn has been shifted away from µ and reversing this shift by an amount that depends on the degree of correlation between X and Z-see Figure 1b for an illustration.In general, Cov(X, Z) is unknown, but it can be estimated from the sample as follows:1 The control-variates estimator can be extended to handle multiple auxiliary variables by forming a linear combination (Glynn and Szechtman, 2002): where Z is a vector of standardized variables, Zn is its mean over the sample, and the expectations of the covariance matrix ZZ T and weighted vectors XZ are taken over the test set.The latter is unknown, but as in the scalar case it can be estimated from the sample: In our setting, control variates are easily derived by standardizing the metric scores y i , which are available for all segments in the test set.The resulting estimator is convenient because it is applied after sampling is complete, making it independent of the sampling method, including whether the sample is drawn incrementally or in batch mode.

Error Bounds
For practical applications it is desirable to upperbound the error |µ − μ| in the estimated score with some degree of confidence.Given a confidence level γ (e.g., 0.95), we would like to find an error bound t such that: A classical bound can be derived from Hoeffding's inequality, which states that equation (5) holds if: where R is the difference between the largest and smallest scores in the test set, δ = 1 − γ, and k n = 1 − (n − 1)/N is an adjustment for sampling without replacement (Serfling, 1974).A problem with Hoeffding's inequality is that it scales with the range of the scores and does not take variance into account, so its bound will be pessimistic if variance is small relative to the extremes.In such cases, the Bernstein bound (Mnih et al., 2008) will be tighter: where σ is a sample estimate of the variance.Note that the contribution of R diminishes as 1/n in this formula, compared with 1/ √ n in the Hoeffding bound.Both these bounds are general in the sense that they make no assumptions about the score distribution.

Data
Our development data consists of MQM ratings made available by Freitag et al. (2021a) for 10 English-German and 10 Chinese-English "systems" (including human translations and MT) from the WMT 2020 news test sets (Barrault et al., 2020).Each segment was annotated by three expert raters who assigned scores ranging from 0 (perfect) to 25 (worst).There were six annotators per language pair, each of whom rated all system outputs for a set of documents comprising approximately half the complete test set (about 710 segments / rater for German, and 1000 segments / rater for Chinese).
We created simulations for each rater and system combination, excluding the Human-A "system", as it was the reference for the MT metrics we used as features.This resulted in 54 simulations for each language pair.For each simulation, the task is to predict the average score over the complete subset of segments annotated by a single rater for a single system.No knowledge of other segments, system outputs, or rater decisions is permitted to leak across simulations.As features, we used the 10 metrics submitted to the WMT 2020 metrics task (Mathur et al., 2020) that had highest average segment-level Pearson correlation with the MQM scores in our dev data.2These correlations are generally poor: from 0.279-0.410for English-German, and 0.425-0.465for Chinese-English. 3o eliminate the effects of hyper-parameter tuning on the development data, we carried out additional evaluation on a test set consisting of newstest data from the WMT 2021 metrics shared task (Freitag et al., 2021b) for English-German (17 systems), Chinese-English (15 systems), and English-Russian (16 systems).This is similar to the dev data, except that only one MQM rating is available per segment.The number of rated segments was 527 for German and Russian, and 650 for Chinese.English-Russian ratings were annotated using a different MQM methodology (from Unbabel rather than Google), resulting in scores on a 0-100 scale, with 100 being best.As before, we created separate simulations for each system, omitting the human "system" used as a reference for the metrics.To avoid bias, rather than selecting metrics according to correlation, we chose the WMT 2021 primary submissions of two top-performing metrics from the dev data: BLEURT and COMET. 4ppendix A contains further details about scores and rater assignments for the dev and test sets.

Results
We tested the sampling and estimation strategies described in section 2 by comparing them to the baseline of simple random sampling with a mean estimator.For each simulation we considered sample sizes ranging from 5-50% of the available data, at 5% intervals. 5For each sample size and technique for establishing μ, we drew 100 random samples, computed the average and std deviation of the error |µ − μ| across the samples, then averaged the results across simulations to summarize performance at that sample size.We also measured the number of "wins"-simulations in which a technique had a lower average error than the baseline.Finally, we aggregated these results across sample sizes to summarize performance in a single number.We begin by evaluating the stratified sampling methods described in section 2.1, comparing stratification over documents and over bins defined by metric scores.The latter were formed by scoring each segment with an average of the standardized metric scores assigned to it, then sorting and partitioning so each bin contained approximately 80 segments (8x larger than the average document).More elaborate clustering and metric-selection techniques did not improve over this method.Performance was also quite flat as a function of bin size, though it worsened as bin size approached average document size.We tested both stratification methods with proportional and optimal allocation using averaged metric scores as proxies for human scores when estimating the variance in each bin.

Stratified sampling
Figure 2 shows absolute error for these methods as a function of sample size, and Table 1 summarizes aggregate performance across sizes.The general pattern is similar for both language pairs: proportional allocation with documents (docs-prop) outperforms the random-sampling baseline; proportional allocation with metrics (metrics-prop) behaves similarly; and optimal allocation with document bins (docs-opt) underperforms, as does optimal allocation with metric bins (not shown, as it is much worse).Optimal allocation focuses sharply on bins with high estimated variance-which will be harmful if the estimates are wrong-so we experimented with various smoothing methods, but none improved over pure proportional allocation.
Although stratification clearly reduces the error on average, the usefulness of this result is tempered by the large variances shown in Table 1.For any given random draw, these imply that the stratified estimate is only slightly more likely to be better than the baseline.Even when comparing errors averaged over 100 random draws per simulation, the stratified estimates are only better than the baseline for approximately 75% of simulations for English-German, and 90% for Chinese-English.Table 2 shows aggregate results for incremental stratified sampling using documents as bins, with two methods for estimating per-bin variances for optimal allocation. 6The docs-incr-metrics method involves learning a k-nearest-neighbor (k=25) model with standardized metrics as features on all labeled segments, then using its predictions to estimate variances for the unlabeled segments in each bin.In docs-incr-human, the variance of the segments remaining in each bin is estimated from the segments that have already been scored.Both these methods underperform the baseline; in particular, the use of a learned mapping in docsincr-metrics provides only modest gains over the raw averages in docs-opt.We now turn to experiments with the controlvariate estimators described in section 2.2. Figure 3   and Table 3 present the results.We derived standardized scalar variates to plug into equation (3) from: a single high-performing metric (BLEURTextended, cv-bleurt); the mean of all metrics (cvmean); and predictions from a knn model learned from all metric values on the labeled segments (cvknn).We also used all standardized metrics directly (cv-multi) as input to the vector in equation (4). 7ll tested variants give reasonable improvements over the baseline, with quite similar error rates, especially for English-German.For Chinese-English, combining all metrics with the knn model improves slightly over BLEURT-extended, reducing the absolute error by 5%.This may reflect somewhat higher metric correlations for this language pair.

Control variates and combined results
As control variate estimation is applied after sampling is complete, it is straightforward to combine it with stratification.Figure 4 and Table 4 show the results of combining proportional stratified sampling using documents with the best control variates estimator (docs-prop+cv-knn), along with the component techniques for comparison.As one might hope, the techniques are complementary despite their similar individual performance.Interestingly, this is not the case when metric-based clusters are used for stratification instead of documents (metrics-prop+cv-knn, last line in Table 4), because the same information is used for both variancereduction techniques.The docs-prop+cv-knn combination produces our best results, with error reductions of 14% and 23% over the baseline for English-German and English-Chinese, and better average performance in almost 90% and 100% of simulations, respectively.Unfortunately, however, the standard deviation of these estimates remains uncomfortably close to the size of the average absolute error.Despite large variance across individual samples, sampling techniques can be useful in practice if it is possible to reliably bound the error in the estimate derived from a given sample.We computed the bounds from section 2.3 for different sample sizes with docs-prop+cv-knn, setting γ = 0.95.Both the Hoeffding and Bernstein bounds are very loose, overestimating the true error in 100% of samples, by margins that are about an order of magnitude greater than the average error in Figure 4. 8 We hypothesize that this is due to scores having a large range R, and being highly skewed, with µ R. To test this, we recomputed the Hoeffding bound with empirically-determined R values of 4 and 7 for English-German and Chinese-English.As shown in Table 5, this gives results which are well calibrated (cal > 95%) for doc-prop+cv-knn, with reasonable error bounds.Performance is somewhat worse for the baseline estimates, although the difference in error between the two techniques is negligible compared to the predicted bound.This oracle experiment suggests that it will be difficult to find non-oracle bounds that are substantially lower for doc-prop+cv-knn than for the baseline.Figure 5 and Table 6 show results comparing baseline random sampling with docs-prop+cv-knn on our evaluation set.Both the curves and the aggregate results display a similar pattern to the development results, with relatively large gains over the baseline for Chinese-English (21% relative error reduction, wins in 98% of simulations), and smaller ones for English-German and English-Russian9 (reductions of 7% and win rates of about 77%).As before, standard deviations are very high.

Discussion
How should we interpret these results?If we had a more reliable way of binning segments with similar human ratings, or metrics that correlated better at the segment level, it would be possible to reduce variance to levels that would permit realistic error bounds.That would enable a scenario in which we could determine the number of segments n that need to be rated in order to estimate the complete test-set score to within a given tolerance.As it is, however, our error bounds are very large-and we do not manage to reduce them significantly with improved sampling and estimation methods.This is unlikely to change soon for complex annotation tasks like MT because because humans are noisy raters; as shown in Table 12, they are difficult to predict even when using other humans as oracles.
In the absence of more reliable signals for reducing variance, a way to make practical use of the techniques we study is to flip the scenario around and aim to improve the quality of an estimate made from a fixed budget of n human ratings.It is common practice to obtain human annotations for only a portion of a larger test set due to time or cost constraints (Barrault et al., 2020;Freitag et al., 2021a).In this setting, our techniques can lead to improved estimates compared to just taking the mean of randomly-selected segments (although there is no guarantee that they will do so for any given sample).
The risks in applying this strategy are low.Stratified sampling with proportional allocation provides an unbiased estimate of the test-set mean, with variance that is ≤ random sampling (Rice, 2007), and equality only in the case that the bins have identical statistics.The situation is trickier for control variates.In theory, the control-variate estimator is also unbiased, with lower variance than the sample mean, but this assumes that the test-set covariance Cov(X, Z) between scores X and the auxiliary variable Z is known.Since we only know the scores in the sample, we must rely on an estimate for Cov(X, Z), creating the possibility for errors if this is significantly larger than the true covariance.However, as Chaganty et al. (2018) point out, the error in the sample estimate for Cov(X, Z) diminishes as 1/n, much faster than the 1/ √ n rate for the error |µ − μ| in the estimated score.In our data, we found no appreciable degradation of performance on small samples, even ones containing as few as 30 items.
Based on these observations, we can make the following recommendations for improving the estimated mean score of a test set containing N items given a fixed number n < N of items to be manually annotated: 1. Use prior information such as document membership to partition items into bins, then choose items using stratified sampling as described in equation ( 1), with proportional allocation.Beware of rounding errors when only a few samples are taken from each bin.
2. Use an automatic metric or other feature that correlates with human scores as a control variate in equation ( 3).This step is carried out after sampling is complete, and is independent of the sampling method used.If multiple metrics are available, combine them into a single variate by averaging or applying a smooth regressor learned on the sample (knn with k=25 worked well for us).Be alert to the possibility of errors in the covariance estimate when n is small (≤ 30).

Chaganty et al. (2018) pioneered control variates
for NLP evaluation, using them to improve estimates for summarization and question answering.Despite some technical differences-they measure variance ratios rather than absolute error, simulate human variance by sampling from a collection of raters, and use bootstrapped confidence intervalstheir findings are roughly in line with ours.We extend their work by showing that gains from stratified sampling are complementary to those from control variates, and explore a broader range of scenarios, including using multiple variates and incremental sampling.
Recent work has investigated incremental labeling tasks and/or combining human scores with automatic metrics.Mendonça et al. (2021) apply online learning algorithms to an MT system-ranking task in which different segments are selected for human evaluation on each iteration, using COMET to fill in missing human scores in WMT 2019 data.Their algorithm converges to correct results after several hundred iterations, but this condition is not detected automatically.Thorleiksdóttir et al. (2021) use Hoeffding's inequality to measure confidence in pairwise ranking decisions of varying difficulties for controlled text generation output; they consider human scores only.Singla et al. (2021) sample foreign-language test responses for human grading, with the aim of improving over purely automatic scoring; a reverse problem to ours.Hashimoto et al. (2019) propose a synergistic combination of human and automatic scoring for evaluating text generation.
Finally, there has been considerable work on measuring and rectifying inaccuracies in human annotation (Sun et al., 2020;Wei and Jia, 2021;Gladkoff et al., 2021;Paun et al., 2018).We sidestep this issue by aiming to predict the performance of a single human rater, assuming that if this can be done accurately, conflicts among raters can be resolved in a post-processing step.

Conclusion
We investigate two classical variance-reduction techniques for improving the accuracy of sampled human ratings of MT output, measured against the mean of all ratings for a given test set.We find that stratified sampling and control variates are complementary, contributing about equally to gains of up to 20% in average absolute error reduction com-pared to random sampling.Exploiting this result to dynamically reduce annotator effort given a target error tolerance is not feasible due to the high variance in our data, but we propose that our techniques could instead be used to improve estimates made from a fixed annotation budget.Concrete recommendations for this scenario are provided in section 5. Our method is easy to implement, and can be applied to any setting involving averaged numerical item-wise scores where document (or other prior grouping) and automatic metric side information is available.
In future work we look forward to delving into questions raised by our results: why doesn't optimal allocation work better, particularly in the incremental setting; is there a better way to estimate variance from metrics; why aren't metric combinations more helpful; and can error bounds be improved, perhaps with bootstrapping methods?
(a) Stratified sampling forces sampled segments (shown in red) to be evenly distributed across bins, resulting in better estimates when the score variance within bins is lower than the variance across bins.(b)Control variates allow for reversing the shift of the sample mean Xn depending on the strength of the correlation between X and Z.In this illustration, where X and Z are highly correlated (∼0.9),Zn < 0 reflects the negative shift in Xn.

Figure 1 :
Figure 1: Complementary strategies for reducing the variance of the estimated average score.

Figure 2 :
Figure 2: Absolute error and standard deviation for stratified sampling methods.

Figure 3 :
Figure 3: Absolute error and std deviation for different control-variate estimators with random sampling.

Figure 4 :
Figure 4: Absolute error and std deviation for control-variate estimators and stratified sampling.

Figure 5 :
Figure 5: Absolute error for control-variate estimators and stratified sampling on eval data. sampling

Table 5 :
Performance of error bounds for different sample sizes.Statistics are averaged over simulations: cal is % of samples for which the true error was lower than the bound, slack is the difference between the bound and the error, and t is the bound.base is the baseline estimator, and best is docs-prop+cv-knn.