Data Sampling and (In)stability in Machine Translation Evaluation

We analyze the different data sampling approaches used in selecting data for human evaluation and ranking of machine translation systems at the highly influential Conference on Machine Translation (WMT). By using automatic evaluation metrics, we are able to focus on the impact of the data sampling procedure as separate from questions about human anno-tator consistency. We provide evidence that the latest data sampling approach used at WMT skews the annotated data toward shorter documents, not necessarily representative of the full test set. Lastly, we examine a new data sampling method that uses the available labour budget to sample data in a more representative manner, with the goals of improving representation of various document lengths in the sample and producing more stable rankings of system translation quality.


Introduction
Human evaluation of machine translation (MT) quality is very labour-intensive, meaning that it is not always possible to have full test sets annotated by human evaluators, for example in large-scale shared tasks like the News/General MT Task at the Conference on Machine Translation (WMT) (Kocmi et al., 2022).The typical practice in such situations is sampling a subset of the test set for human annotation to estimate the quality of MT systems; the rankings of MT systems computed from this are expected to be stable and representative of the full test set.However, in practice, inconsistency and instability in system rankings are observed in human evaluation and are often blamed on human annotator inconsistency.Thus, much of the focus on MT human evaluation is on denoising and calibrating human annotations, but there are other sources of error in the data sampling process orthogonal to annotator behavior.Throughout WMT's history, the design of data sampling methods has been tangled up with ex-periments on human assessment collection approaches.In the early years when constrastive adequacy/fluency judgments and relative ranking were used (Koehn and Monz, 2006;Callison-Burch et al., 2007, 2008, 2009, 2010, 2011, 2012;Bojar et al., 2013Bojar et al., , 2014Bojar et al., , 2015)), the nature of the assessment method (i.e., comparing system outputs directly) ensured that there were overlaps in the segments annotated for the various systems.However, when direct assessment (Graham et al., 2013(Graham et al., , 2014(Graham et al., , 2016) ) was introduced (Bojar et al., 2016), the annotation subsets were selected independently per system, which is expected to produce consistent rankings (assuming sufficient annotations).In practice, some years have seen language pairs with very low annotation (sampling coverage of 12.5%) opening the door to scenarios where MT systems could be evaluated and ranked on disjoint sets of sentences, raising questions about fairness and consistency.
As WMT added document context to human evaluation (Barrault et al., 2019), segments for annotation were sampled at the document level (rather than at segment level).This introduced another source of instability and error by limiting the representation of topics and vocabulary in annotated samples, introducing systematic bias in the samples (e.g., document length), and in some cases even preventing some documents from being sampled (Knowles, 2021;Knowles and Lo, 2022).Although Miller et al. (2020) show that system performance can be resilient to adaptive overfitting against the frequently reused evaluation sets in other NLP task, they also show that NLP models' robustness to distribution shift remains a challenge.As MT research moves toward document level, document length may be related to the difficulty of the translation task.The data sampled for gold standard human evaluation should reflect the full test set to support fair system rankings and further analysis of distribution shift effects.
As discussed, sample size or coverage is a major factor in how representative the annotation subset is; with high coverage, other sources of error are less likely to cause problems, with low coverage, the errors may have compounding effects on ranking stability.However, sample size is an aspect that may be tightly constrained (e.g., due to funding).
In this paper, we study the data sampling methods and resulting instability of system rankings in WMT News/General MT task over the past four years.To focus on data sampling -separate from questions about human annotator consistency -we use automatic MT evaluation metrics to generate these system rankings.We show that system ranking consistency and representation of documents lengths in the sample can be improved by a new data sampling method that uses the available labour budget and balances the desires for document context and representativeness of the sample.

Data Sampling Methods
We divide our discussion of data sampling methods into two orthogonal components: 1) whether the subset of data annotated for each systems is sampled independently per system or once per test set and 2) how the sampling is performed.

Matching Subsets
One option for annotation is to sample a subset of data for annotation once from the test set, and then annotate each system's output over this fixed subset; we call this the matching subset condition.The alternative, used at WMT until recently, is to randomly sample data from each system's output independently.In extreme cases this can mean that there are no segments that have been annotated for all systems (Knowles and Lo, 2022).Mismatching annotation subsets appears to be more problematic when full documents are sampled than when data is sampled at the segment level (Knowles, 2021).
We argue in favour of using matching subsets, both because of the risk with mismatched subsets of introducing error into the rankings (i.e., by scoring one system on an easier subset of data) and because it offers opportunities to use statistical tests that rely on paired samples.In this paper, the simulation experiment of our proposed sampling approach is done with matching subsets of data.

Sampling Approach
Orthogonal to this question of matching or mismatched subsets for annotation is the question of how to sample the data.We describe three approaches that have been used at WMT and the approach we examine in this work, considering advantages and disadvantages.All suffer when there is low coverage.We briefly discuss some topics of user interface, but mostly leave that aside.

Segment-level (SL)
Sampling at the segment level (i.e., randomly selecting segments without regard to document boundaries) has the advantage of better test set representation, especially with high coverage.The main disadvantage is the lack of document context, which is considered important for distinguishing high-quality machine translation from human translation (Läubli et al., 2018;Toral et al., 2018).1

Whole document (WD)
This approach involves sampling whole documents; particularly at low coverage this may not be representative.On the other hand, it has the advantage of full document context, so there are no additional requirements of the annotation interface to incorporate context beyond the sample.In the sampling used at WMT, there has been a limit on document size; on occasion this has meant that large portions of the test set could not be sampled.

Document Fixed Snippet (DF)
The WMT 2022 General Task (Kocmi et al., 2022) attempted a middle ground between segment-level and whole document sampling, sampling snippets of up to 10 contiguous segments (shorter snippets were only drawn when the whole document was shorter than 10 sentences). 2 The aim of this was to cover a broader range of documents while still maintaining document context.Additionally, in the user interface, annotators were shown preceding context of up to 10 snippets.

Document Budgeted Snippet (DB)
In this proposed approach, a fixed budget is set in advance, and then snippets are sampled from documents proportional to the budget.That is, if there is budget to annotate 40% of the data (not including quality assurance or interannotator agreement annotations, which we do not discuss in this work), snippets corresponding to 40% of document length will be sampled.Where the document is too small or does not divide evenly, the document will be sampled (or the length of the snippet rounded up or down) to produce the correct number of snippets and snippet lengths in expectation. 3This approach operates under the assumption that we wish to cover a wide range of documents, perform annotations with context, and produce a representative sample.In effect, this should produce document coverage percentages roughly equivalent to segment-level sampling, but with contiguous rather than discontiguous segments.For test sets with extremely long documents, this could be problematic for some annotation user interfaces.

Simulation
We show the effects of different sampling strategies by scoring segments in the sampled subset and the full test set with automatic metrics and comparing system rankings between the two.

Data and setup
We use data collected at the WMT News/General shared tasks from 2019 to 2022 (Barrault et al., 2019(Barrault et al., , 2020;;Akhbardeh et al., 2021;Kocmi et al., 2022) and organized in the MT Metrics Eval package. 4The MT Metrics Eval package includes all scores from baseline and participating MT evaluation metrics in the Metrics shared task (Ma et al., 2019;Mathur et al., 2020;Freitag et al., 2021Freitag et al., , 2022)), covering all segments of all MT systems in WMT News/General shared tasks.It also contains complete information about which segments of each MT system were annotated, allowing us to approximate the coverage budget (without access to the actual sampling code, which was not available at the time of submission). 5Each sampling method (using both the exact data annotated at WMT and simulations) is compared against our proposed document budgeted snippet (DB) sampling method in sumulations.All simulations are run 13 times with 13 as the random seed and we are reporting the worst, best, and median stability.
Automatic metrics are used for simulation because we can obtain and compare the system rankings between the full test set and multiple runs of different sampling approaches easily at minimal cost, testing the effects of sampling separately from annotator consistency.We compute rankings by averaging the segment-level metric scores over the sampled subset for each system.We focus on the four automatic metrics that participated in all or most of WMT19-22 Metrics shared tasks: chrF (Popović, 2015), COMET-20 (Rei et al., 2020), sentBLEU (Chen and Cherry, 2014) and YiSi-1 (Lo, 2019).sentBLEU is the sentence-level BLEU (Papineni et al., 2002), which is based on the precision of n-grams between the MT output and the reference weighted by a brevity penalty.chrF uses character n-grams, instead of word n-grams, and considers both precision and recall between the MT output and the reference.YiSi-1 measures the semantic similarity between a machine translation and human references by aggregating the IDFweighted lexical semantic similarities based on the contextual embeddings extracted from pre-trained language models.COMET-20 is the 2020 version of COMET, which is a learnt metric fine-tuned to produce a z-standardized DA for a given translation by comparing its representation to source and reference embeddings.Though the correlations of these four metrics with human judgment on translation quality vary, it does not affect the simulation validity because we compare subset/full test set rankings from the same metric.

Evaluation metric
If a sampled subset were perfectly representative of the full test set, the ranking of systems computed by averaging over the segment-level scores in the subset would be identical to that obtained by averaging over the full test set.Exact scores might change, but the relative ranking of systems would be the same.We follow Knowles (2021) to use the number of language pairs where the system ranking changed to analyze the instability of human evaluation.We choose ranking change rather than cluster change because of our use of Metrics data, which ignores clusterings and focuses only on ranking.Note that we are not able to use Spear-    man's ranking correlation to present the distortions in rankings because the number of systems for each language pair ranged from 9 to 22 which does not meet the minimum number of samples needed for Spearman's ranking correlation analysis at significance level 0.05 (Bonett and Wright, 2000;May and Looney, 2020).

Results and Discussion
Tables 1, 2 and 3 show the number of language pairs where the ranking changed (for at least one pair of systems) between the sampled subset and the full test set; lower is better.We compare the WMT segment-level (SL), whole document (WD) and document fixed snippet (DF) sampling methods respectively against the best, median, and worst runs of document budgeted snippet (DB) sampling, and note whether the WMT methods used matching or non-matching subsets.
Comparing SL against DB in Table 1, we see that the number of inconsistent system rankings for the median run of DB is similar to that of SL.Thus, the DB approach performs similarly to SL, while having the advantage of document context.Läubli et al. (2018) and Toral et al. (2018) show that document context is necessary for annotation quality and consistency.As MT moves towards documentlevel translation from sentence-level, it is essential for human evaluation to also have the capability to evaluate with document context to support future research on MT.For this reason, even though our proposed DB sampling method only provides similar stability as the SL sampling method in table 1, we argue that the advantage of having document context makes DB more suitable for human evaluation of MT (Castilho (2021) makes a similar argument about tradeoffs).In Table 2, we see clearly that DB produces more consistent system rankings than WD, with the worst run of DB still having fewer inconsistent system rankings than WD.Comparing DF against DB in Table 3, we see that the number of inconsistent system rankings for the median run of DB is similar to that of DF.However, Figure 1 and Table 4 show that DF consistently oversampled segments from short documents and undersampled from long documents, relative to the proportion of the test set that they make up; DB is designed to better match the full test set distribution.It is worth additional examination to determine the consequences of a potential tradeoff between better representation of document length distributions and topic/vocabulary representation.
Beyond noting the concern about the sampled data not being representative of the full test set, our simulation method cannot demonstrate all forms of system ranking instability caused by bias in document length because current automatic MT evaluation metrics do not consider document-level quality.As MT research moves toward the document level, document length distribution in evaluation data will be increasingly important.Sampling bias in evaluation towards shorter documents may result in system rankings not able to accurately reflect system performance in translating long documents.
Let us recall that the reason we do sampling at all is because we want to understand how systems perform on the full test set, but do not have the budget to collect enough annotations.Coverage is a key factor in system ranking consistency, and under tight budgetary constraints it may not always be possible to mitigate instability simply by modifying the sampling method.If coverage is too low, it may be worth considering non-random alternatives, such as determining whether there are portions of the test set that are actually of greater importance to the MT use case, and selecting for those.But assuming the sampling case, we emphasize the importance of minimizing inconsistency and sources of errors in as much as possible in as many parts of the evaluation setup as possible, to prevent compounding effects on the final rankings.
We made retrospective comparisons, but there is no reason we cannot sample and compare rankings against the full test set using automatic metrics before performing human annotation.This could enable us to select a sample that has the smallest level of inconsistency with the full test set, rather than hoping for median performance.The risk of this is if this biases the subset due to the choice of metric (or if metrics perform poorly on the language pair); in future work we plan to examine whether such metric-guided sampling reduces inconsistency with human annotation.

Conclusions
We examine three different approaches that WMT has used for sampling segments from test sets for human judgment, performing simulations using automatic metrics in place of human annotations.This allows us to examine a large range of scenarios at low cost, with the risk that it may not be fully representative of human judgment distributions.We demonstrate in simulation that a document budgeted snippet sampling approach finds a balance between providing document context, representation (i.e., better representing the distribution of document lengths), and ranking stability.Additionally, we use this analysis to highlight problems and challenges in comparing past human annotation approaches.In particular, large and small variations in annotation procedures are often conflated and collapsed into overly-simplified descriptions that obscure the ways in which they differ from one another; we attempt to untangle some of these.We urge researchers to take care in examiningin isolation and in combination -the effects that various design decisions have on results, in order to build annotation approaches that remove as many sources of error as possible.

Limitations
The main limitation of this work is our use of automatic metrics rather than human evaluation.First, the score distribution produced by a metric is not guaranteed to be similar to one produced by human annotators, which could influence results.Secondly, the metrics we examined do not incorporate context.Motivated by evidence that document-level (or contextual) information is becoming necessary to distinguish between human translations and high quality machine translation (Läubli et al., 2018;Toral et al., 2018), recent WMT evaluations have incorporated context.Since the human annotations are influenced by the context in which they appear and the automatic metrics are not (i.e., given an identical segment in two different contexts, the automatic metric will score them identically while a human annotator may not), additional study may be necessary to answer questions such as whether additional preceding source context should be displayed to annotators (as suggested in Castilho et al. (2020)), to determine how much additional time reading this context would take (which may influence the annotation budget), or to determine whether human annotator behavior may differ based on where in a document the snippet comes from.We also do not directly address issues such as the best interfaces for human annotation; a problem that is mostly orthogonal to the question of what data should be annotated.
In this work, we also follow the approach in the WMT Metrics shared task of treating the scores assigned to systems (in our case by automatic metrics rather than human annotators) as full rankings of systems, rather than as clusters of systems.In practice, this may mean that statistically insignificant differences between systems are considered on par with statistically significant ones when we examine reorderings that occur based on different sampling procedures.While this is a major concern in human annotation (where there is also an effort to handle annotator variation, a separate source of instability), it is less of a concern in this setting where the annotation is guaranteed to be consistent.
One additional limitation to our proposed future work of using metrics as a pre-sampling approach is that they may not perform equally well across all languages.See Appendix A for the list of language pairs on which these experiments were performed.

Ethics/Impact Statement
This work, while it uses automatic metrics rather than human judgments to demonstrate theory, is focused on sampling methods for human evaluation of machine translation.Future work should examine whether human evaluation and distributions of human annotations do follow the same patterns we observed across automatic metrics in this work.A risk we have observed in failing to do adequate theoretical analysis of annotation setups is that the blame for inconsistency is sometimes shifted to the human annotators themselves, when in fact there may be more that those setting up the annotation schema ought to do to account for various other sources of inconsistency introduced into the process.Thus, we do think it is important and valuable to do additional (controlled) experiments on the approaches we have examined with human annotations, to determine whether there are user interface, context, or other issues that may present themselves in human annotation but not in automatic evaluation.

Table 1 :
Effect of data sampling methods comparing on the data from translation directions in WMT19 and WMT20 that used segment-level sampling approach and runs of document budgeted snippet approach.Values indicate the fraction of translation directions that had changes in rank.The top section shows the real WMT results.For the simulation of WMT (middle section) and the document budgeted snippet approach (bottom section), multiple runs of simulation have been done and those with the best (min.), median and worst (max.)number changes are reported.

Table 2 :
Effect of data sampling methods comparing translation directions that used whole document sampling and runs of document budgeted snippet sampling.

Table 3 :
Effect of sampling methods comparing translation directions with document fixed snippet sampling approach to document budgeted snippet sampling.Figure 1: Proportion of data from documents with different document lengths in the full test set and the subset sampled by document fixed snippet and document budgeted snippet sampling.

Table 6 :
Average coverage per system of human annotation for the WMT19-20 test sets that used segment-level (SL) sampling method

Table 7 :
Average coverage per system of human annotation for the WMT19-22 test sets that used whole document (WD) sampling method

Table 8 :
Average coverage per system of human annotation for the WMT22 test sets that used document fixed snippet (DF) sampling method