GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research.We revisit this problem with a focus on producing consistent evaluations that are reproducible—over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks.We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension.For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency.We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.


Introduction
While the emergence of powerful language models (Radford et al., 2019;Raffel et al., 2020;Lewis et al., 2020) has made text generation omnipresent, effective evaluation of the resulting systems' performance on open-ended generation tasks remains a challenge.This has motivated adoption of human evaluation in recent works (Celikyilmaz et al., 2020;Fabbri et al., 2021), even though it poses The GENIE architecture for evaluating text generation tasks, with a summarization example.Similar to automatic leaderboards, model developers submit their predictions (top).GENIE then evaluates with a standard human evaluation as well as with automatic metrics (center).These scores are then used to rank and track systems' performance across time (bottom).
several challenges (Clark et al., 2021;Karpinska et al., 2021).First, the estimates of system performance are not reproducible-over time and various annotator populations.Additionally, the setups are not standardized.Different works use different annotation interfaces, even those working on the same dataset, despite substantial efforts needed for building an appropriate annotation interface and guidelines to extract quality human annotations and filter out noisy annotators.
This work presents an investigation toward reliably repeatable and standardized human evaluation.First and foremost, we study the reproducibility of human annotations, in two stages of the annotation pipeline.We study this goal empirically as a func-tion of various design choices ( §4) such as the way the judgments are aggregated.We then propose a probabilistic framework for detecting malicious annotators ( §5) and isolating their annotations from the resulting performance estimates. 2uided by the earlier studies, we present GE-NIE (Figure 1)-a framework for human evaluation of text generation, which scales to a variety of tasks and datasets ( §6).GENIE posts model predictions to a crowdsourcing platform, 3 where human annotators evaluate them according to predefined, dataset-specific guidelines.We describe mechanisms introduced into GENIE to quantify annotator variance and spread the annotations across various days, showing that GENIE achieves reliable scores on the studied tasks.To show its applicability, we instantiate GENIE with leaderboards for several popular text generation datasets in English from four diverse tasks-machine translation, question answering, summarization, and commonsense reasoning-and invite developers to extend it with more datasets.Since its deployment, GENIE has analyzed and ranked about 50 submissions from 10 different groups across all of our tasks, indicating the interest in standardized human evaluation.
The GENIE infrastructure opens the door for three avenues of research: (1) GENIE provides developers of text-generation models with the ease of the "leaderboard experience," alleviating the evaluation burden while ensuring high-quality, standardized comparison against previous models.( 2) GENIE facilitates the study of human evaluation interfaces (Nenkova and Passonneau, 2004;Liu et al., 2016;Bragg et al., 2018;Shapira et al., 2019), addressing challenges such as annotator training, inter-annotator agreement, and reproducibility, all of which can be integrated into GENIE to compare against other evaluation metrics on past and future model submissions.(3) GENIE helps developers of automatic evaluation metrics (Zhang et al., 2020b), by serving as a hub of model submissions and associated human scores.

Related Work
We survey relevant work on automatic and humanin-loop evaluation of text generation.See Welty et al. (2019); van der Lee et al. (2019); Celikyilmaz et al. (2020) for further in-depth discussion.
(Semi-)automatic Metrics Many researchers have proposed automated metrics for text generation tasks, such as BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005).These metrics initially correlated well with human judgments for contemporary models (Papineni et al., 2002;Doddington, 2002;Coughlin, 2003), though the correspondence breaks down as they become targets for optimization (Callison-Burch et al., 2006;Sun et al., 2019) or as models become increasingly powerful (Ma et al., 2019;Edunov et al., 2020).Several more recent approaches aim to learn automated metrics for text generation tasks, including for image description (Vedantam et al., 2015), paraphrasing (Sellam et al., 2020), and abstractive question answering (Chen et al., 2020).Such progress in automatic metrics is incorporated into recent leaderboards (Kasai et al., 2021).We integrate some of these metrics into our proposed system to track their correlation with human evaluation.
Human Evaluation of Language Given the limitations of automatic metrics, much prior work has developed ways to conduct human evaluation of language generation in general, and machine translation in particular.Human evaluation for machine translation (Graham et al., 2013(Graham et al., , 2014;;Sakaguchi and Van Durme, 2018;Freitag et al., 2021) typically involves crowdsourcing where qualified crowd workers score output translations given the reference text.Results from manual evaluation are used as the primary metric in recent WMT competitions (Bojar et al., 2016(Bojar et al., , 2018;;Barrault et al., 2020).However, to date, human evaluation efforts are typically conducted (1) on individual tasks such as machine translation, (2) by individual researchers with potentially varying design decisions, making results incomparable across evaluations, or (3) through shared tasks such as WMT, which force synchronization across teams for evaluation, slowing progress.As a result, most of the research on model development still evaluates models solely on automatic metrics such as BLEU (Papineni et al., 2002).GENIE relaxes these limitations by providing a continually-running leaderboard across language generation tasks with shared high-quality human evaluation templates.

Human-in-the-loop Evaluation
There are a few recent and concurrent leaderboards that incorporate manual analysis, tending to focus on individual tasks.For example, HYPE (Zhou et al., 2019) is an evaluation platform for image generation, ChatE-val (Sedoc et al., 2019) is an evaluation platform for chatbots, and, more recently, Zellers et al. (2021) present a leaderboard for the advice generation task introduced in their work.DynaBench (Kiela et al., 2021) is a related multi-task leaderboard but uses changing, adversarially-created datasets that do not support our goal of controlled model comparison across time.HUME (Hashimoto et al., 2019) was proposed as an evaluation metric for summarization and dialog which combines human annotations with automatic metrics for diversity and quality.STORIUM (Akoury et al., 2020) was introduced for human-in-the-loop generation and evaluation of long open-ended stories as a computer game.Concurrently, Gehrmann et al. (2021) introduced GEM, a workshop for participant-driven evaluation of language generation tasks.While such workshops inspire progress toward common goals, synchronized evaluations, often only once per year, likely slow progress.We take the view that evaluations on a more frequent, rolling basis will give researchers more flexibility.To the best of our knowledge, GENIE is the first crowdsourced human-in-the-loop system that supports task leaderboards and is backed by principled design to ensure scoring reliability of human evaluations.

GENIE Principles for Human Evaluation of Generative Models
There are many ways to run human evaluations.
Reflecting on what's needed to compare text generation models across time, we formulated the following principles to guide our design choices.
Application-Motivated Ultimately, the evaluation's purpose is to identify useful models and techniques.Thus, it should measure something informative about the their usefulness in applications (such as a generated text's correctness or fluency).
Reproducible To compare different models over time, the evaluation must be reproducible.If repeated, it should give largely the same results.For example, results should hold across different groups of annotators, and remain stable across appropriate lengths of time.
Interpretable The evaluation should help a researcher understand how the system behaves, and thus must measure an aspect of the system that is easy to understand.An evaluation which ranks models but isn't interpretable has limited usefulness, since different applications might prioritize different things and researchers must navigate costbenefit trade offs between more expensive, higher performing models and cheaper ones.

Scalar
The evaluation should produce an absolute scalar measurement of the model performance (rather than a relative or comparative one) that facilitates comparison of a new model to all those previously evaluated.
Quantified Uncertainty All measurements are subject to uncertainty, including human evaluations.Thus, when comparing evaluations, we should consider how confident we can be that the resulting measurement is close to the true, latent measurement based on a more complete population of inputs and human annotators.
Rolling Given rapid recent advances in natural language generation, it is essential to develop easily-accessible evaluation platforms for frequent model evaluations that do not require competing teams to synchronize with each other.
Extensible Evaluation of NLP models is actively evolving, as new datasets are introduced and more is learned about how best to conduct human evaluation.Therefore, an evaluation framework should be easily extensible to new tasks or the latest practices.
Next, we empirically study design decisions along the aforementioned evaluation desiderata.

Design Decisions for Consistent Human Evaluations
When designing an evaluation, some questions can be answered with principles, while others must be answered empirically.We investigate several questions around the prompt design that commonly occur across various tasks and impact evaluations' reproducibility and confidence.
(Q 1 ) Granularity of the elicitation: We examine two kinds of labels: (a) binary, and (b) Likert for 5 categories: Strongly agree, agree, neutral, disagree, and strongly disagree.(Q 2 ) Aggregation of per-example labels: Given multiple labels per example, we investigate aggregating by (a) averaging their scores, and (b) taking a majority vote.(Q 3 ) Labels per example: for a fixed annotation budget, we compare collecting (a) 3 labels per annotation example (multilabeling), with (b)  Case Study: Comparing Evaluation Designs for Open-domain Question Answering (ARC-DA To study these design choices, we evaluated T5-11B (Raffel et al., 2020) on the development set of ARC-DA (Clark et al., 2021), a generative question answering dataset (see §7.1 further details).
We used modified versions of the same annotation interface as Bhakthavatsalam et al. (2021) (Figure 4).Each evaluation was run once with a Likert scale and once with a binary scale.All instances (n = 360) were annotated by 3 annotators, repeated three times across different weekdays. 5hen the quality judgments were mapped to numerical values. 6To produce the unilabeling and multilabeling results, we simulated these policies by randomly sampling with replacement for 500 rounds, either a random 1/3 of the total number of examples (multilabeling) or 1/3 of the total number of annotations for each example (unilabeling).7 Figure 2 compares the reproducibility of different setups across time.Each subplot represents a choice of (Q 1 ) scale (binary/Likert), and (Q 2 ) aggregation (mean/majority-vote).We compare these setups across subplots, and within subplots compare (Q 3 ) unilabeling and multilabeling.The choices of scale and aggregation appear to have little effect on the evaluation, with all combinations broadly stable across days, though Likert elicitation with mean aggregation is slightly more stable.
Figure 3 compares the variance for all possible combinations.The choices of scale and aggregation appear to have little effect, though the Likert scale with mean aggregation may have the lowest variance.The biggest impact comes from unilabeling, which noticeably reduces the variance in comparison to multilabeling across all scenarios.This observation is consistent with previous work demonstrating the effectiveness of unilabeling for model training (Lin et al., 2014), but deviates from how annotations are often done in NLP (van der Lee et al., 2019).Our finding suggests that unilabeling is a promising strategy for model evaluation.
Overall, unilabeling with Likert scales and mean aggregation appears most reliable among all configurations for ARC-DA, and therefore we use this configuration in GENIE.Moreover, for the main leaderboard evaluations we use 3-7 times more samples, and expect even less variation.Our analysis shows that these design choices provide a good starting point for reproducible experiments with confident estimates.

Monitoring Annotation Quality
Despite strict qualification requirements, in our early experiments some annotators chose arbitrary labels after initially choosing correct ones.While a small percentage, these annotators complete a disproportionate share of tasks and significantly impact evaluations.To solve this problem, we built a monitoring system with two components: automatically generated test questions and an unsupervised scoring model.8 Test Questions Because noisy annotators could favor marking examples as correct or incorrect, test questions need both positive and negative examples.
For positive examples, we replaced model predictions with gold responses.For negative examples, we cyclically permuted the gold generations, so no example was matched with its original.Thus, the negative examples look correct at a glance, but almost never are.
Scoring Model Manually reviewing annotations can be time consuming and error prone, so we automate this process with a scoring model to infer if workers have acceptable accuracy.Probabilistic models of annotation have been richly studied (Hovy et al., 2013;Passonneau and Carpenter, 2014;Paun et al., 2018).Much prior work uses worker agreement to identify noisy annotators.Since we use unilabeling ( §4), workers annotate disjoint sets of examples and these methods are not applicable.Instead, we use a similar probabilistic model but applied to predict how often workers correctly answer the test questions.Such a model must be unsupervised, since new tasks won't have identified noisy annotators, interpretable, since parameters like confidence thresholds must be set a priori, and sequential, so noisy annotators can be detected as soon as there is enough evidence.
In our model, each worker, w, answers n w test questions.The number of correctly answered test questions, X w , is binomially distributed with mean P w .Each P w comes from a mixture of betas prior.Thus, noisy and non-noisy annotators can be modeled with different mixture components.
We compare two definitions of noisy annotators.The rate criterion defines them as workers with an accuracy (P w ) below a threshold (90%).The class criterion defines them as workers whose latent class (Z w ) corresponds to any mixture component besides the one with highest expected accuracy.
We fit the model parameters, θ i , α i , β i , for mixture components i = 1, . . .k, with maximum likelihood via the EM algorithm (Dempster et al., 1977) for mixture models (Murphy, 2012).Then, we infer a posterior distribution for each worker's accuracy (P w ) and latent class (Z w ) given the number of questions they answered correctly (X w ).Since the prior is a mixture of conjugate distributions, the posteriors have a closed-form (Diaconis and Ylvisaker, 1979).
To adapt the Likert responses for this model, we binarize them at 0.5.Positive and negative test questions are modeled independently, and annotators are considered noisy if they are noisy on either.Since the difficulty of annotating different tasks varies, each GENIE task is modeled separately.Finally, to stabilize the EM algorithm and resulting parameter estimates, we augment the worker responses with pseudo-data.See Appendix B.1 for the full technical details.

Detecting Noisy Annotators for WMT21
To test GENIE in a real-world scenario, we used it to evaluate the 24 systems submitted to WMT21 and several additional baselines on German-to-English translation (Akhbardeh et al., 2021).For the evaluations, the GENIE leaderboards used 5% of examples as positive and 5% as negative test questions.We manually reviewed test question statistics to identify and remove 5 noisy annotators from a pool of 88 (5.7%).As in our preliminary experiments, these noisy annotators represented a small fraction of annotators; however, we had previously found such annotators could annotate up to 50% of the HITs.9By identifying and removing them, we prevented such a negative impact on our WMT evaluations.
Simulation Study Even a fairly large real-world evaluation encounters only a few noisy annotators.So, we complement our WMT21 case study with simulations based on it, where we can run more trials and know the ground truth.
We split the WMT21 annotations chronologically into validation and test sets.The validation set was used during model development, while we evaluated the models by simulating 25 rounds of annotation based on the test set's statistics. 10Similarly to the annotation models discussed in Karger et al. (2011), each worker was independently designated as noisy and then assigned a rate at which they labeled test questions correctly.Based on the validation data's statistics, for each round we drew the noisy annotator probability uniformly from 1-10%, and each annotator's probability of being correct uniformly between 0-50% for noisy annotators and 95-100% for the rest.The model predicted annotators as noisy if the posteriors for Z w and P w assigned them at least 99% probability of being noisy annotators under the rate or class criteria.We computed precision and recall across all the simulations, bucketing workers by how many test questions they answered.Table 1 shows the simulation results.In addition to varying the number of components (k) in the learned priors, we also compared against uninformative priors (the Jeffreys and uniform priors), and informative priors (fixed).Noisy annotators lose the chance to answer additional test questions, thus it's critical that models have high precision when marking workers as noisy.The uninformative priors suffer from low precision, assigning too much probability to a worker being a noisy annotator.The informed and learned priors both perform well, with high precision and good recall-in some cases identifying almost all noisy annotators with fewer than 15 test questions.The learned priors have the additional advantage that they can adapt to different distributions by pooling information across annotators.Based on these results, the 2-component learned rate and class models have proven to be strong candidates for application.

Automatically Managing Human Evaluation Leaderboards
This section reviews the GENIE system, which automates much of the management of text generation leaderboards with human evaluations.While we note that some human management, such as providing support and handling disputes, should never be fully automated, GENIE alleviates much of the overall burden.The next section ( §7) describes its instantiation into the GENIE leaderboards for four diverse text generation tasks.At a high level, the GENIE system coordinates a leaderboard UI, data processing backend, and crowdsourcing campaigns on Amazon Mechanical Turk.After retrieving newly uploaded submissions, the backend computes automatic metrics.Upon success, the backend then creates annotation tasks on Amazon Mechanical Turk (AMT) using AMTI11 (A Mechanical Turk Inferface), an opensource Python package for working with AMT.
Each leaderboard is a separate instance of the system, with its own crowdsourcing templates, including instructions, examples, and prompts (see §7).Following our observations in §4, all templates use Likert scales which are then mapped to real-valued scores (cf.footnote 6) and averaged.
The system also maintains a history of past annotations (per-instance and per-worker), updating statistics after each evaluation.This has several immediate and future benefits: worker statistics enable spam detection ( §5), while the annotations can be used for future studies on human evaluation.
These components enable the following features: Extensibility New tasks can be modularly added to the GENIE system, creating new leaderboards.Each task requires a crowdsourcing template and a code object specifying how to push model predictions into and pull workers' annotations from the crowdsourcing templates.We release an extensible open-source annotation template library,12 seeded with the four task templates used in this work.
Uncertainty Quantification To better inform model comparisons, we report scores with uncertainty estimates.Bootstrap resampling (samples with replacement from the observed annotations) provides the 95% confidence intervals for the estimated submission quality scores, as commonly done in machine translation (Koehn, 2004).

Human Evaluations: Uncertainty vs Cost
To balance confidence with affordability, the system evaluates a subsample of the test sets.This subset is random, but fixed to reduce the variance between model comparisons.Sentence-level tasks, such as translation of sentences, cost less to annotate per example.Depending on task difficulty, we adjust the pay rate per HIT such that we are paying workers at a higher rate than 15 USD per hour.For these tasks we annotate 800 instances at a cost of ∼$600 per submission (standard error < 1.77%).For larger tasks, we evaluate 300 instances costing ∼$350 per submission (standard error < 2.89%). 13These evaluations are much larger than what was previously done, e.g., 100 instances for MT in Ma et al. (2018) or around 100 instances for summarization (Kim et al., 2019;Hardy et al., 2019;Kryscinski et al., 2019;Fabbri et al., 2021).
Automatic Metrics To supplement human evaluations, we compute recent and popular automatic metrics for each task: METEOR (Banerjee and Lavie, 2005), ROUGE (Lin et al., 2006), BLEU (Papineni et al., 2002), SacreBLEU (Post, 2018), BLEURT (Sellam et al., 2020) and BERTScore (Zhang et al., 2020b).Integrating these metrics into GENIE enables researchers to examine their correlation with human judgments as well as observing trends as more models are submitted.

Quality Control
To ensure annotation quality, annotators must pass strict qualifications requirements 14 and task-specific qualification tests based on a subset of the questions derived from the task's training data.These tests check that the workers have carefully read the instructions and are comfortable with annotating instances of the particular task.
In addition, we replace 5% of examples with positive and another 5% with negative test questions, which we analyze with the 2-component learned class model between submission evaluations, as described in §5.Accordingly, noisy annotations are excluded from results and annotators from the pool of eligible workers.Lastly, to eliminate variability from evaluating at different times (weekend vs. weekdays, different work hours), we publish the AMT tasks on weekdays at 10am Pacific Time.
13 See Appendix C for a discussion of standard error. 14I.e., 5000 completed HITs, a 99% assignment approval rate, and being based in a country with a population predomninantly of native English speakers (e.g., USA, Canada, UK, Australia) since our initial set of tasks focuses on English.Table 2: Datasets currently available in GENIE, along with their domain and size by task type.
7 The GENIE Leaderboards

Tasks and Datasets
We integrate in GENIE datasets from four diverse text-generation tasks, representing longstanding challenges, as outlined below.We focus on English language datasets, mostly due to easy integration with crowdsourcing platforms.In the future, we hope to integrate other new datasets, particularly other languages.GENIE is easily extensible; it uses community datasets and metrics via the opensource Datasets library. 15The templates for all tasks are exemplified in Figure 4.
Question Answering Given an input question about a given context, the system is expected to provide the answer in natural-language form.We use the ARC-DA dataset, 16 which contains questions about subjects from elementary-school science exams.See Figure 4 for an example.
Commonsense Reasoning Given an input scenario, the task is to generate a plausible explanation, according to typical real-world human behavior and understanding.We use αNLG (Bhagavatula et al., 2020), a dataset for the conditional generation task of explaining given observations in natural language.For evaluation, we use a template and instructions that are similar to those used by Bhagavatula et al. ( 2020), as shown in Figure 4b.

Machine Translation
The task is to generate a translation in a target language given a text in a source language.Here we use the recent WMT19 and WMT21 datasets with publicly available system outputs (Barrault et al., 2019;Akhbardeh et al., 2021).17To ensure the generated text is evaluated by native speakers, we focus on German-to-English translation (DE-EN), and leave the expansion to  2018).Here, Summary A is the gold label while Summary B is model-predicted text.We permute this randomly between instances so that the annotators are blind to which one is gold.other language pairs as future work.Importantly, WMT19 and WMT21 DE-EN test data only contain text that was originally in German (Barrault et al., 2019), avoiding overestimating the quality of translation systems due to translationese effects (Toral et al., 2018;Graham et al., 2019;Edunov et al., 2020).We follow the WMT human evaluation template to assess sentence-level translation quality against the reference (Barrault et al., 2019).The one difference is that, consistent with the other GENIE tasks, we use a five-category Likert scale instead of a continuous one in WMT.See Figure 4c.
Summarization The model is expected to generate a summary of the key points mentioned in a given paragraph.Here we use XSUM (Narayan et al., 2018) and Radev, 1995).See Figure 4d for an example.

Evaluating GENIE Baselines
Here we evaluate several baseline models for each dataset using the GENIE evaluation pipeline.

Models
We use models that are known to perform strongly for each of our tasks.For all tasks but machine translation, we train and evaluate T5 (11B; Raffel et al., 2020), a powerful textgeneration model that has shown promising results on a wide variety of text generation tasks.
For WMT we evaluate other specialized models instead of T5, which is pre-trained only on English (Raffel et al., 2020).For WMT21 DE-EN, we evaluate all publicly available shared task sub-  missions (see footnote 17).Additionally, we train and evaluate four transformer-based baselines with varying sizes: GENIE-large-6-6 (transformer large with a 6-layer encoder and a 6-layer decoder), GE-NIE-base-6-6, GENIE-base-3-3, and GENIE-base-1-1. 18These models are trained solely on the given training data without ensembling, backtranslation, or any other data augmentation method, to support future research in low-compute settings.

Results
The results are summarized in Table 3.The human judgment scores for each task are calculated with our described pipeline ( §6).Even though we have evaluated strong baselines for each task, the machine responses are far from what human judges consider perfect.In the WMT21 task, the transformer baselines are ranked in the expected order: large-6-6, base-6-6, base-3-3, followed by base-1-1.These results support the validity of our evaluations.We defer any further study of the correlations between human judgments and automatic metrics for future work since such a study would require more models to be evaluated.

Limitations
As with other works which deal with human annotation, the results generated via our evaluation framework will have inherent variability.While we tried to mitigate sources of variation in various ways (see §5,6), some are bound to remain and are hard to account for.These include, for example, selection bias in the pool of annotators that choose to work on our tasks, who may come from specific countries and social status and select for certain tasks and their templates.We welcome future evolution of all parts of the GENIE architecture, including its evaluation metrics.

Conclusion and Future Work
We introduce GENIE, a unified approach to humanin-the-loop evaluation of text generation over a wide set of text generation tasks.GENIE is open for use and will be adapted based on future adoption.We encourage submissions from all researchers interested in text generation models.

A Details on Model Engineering
Here we summarize the experimental details for building the models used in §7.2.
T5 models.For various datasets (except WMT which requires a multi-lingual model) we trained T5 models of different sizes: 11 billion parameters (11B) and 3 billion parameters (3B).We used the default hyperparameters on these frameworks: token-limits of size 512 and 100 for inputs and outputs sequences, respectively; learning rates of 1e-3 and batch sizes of 8.The models were trained for 100k steps on v3-8 TPUs which took about 24 hours to finish, on average.The checkpoint with the highest score on the dev set of each task was selected for evaluation.GENIE WMT models.Tables 4 and 5

B Monitoring Annotation Quality
This appendix provides details on model construction and evaluation for §5.

B.1 Modeling
All rate models used P w = 0.9 as the threshold for defining a noisy annotator.All class models used the mixture component with highest average accuracy to define non-noisy annotators.Workers were ).The uniform model used a uniform prior, or Beta(1, 1).

Learned Priors
The learned prior models were all fit via the EM algorithm, as described below, and we tried 1 and 2 components for the rate model and 2 components for the class model.
Optimization To stabilize the EM algorithm and regularize the parameter estimates, we augmented with pseudo-data.To the data, we added 40 pseudoworkers, each completing 20 tasks: 36 annotators with 19 successes, and four noisy annotators with 1, 1, 5, and 10 successes, respectively.The EM algorithm was run with 10 initializations, each with up to 1,000 iterations and relative tolerance of 1e−6 for stopping.Components were initialized with equal mixture probabilities, uniformly random means from 0 to 1, and concentration parameters drawn from a gamma distribution with a shape parameter of 2. Beta mixture components were fit using the Dirichlet-multinomial fixed point iterator from Minka (2000), 10,000 iterations and a relative tolerance of 1e−7.

B.2 Evaluation
For evaluation, we simulated 25 rounds of annotation.In each round, the number of test questions were the counts from the test set of annotations, a fixed noisy annotator rate was drawn uniformly from 1% to 10%, a mean and concentration parameter for the beta distribution of noisy annotator's success probabilities was respectively drawn uniformly from 0 to 0.5 and 5 to 50, and a mean and concentration parameter for the beta distribution of regular annotator's success probabilities was respectively drawn uniformly from 0.95 to 1 and 100 to 1, 000.Each worker was assigned a noisy or regular annotator label and accordingly a success probability, then successes and failures were binomial distributed.

C Standard Error
Standard error quantifies the variability of an estimate, θ.Mathematically, the standard error is the estimate's standard deviation (as opposed to the standard deviation of a single sample).Often, the estimate is an average of multiple, independent samples, in which case the standard error is: where n is the number of samples and σ is the standard deviation of a single sample.When the estimate is an average, it's approximately normally distributed due to the central limit theorem, making θ ± 1.96 σ √ n an approximate 95% confidence interval.
The Bhatia-Davis inequality (Bhatia and Davis, 2000) bounds the variance of a random variable in terms of its upper bound, M , lower bound, m, and expectation, µ: Since the scores from our annotators are bounded between 0 and 1, the maximum standard deviation for any of them is 0.5.Moreover, if a model's score is 0.8 on average, then the maximum standard deviation for its annotations is (1 − 0.8)(0.8− 0) = 0.4.Dividing by √ n translates these into bounds on the worst-case standard error of our estimates: where µ is the expected score.
Table 6: Summary of evaluating several existing models on each dataset with GENIE.The highest numbers (and their CI) in each column are indicated in bold.The scores given by crowd workers are indicated with blue color.We evaluated all 24 systems from WMT21 but only show the top 3 systems as well as our GENIE transformer baselines here.
Figure1: The GENIE architecture for evaluating text generation tasks, with a summarization example.Similar to automatic leaderboards, model developers submit their predictions (top).GENIE then evaluates with a standard human evaluation as well as with automatic metrics (center).These scores are then used to rank and track systems' performance across time (bottom).

Figure 2 :
Figure 2: Label aggregation variants across three different days for the GENIE ARC-DA leaderboard.Using Likert scales yield lower inter-day variation.The horizontal red dashed line (0.803) denotes the annotations by an expert annotator intimately familiar with the task.

Figure 3 :
Figure 3: Standard deviation (STD) of different labeling strategies.Unilabeling yields lower variance and hence, better stability across different populations of annotators (on different days).

Figure 4 :
Figure 4: Annotation interfaces for the datasets of four tasks integrated in GENIE.
The 1-component fixed rate model had one beta mixture component with parameters α = 4 and β = 1.Both the 2-component fixed rate model and the 2-component fixed class model had two mixture components with probabilities 0.05 and 0.95 and parameters α = 0.5, β = 4.5 and α = 9.5, β = 0.5.

Table 4 :
list hyperparameters for our GENIE transformer baselines.BPE with 32K operations is applied jointly to German and English text.All embeddings are shared.Transformer-base fairseq hyperparameters and setting.

Table 5 :
Transformer-large fairseq hyperprameters.marked as noisy if the model assigned more than 99% probability to them being so.For reproducibility, we set Python and NumPy's random seeds to 0 at the beginning of our experiments.