The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.


Abstract
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with wellestablished, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

Introduction
Natural language generation is the task to automatically generate understandable texts, typically using a non-linguistic or textual representation of information as input (Reiter and Dale, 2000). These texts aim to fulfill an underlying communicative goal (e.g., to produce a summary of an article) while remaining faithful to the input information, fluent, grammatical, and natural-looking. An NLG system needs to be robust to shifts in the data distribution and be able to produce text in many different languages. Finally, it is often desired that repeated interactions with the model produce diverse outputs, for example, to explain concepts in multiple ways or to become a more interesting conversational agent. These optimization objectives can often be conflicting (Hashimoto et al., 2019) and, as a result, evaluations that focus only on a single aspect may fail to recognize the drawbacks of a particular method. To demonstrate this trade-off, consider an improvement on the CNN-DM summarization dataset (Hermann et al., 2015;Nallapati et al., 2016) measured by the ROUGE-L met-ric (Lin, 2004). Since ROUGE only tests the extent to which a generated summary has a lexical overlap with a reference summary, it can erroneously produce high scores for fluent, yet meaningless and unfaithful outputs as long as many of the same words are used (Maynez et al., 2020;Gabriel et al., 2020). Moreover, ROUGE tends to favor systems that produce longer summaries (Sun et al., 2019). It is thus crucial to carefully assess the progress of NLG toward all of its goals at the same time in ways that evolve alongside the models. This is currently not the case; new models are evaluated on different datasets, most of which focus only on the English language (Bender, 2019), and using these flawed metrics. Moreover, while human evaluations of generated texts can provide complementary insights to automatic evaluation (Manning et al., 2020), it can also lead to contradicting results since studies often omit crucial replication details and assume different definitions of the measured quantities (Howcroft et al., 2020).
We propose a living benchmark called GEM (Generation, Evaluation, and Metrics) that aims to enable research on a wide range of NLG challenges. To avoid the fallacy of encouraging hill climbing on a leaderboard (Linzen, 2020), GEM focuses on an in-depth evaluation of model outputs across human and automatic evaluation that aims to uncover shortcomings and opportunities for progress. As datasets, metrics, and models improve, the benchmark environment will improve as well, replacing "solved" tasks with more challenging ones, incorporating newly developed metrics, and addressing discovered flaws in the experimental setup, as demonstrated in Figure 1. Making all model outputs available under an open-source license will support evaluation research and integrating new metrics will, in turn, help their adoption and increase the robustness of model evaluations.
The initial set of eleven included datasets is presented in Table 1. They measure specific generation challenges, such as the content selection and planning (What to say?), and the surface realization (How to say it?) (Reiter and Dale, 2000;Gatt and Krahmer, 2018). Models need to be capable of paraphrasing, simplification, and others. In addition to those challenges, GEM datasets also differ in their communicative goals, languages, the noisiness of data, and resource availability, to evaluate the consistency of evaluation schemes. About half of the datasets have multiple references and more  Figure 1: The opportunities of living benchmarks and pitfalls of evaluation. As models improve, we need consistent evaluations such that models can be compared to each other. This can only happen if we develop robust human evaluation standards and improve our automated metrics. Otherwise, results are challenging to interpret and compare to each other. Finally, as models improve and metrics saturate, we need to evaluate them on more challenging datasets instead of continuing to move sideways on old ones. GEM aims to provide this environment for natural language generation. than half were post-processed to improve data quality. The sizes range from 5k to 500k data points. GEM features 18 languages across all tasks and two of the datasets do not include English at all. To be able to properly assess the performance of models in a way robust to the shortcuts a model can take, we additionally introduce ten types of challenging test sets that probe for specific modeling aspects (Perez-Beltrachini and Gardent, 2017;Ribeiro et al., 2020). To ensure that research with GEM is conducted responsibly, all the datasets are documented in an NLG-specific version of data cards (Bender and Friedman, 2018;Gebru et al., 2018) we developed and for which we release a template and guide. Moreover, all submitted models will have an associated data card (Mitchell et al., 2019). This paper describes the selection and construction of the GEM datasets in support of the announcement of the shared task at ACL 2021. More detailed information can be found on our website https://gem-benchmark.com/.

Benchmarks in NLG
In this section, we summarize common criticisms of benchmarks in NLP, discuss how they apply to NLG, and how we plan to address them. Then, we describe opportunities that GEM can provide. NLP benchmarks such as GLUE (Wang et al., 2019b) are common for natural language understanding   Summarize relevant points within a news article *de/es *520k Articles Schema-Guided Dialog  Provide the surface realization for a virtual assistant en *165k Dialog Act ToTTo  Produce an English sentence that describes the highlighted cells in the context of the given  (NLU) tasks. They aggregate multiple tasks under a unified evaluation framework, which enables researchers to fairly compare their models to others. Due to the improved model comparability, benchmarks are critical in measuring modeling progress.
However, they also pose a risk that progress is reduced to the single number shown in a benchmark's leaderboard and thus may encourage blindly optimizing it without regard to other considerations like model size or fairness (Ethayarajh and Jurafsky, 2020). This is especially challenging for benchmarks in NLG since, as discussed above, the performance cannot be described through a single metric and it is often not clear what metric to optimize for. This shortfall can be seen in benchmarks like DecaNLP (McCann et al., 2018) and GLGE  which include NLG tasks but focus only on a single metric and, as a result, may mischaracterize a system's performance.
Moreover, an easy-to-use data infrastructure also disincentivizes researchers from interacting with and conducting in-depth analyses of the data sets that models are trained on. The limited analysis delegates the responsibility to ensure that all included datasets have been collected fairly to the creators of the benchmark (Denton et al., 2020). The dataset and benchmark creators thus must provide in-depth statements that describe the data characteristics and surface potential issues and consider these issues when selecting datasets for a benchmark (Gebru et al., 2018;Bender and Friedman, 2018).
These dangers emphasize selecting datasets for a benchmark needs to be carefully done, that the setup has to remain flexible to be able to address newly found limitations, and that the benchmark should focus on climbing a leaderboard. Instead, a living benchmark that can adjust its datasets and specific evaluation metrics can be much more powerful and long-lived. This can, for example, be seen in Dynabench, 1 (Potts et al., 2020) which has a static evaluation, but interactively adds more test data through a human-in-the-loop approach.
Increasing multilingualism of NLG research. Another potentially harmful choice by benchmark creators is the choice of the languages of the included datasets. It is often assumed that work on English transfers to other languages (Bender, 2011). However, this assumption does not consider differences between the languages that lead to higher modeling complexity, for example, a richer morphology or a flexible word-order. Still, the majority of work in NLP and almost all benchmarks exclusively focus on English (e.g., Wang et al., 2019b;McCann et al., 2018). Even if multiple languages are considered, the availability of data in a language often does not represent the number of speakers of a language. This means that work on languages with little available data can potentially impact many more people than work on highly resourced languages (Joshi et al., 2020).
As a result, many recent benchmarking and dataset creation efforts in NLU develop and focus on tasks that are inherently multilingual or which explore cross-lingual transfer. For example, XTREME (Hu et al., 2020) introduces a benchmark covering 40 languages across multiple NLU and retrieval tasks, XCOPA (Ponti et al., 2020) is a commonsense reasoning dataset for eleven languages, and MLQA ) is a dataset for extractive question answering across seven languages. We can observe a similar recent trend in natural language generation, where ML-Sum  and WikiLingua  were created as multilingual summarization datasets. There also have been first steps toward including NLG tasks in multilingual NLU benchmarks. For example, XGLUE includes Question and News Title Generation (Liang et al., 2020). Unfortunately, XGLUE reduces the generation evaluation to BLEU-4, a metric that is inadequate for NLG (Reiter, 2018).
There have also been multiple shared tasks in NLG that focus on multilingualism, for instance, the shared task on multilingual surface realization which includes eleven languages (Mille et al., , 2020. The shared task on document-level generation and translation featured German and English generation challenges (Heafield et al., 2020). The WebNLG+ shared task asked participants to contribute models that can realize text in Russian and English (Ferreira et al., 2020).
A benchmark that focuses only on NLG can en-able much richer evaluation (as described in the next sections), and promote non-English datasets. In addition, it can ensure that the datasets created for those shared tasks continue being evaluated.
Providing a testbed for automated evaluation.
Most traditional automated metrics, such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002), measure the n-gram overlap between a reference and the generated text. However, in most cases, there is more than one correct way to generate a text, especially in tasks with a latent content planning or selection step (Reiter and Dale, 2000). That means that a correct solution may score low on a metric. While multiple references alleviate the issue somewhat, these metrics still have a low correlation with human judgments (Reiter, 2018;Fabbri et al., 2020). To address the issue, the machine translation community has been organizing yearly metrics shared tasks which produce metrics that achieve a high correlation (Stanojević et al., 2015;Bojar et al., 2016Bojar et al., , 2017Ma et al., 2018Ma et al., , 2019Mathur et al., 2020b). The latest metrics focus on semantic equivalence instead of lexical similarity, which improves the correlations drastically. However, recent work by Fabbri et al. (2020) demonstrates that this may not hold in summarization, where the automated metric BERTScore (Zhang et al., 2020b) does not improve upon the correlation of ROUGE. Moreover, Mathur et al. (2020a) and Freitag et al. (2020) find that when comparing two high-quality systems, differences according to a metric may also stem from how references are written or flaws in the metric itself. 2 Given that automated metrics perform differently across tasks, setups, and languages, a multi-task NLG benchmark has the opportunity to act as a testbed to evaluate how the latest advances in automated metrics perform on these different tasks. The benchmark can facilitate this research through the release of system outputs and associated human annotations, which is what we are planning to do with GEM. Moreover, we allow the integration of additional metrics into our living benchmark system, which enables a much faster adoption.
Developing reproducible human evaluation standards. In recent work, Howcroft et al. (2020) investigated NLG papers from the last twenty years and the evaluation methodologies differ drastically across papers. Moreover, in most cases, it is not even mentioned what the human evaluation aims to measure and that definitions of measures like "accuracy" or "fluency" are inconsistent. They thus suggest reporting standards for criteria and methods, following a classification system proposed by Belz et al. (2020). In addition, regularly scheduled shared tasks like WMT have lead to standardization of human evaluation setups and enabled controlled experimentation with them. GEM has the opportunity to develop reproducible standards for how human evaluation for NLG tasks beyond translation should be conducted while at the same time incorporating lessons from related work. Acting on the same need, the recently proposed GENIE (Khashabi et al., 2021) system aims to automate and standardize the human evaluation of different NLG systems, however with the contrasting goal of reducing the evaluating to a leaderboard-like score. To avoid further fragmentation of the field, GEM is developing its own human evaluation approaches, but uses the infrastructure provided by GENIE to run its human evaluation.
In addition to GENIE, multiple other related efforts exist that work toward the goal of reproducible and robust in-depth human and automatic evaluation for NLG tasks, and which focus on specific modeling-or task-aspects that are different from those in GEM. Among those are KILT (Petroni et al., 2020) which focuses on knowledge-intensive tasks and retrieval-based models, Storium (Akoury et al., 2020) which focuses on open-ended story generation, and BIG bench 3 which focuses on measuring few-shot and zero-shot capabilities of language models.

Dataset Selection
As highlighted in Figure 1, the selection of included datasets is an integral part of a benchmark. They should be challenging for models, but it should still be possible to evaluate models trained on them. Moreover, the datasets should cover a wide range of relevant generation challenges that allow for findings to be as general as possible. Finally, the datasets should cover tasks that are interesting for contributors to work on to facilitate the wide adoption of the benchmark.
To collect datasets with those desired properties, the selection methodology for GEM is composed 3 https://github.com/google/BIG-bench of three steps. First, we elicited a set of proposals from everyone involved in the effort. Second, we identified criteria for the selection. Third, all GEM members voted on individual dataset and criteria utilities. The final selection maximizes the utility under constrained resources, similar to a knapsack solver. 4 This can be seen as an extension of the selection process of SuperGLUE (Wang et al., 2019a) that had similar first and second steps but made the final decision based on which were harder for a baseline model to solve after identifying a final set of candidate datasets. Since we are going to introduce challenge sets, the baseline performance of models on a dataset matters less.
Dataset Elicitation. In the first step, all GEM participants were asked to suggest datasets following the schema provided in Appendix A. The categories included multiple brief categorizations, such as a description of the challenge that this dataset provides, its high-level task, and the communicative goal of an agent trained on the data. Following our goal to focus on non-English languages, we further asked for the languages included in the dataset, as well as the language locale. This step yielded 35 proposed datasets, listed in Appendix B.
Estimating Task+Criterion Utility. The second step focused on the selection of criteria to inform the selection. The initial set of criteria was selected through open discussion involving all members. We split criteria into "hard" and "soft" ones -hard criteria would lead to the definite inclusion/exclusion of a task if (not) satisfied. Soft criteria inform the utility of the remaining tasks. All GEM members filled out a survey asking them to rate, on a 5-point Likert scale, how much they wanted to see a task included in GEM. Additionally, we posed yes/no questions for all considered hard criteria and various questions about the soft criteria (e.g., "what percentage of the tasks should feature non-English language?", or "do we prefer noisy or clean datasets?"). Finally, the survey included open text fields that asked for (1) comments on any of the tasks, (2) comments or suggestions on hard exclusion criteria, and (3) suggestions of additional criterion/criteria. The full list of questions is shown in Appendix C.
The survey received 28 responses, revealing that the initial version of GEM should include a median of 10 tasks or an average of 12. Of those tasks, about a third should feature non-English language.
Selected Criteria. For the hard criteria, there was an agreement to focus only on open-access datasets and that concurrent or past shared tasks for the same datasets are not an issue. Overall, the sentiment determined the following selection principles: • We focus on diverse high-level tasks over a single high-level task evaluated in-depth. However, each high-level task should include multiple datasets.
• We focus on clean datasets to avoid conflating model mistakes and learned noise.
• We include a mix of high-and low-resource datasets.
• We focus on data with interesting test sets.
• We should not focus on the quality of current evaluation strategies for a given dataset.
• We prefer multi-reference datasets since those have been shown to lead to more robust automatic evaluation.
High-Level Tasks. Since these principles dictate that we should focus on a small set of high-level tasks, we used the free-text replies to evaluate the interest in different high-level tasks. Grouping the proposed tasks yielded the following candidates: Summarization, Dialog, Simplification/Compression, Question Answering, Creative Writing, Data-to-Text, and Question Generation. 5 There was a preference to exclude image inputs and question answering because those tasks add complexity to the evaluation beyond the generated text. Moreover, since creative generation tasks like story generation and poetry generation suffer even more from inadequate evaluation approaches, there was a consensus to not include them. There was, however, a strong preference for the high-level tasks Summarization, Data-to-text, and Dialog. 6 5 For a full overview of potential future expansions and challenges, we refer to the survey by Gatt and Krahmer (2018). 6 One may question the absence of Translation from this list. While it is a generation task, we excluded it since Translation already has regular benchmarking efforts with WMT. Specific Datasets. The final selection is shown in Table 1. To arrive at the selection, we first ranked all datasets by their average rating. For this, we treated positive ratings as 1, negative ratings as -1, and neutral ratings as 0. The highestranked datasets were E2E with 0.577, XSum with 0.538, and ToTTo with 0.461. Unfortunately, non-English datasets were ranked lower, with only WebNLG and MLSum among the top 15 datasets. We grouped all datasets by their high-level tasks and selected a group that would not violate the selection principles (e.g., only high-resource tasks). If two datasets fit, we picked the one with a higher interest rating. Among the 11 datasets, we have 18different languages, and the dataset sizes range from 5,000 examples to 1.5M, with most datasets between 50-150k examples. Two of them do not include English at all, which we hope reduces the dependence of the modeling approaches on anglocentric pretraining (Anastasopoulos and Neubig, 2020). The high-level tasks include Dialog, Summarization, Data-to-Text, and Simplification. About half of the datasets have multiple references and more than half had post-processing steps applied to them to ensure high data quality.

GEMifying the data
We produce data cards (Bender and Friedman, 2018;Gebru et al., 2018) for all data sets in GEM, for which we developed an NLG-specific template. 7 In addition to describing the data itself, the cards acknowledge potential limitations of a dataset regarding its creation process and describe its real-world use cases to ensure that the research is conducted responsibly.
These datasets are the base selection, and as part of GEM, we may change datasets and how they are used. For example, we may improve the training sets, make the test sets more challenging, or probe for specific skills a model must exhibit with testonly datasets (Perez-Beltrachini and Gardent, 2017;Linzen, 2020;Ribeiro et al., 2020;Schlegel et al., 2020). We may also ask to evaluate a single model on multiple test sets, following the design by Dua et al. (2019).
We are including modifications to several of the datasets: (1) (2) XSum: Summaries in this dataset often have divergence issues between the source and target texts since gold summaries are introductory sentences prefacing each article. Models agnostic to such noises are vulnerable to hallucinations Dhingra et al., 2019). To combat this, we fine-tuned a BERT-based (Devlin et al., 2019) classifier on 500 document and gold summary pairs, manually annotated for faithfulness (Maynez et al., 2020) and excluded all document-summary pairs from the original XSum dataset where the classifier was not confident (p(faithful) > 0.8) whether the summary is faithful to the document or not. (3) Schema-Guided Dialog: We are focusing on the response-generation part of the dataset and thus reformatted the dataset to treat the service agent utterances as the targets to be generated and the previous customer utterance and the agent's dialog act as the input. We additionally reformat the dialog acts to directly conform to the format described in the paper (Kale and Rastogi, 2020). (4) Wik-iLingua: We focus on the same five languages that were benchmarked in its original release (en, es, ru, tr, vi) in a cross-lingual setup in which the inputs are in the respective language and the outputs are in English. However, we re-split the original data to avoid train-test overlaps between languages and provide training data in 13 additional languages (as shown in Table 1). For GEM, we allow submis-8 https://pypi.org/project/langdetect/ sions trained on any of the languages in isolation or as part of a multilingual model.

Challenge Sets
In addition to applying consistent metrics to existing test sets, understanding specific model behavior, such as model generalization capabilities or performance under targeted cases, is also key for improvement. This is difficult to assess through evaluations on i.i.d. test splits. We thus release challenge sets to evaluate data-to-text and text-to-text models (overview in Table 2). In addition to enabling a more specific breakdown of how a model performs in the presence of challenging inputs, the set of system outputs on these test sets also constitutes a rich corpus that enables further error analysis and research. We apply multiple strategies to create the special test sets, in particular (I) alteration of the existing test sets (e.g., the introduction of distractors), (II) breaking down of the existing sets into subsets with certain properties (e.g., subsets with different complexity), and (III) the compilation of new test sets (e.g., out-of-vocabulary inputs). We restrict the size of each challenge set to about 500 examples to minimize computational overhead. On the WebNLG challenge sets, all subset items are selected proportionally from each category to ensure a similar distribution to the original set; on all other datasets the subset items are selected from the whole set. The results of the different systems on these subsets will be compared to the results obtained by the same systems on the same subsets of the original test data. For case (I), altering existing test sets, the first challenge set adds numerical variation in WebNLG. This variation attempts to respect the format of the current cardinal value (e.g. alpha, integer, or floating-point) and replaces the existing value with a new random value as a means to challenge existing trained models. The generated number is lower-bounded between zero and upper bounded to be within to the highest power of 10 unit for the given value (e.g. replacing a value of 54 would result in a random value between 0-100). Floating values are also bounded to have the same degree of precision as the input value. For structureto-text and dialog datasets, we produce a version of the test sets in which the order of the components of the input structures (triples, concepts, dialog acts, table rows, etc.) is randomly changed. For text-to-text datasets and Schema-guided Dialog, we introduce several types of perturbations: (a) typographical errors, using butter-fingers 9 with two thresholds 0.02 and 0.05, which respectively correspond to lower and higher error frequencies; (b) removal of the final punctuation sign (if any); (c) substitution of the input text by a backtranslated version, using the backtranslation implementation by Xie et al. (2020). We rejected backtranslation outputs based on a character length to ensure that the difference in character length between original and backtranslation does not exceed 35% of the original source character length. For XSum 99.8% of the backtranslations were accepted, for Wiki-Auto 94.42% (ASSET) and 87.18% (TURK), and for Schema-Guided Dialog 78%.
In case (II), the breaking down existing sets, we first provide for each dataset random samples of training and validation data, in order to assess to what extent the scores of the different systems drop when run on the test data. Then, specific splits are created for particular datasets, in order to assess possible biases of the models, and their robustness across inputs with different specifications. For ToTTo, test set splits are built according to several aspects that can be identified using Wiki-Data: gender, ethnicity and nationality grouped by continent. For gender, we compare the performance between male and female people, but cannot compare other genders due to a lack of original data -only seven people in the original test set are marked as having a different gender. We compare across the continent of the underlying nationality to address the issue that data for each country can be very sparse -i.e., only 19 coun-9 https://github.com/alexyorke/ butter-fingers tries are represented by more than ten people and only one of these is located in Africa (Kenya). In case a person has citizenships across multiple continents, we may include the person in any of the included continents. Finally, we compare African Americans vs. all Americans. Ethnicity is very sparsely annotated in WikiData with fewer than 150 annotated test examples in total and 128 of these are African Americans. We thus are unable to compare the performance on, e.g., Yoruba or Punjabi people, both of which have fewer than five instances. Another caveat here is that only 21 of the 128 people are female. Our contrast subset that can include any US citizens matches these counts. Across all three challenge subsets, we additionally match the fraction of the existing non-overlap and overlap properties. For WebNLG, we propose subsets based on the shape of the inputs (number of triples, number of common subjects and/or objects, depth, etc.) For Turk/ASSET, splits are created in terms of the syntactic complexity of the sentences to be simplified. To characterise sentence complexity we use the developmental level scale proposed by Covington et al. (2006). 10 Although Turk and ASSET contain similar input sentences, the human references in Turk were created without allowing sentence splits and ASSET was created by encouraging annotators to split long sentences. For all datasets, we propose splits based on the frequency of the parts that compose the input in the training data; the resulting test sets range from being made of very common components to being made only from components unseen in the training data. For case (III), we collect time-shifted test data for news summarization in the form of articles with Covid19-related keywords. Since MLSum and XSum were collected before the pandemic, we can measure how a model responds to context not seen in the training data (outside of potential pretraining). The new set of articles covers existing article topics (economy, sports, etc.) but all in relation to the Covid19 pandemic. In addition, some new topics appear in the collected data derived from outlet sections that were not part of the original data collection. 11

Experimental Setup
Since the GEM test sets and final metrics selection have not been released yet, we describe an experimental setup that will ensure that participating models are trained correctly and evaluated on publicly available data with available metrics that will give a sufficient indication of a model's performance. To do this, we are reporting the results of the baseline models on the validation sets.

Modeling Baselines
Much of the recent modeling progress in NLP can be attributed to the rise of the pretrain-then-finetune paradigm which has led to consistently better results. This finding is consistent with human judgments for summarization, as shown by Fabbri et al.
(2020), among others. However, many of the tasks included in GEM may not benefit from a language model encoder since their input is not natural language. We thus apply a variety of different architectures that vary in size, complexity, and training schema. Our main baselines are T5 with 60M parameters (Raffel et al., 2020) and BART with 139M parameters (Lewis et al., 2020a). For non-English datasets, we use their multilingual counterparts mT5 in various sizes (Xue et al., 2020) and mBART (Liu et al., 2020b). We additionally train the following baselines on a subset of tasks: TGen (with added language model and lemma tags denoted as TGen+/++) (Dušek and Jurčíček, 2016b), an architecture for generation from dialog acts, an LSTM-based Sequence-to-sequence model with attention (Bahdanau et al., 2015), DialoGPT (Zhang et al., 2020c), a pretraining approach for conversational models, and PEGASUS , which uses a summarization-specific pretraining schema that masks and predicts entire sentences.For WikiLingua, we additionally report results on a setup proposed by  which includes first training a monolingual model followed by finetuning with the correct source language, coupled with synthetic data generated through translation (mBART+). Almost all baselines can be reproduced on a GPUbased colaboratory notebook within 2-3 hours.

Automated Evaluation
As mentioned above, GEM provides a testbed for automated metrics and can be used to popularize newly developed ones. Thus, models are evaluated via a constantly expanding list of metrics and, to avoid overfitting to known metrics, we will use metrics on the test submissions that are not included in this initial writeup. Consequentially, the baseline results are an incomplete list which will be expanded upon the announcement of the test metrics. The set of metrics can be computed via the framework described at https://gem-benchmark. com/shared_task which comprises metrics in the following categories: Lexical Similarity. We include multiple "traditional" metrics as baseline metrics, notably BLEU (Papineni et al., 2002), ROUGE-1/2/L (Lin, 2004), and METEOR (Banerjee and Lavie, 2005). These metrics can often be gamed, for example, ROUGE can be improved by increased the output length of the model (Sun et al., 2019). Moreover, the reliability of these metrics depends on the quality and number of the references (Mathur et al., 2020a;Freitag et al., 2020). However, on a system-level, they still correlate well with human judgments for some tasks (Reiter, 2018).
Semantic Equivalence. More recently, metrics that rely on pretrained language models have shown improved correlations with human judgments on the segment-level. We thus include BERTScore (Zhang et al., 2020b), a metric based on the similarity of sentence embeddings, and BLEURT (Sellam et al., 2020), a metric that is fine-tuned on human ratings. The reported baseline results use RoBERTa-large  and mBERT (Devlin et al., 2019) for BERTScore and the English-only BLEURT-base-128 for BLEURT.
Probing for Faithfulness. Another approach that has shown promise in summarization. The approach relies on the insight that a reader of a reference and generated summary should be able to answer the same question, regardless of how the summary is phrased. There has been much development toward these QA-based approaches (Eyal et al., 2019;Scialom et al., 2019;Durmus et al., 2020;Wang et al., 2020, among others) and they can provide an alternative angle to model evaluation that does not highly correlate with other evaluation approaches (Fabbri et al., 2020). While most related work on these metrics is limited to summarization, we are evaluating systems using a QA-based method called QuestEval (Scialom et al., 2021) that supports all of our tasks.
In addition to QA-based evaluation, there have also been related efforts to develop more fine-   Table 4: Results of the baseline results we release with GEM, focusing on diversity of the outputs and neutral system characterizations. grained and interpretable evaluation metrics, for example to measure consistency in data-to-text problems (Opitz and Frank, 2020;Dhingra et al., 2019). We are using one such metric called NUBIA (Kane et al., 2020), the NeUral Based Interchangeability Assessor, which combines multiple measures such as entailment and similarity into a decomposable and interpretable score.
Diversity. As argued by Hashimoto et al. (2019) among many others, NLG models intrinsically trade off diversity and quality. A model can produce more diverse outputs through sampling but at the cost of output quality. To account for this aspect, we compute multiple diversity metrics, starting with those proposed for the analysis of the results of the E2E NLG challenge (Dusek et al., 2020) and by van Miltenburg et al. (2018). These include the Shannon Entropy (Shannon and Weaver, 1963) over unigrams and bigrams (H 1 , H 2 ), the mean segmented type token ratio over segment lengths of 100 (MSTTR, Johnson, 1944), the ratio of distinct n-grams over the total number of n-grams (Distinct 1,2 ), and the count of n-grams that only appear once across the entire test output (Unique 1,2 , Li et al., 2016).
System Characterization. The final section of metrics will characterize the systems. While the focus of this section will be on qualitative descriptions through model cards, we also gather quantitative information that is not necessarily associated with a judgment. As part of this, we collect the number of parameters of a system, as suggested by Ethayarajh and Jurafsky (2020). For each task, we additionally report the vocabulary size over the output (|V|) and the mean output length of a system (Sun et al., 2019).

Results
One of the central aims of GEM is to measure the progress in NLG without misrepresenting the complex interactions between the sometimes contradicting measures. We thus will not distill the complex interplay of the data, metrics, and model outputs into a single number or statement, and we do not present results in a traditional leaderboard. Instead, we developed an interactive result exploration system that allows analyses of model results, and which we describe in this section. To further motivate this change, consider the following conclusion someone may draw from looking at a leaderboard: System Foo performs the best.
Our interactive system aims to enable more nuanced statements such as: System Foo leads to consistent performance increases in Bar-type metrics on challenges that measure Baz while maintaining equal performance on most metrics of type Qux.
A screenshot of our system is presented in Figure 2. 12 In addition, our baseline results are presented in a tabular view in Tables 3 and 4. Our interactive system is centered around a parallel coordinates plot (Inselberg, 1985) which shows all results as lines through parallel axes. Every line intersects the axes at the corresponding mapped value. For instance, see the red line representing the results for task "ToTTo" of baseline "t5-small". Filters can be applied along axes (see BLEURT axis in Figure 2) and the filtered selection is highlighted through bold lines. A selection can be a set of metrics, systems, or tasks. This style of presentation has not been used before for a benchmark. The closest prior work is by Fu et al. (2020) for namedentity recognition which allows similar filtering and sorting, but presents the results in a table. However, the parallel coordinates approach can scale to a much greater number of metrics than a table. Moreover, by using a parallel coordinates plot instead of a table, it is easy to spot patterns that span multiple metrics, systems, or tasks. For example, the highlighted line in Figure 2 uncovers that, for the T5 baseline on ToTTo, the diversity metrics score higher than other systems while scoring lower on reference-based metrics. Since we only have a single baseline for ToTTo, it is unclear whether this difference can be attributed to the dataset or the system but this relationship will be uncovered once we receive submissions.
The final system will additionally be able to display the model cards and other related metainformation associated with submissions. It will also be able to show (and compare) exemplary outputs for each test set. Those two features will improve the transparency of the results and systems to those who are not familiar with a task and provide necessary information to those who consider using a particular system. The combination of all components will enable analysis on quantitative, individual, and qualitative level which can support formulating new research hypotheses and gather in-depth insights about system performance. For example, the functionality to compare human anno-12 An initial version showcasing our baseline results is deployed on our website. tation and automatic measures could lead to a better understanding how fluency affect BERTScore. In addition to the interactive self-directed result exploration, our shared task features an evaluation and analysis part. Instead of dictating the interpretation of the modeling shared task results, we will release all system outputs and metrics in this second part and participants of this part may run their own evaluation and conduct interesting analyses.

Submitting to the benchmark
While we ask submitters to try to cover as many tasks as possible, we acknowledge potential restrictions on computation resources. We thus do not require that a submissions to GEM has to include predictions on every included test and challenge sets. All predictions from a model should be formatted and added into a single file as outlined on our website.
In addition, we require every submitter to answer a series of questions that we will use to construct a model card (Mitchell et al., 2019) and externalize potential concerns regarding the social impact of a model and its use, or its training data. The card will additionally display information to replicate the experiments. While we require responses to these questions at submission time, we allow the information about a model to remain anonymous during required anonymization periods should a paper describing the model be under submission elsewhere. All submitted model outputs will be made publicly available for download.
After a submission, we will run the evaluation suite on the submitted outputs and additionally collect human annotations.
Human Evaluation GEM will be used to develop reproducible and consistent human evaluation strategies for generated text. This task involves selecting and defining which quantities of the generated text should be measured, developing annotation schemes and rater guidelines to capture these quantities accurately, and infrastructure to annotate system outputs.
We aim to develop these setups for all task setups such as summarization, dialogue, simplification, and data-to-text. To approach this task, we will follow the recently proposed taxonomy of human evaluation measures by Belz et al. (2020) and follow the reporting strategies proposed by Howcroft et al. (2020). The detailed setups will be described in a evaluation datasheet (Shimorina and Belz, 2021).
All shared task participants will be asked to provide gold annotations on system outputs, which we will then use to evaluate the consistency of crowdsourced annotations. 13

Next Steps
This section lists the currently active developments and the long-term steps we will take to ensure that GEM will continue to evolve and improve.

Collecting more multilingual data
Many of the initial datasets in GEM are focused on (American or British) English; we see this release as a starting point for the collection of new datasets to improve the inclusiveness of other languages and cultures. From the task point of view, to ensure the longevity of the dataset, we want it to be practical and socially beneficial. Through GEM, we have developed a set of desired criteria for NLG datasets and we aim to apply this knowledge to data collection and actively work toward reducing the disparity in data availability between languages (Joshi et al., 2020). To this end, we are focusing on a task that requires content selection, planning, and surface realization along in a grounded scenario. The idea is in the prototyping stage with prospects broadly towards dialog response generation and topic summarization in multiple languages. We plan to do so by collaborating with speakers of low-resourced languages through a participatory research approach, as suggested by (∀ et al., 2020). Toward this goal, GEM welcomes anyone interested in collaborating on this effort.

Personalizing and Controlling NLG
GEM currently focuses on tasks that deterministically transform an input into an output. With the increasing use of NLG models in real-world applications, how to enable and evaluate personalized NLG systems (e.g., in dialect or formality) remains challenging. Several related tasks have been proposed, for example, the transfer of writing style from informal to formal (Rao and Tetreault, 2018), personalization of machine translation systems to align with particular personal traits (Mirkin and Meunier, 2015), or persona-guided response generation of dialogue systems . We envision our framework to be extended (e.g., dataset, evaluation) to incorporate this line of userfocused NLG.

Regular updates to the living benchmark
To activate the benefits of a living benchmark that is focused on evaluation, we commit to regular updates for GEM. We invite contributions in the form of model outputs, analyses, and metrics at any time and will automatically update the results presented on our website to incorporate them. For the updates to the dataset selection, we want to consider the input of the wider NLG research community. To do so, we will set up a yearly selection process similar to the one described in Section 3. The first update process will be run after the GEM workshop at ACL 2021. To be able to have a robust comparison between different versions of GEM, we will only replace a small subset of datasets at a time.

Conclusion
In this paper, we have introduced GEM, a living natural language generation benchmark with a focus on evaluation. While GEM does not claim to instantly solve all issues of benchmarks in NLG, we aim to provide an environment in which systems can be tested in a principled manner and which can elevate the prominence of interesting evaluation approaches. By providing a testbed to easily conduct experiments across many datasets and evaluate in a repeatable, consistent, and more interpretable way, we will be able to track progress toward the goals in NLG research much more clearly. Moreover, we will be able to extend and shape GEM in the future to include more multilingual datasets, which will assist in their adoption across the wider research community.

Contribution Statements
GEM is a large effort with a decentralized organization that is split into different task-specific subgroups. To acknowledge everyone's contribution, we list the contribution statements below for all groups.
Steering Committee. Antoine Bosselut, Esin Durmus, Varun Prashant Gangal, Sebastian Gehrmann, Laura Perez-Beltrachini, Samira Shaikh, and Wei Xu make up the steering committee. Sebastian Gehrmann coordinates and leads the GEM effort. All others provide feedback and discuss larger decisions regarding the direction of GEM and act as conference organizers for the ACL 2021 workshop. Table2Text. Varun Gangal and Miruna Clinciu are part of this group. Miruna Clinciu was responsible primarily for DART and Varun Gangal for ToTTo while maintaining a close correspondence and understanding between them to ensure all steps, such as code structure, preprocessing primitives, baselines were as uniform as possible.
Simplification. Dhruv Kumar, Mounica Maddela, and Wei Xu contributed to the GEM Simpli-fication task. Dhruv Kumar created the data cards for the datasets, added Wiki-Auto and Turk/ASSET datasets to TFDS, and integrated the SARI metric  into the GEM evaluation framework. Mounica Maddela created baselines for the task and added the Turk benchmark corpus to Hugging Face and TFDS. Wei Xu helped in the organization and planning of the task setup.
Automated Evaluation. Ondrej Dusek wrote the base code and included BLEU, Meteor, ROUGE, and referenceless metrics (the latter based on code supplied by Emiel van Miltenburg). He also prepared reference sets for E2E, Czech Restaurants and WebNLG. Sebastian Gehrman included BLEURT and BERTScore and prepared the reference sets. Dhruv Kumar included SARI and adapted the code for source-based metrics. Nishant Subramani helped with code refactoring. Miruna Clinciu , Emiel van Miltenburg and Thibault Sellam provided feedback and participated in discussions.
Human Evaluation. Samira Shaikh was the point of contact for this working group. She led the discussions to make progress on the group goals. She also worked with the group to select the general evaluation criteria as well as the criteria for dialogue and simplification tasks. Khyathi Chandu and Miruna Clinciu worked on selecting evaluation criteria for the summarization task and participated in the group discussions. Simon Mille provided support on using the criteria taxonomy and the annotated evaluation sheets for selecting and defining the criteria to use; worked on selecting the D2T criteria. Vitaly Nikolaev and Sashank Santhanam worked on selecting evaluation criteria for dialog and simplification tasks. João Sedoc worked with the group to select the evaluation criteria in general as well as the specific ones for dialog and simplification. He also helped to select among annotation interfaces. Anastasia Shimorina worked with the group to select the evaluation criteria and participated in the discussions. Chris Emezue, Sebastian Gehrmann, Khyati Mahajan, and Yufang Hou participated in discussions. Crowdsourcing New Data. Chris Emezue, Rubungo Andre Niyongabo, Aremu Anuoluwapo, Khyathi Chandu, Yufang Hou, Samira Shaikh, Varun Prashant Gangal, and Dimitra Gkatzia are members of this group. Khyathi Chandu worked on identifying where the current datasets fall short to motivate the crowdsourcing of data for a new task. Based on the suggestions from collaborators, she wrote two task proposals in the domains of longform text, conversations, and data-to-text that address an array of challenges in generation and easily scale to multiple languages. Samira Shaikh participated in the discussions and gave feedback on the task proposals in the pilot study phase. Dimitra Gkatzia looked into potential resources for crowdsourcing. Chris Emezue and Rubungo Andre Niyongabo explored potential low-resource African languages for crowdsourcing. We are in the process of piloting the tasks internally.