GLGE: A New General Language Generation Evaluation Benchmark

Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP). These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset are publicly available at https://github.com/microsoft/glge).


Introduction
Pretrained language models, such as BERT (Devlin et al., 2019) and other advanced pretrained models (Raffel et al., 2020;Yang et al., 2019;Alberti et al., 2019;Brown et al., 2020;Clark et al., 2020) have made great progress in a host of Natural Language Understanding (NLU) tasks. Meanwhile, the development of general evaluation benchmarks has also helped drive the progress of these models. These benchmarks usually use an overall score to evaluate the performance of models across a wide range of NLU tasks. In addition to GLUE (Wang et al., 2019b) and Su-perGLUE (Wang et al., 2019a) which are general language understanding evaluation benchmarks for English, several general language understanding evaluation benchmarks for other languages are proposed, such as CLUE (Xu et al., 2020) for Chinese, FLUE (Le et al., 2020) for French, and In-doNLU (Wilie et al., 2020) for Indonesian. Furthermore, the multilingual multi-task benchmarks such as XTREME  and XGLUE  are proposed for cross-lingual evaluation.
In addition to NLU tasks, an increasing number of pretrained language models designed for Natural Language Generation (NLG) tasks have recently been proposed, such as MASS (Song et al., 2019), BERT-share (Rothe et al., 2020), BART (Lewis et al., 2020), ProphetNet , and ERINE-GEN (Xiao et al., 2020). However, the generalization capabilities of the language generation of these models are usually evaluated with different tasks, datasets, and metrics, which cannot provide a coherent and comprehensive evaluation. Although there are several general evaluation benchmarks as we mentioned above, none of them are particularly designed for general language generation evaluation.
To fill the gap of the NLG evaluation benchmark, we introduce the General Language Generation Evaluation (GLGE) benchmark, a new multi-task benchmark for evaluating the generalization capabilities of NLG in English language. It contains eight English language generation tasks, covering text summarization, question generation, generative question answering, and dialogue. We select six pre-existing popular datasets and introduce two new datasets selected from real-world scenarios. Moreover, in order to provide more diversified difficulty challenges, we employ two simple but effective strategies to build three NLG evaluation benchmarks (called GLGE-Easy, GLGE-Medium, and GLGE-Hard) in terms of task difficulty.
To better understand the challenges posed by GLGE, we conduct experiments with existing widely used non-pretrained models (e.g., vanilla LSTM Seq2Seq (Bahdanau et al., 2015), vanilla Transformer (Vaswani et al., 2017)), and pretrained models (e.g., MASS (Song et al., 2019), BART (Lewis et al., 2020), and ProphetNet ). We further analyze the n-gram diversity of the output samples. The experimental results show that there is a large performance gap between the pretrained models and the non-pretrained models. However, on the GLGE-hard task, the performance of the pretrained models still has great room for improvement.
In summary, the contributions of this work are five-fold: (1) a new multi-task NLG evaluation benchmark consisting of eight distinct datasets across four kinds of typical NLG tasks, (2) three NLG evaluation benchmarks of different difficulty levels, (3) standardized evaluation metrics and scripts for model evaluation and comparison, (4) open-sourced baselines and a public leaderboard 2 for the benchmark, (5) a thorough comparative study on existing widely used non-pretrained models and pretrained models with a detailed analysis of the results.

Design Principles
For the GLGE benchmark, we design and select the NLG tasks based on the following principles:

Task Diversity
The tasks in GLGE focus on evaluating the generalization capabilities of a NLG model, varying the task, the length of the input text, the length of the output text, the type of generated text, and the size of the dataset.

Task Difficulty
The tasks in GLGE should be challenging but solvable, which can encourage researchers to design better NLG models. Furthermore, we aim to provide benchmarks of different difficulty levels like GLUE (Wang et al., 2019b) and Super-GLUE (Wang et al., 2019a), which allows researchers to comprehensively evaluate the models. Researchers can also select the benchmark with moderate difficulty according to the size of the 2 https://microsoft.github.io/glge/. model and the scale of the used pretraining corpus for comparison.

Ease of Evaluation
The tasks in GLGE should be easily evaluated automatically. For some unconditional, open-ended, and weak conditional language generation tasks (e.g., answer-agnostic question generation, singleturn chit-chat response generation, and story generation), reasonable generation results are diverse. Due to the limited number of references in the automatic evaluation of text generation tasks, it is more difficult for automatic evaluation of those tasks. Therefore, instead of selecting unconditional and weak conditional language generation tasks, we tend to select language generation tasks with stronger conditions (e.g., answer-aware question generation), which makes the automatic evaluation more convincing.

Task Popularity
Most tasks in GLGE should use widely-used NLG datasets, which have been implicitly agreed upon by the NLG community as challenging and meaningful. Since GLGE is mainly designed for the generalization capabilities evaluation of the English NLG pretrained model, the choice of task also refers to several related works of NLG pretraining model, such as MASS (Song et al., 2019), BART (Lewis et al., 2020), ProphetNet , and ERNIE-GEN (Xiao et al., 2020).
Based on the above principles, we invite 10 NLG experts 3 to discuss and vote on existing widelyused NLG datasets. Note that since the GLGE is designed for evaluating the generalization capabilities of NLG in English language, we do not include the cross-lingual NLG tasks, such as machine translation and cross-lingual text summarization . Finally, we select 6 existing popular NLG datasets. Besides, we also introduce two new datasets selected from real-world scenarios for the GLGE benchmark, which makes GLGE have more practical values. Unlike the existing datasets, the test sets of these new two datasets are hidden, which further ensures the fairness of the evaluation results. The input sequence and output sequence of the selected tasks are all well-defined. We preprocess them and provide the input and output sequence pairs directly, which benefits researchers to focus on model improvements.

Tasks and Datasets
GLGE contains eight English NLG tasks, covering text summarization, question generation, generative question answering, and dialogue. Descriptions and statistics of these tasks are shown in Table 1, with concrete examples shown in Appendix.

Abstractive Text Summarization
As a typical NLG task, abstractive text summarization aims to generate a short and fluent summary of a long text document. GLGE contains four abstractive text summarization tasks. As discussed in Bhandari et al. (2020), we use ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) as the metrics for these tasks. CNN/DailyMail (Hermann et al., 2015) dataset contains 220K articles from the Daily Mail newspapers and 93K articles from the CNN. Each article contains a bullet point summary. GLGE uses the non-anonymized variant See et al. (2017). After the pre-processing, there are 311,971 ⟨article, summary⟩ pairs, where the source input is the article, and the target output is the summary which consists of multiple sentences.
Gigaword (Rush et al., 2015) contains 4M examples extracted from the news articles of the Gigaword corpus (Graff et al., 2003). After the pre-processing, there are 3,995,559 ⟨passage, summary⟩ data pairs, where the source input is the first sentence of the article, and the target output is the headline that usually contains a single sentence.
XSum (Narayan et al., 2018) consists of 227K online articles from the British Broadcasting Corporation (BBC), which contains professionally writ-ten single-sentence summaries. After the preprocessing, there are 226,677 ⟨article, summary⟩ data pairs, where the source input is the news article, and the target output is a single-sentence summary.
MSNews MicroSoft News headline generation (MSNews) is a new News headline generation dataset we collected for GLGE. We random select 151K online news articles from 2012-01-01 to 2020-09-01 from a real-world news search engine. Each article contains a professionally written single-sentence headline. After the pre-processing, there are 151,140 ⟨article, headline⟩ data pairs, where the source input is the news article, and the target output is a news headline.

Answer-aware Question Generation
The question generation task is another typical NLG task, which aims to generate a question based on a given text passage or document. Compared with answer-agnostic question generation tasks that can generate lots of reasonable questions, answeraware question generation (Zhou et al., 2017) is asked to generate a question asks towards the given answer span based on a given text passage or document. In order to facilitate automatic evaluation, GLGE selects two answer-aware question generation tasks: SQuAD 1.1 (Rajpurkar et al., 2016) dataset contains over 100K crowd-worker created questions with the corresponding answer spans in 536 Wikipedia articles. Since the original hidden test set of the SQuAD 1.1 is hidden, we follow (Du et al., 2017;Zhao et al., 2018) to re-split the dataset with the examples from the original training set and development set. After the pre-processing, there are 98,169 ⟨answer, passage, question⟩ data triples, in which the source input is a Wikipedia passage along with an answer span, and the target output is a question. ROUGE-L, BLEU-4 (Papineni et al., 2002), and METEOR (Banerjee and Lavie, 2005) are used as the metrics.

MSQG MicroSoft Question Generation (MSQG)
is another dataset we collected, which is a new challenge dataset, the questions in this dataset are freely edited by daily users. For MSQG, we collect 220K passages from a real world search engine. Each passage contains a highlight span and a related query, we regard the queries as questions in this dataset. After the pre-processing, there are 220,088 ⟨highlight span, passage, question⟩ data triples, where the source input is a news passage along with highlight span, and the target output is a user question. ROUGE-L, BLEU-4, and METEOR are used as the metrics.

Conversational Question Answering
Conversational question answering is a classic and popular generative question answering task. Compared with the extractive question answering, such as SQuAD (Rajpurkar et al., 2016), conversational question answering requires the model to answer the question based on a running conversation history and the given passage.
CoQA (Reddy et al., 2019) dataset contains 127K questions with answers, obtained from 8K conversations about text passages from seven diverse domains. After the pre-processing, there are 116,630 ⟨conversation history, passage, question, answer⟩ data 4-tuples, where the source input is a sequence of conversation history along with a given question and a given passage, and the target output is a freeform answer text. F1-Score (Rajpurkar et al., 2016) is used as the metric.

Personalized Dialogue
Conversational AI is an important topic in NLG.
Compared with text summarization, the responses of single-turn conversations are diverse and might lack of specification, and thus it is hard to use the single ground-truth for automatic evaluation. We select the personalizing dialogue task, which is a challenging multi-turn conversation task. In addition to the conversation history, this task gives the profile information as an additional condition to facilitate specific response generation.
PersonaChat (Zhang et al., 2018) dataset consists of about 160K utterances, which requires the model to generate responses according to given multiturn conversations and persona profile. After preprocessing, there are 151,157 ⟨persona profile description text, conversation history, response⟩ data triples, where the source input is a sequence of conversation history along with several sentences of persona profile description information, and the target output is a response. BLEU-1, BLEU-2, Distinct-1, and Distinct-2 (Li et al., 2016) are used as the evaluation metrics.

Overall Score
Similar to GLUE (Wang et al., 2019b) and Super-GLUE (Wang et al., 2019a), we seek to give an overall system performance over all GLGE tasks by aggregating the scores of all tasks. We follow GLUE to adopt a simple approach that weighs each task equally. For the tasks with multiple metrics, we firstly average those metrics to get a task score. Besides, because the values of the original Distinct-1 (D-1) and Distinct-2 (D-2) (Li et al., 2016) scores which are used as the metrics for dialogue task are usually quite small (less than 0.01), we re-scale them by 100.0 so that these score values are in the same order of magnitude as other scores.

Challenges of Three Difficulty Levels
As discussed in § 2.1.2, GLGE provides three levels of difficulty for each task, called GLGE-Easy, GLGE-Medium, and GLGE-Hard. The original 8 task datasets as described in § 2.2 constitute the GLGE-Easy. Based on GLGE-Easy, we employ two strategies to further increase the task difficulty. Low-resource. We increase the difficulty of GLGE by simulating low-resource scenarios. For each task, we keep the test and development sets of GLGE-Easy and randomly reduce the scale of the training data to 50% of the original train set. The dataset of 8 tasks under this setting is regarded as GLGE-Medium. Low-frequency. In order to further evaluate the generalization capability of the NLG model, we increase the difficulty of GLGE by reducing the word overlap rate between the output of the training set and the output of the test set. The motivation is that a good NLG model should be able to generate a fluent target output based on the input information, even if the target output may contain some lowfrequency words. For the test set and development sets of GLGE-Hard, we still use those in GLGE-Easy. For the training set of GLGE-Hard, we first count the frequency of each token in the target sentence of the test set. Then we remove the stop words of the target sentence of each training sample in GLGE-Easy, ranking them by calculating their word frequency score of the test set. This can be formulated as where y is a target sentence without stop words, w y is a token in y, TF(w y ) denotes the word frequency of the token w y in the target sentences of the whole test set, and |y| denotes the token length of y. Instead of reducing the training data scale randomly as in GLGE-Medium, we select the top 25% training data with minimum word frequency score of the test set from the original training set as the training set of each dataset. The dataset of 8 tasks under this setting is regarded as GLGE-Hard.

Baselines
For the baselines, we first evaluate two widelyused non-pretrained models: vanilla LSTM based Seq2Seq (Bahdanau et al., 2015) and vanilla Transformer (Vaswani et al., 2017). Besides, we evaluate several widely used pretrained NLG models, including MASS (Song et al., 2019), BART (Lewis et al., 2020), and ProphetNet . To further evaluate the performance of the pretrained NLG models of different model sizes and the different scales of the pretraining corpus, we compare the MASS base , ProphetNet base , MASS middle , BART large , and ProphetNet large on GLGE.

Implementation Details
Vanilla LSTM (Bahdanau et al., 2015). The hyperparameters and implementation of LSTM-Seq2Seq are based on the LSTM register model of Fairseq 4 , where the word embedding dimension, the hidden size, the number of the encoder layer, and the number of the decoder layer are 512, 512, 1, and 1, respectively. For each task in GLGE, we use Adam (Kingma and Ba, 2015) with an initial learning rate of between 0.0001 and 0.0003, and train the LSTM-Seq2Seq for a maximum of 100 epochs.
Vanilla Transformer (Vaswani et al., 2017). The hyper-parameters and implementation of Transformer are based on the transformer vaswani wmt en de big register model of Fairseq 5 , which contains a 6-layer encoder and a 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. For each task in GLGE, we use Adam with the initial learning rate of between 0.0003 and 0.001, and train the Transformer for a maximum of 20 epochs. MASS base (Song et al., 2019). The hyperparameters and implementation of MASS are based on their source code 6 . MASS base contains a 6-layer encoder and a 6-layer decoder with 768 embedding/hidden size and 3072 feed-forward filter size. The MASS base is pretrained on BookCorpus (Zhu et al., 2015) and English Wikipedia (16GB in total). For each task in GLGE, we fine-tune MASS base with the same hyper-parameters used in their source code 7 for a maximum of 25 epochs.
MASS middle (Song et al., 2019) which contains a 6-layer encoder and a 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. The MASS middle is also pretrained on BookCorpus and English Wikipedia (16GB in total). For each task in GLGE, we use the same hyper-parameters as used in MASS base . ProphetNet base . The hyperparameters and implementation of ProphetNet are based on their source code 8 . ProphetNet base contains a 6-layer encoder and a 6-layer decoder with 768 embedding/hidden size and 3072 feed-forward filter size. Similar to MASS, the ProphetNet base is pretrained on BookCorpus and English Wikipedia (16GB in total) with 125K steps. For each task in GLGE, we fine-tune ProphetNet base with the same hyper-parameters used in their source code for a maximum of 10 epochs. ProphetNet large   GLGE, we fine-tune ProphetNet large with the same hyper-parameters used in their source code for a maximum of 10 epochs. BART large (Lewis et al., 2020). The hyperparameters and implementation of BART large are based on the source code 9 . BART large contains a 12-layer encoder and 12-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. The pretraining of BART large uses the same pretraining data as , consisting of 160GB of news, books, stories, and web text. For each task in GLGE, we fine-tune BART large with the same hyper-parameters used in their source code 10 for a maximum of 20000 iterations.
Except BART, all the baselines adopt BERTuncased tokenizer. We fine-tune all baselines on each individual task with 4 × 16GB NVIDIA V100 GPUs. We evaluate the best model checkpoint based on the loss on the development set. During inference, we use beam search (Och and Ney, 2004) with beam size 4 or 5 and remove the duplicated trigrams in beam search (Fan et al., 2018)

Results and Analysis
Overall results. The main results are presented in Table 2. From the overall scores (highlighted in color), we can observe the fairly consistent gains moving from LSTM to Transformer, and then to pretrained base models and pretrained large models, such as ProphetNet base and ProphetNet large . The performance gap between the pretrained model and non-pretrained model is obvious. The difference in terms of overall score is about absolute 15% on the three levels GLGE benchmarks (GLGE-Easy, GLGE-Medium, and GLGE-Hard). As expected, the pretrained large models (ProphetNet large and BART large ) achieve the best overall scores. From the results of each model on the three levels GLGE benchmarks, we can see that each model has a significant drop in performance from GLGE-Easy to GLGE-Medium and GLGE-Hard benchmarks. For both the non-pretrained model and pretrained models, there is a nearly 2% drop in terms of overall score from GLGE-Easy to GLGE-Medium, and about 4%-8% drop from GLGE-Easy to GLGE-Hard. These results illustrate the diversified difficulty of GLGE. We recommend that researchers choose a GLGE benchmark with moder-Models Avg.
Text  32.3 42.7/20.0/39.5 35.3/16.2/32.6 40.6/17.4/32.6 40.9/21.4/37.1 47.8/19.2/23.9 36.4/9.4/20.9 57.6 44.4/36.9/1.4/8.3 Table 3: Overall results of baselines across the tasks of GLGE-Medium + Low-frequency strategy. We use the color to highlight the overall score. R-1: ate difficulty based on the model size of the pretrained model and the scale size of the pretrained corpus. At the same time, researchers can also perform more comprehensive evaluation of model performance on GLGE benchmarks at all difficulty levels.
Low-frequency strategy analysis. To further verify the effectiveness of the low-frequency strategy as described in § 2.4. We build the GLGE-Medium + low-frequency benchmark and evaluate the models on it. Both GLGE-Medium and GLGE-Medium + low-frequency retain 50% of the training samples in the GLGE-Easy training set. The only difference between them is that GLGE-Medium uses random sampling to retain 50% training samples, while GLGE-Medium + low-frequency uses the low-frequency strategy as used in GLGE-Hard to select 50% training samples. We use the same baselines with the same settings in § 3.2 to compare the model performance on GLGE-Medium and GLGE-Medium + low-frequency. The results are shown in Table 3. We can see that after the introduction of the low-frequency strategy, the performance of the models has dropped significantly. These results demonstrate that the low-frequency strategy can effectively improve the difficulty of the benchmark.
Output diversity analysis. We further compare the output diversity of each model on all the tasks of GLGE-Easy. We report the mean of the Distinct bigram (Distinct-2) (Li et al., 2016) ratios of the generated samples to the golden references. Note that if the bigram diversity of the generated samples is close to that of the real samples, the distinct bigram ratio is close to 1. The results are shown in Figure 1. In general, output bigram diversity of the pretrained model is higher than non-pretrained models.
For the tasks of CNN/DailyMail (CNN/DM), Gigaword, and MSNews, the bigram diversity of the generated samples are close to that of the real samples. However, for the tasks of XSUM, SQuAD 1.1, MSQG, CoQA, and PersonaChat, the bigram diversity of the non-pretrained model is significantly lower than that of the pretrained model. For these tasks, the non-pretrained model tends to generate universal responses (Li et al., 2016) or outputs. Moreover, there is still a huge gap between the bigram diversity of pretrained models and real samples (golden) on the task of XSUM, MSQG, and PersonaChat. Obviously, there exists great room for future improvement of the pretrained models in terms of output diversity.

Related Works
Benchmarks Recently, the development of general natural language understanding (NLU) evaluation benchmarks has helped drive the progress of pretraining and transfer learning in NLP. Conneau and Kiela (2018) (Mehri et al., 2020) for task-oriented dialogue is proposed. It consists of seven task-oriented dialogue datasets covering four kinds of NLU tasks: intent prediction, slot tagging, semantic parsing, and dialogue state tracking. In addition to the English NLU evaluation benchmark, there has been an increasing amount of new benchmarks in other languages. For example, CLUE (Xu et al., 2020) is a Chinese NLU benchmark that consists of eight diverse Chinese NLU tasks, including single-sentence, sentencepair, and machine reading comprehension tasks. FLUE (Le et al., 2020) is proposed for the French language, which is a French NLU benchmark that includes several NLU tasks, such as text classification, paraphrasing, language inference, parsing, POS tagging, and word sense disambiguation. In-doNLU (Wilie et al., 2020) is a new benchmark for evaluating Indonesian language understanding, which introduces twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling. Furthermore, the multilingual multi-task benchmarks are proposed for cross-lingual evaluating.  introduce XTREME benchmark which is a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across forty languages and nine tasks. Almost at the same time, XGLUE  is proposed which is a new multilingual multi-task benchmark for crosslingual pretraining, understanding, and generation. There are eleven cross-lingual tasks including nine NLU tasks and two NLG tasks in XGLUE, each task provides labeled data in multiple languages.
The above benchmarks mostly focus on NLU which provides a range of language understanding tasks. However, to our best knowledge, there is no benchmark designed specifically for general NLG evaluation. To fill this gap, we introduce GLGE, a new multi-task benchmark for evaluating the generalization capabilities of NLG across eight language generation tasks.
Pretrained NLG Models In recent years, pretrained language models (Devlin et al., 2019;Raffel et al., 2020;Yang et al., 2019;Alberti et al., 2019;Brown et al., 2020;Clark et al., 2020) have achieved state-of-the-art results in several NLU benchmarks. Besides, more and more pretraining based models which are designed for NLG tasks are proposed. Rothe et al. (2020) adopt the Transformer-based sequence-to-sequence (seq2seq) model and leverage the checkpoints of the pretrained NLU models for sequence generation tasks. MASS (Song et al., 2019) pretrains the seq2seq model by dropping a continuous token span to corrupt the text and learns to reconstruct it. Raffel et al. (2020) investigate several model structures and pretraining tasks, and further propose a unified text-to-text transformer called T5. Similarly, BART (Lewis et al., 2020) adopts the encoder-decoder structure and is pretrained with randomly sentence order reconstruction and text in-filling tasks.
More recently,  propose, Prophet-Net, which introduces the future n-gram prediction mechanism for language generation. ENRINE-GEN (Xiao et al., 2020) introduces the infilling generation mechanism, noise-aware generation, and span-by-span generation task for NLG model pretraining. In addition to general NLG tasks, some task-specific pretrained NLG models are proposed. For dialogue and conversation, a dialogue generative pretrained transformer called DialoGPT (Zhang et al., 2020c) is proposed for conversational response generation, which is pretrained on a large-scale conversation-like exchange corpus. Furthermore, PLATO (Bao et al., 2020) is a dialogue generation pretraining framework for chit-chat, knowledge grounded dialogues, and conversational question answering. PLATO introduces discrete latent variables to tackle the one-to-many mapping problem in response generation. For text summarization, Zhang et al. (2020a) propose PE-GASUS, which design the pretraining objectives called gap sentence generation tailored for abstractive text summarization.

Conclusion
To facilitate the development, evaluation, and comparison of new NLG models, we introduce GLGE, a multi-task evaluation benchmark for NLG with three difficulty levels. To the best of our knowledge, GLGE is the first comprehensive NLG evaluation benchmark. We evaluate several baselines on GLGE and analyze their results. The GLGE benchmark is hosted publicly and we invite the research community to submit to the leaderboard.
In future work, we will try to introduce other automatic evaluation metrics, such as BERTscore (Zhang et al., 2020b) and BLEURT (Sellam et al., 2020). Besides, we will compare the correlation between these metrics and human judgment.