TGEA: An Error-Annotated Dataset and Benchmark Tasks for TextGeneration from Pretrained Language Models

In order to deeply understand the capability of pretrained language models in text generation and conduct a diagnostic evaluation, we propose TGEA, an error-annotated dataset with multiple benchmark tasks for text generation from pretrained language models (PLMs). We use carefully selected prompt words to guide GPT-2 to generate candidate sentences, from which we select 47K for error annotation. Crowdsourced workers manually check each of these sentences and detect 12k erroneous sentences. We create an error taxonomy to cover 24 types of errors occurring in these erroneous sentences according to the nature of errors with respect to linguistics and knowledge (e.g., common sense). For each erroneous span in PLM-generated sentences, we also detect another span that is closely associated with it. Each error is hence manually labeled with comprehensive annotations, including the span of the error, the associated span, minimal correction to the error, the type of the error, and rationale behind the error. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. This is the first dataset with comprehensive annotations for PLM-generated texts, which facilitates the diagnostic evaluation of PLM-based text generation. Furthermore, we use TGEA as a benchmark dataset and propose a series of automatic diagnosis tasks, including error detection, error type classification, associated span detection, error rationale generation, to further promote future study on the automatic error detection and correction on texts generated by pretrained language models.

On several NLU datasets, PLM-based neural models have gradually achieved human-level performance in terms of automatic evaluation metrics (e.g., accuracy, F 1 ) (He et al., 2020;Zhang et al., 2021). In order to deeply understand and analyze the capability of PLMs on NLU, a variety of more challenging NLU datasets have been proposed (Warstadt et al., 2020;Cui et al., 2020a;Jain et al., 2020;Talmor et al., 2020). These datasets can be used not only to obtain knowledge on how PLM-based models work and what they learn, but also to define new NLU tasks and to serve as a benchmark for future progress. For example, evaluating and analyzing PLM-based models on learning document structures with a carefully created benchmark test suite (Chen et al., 2019), helps to develop new methods to enhance the capability of these models on discourse modeling (Iter et al., 2020). Knowing the weakness of current PLM-based models in commonsense reasoning  has inspired people to develop various reasoning datasets (Cui et al., 2020a;Zhang et al., 2020b).
On the other hand, state-of-the-art PLMs are able to generate texts that are even not distinguishable from human-written texts by human evaluators (Radford et al., 2019;Brown et al., 2020). This makes us curious about the capability of PLMs on text generation. Are they really reaching humanlevel performance on text generation? In contrast to the studies of PLMs on NLU, research on the capability of PLMs on NLG is quite limited, especially in dataset building and diagnostic evaluation of text generation errors.
In this paper, in order to recognize the perimeter of text generation capability of PLMs, we propose TGEA, an error-annotated dataset with multiple benchmark tasks for text generation from pretrained language models. The original raw data are collected from texts generated by a Chinese GPT-2 model. The entire data collection and annotation procedure is visualized in Figure 1. The goals and contributions of building TGEA are as follows.
• TGEA, to the best of our knowledge, is the first dataset built on machine-generated texts from state-of-the-art pretrained language models with rich annotations. The key interest of this dataset is detecting and annotating text generation errors from PLMs. Therefore it is different from conventional text generation datasets (e.g., Multi-News (Fabbri et al., 2019), TextCaps (Sidorov et al., 2020)) that are constructed to train models to learn text generation (e.g., generating texts from images or long documents). It is also different from grammatical error correction (GEC) datasets (Zhao et al., 2018;Flachs et al., 2020) that are built from human-written texts usually by second language learners. • TGEA provides rich semantic information for text generation errors, including error types, associated text spans, error corrections and rationals behind errors, as shown in Figure  1. Marking text spans that are closely related to erroneous words allows us to detect longdistance dependencies of errors or reasoning chains related to errors. Rationales behind errors directly explain why errors are annotated. All these error-centered manual annotations not only increase the interpretability of our dataset, but also facilitate a comprehensive diagnostic evaluation of pretrained language models on text generation. • We created an error taxonomy for TGEA, which covers 24 error types in a two-level hierarchy. With this error taxonomy, we not only obtain a high agreement on manual error annotation but also recognize the strengths and weaknesses of GPT-2 on text generation by estimating a distribution over these 24 error types. Comparing our dataset with GEC datasets, we find that humans and GPT-2 have  a very different error distribution, especially on errors related to commonsense reasoning. • TGEA not only exhibits text generation errors from pretrained language models, but also can serve as a dataset to train various models to automatically detect and correct these errors, like GEC datasets for training models to automatically correct human errors. We define 5 benchmark tasks over our dataset, i.e., erroneous sentence detection, erroneous span and associated span detection, error type classification, error correction and error rationale generation. For all these tasks, we provide experimental results using state-of-the-art models as baselines.

Related Work
Our work is related to GEC datasets in error annotation and correction (machine vs. human errors). It is also partially related to commonsense reasoning datasets that have been proposed recently in that our dataset includes commonsense reasoning errors and rationales behind these errors. Our dataset is not related to conventional text generation datasets (Vougiouklis et al., 2017;Wiseman et al., 2017;Parikh et al., 2020) for training text generation models. A comprehensive comparison to GEC datasets and commonsense reasoning datasets is shown in Table 1.  (Napoles et al., 2017) is a GEC dataset built from TOFEL Exams, which does not force annotators to make minimal edits, preferring holistic fluency rewrites. CMEG (Napoles et al., 2019) is different from general grammatical error correction datasets with texts from second language learners. It uses articles or blogs (e.g., Wiki, Yahoo)) written by native English speakers to explore grammatical error phenomena in different domains. CWEB (Flachs et al., 2020) also uses website texts in English, such as blogs. The difference between CWEB and CMEG is that the percentage of erroneous tokens in the former is smaller than the latter as the purpose of CWEB is to study grammatical error correction in low error density domains. CGEC (Zhao et al., 2018) is a large-scale Chinese grammatical error correction dataset, derived from wrong sentences written by Chinese learners in the process of learning Chinese as a second language.
In addition to the difference in text sources (i.e., human-written vs. machine-generated), other significant differences between our dataset and existing GEC datasets are that our dataset contains commonsense reasoning errors and provides associated text span annotations and rationales for errors, as shown in Table 1.

Commonsense Datasets
A variety of commonsense datasets have been proposed. Roemmele et al. (2011) introduce COPA that focuses on commonsense causal reasoning. Levesque et al. (2012) present Winograd Scheme Challenge (WSC), a dataset testing commonsense reasoning in the form of anaphora resolution. Winogrande, a larger version of WSC, is introduced by Sakaguchi et al. (2020), which contains ∼ 44, 000 examples. Winowhy (Zhang et al., 2020a) asks annotators to provide reasons for their decisions to WSC. In this aspect, the differences of our dataset from Winowhy are twofold. First, we provide reasons for errors rather than correct decisions to anaphora. Second, we provide reasons for all text generation errors, rather than only errors related to commonsense reasoning.
In addition to COPA and WSC-style datasets, many large crowdsourced datasets have been also proposed recently. CommonsenseQA (Talmor et al., 2019), a commonsense question answering dataset, has been constructed from ConceptNet. HellaSwag (Zellers et al., 2019b)   In the aspect of commonsense reasoning, our dataset is different from the mentioned commonsense datasets in that we detect and annotate errors in machine-generated texts, which violates common sense, rather than creating examples to examine the commonsense reasoning ability of machines.

Error Taxonomy
Before crowdsourced workers manually annotate errors in machine-generated texts, we need to create an error taxonomy for such error coding. Three principles are used to guide the design of the error taxonomy: coverage, exclusiveness and easiness. The coverage rule requires that the error system can cover almost all different types of errors in machine-generated texts. The exclusiveness requirement indicates that each error type is not overlapping with other error types in the taxonomy. The final easiness principle means that the error coding system is easy to be used by annotators. With these three principles and aid from a linguist, we created an error taxonomy in a two-level hierarchy, which was revised in our pre-annotation stage.
The first level of the error taxonomy includes 5 error types.
• Inappropriate combination. This type of errors suggests that two words/phrases are syntactically or lexically inappropriately com-bined in a sentence. Such errors include not only lexical collocation errors but also longdistance syntactic constituency combination errors (e.g., inappropriate subject-object combination). This error type is similar to "replacing" error in some GEC datasets (e.g., CWEB (Flachs et al., 2020)) as one element of an inappropriate combination should be usually replaced with other expressions. As we want to find text spans associated with erroneous words/phrases, we term this error type as "inappropriate combination". We further divide this error type into five subtypes at the second level. • Missing. Grammatical constituencies or words are missing. 5 subtypes are defined under this error type. • Redundancy. Words or phrases are unnecessary. 5 subtypes are also defined. • Discourse Error. This error type is defined for inter-sentential cohesion/coherence errors (e.g., coreference errors, incorrect discourse connectives). • Commonsense Error. This error code is for errors related to commonsense reasoning. We divide this error type into 8 subtypes according to the type of commonsense knowledge type required (e.g., time, spatial, number).
All other errors that cannot be categorized into the aforementioned error types are grouped into "Other". Table 2 displays examples for the above defined error types. 24 error subtypes are displayed in Figure 2 and examples of these subtypes are shown in Appendix.

Machine-Generated Text Collection
Raw texts in our dataset are collected from a pretrained Chinese GPT-2 (NEZHA-Gen) 2 , which generates texts according to a system prompt. NEZHA-Gen has 12 layers and 12 attention heads and is trained on Chinese Wikipedia and news data (see Appendix for more details on the hyperparameters of NEZHA-Gen). As it is easy for NEZHA-Gen to generate high-quality texts with high-frequency prompt words, we create a list of prompt words according to their frequency to guarantee that there are sufficient erroneous sentences in collected raw texts. By doing so, we have found that GPT has a better chance to generate wrong sentences with such prompts. Specifically, we have randomly sampled 2M sentences from the data used to train NEZHA-Gen. The sampled sentences are then word-segmented and POS-tagged by Baidu LAC tool 3 (Jiao et al., 2018). We then select and sort nouns in a descending order according to their frequencies in the sampled corpus. Nouns ranking in the range of top [40%, 60%] are selected as prompts.
We further filter out noisy texts from texts generated with these selected prompts. Noisy texts are either texts containing no more than 15 characters or texts where Chinese characters account for less 70% of all characters.

Error Annotation
There are 5 stages in error annotation, as shown in Figure 1. We introduce each of them in this subsection.
(1) Erroneous text detection. Texts generated by NEZHA-Gen with prompt words are present to annotators one by one. The first stage of annotation is hence to detect erroneous texts for subsequent annotations. Corresponding tags are annotated for texts being manually checked.
(2) Erroneous and associated span detection. The next task for annotators is to detect erroneous and associated text spans in detected erroneous texts. For erroneous span detection, as a text may contain several spans that can be edited or the text can be corrected in different ways, which span should be regarded as erroneous is closely related to the way that we correct the text. Therefore, the basic principle that guides the annotation of erro-neous spans is also the rule that we use for error correction: making minimal edits, which is also used in GEC datasets (Flachs et al., 2020;Napoles et al., 2017). In addition to the minimal edit principle, we also provide the following specific rules for annotators: • If annotators feel that a text is ambiguous and that it is difficult to correct the text, the text can be discarded without any further annotations. • If there are several spans that can be edited, the first erroneous span is preferred to be edited. • If the number of errors to be corrected in a text is larger than 4, the text is removed.
Following these rules, annotators have removed 4,291 texts, which account for only 8.36% of all detected erroneous texts in the first stage.
In addition to erroneous span annotation, unlike GEC datasets (Daudaravicius et al., 2016;Zhao et al., 2018), we also detect a text span that is closely related to the already detected erroneous span with respect to the error, and term this span as "associated span". In Table 2, we show examples with annotated erroneous and associated text spans. For an inappropriate combination, the associated span is usually a span that should not co-occur with the erroneous span.
(3) Error correction. After detecting erroneous spans in a given text, annotators are required to make corrections following the minimal edit principle. Annotators are also required to use common words for error correction to make the corrected text as fluent as possible.
(4) Error type classification. Once annotators detect both erroneous and associated spans as well as provide corrections, they are becoming quite aware of these errors. Hence, we now ask them to categorize the annotated errors into error types defined in our error taxonomy. First, they select the primary type from the level-1 error types. Then, if there are level-2 error subtypes, annotators continue to select a subtype. We observe that errors annotated with "other" only account for 5.70%, suggesting that our error taxonomy has good coverage.
(5) Rationale generation. Partially inspired by previous datasets that provide explanations together with corresponding annotations, e.g., e-SNLI (Camburu et al., 2018), Winowhy (Zhang et al., 2020a)   and R4C (Inoue et al., 2020), we ask annotators to give a reason for each error to justify their annotations. To the best of our knowledge, no GEC datasets provide explanations for error corrections. We believe that annotated rationales can be used to improve the interpretability of neural models trained on our dataset.

Annotation Quality Control
In order to ensure the quality of error annotations, we have adopted a very strict quality control protocol during annotation. First, we train two reviewers with 1K machine-generated texts. The annotation consistency of the two reviewers on the 1K texts is very high, with an average IAA of 92.3% and Cohen's Kappa (McHugh, 2012) of 82.6% across the annotation tasks (1), (2) and (4). For the texts annotated by the two reviewers, we have conducted an evaluation. The average accuracy of all tasks is 96.3% and 97.4% respectively.
Second, 200 candidate workers participate in a pre-annotation stage. The two reviewers will review annotations from these participants to distinguish whether the annotation is correct or not. Only participants who have reached an accuracy of >90% in every tasks can join in the next stage. As a result, 20 participants have passed the training in the pre-annotation stage. We then divide them into two groups and ask them to annotate the same 500 texts. The inter-annotator IAA and Cohen's Kappa are shown in Table 3, which suggests that the 20 annotators are ready for final annotation.
Third, in order to further ensure annotation quality, we have carried out iterative verification and amendment. The two reviewers will review each annotated text. If they found the annotation is wrong, the unqualified data will be returned for amendment until they are qualified.

Dataset Statistics
Overall statistics. We reshuffle all annotated texts and divide them into the training/dev/test sets with a proportion of 8:1:1. As shown in Table 4, the training set contains 27,096 correct texts and 9,740 erroneous texts. Both the development and test set contain 4,706 texts, among which 1,218 texts are erroneous. Not surprisingly, most erroneous texts contain only one error.
After Chinese word segmentation via Jieba 4 , there are 1,208,719 tokens in total. On average, there are 25.68 tokens in each text. Annotation statistics. As shown in Table 4, each erroneous text span contains 2.94 tokens while each associated span is composed of 4.27 tokens. The average distance from an erroneous text span to its associated span is 7.03 tokens, which is about 1/3 of the average text length.

Error Type Distribution
We further show the percentages of both level-1 and level-2 error types in Figure 2. We observe that only 5.7% cases cannot be categorized into our defined error types. The inappropriate combination, missing and redundancy error, which are the main error types in GEC datasets, account for 64.85% in our dataset. In addition to these errors, we see 18.96% commonsense errors and 10.48% discourse errors, which are usually not very common in GEC datasets. However, these two types of errors with high percentages in our dataset suggest that pretrained language models can be further improved on both commonsense reasoning and discourse modeling.

TGEA as a Benchmark
We use our dataset as a benchmark and propose 5 tasks that are defined for errors in texts generated by PLMs. We provide baseline results for these tasks in this section.
We employ three BERT-style Chinese PLMs as baselines in our experiments, namely BERT-wwmext, RoBERTa-wwm-ext-large developed by Cui et al. (2020b) 5 and ALBERT-Chinese-large 6 . For notational simiplicity, we denote them as BERT zh , RoBERTa zh and ALBERT zh respectively. Please refer to the Appendix for the model hyperparameter settings of each task.

Erroneous Text Detection
Task definition. This is a text classification task to judge whether a given text is erroneous. In order to avoid data imbalance, we use the same number of correct and erroneous texts for training. Model. The three Chinese PLMs are used with standard text-classification fine-tuning. Results. All models perform just <14% better than chance (random guessing), as shown in Table 5. We also provide human performance on this task. The best model RoBERTa zh is worse than human performance by 26 points. This suggests that automatically detecting erroneous texts generated by pretrained language models is very challenging even in the balanced classification scenario.

Erroneous Span and Associated Span Detection
Task definition. We define the detection of the two types of spans as a joint task as they are closely related to each other. The joint task is similar to named entity recognition (NER) (a sequence labeling task) and it requires to recognize the erroneous and associated text spans simultaneously. NERstyle word-level tags are hence annotated for each erroneous text.
Model. The three Chinese PLMs with NER-like fine-tuning are evaluated for this task. Since this is a 3-class token classification task, we report class-F 1 on erroneous and associated span. The class-F 1 on class X is calculated like a normal F 1 for a binary classification task, by treating the target class X as the positive class and all other classes as negative.
Results. As shown in Table 5, all models are very poor in this task, indicating the difficulty of automatically detecting erroneous and associated spans. However, we have found that models can benefit much from the joint detection over the detection of a single type of span (either erroneous or associated span). Our preliminary experiments on the detection of only erroneous span show that the best model can only achieve 26.42% erroneous class-F 1 on the test set, while the joint task achieves 27.66% erroneous class-F 1 on the test set.

Error Type Classification
Task definition. Again this is a text classification task. We only perform classification over level-1 error types in the form of 5-way classification. Model. We use models similar to the first task.
Results. The overall accuracy and Macro-F 1 (shown in Table 5) are very low. However, we find some error types are easier than others. The accuracy on the classification of redundancy errors is 53.91%, the highest among all error types.

Error Correction
Task definition. This task is the same as GEC, which transforms an erroneous text into a correct sequence.
Model. we use the state-of-the-art BERT-GEC model (Kaneko et al., 2020) as the baseline for this task, which is an encoder-decoder model using representations learned by PLMs as additional inputs. Following ，we feed representations learned by BERT zh and RoBERTa zh into  the BERT-GEC model. Results. We report precision, recall and F 0.5 scores using the official Max-Match tool (Dahlmeier and Ng, 2012). As shown in Table 5, the best RoBERTa zh GEC model achieves a very low F 0.5 of 0.93% and 0.98% on the development and test set respectively. We speculate that the reasons for this are twofold. First, comparing with GEC data on human-written texts, our dataset is relatively small. Second, our dataset contains error types that are very different from those in previous GEC datasets (Zhao et al., 2018;Flachs et al., 2020). Punctuation, spelling and other word-characterlevel errors, which are easy to be corrected, are rare in TGEA although they are quite common in GEC datasets. In contrast, TGEA contains more complicated errors that can only be corrected with knowledge of common sense, long-distance or inter-sentential dependencies, etc.

Rationale Generation
Task definition. This is a text generation task that directly generate an explanation with respect to text generation errors from an erroneous text.

Model.
We use NEZHA-Gen as the baseline for this task.
We restructure annotated texts in our dataset in the form of {T, 这句话错误的原因是：, R} ({T, The reason behind the errors in this sentence is:, R}), where T is an erroneous sentence, while R is the error rational provided by annotators. We then fine-tune NEZHA-Gen on the reformatted training set and evaluate the fine-tuned model on the reformatted development and test set. We report BLEU (Papineni et al., 2002), Rouge-L (Lin, 2004) and BERT Score (Zhang et al., 2020c). Results. It can be expected that results in these metrics will be very low due to the high difficulty of this task. We analyze generated texts from the baseline and find that generated rationales are usually much longer than reference rationales provided by human annotators. This could result in the low BLEU score since long hypotheses are penalized in BLEU computation. We also experiment zero-shot generation on the test set. The results are {BLEU = 0.04%, Rouge-L = 6.83%, BERT Score = 54.27%}, indicating that fine-tuning on the annotated training set can improve this task. We suggest that this generation task could be reformulated as a multi-choice question answering task by providing alternative rationales as distractors, similar to VCR (Zellers et al., 2019a). We leave this to our future work.

Discussion
Since we use machine-generated texts for error annotation, hyperparameters of models (e.g., sampling strategies, model size), model types (e.g., GPT-2, GPT-3 or other PLMs for text generation), and genres of texts used to train PLMs, etc., all have impacts on generated texts and hence on error types and error distribution.
A straightforward way to mitigate this issue is to collect raw texts from multiple models with different hyperparameters, neural architectures and text genres. This will lead to an expanded dataset with a much larger number of instances to be manually annotated, which is expensive and time-consuming. Yet another issue with this is that it may result in a bunch of data due to inconsistency across different models and difficulty in setting the proportion of each data source.
Instead, we focus on consistently annotating errors for texts generated from a single source. In order to make TGEA as general and representative as possible, we use GPT-2 that is not only currently state of the art in text generation but also easily available. We also adopt standard and widely-used hyperparameters (see Appendix for more details) for NEZHA-Gen to generate texts.
Additionally, we use a random sampling strategy with top k = 30. For setting k, we have analyzed 500 examples with different values of k, and found that adjusting k has a reasonable impact on the percentage of redundancy errors. Except for the extreme case of k = 1, the types of errors and the distribution of them do not change significantly. Take commonsense errors as an example, which is the biggest difference from human-written texts. When k varies in a range of {5, 10, 20, 30, 50}, the percentage of commonsense errors is 18.6% ± 5.8%. Redundancy errors account for >95% when k = 1 (while commonsense errors account for 0.8%), but sharply drop to 37.4% as k = 5, and the form of repetition changes from same-word repetition to a mixed repetition of "synonymous/same-word", suggesting that a simple repetition penalty may not be sufficient to deal with semantic redundancy. When k ∈ {10, 20, 30, 50}, the percentage of redundancy errors is very close to the result reported in Figure  2. When k > 30, many generated sentences are completely incomprehensible. A larger k will also reduce the generation efficiency. Therefore, we chose a sampling strategy of k = 30, which is the trade-off between text quality and generation efficiency.

Conclusions
In this paper, we have presented TGEA, the first dataset with a variety of manual annotations on errors occurring texts generated by pretrained lan-guage models. For each erroneous text generated by a Chinese GPT-2 model, our crowdsourced annotators detect erroneous text spans with their associated text spans and provide error types defined in a two-level hierarchical taxonomy as well as rationales behind detected errors. We elaborate the 5 annotation stages for building TGEA with a strict annotation quality control protocol. We also report baseline results of the 5 benchmark tasks on TGEA. The low results suggest that our dataset is a challenging testbed for future work on automatic detection of erroneous spans and types as well as producing error corrections and rationales for texts generated by PLMs. TGEA is featured with wide error type coverage, rich semantic annotation and functional diversity, which can not only be used for deep diagnostic analysis on the text generation capability of pretrained language models, but also facilitate and promote the research of automatic and interpretable error correction for PLM-generated texts.

A Appendix
A.1 NEZHA-Gen Hyperparameters        [抓住]机会。 We had a chance to equalise at the beginning, but we didn't :
Modifier 在国内成立水牛研究中心，有利于增强 : [水牛对]自然条件和人工环境的适应能力。 The establishment of Buffalo Research Center in China is conducive to enhance the adaptability [of buffalo] to natural conditions and artificial environment.

Function Word
他的儿子 : [在]上一届奥运会夺得冠军，并且获得当年世界锦标杯赛金牌。 His son won champion : [in] the last Olympic Games and won the gold medal in the World Championship Cup that year.

动，以此来缓解人们对恐怖主义威胁的忧虑。
In the near future, we will work with Russia, ::: China [France] and other countries to further promote this series of actions to ease people's concerns about the threat of terrorism, a Chinese official said. Table 7: Examples of level-2 error types in TGEA. ::::::::::: Underwaved ::::: words are erroneous words while underlined words are associated words. Words in "[]" are corrections to erroneous words.