OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.


Introduction
Significant advances have been witnessed in many NLG tasks with pretraining models (Devlin et al., 2019;Brown et al., 2020). However, existing generation models are still far behind the human-level performance to generate reasonable texts, particularly for open-ended generation tasks such as story generation (Fan et al., 2018;Guan et al., 2020). One critical obstacle is the lack of powerful metrics for measuring the quality of generation. * Corresponding author The standard paradigm for evaluating NLG metrics is to calculate the correlation with human judgments on manually annotated datasets (Tao et al., 2018;Sellam et al., 2020). Recent studies have discovered that the existing automatic metrics may correlate poorly with human judgments (Liu et al., 2016;Guan and Huang, 2020). Unfortunately, the lack of benchmark datasets makes it challenging to completely assess the capabilities of a metric and fairly compare different metrics. Firstly, annotated datasets usually contain innate data bias and annotation bias. Secondly, summarizing the performance with a single aggregate statistic (e.g., a correlation score) makes it difficult to probe which aspects a metric can successfully capture and which can not. Therefore, many alternative approaches have been proposed to evaluate NLG metrics, such as measuring the robustness to adversarial examples (Zhang* et al., 2020), and the generalization to quality-biased data (Sellam et al., 2020). However, these approaches only focus on an individual capability or a single task, thereby failing to fully reveal the strengths and weaknesses of a NLG metric. Therefore, we propose OpenMEVA, a benchmark for Open-ended story generation Metrics Evaluation. We first collect a MANually annotated Story dataset (MANS). The stories are generated by various generation models trained on two widely used story corpora, ROCStories (Mostafazadeh et al., 2016) and WritingPrompts (Fan et al., 2018). Therefore, MANS supports to evaluate metrics in terms of not only the correlation with human judgments, but also the generalization w.r.t model drift (generations from different models) and dataset drift (examples from different datasets).
In addition, OpenMEVA also includes an AUTOconstructed Story dataset (AUTOS) to test the robustness and the ability to judge story coherence, namely, the semantic relations and discourse structures in the context. We construct AUTOS by per-turbing human-written stories, and test the metrics in each single aspect (e.g., the ability to recognize inconsistency) by validating the input-output behavior (Ribeiro et al., 2020). Through such behavioral tests, AUTOS can support to reveal potential issues of metrics in multiple aspects, which would be not traceable in machine-generated examples in MANS. We conduct extensive experiments to assess the capabilities of existing automatic metrics on Open-MEVA. We find that state-of-the-art metrics still correlate poorly (less than 0.5) with human judgments on MANS. And it is difficult for the learnable metrics to generalize to model or dataset drift. Through tests on AUTOS, we observe that most metrics can perform well in recognizing incoherence at token level (e.g., unrelated entities) and sentence level (e.g., semantic repetition), but fail to recognize discourse-level incoherence (e.g., inconsistency) and lack understanding of inferential knowledge (e.g., temporal order between events). Besides, we also show that existing metrics are not robust to a small number of typos and synonym substitution. These findings may inspire new directions for developing NLG models and designing metrics in future research.
We also provide an open-source toolkit which implements various metrics, and therefore supports the comparison and analysis of metrics. In addition, the toolkit provides data perturbation techniques for generating customized test cases beyond AU-TOS, which can facilitate fast development of new automatic metrics 1 .

Related Work
Various automatic metrics have been proposed for evaluating language generation. They can be roughly divided into referenced, unreferenced, and hybrid metrics, according to whether relying on human-written references when calculating the metric score. Referenced metrics usually measure the similarity between a sample and some references based on word-overlap (e.g., BLEU (Papineni et al., 2002), ROUGE (Lin, 2004)) or word embedding (e.g., BERTScore (Zhang* et al., 2020), MoverScore ). However, referenced metrics were reported to correlate poorly with human judgments in open-ended generation tasks (Liu et al., 2016) due to the one-to-many issue (Zhao et al., 2017). To address the issue, un-referenced metrics were proposed to measure the quality of a sample without any reference, such as perplexity, discriminator-based metric (Kannan and Vinyals, 2017), UNION (Guan and Huang, 2020) and GRADE . Besides, hybrid metrics combine referenced and unreferenced metrics (e.g., RUBER and its variant (Tao et al., 2018;Ghazarian et al., 2019)) or learn from the human-annotated score (e.g., ADEM (Lowe et al., 2017), BLEURT (Sellam et al., 2020)).
Recently, there have been many criticisms for existing metrics. Garbacea et al. (2019) showed the poor generalization of discriminator-based metrics. Sai et al. (2019) demonstrated ADEM is not robust to simple attacks such as simple word substitution or random word shuffle. However, these criticisms only focus on individual metrics or capabilities. Notably, Ribeiro et al. (2020) proposed a framework CheckList to evaluate different capabilities of general language understanding models by validating the input-output behavior. The test cases are created from scratch or by perturbing an existing dataset. Similar to Checklist, Open-MEVA also employs automatically constructing examples for behavioral tests. However, CheckList only focuses on single sentences, thereby lacking the ability to test models in understanding long texts with many discourse-level features (e.g., temporal relationship). Moreover, the testing methods of CheckList are not directly applicable for NLG metrics. Specifically, CheckList measures the performance of a model by calculating the failure rate between discrete model prediction and automatic labels. Such failure rates are ineffective for measuring metrics since most metric scores are continuous. To address the above issues, we propose perturbation techniques and testing methods more applicable for story generation metrics.

Data Collection
We collect MANS and AUTOS based on ROCStories (ROC for short) (Mostafazadeh et al., 2016) and WritingPrompts (WP for short) (Fan et al., 2018), which are commonly used for story generation (Guan et al., 2020;Fan et al., 2019) and evaluation (Guan and Huang, 2020 Figure 1: Overview for the manual annotation interface. Story A gets two points in overall quality since it gets three points deducted for its repetitive plot and chaotic scene. The ratings of Annotator #5 for the current story group are rejected because of the low score for the human-written story and the high score for the negative sample.
in WP. Although we only consider the stories in the two corpora, OpenMEVA is designed to measure the capability of NLG metrics to evaluate general linguistic features such as coherence, which may pertain to other stories. Besides, our idea that building datasets by manual annotation or automatic construction can be easily extended to evaluate specific aspects for other types of stories.

MANS: Manually Annotated Stories
We collect MANS to assess the correlation of metrics with human judgments and the generalization ability when evaluating machine-generated stories. We randomly split ROC and WP by 90%/5%/5% for training/validation/test of the generation models. We regard the first sentence for ROC and the prompt for WP as input. After training, we generate stories based on the test sets. Then, we resort to Amazon Mechanical Turk (AMT) for human judgments of the generated stories. We consider various generation models including a Seq2Seq model (Sutskever et al., 2014), Fusion Manual Annotation We present the manual annotation interface in Figure 1. In each human intelligence task (HIT) of AMT, we show workers the input of a story paired with seven stories including (a) five stories generated by the above five models, (b) the human-written story, and (c) a negative example constructed by perturbing a story (e.g., repetition, shuffling) sampled from the test sets.
Then we ask workers to compare the overall quality of the seven stories 2 , and rate each story with a 5-point Likert scale. We reject an HIT if the worker rates the human-written story lower than four points or rates the negative example higher than two points. Through the quality control mechanism, we filtered about 38.7% assignments for ROC and 75.4% for WP. Finally, we ensure that there are five valid ratings for each generated story, and we regard the average rating as the final human judgment.
Considering that overall quality is often too abstract to measure, we follow previous recommendations (Belz and Hastie, 2014;van der Lee et al., 2020) to decide the overall quality by summarizing multiple separate criteria. We ask the workers to decide the rating of a story based on a point deduction policy. Specifically, a story should get punishment in points if it contains errors such as repetitive plots, unrelated events and conflicting logic, or globally chaotic scenes, which are commonly observed in existing NLG models (Guan and Huang, 2020) (several examples shown in the appendix). Intuitively, the policy can alleviate the tendency to give high scores and ensure that the judgment standard of workers is as consistent as possible during annotation. To avoid introducing extra bias in the policy, we do not impose the restriction on workers to exactly match the rating in overall quality with the deducted points.

Data Statistics
We randomly sampled 200 stories from test sets of ROC and WP for story

Aspects
Selecting Coherent Examples Creating Incoherent Examples

Lexical Repetition
All the human-written stories.
(2) Repeating a sentence. Case: ... he stepped on the stage and stepped on the stage ...

Semantic Repetition
All the human-written stories.
(1) Repeating a sentence with its paraphrase by back translation 3 . To ensure the semantic similarity and avoid much word overlap, we only use those paraphrases whose MoverScore is larger than 0.4 and BLEU-1 is less than 0.6 with the original sentences. We present some examples for paraphrase generation in the appendix. Case: he hired an attorney. he employed a lawyer ... (MoverScore=0.57, BLEU-1=0.40)

Character Behavior
Stories with passive voice or with personal pronouns (e.g., "him", "their") for multiple characters. Case: ... it asked John if John could ...
(1) Reordering the subject and object of a sentence.
(2) Substituting a personal pronoun with another one which refers to other characters. And we do no change the grammatical case of the substituted pronoun (e.g., "my" can be substituted with "his" but not with "him"). Case: John ↔ asked it if John could ...

Common Sense
Stories with both the head and tail entities of a triple in ConceptNet 4 (Speer and Havasi, 2012).
(1) Substituting 10% entities with its neighboring entity in ConceptNet. Case: today is Halloween → Christmas . Jack is excited to go trick or treating ... ("Halloween" and "Christmas" has the relation "Antonyms")

Relatedness
Stories with weak token-level semantic relatedness within the context 5 . Case: Craig was diagnosed with cancer. he decided to fight it ...
(2) Substituting a sentence randomly with another sampled from the dataset. Case: Craig was diagnosed with cancer. he decided to fight it. → Kelly wanted to put up the Christmas tree. He tried several different approaches and medications. eventually it went into remission ...

Causal Relationship
Stories with causality-related words (e.g., "because"). Case: ... the sky is clear. so he can see it .
(1) Reordering the cause and effect, which should be two individual sentences or two clauses connected by a causalityrelated conjunction; (2) Substituting the causality-related words with the antonyms (e.g., "reason" vs. "result"). Case: ... he can see it. ↔ so the sky is clear.
(1) Reordering two sequential events, which should be two individual sentences or two clauses connected by a timerelated conjunction.

Aspects Perturbations
Synonyms Substituting a word with its synonym retrieved from WordNet. Case: ... I purchased → bought my uniforms.

Paraphrases
Substituting a sentence with its paraphrase. Case: he hired an attorney → he employed a lawyer

AUTOS: Auto-Constructed Stories
While improving correlation with human judgments is the ultimate goal for developing automatic metrics, merely relying on limited annotated data may make the true evaluation performance overestimated (Ribeiro et al., 2020). Besides, a machine-generated story may contain multiple entangled errors (e.g., repetition, unrelatedness), which do not support individual tests for metrics. Therefore, we propose to evaluate the capabilities of metrics with auto-constructed test examples (i.e., AUTOS), each of which is created to focus on a single aspect. We construct AUTOS based on the human-written stories in the test sets of ROC and WP.
Aspects We argue that an ideal metric for evaluating open-ended language generation should have at least the following capabilities: (a) the ability to judge story coherence, which requires recognizing lexical and semantic repetition, unreasonable character behavior (e.g., chaotic coreferences), violation of common sense (e.g., "trick or treat" on "Christmas"), poor consistency and relatedness, incorrect causal and temporal relationship; and (b) the robustness to perturbations, such as substituting with synonyms or paraphrases, deleting unimportant punctuation marks, contracting full expressions or expanding contractions, and adding typos. Tests in these aspects require metrics to fully understand the linguistic features at token level (e.g., synonyms), sentence level (e.g., semantic similarity), and discourse level (e.g., context relatedness in content and proper sentence orders), and possess knowledge about common sense, causality, etc., which are usually not traceable in machine-generated stories. Although these aspects are not exhaustive, it is a starting point for further research.  Table 1. For robustness assessment, we expect the metric scores to remain the same with certain perturbations, i.e., the invariance test, as shown in Table 2. However, the perturbation may inevitably introduce grammar errors. To alleviate the issue, we filter out those ungrammatical examples in AUTOS except for those used to evaluate robustness to typos using an automatic grammaticality classifier. We present the statistics of AUTOS together with the evaluation results in Table 6/ 7 for the discrimination/invariance tests, respectively. And we provide more details about the construction of AUTOS and the grammaticality classifier in the appendix.

Evaluation
We evaluated existing metrics on OpenMEVA, and analyzed the strengths and weaknesses with extensive experiments.

Evaluated Metrics
We experimented with existing metrics of different types as follows: (a) Referenced Metrics: the word-overlap based metric sentence BLEU score (geometric mean from 1-gram to 4-gram) (Papineni et al., 2002), the contextualized embedding based metrics, BERTScore-F1 (Zhang* et al., 2020). In addition, we also reported the performance of the unreferenced version in RUBER-BERT, denoted as R u -BERT. And we present results with more metrics in the appendix.

Correlation with Human Judgments
We first calculate the Pearson correlation coefficient between metric scores and human judgments on MANS. Besides, we also evaluate metrics on the other four evaluation sets constructed for individual error types (described in Section 3.1) based on MANS. Each of them contains all the reasonable samples and the unreasonable samples of some error type. A reasonable sample means its overall quality score larger than four points. For an unreasonable sample, we decide it is of some error type if there is only one error type annotated by at least three of five annotators. We assign the reasonable and unreasonable samples with binary labels 1 and 0, respectively, and calculate the correlation between metric scores and the binary labels on the four evaluation sets.
We summarize the correlation results in Table 3. As previous studies (Guan and Huang, 2020) observed, unreferenced metrics are more competitive for evaluating open-ended language generation than referenced ones. PPL (F) performs better than PPL (P) on ROC but not on WP, which may be because stories in ROC are created artificially and hence differ from the general language distribution during pretraining GPT-2. Furthermore, measuring input-output relatedness (R u -BERT) is not enough for language generation evaluation. UNION outperforms other metrics in overall quality assessment since it learns to distinguish human-written stories from negative samples with more error types. Interestingly, it seems easier for the metrics to recognize surface errors (e.g., repetitive plots) or serious global errors (e.g., chaotic scenes). However, the best correlation with human judgments is still fairly low, and it is difficult to recognize unrelatedness and conflicting plot. The results indicate the huge room to improve the metrics.
To further examine to what extent the improve-  Table 3: Pearson correlation with human judgments on MANS. PPL (P) and PPL (F) mean Perplexity estimated by pretrained and fine-tuned GPT-2, respectively. The best performance is highlighted in bold. The results contain the correlation with human judgments on all the annotated samples in MANS (Overall), and the correlation with the binary labels on reasonable samples and unreasonable ones of different error types. The error types include Repetitive plots, Unrelated events, Conflicting logic and Chaotic scenes. The numbers in the table header denote the number of corresponding stories. * indicates the correlation score is significant (p-value<0.01).

BERTScore-F1
BERTScore ment in an automatic metric corresponds to the improvement in human judgments, we calculate the correlation between human judgment difference and metric score difference (Mathur et al., 2020). Specifically, we sort the 1,000 stories (for ROC and WP, respectively) in MANS by the human judgments, and then select consecutive 200 stories from the beginning and repeat the selection with a stride 10. We finally get (1, 000 − 200)/10 = 80 story sets 7 . We decide the human judgment or metric score of each set by averaging that of the stories in the set. We calculate the human judgment difference and metric score difference between any two sets of them (80 × 80 = 6, 400 pairs totally), and present the correlation between the differences in Figure 2 for several typical metrics. We can see that a significant improvement in the metrics usually corresponds to a significant improvement 7 We do not construct the sets by randomly sampling since it would be difficult to cover wide enough quality levels.
in human judgments (cyan/dark gray part in Figure 2). However, both an insignificant drop and improvement in a metric could correspond to a significant improvement in human judgments. And worse, the improvement in human judgments may have a wide range, which is particularly evident for BERTScore-F1 and RUBER-BERT (yellow/light gray part in Figure 2). That is, if an NLG model achieves insignificantly better scores in the two metrics, it is quite possible that the model performs significantly worse in human judgments. The situation is improved when using PPL (F) and UNION, suggesting that they may be better to measure language generation.

Generalization Ability
It is extremely important for learnable metrics to deal with model drift and dataset drift (Garbacea et al., 2019; Sellam et al., 2020). Specifically, a generalizable metric should be able to evaluate dif-ferent NLG models since the generation quality or inductive bias can vary significantly across models. Besides, we also expect a metric to reliably evaluate output from different datasets even without re-training. Therefore, we assess the generalization ability of learnable metrics, including PPL (F), R u -BERT and UNION, which are fine-tuned on the training sets of ROC and WP, respectively.
To assess the generalization to model drift, we test the metrics on stories generated by five aforementioned models in MANS, respectively (200 stories by each model). Table 4 presents the performance, which varies considerably with models. R u -BERT only achieves a good correlation on those stories with poor relatedness (e.g., Seq2Seq on WP). PPL (F) and UNION perform comparably but neither do well in evaluating all the NLG models.  Table 4: Pearson correlation with human judgments to assess generalization to output from different models including Seq2Seq (S2S), Plan&Write (P&W), Fusion, GPT-2, KG-GPT-2 (KG-G). The best performance among the metrics is highlighted in bold.
To assess the generalization to dataset drift, we first trained the metrics on ROC and then directly used them to evaluate stories from WP, and vice versa. As shown in Table 5, all the metrics drops significantly in correlation when used for the other dataset due to the difference in length and topic. PPL (F) and UNION also have similar performance drops but are more generalizable. The results suggest existing metrics fall short of generalization.

Ability to Judge Story Coherence
We assess the ability of the unreferenced metrics 8 to judge story coherence based on the discrimination test set of AUTOS. We assign each test example with a binary label (1/0 for the coherent/incoherent example). Then we calculate the correlation between metric scores and the binary labels on the test examples of different aspects. The higher correlation means the better ability to judge coherence. Table 6 presents the correlation results. We summarize the results as follows: (1) PPL is ineffective to recognize repetition errors. The observation is accordant with the results on MANS (Table 3). PPL (P) even has a significantly negative correlation with labels in lexical and semantic repetition.
(2) PPL (F) and UNION have better average performance than others. R u -BERT performs worst in almost all the aspects. UNION has the highest average performance by a large margin on ROC but underperforms PPL (F) on WP, indicating the shortage of UNION when evaluating longer stories. Besides, the results show that a powerful language model may also be a powerful evaluator (if we can alleviate its preference for repetitive texts).
(3) Existing metrics perform well in recognizing incoherence at token and sentence levels. For example, they seem to be able to recognize unreasonable behavior for a certain character, and possess some commonsense knowledge about entity relations. However, in this work the proposed perturbation can not fully cover all possible incoherence in these aspects, which would be regarded as the future work. (4) The metrics still struggle to recognize discourse-level incoherence. Specifically, it is difficult to recognize inconsistent events when we insert or delete negated words, and understand the semantic relatedness across sentences. Besides, they also lack inferential knowledge about the causal and temporal relationship. The observations are also accordant with the results in Table 3 where unrelated events and conflicting logic can not be well recognized. In conclusion, we reveal various issues of the existing metrics by the isolating behavioral testing, while they achieve moderate correlation with human judgments on MANS.  Table 7: Pearson correlation with automatic labels on the invariance test set of AUTOS. The smaller absolute value of correlation indicates the better robustness. The best performance is highlighted in bold and the second best is underlined. The numbers in the ROC/WP rows indicate how many human-written stories (Human) and incoherent samples from the discrimination test set (Dis) are perturbed.

Robustness Evaluation
A reliable metric should produce similar judgments for an example with simple perturbations or attacks in the input. Therefore, it is essential to evaluate the robustness of metrics. We test the robustness on the invariance test set of AUTOS. We assign each example with a binary label (1/0 for the original/perturbed example). Then, we calculate the correlation between metric scores and the binary labels.
The original examples can be sampled either from human-written stories or from the incoherent examples in the discrimination test set. Table 7 shows the robustness results. It is not surprising that R u -BERT has the "best robustness" since the perturbations hardly influence the inputoutput relatedness. The result validates the relatedness is merely one side for evaluating NLG, but not means that it is a promising direction for developing robust metrics 9 . PPL is not robust to synonym 9 We can imagine that a constant metric has the perfect robustness to any perturbations, but is useless for evaluation. substitution because the low-frequency words introduced by the perturbations (e.g., from "happy" to "joyful") can cause significant change in PPL. UNION has better robustness on average thanks to the robust contextualized representation of BERT. Furthermore, both PPL and UNION perform better in contraction than in other aspects. However, they are very sensitive to a small number of typos (less than 2% words) because typos may bring some out-of-vocabulary words. Although the issue is common for almost all the (sub)word-based metrics, it is still important to handle typos since they are also common in human writing.

Conclusion
We present OpenMEVA, a benchmark to comprehensively assess capabilities of metrics for evaluating open-ended story generation. OpenMEVA includes test examples which are created by either annotating machine-generated stories or perturbing human-written stories in terms of each single aspect. We evaluate a number of existing metrics on OpenMEVA and analyze their performance on each capability extensively. Experiments demonstrate that existing metrics still correlate weakly with human judgments, fail to recognize discourselevel incoherence, and lack inferential knowledge, generalization and robustness. Our study reveals the weaknesses of existing metrics and may inspire new research on designing NLG metrics.
The datasets, data augmentation tools, and implemented metrics in this paper can facilitate further research on language generation and evaluation. We would also like to thank the anonymous reviewers for their invaluable suggestions and feedback.

Ethics Statement
We build OpenMEVA based on two existing public story datasets ROCStories (ROC) and Writing-Prompts (WP), which are widely used for story generation and evaluation. We resorted to Amazon Mechanical Turk (AMT) for manual annotation of stories in MANS. We did not ask about personal privacy or collect personal information of annotators in the annotation process. We hired five annotators and payed each annotator $0.05 and $0.1 for annotating each story in ROC and WP, respectively. We decided the payment according to the average story length of two datasets. We admit that there may be still unpredictable bias in MANS even though we have asked three experts to review all the annotated stories.
Besides, we selected or constructed the test examples in AUTOS based on general linguistic features. We did not adopt any selecting strategies or perturbation techniques which may introduce extra bias into AUTOS.

A.1 Story Collection
Data Processing We collect machine-generated stories based on ROC and WP. To achieve better generation and generalization performance, we follow Guan et al. (2020) to delexicalize stories in ROC by masking all the names with placeholders, and retain about 250 words (with correct sentence boundary) from the beginning and truncated the rest in WP.
Story Generation After training, we use the generation models to generate stories based on the test sets of ROC and WP. We adopt nucleus sampling (Holtzman et al., 2020) with p = 0.9 for story generation to avoid as many repetition errors as possible, since such cases are easier for recognition and simulation (we cover the repetition errors mainly with the test examples in AUTOS). Repetitive plots (-2)

A.2 Manual Annotation
One day he decided to try a new recipe. He bought all the ingredients. He followed the recipe. It was the best sauce he ever tasted.
Unrelated events to the beginning (-1) He decided to buy a banana. He picked up a big oak tree. He put it in the kitchen. He is happy with the watermelon.
Unrelated events to the beginning and within its context (-2) He had a watermelon this morning. He wanted another one. He went to buy one. He didn't want to eat watermelons.
Conflicting logic (-1) I buy a watermelon for him. It is pretty great for my dad. He doesn't like it. He finally asked me to be his girlfriend.

Conflicting logic (-2)
I had a watermelon when I was a child. I was feeding him fruits. I picked it up and put it in the house. He asked me to be his son.

A.3 Statistics
The Krippendorff's α is 0.77 for ROC and 0.71 for WP, indicating a moderate inter-annotator agreement according to the interpretation in Table 9. We present the distribution of human judgments for different models in Figure 3 and other statistics in Table 10. The results show the diversity of the stories in length and quality.
RUBER, and the supervised metric BLEURT which is fine-tuned on the released annotation results from Guan and Huang (2020). The experiment results is shown in Table 11.

B.1 Construction
We list some technical details for constructing AU-TOS within different aspects as follows: • Semantic Repetition and Paraphrases: We present several examples for paraphrase generation in Table 14. We adopt MoverScore and BLEU-1 to measure the semantic similarity and word overlap between the paraphrases and the original sentences, respectively. We finally only use the paraphrase whose Mover-Score is larger than 0.4 and BLEU-1 is less than 0.6 with the original sentence, because they achieve both high semantic similarity and low word overlap.
• Character Behaviour: We recognize the personal pronouns in a story following Table 13. We select those stories which contain at least three types of person (i.e., at least three pronouns from different rows) as the coherent examples. And when substituting the pronouns to create incoherent examples, we only perform the substitution in the same column (e.g., "my" can be only substituted with "our", "your", etc.) for better grammaticality.
• Consistency, Causal and Temporal Relationship: We present the negated words, causality-related words and the time-related words in Table 12.

B.2 Grammaticality Classifier
We train a binary classifier on the CoLA corpus (Warstadt et al., 2019) Table 15 to further indicate the usefulness of the classifier. We can see that the classifier can detect the grammar errors in multiple aspects such as verb forms (e.g., "head" should be "heads" for case 1) and sentence elements (e.g., the predicate is missing for case 3). And the classifier would give the grammatical sentences high scores although they may be unreasonable in logic (e.g., repetitive texts for case 4 and conflicting plot for case 5). Finally, we filter out about 21.69% and 50.15% examples for ROC/WP, respectively.

B.3 Statistics
We show the statistics of the discrimination test set and the invariance test set in AUTOS in Table 16 and Table 17, respectively. Causality-related so, because, since, therefore, why cause, reason, result, effect, purpose, aim, sake, consequence, causal Time-related after, before, previously, simultaneously, currently, meanwhile, then, now, ever, again, once, anytime, when, while, never, always, usually, often, sometimes, usually, early, lately, already, forever, ago, yesterday, today, tomorrow ending, beginning, previous, simultaneous, current, temporary, contemporary, temporal, second, minute, hour, day, month, year, century, past, future, present, delay, night, evening, morning, afternoon, noon, morning       Input and Story is the average number of tokens in the inputs and stories. Human and Dis means the humanwritten coherent stories and incoherent samples (sampled from the discrimination test set) to be perturbed, respectively.