DeltaScore: Fine-Grained Story Evaluation with Perturbations

Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects. Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations (e.g., the introduction of typos). Given this, we measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models. We compare DELTASCORE with existing metrics on storytelling datasets from two domains in five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. DELTASCORE demonstrates remarkable performance, revealing a surprising finding that a specific perturbation proves highly effective in capturing multiple aspects.


Introduction
The emergence of large pre-trained language models (PLMs) (Zhao et al., 2023) has empowered story generation models to generate plausible narratives (Tan et al., 2021;Zhang et al., 2022b;Yang et al., 2022).The most advanced models have achieved the ability to produce stories which are not easily distinguishable from human-authored ones (Karpinska et al., 2021;Dou et al., 2022;Xie et al., 2023).However, the development of automated evaluation metrics in this domain has not progressed at the same pace (Guan et al., 2021b).Human evaluation, though considered the gold standard, is hindered by its time-consuming, costly, and The supermarket has various kinds of goods.
The has various kinds of .
It is a nice day. Figure 1: Scenarios where higher quality stories (top) are affected more than lower quality ones (bottom) through aspect-specific perturbations (fluency: "Add typos"; relatedness: "Remove relevant words").Generative likelihood for original/perturbed story is in blue/green circle, and the DELTASCORE value is in orange circle.
non-reproducible nature (Sai et al., 2023).Consequently, there is a demand for better automatic methods that can evaluate the quality of stories.
The prevailing evaluation metrics for story assessment have primarily been adapted from other natural language generation (NLG) tasks, such as BLEU (Papineni et al., 2002) for machine translation, or ROUGE (Lin, 2004) for summarization.Fortunately, recent progress has given rise to the emergence of new metrics explicitly tailored for story evaluation, with a focus on quantifying story coherence (Guan and Huang, 2020;Ghazarian et al., 2021) or capturing human preferences (Chen et al., 2022).Other works have directly utilized the likelihood of a story under a PLM (Vaswani et al., 2017;Han et al., 2022) or its conditional likelihood based on human references or other contextual factors, such as story title (Thompson and Post, 2020;Yuan et al., 2021).Nonetheless, these approaches often yield a singular score that provides an estimate of the overall quality.However, Chhun et al. (2022) contend that the quality of a story is comprised of various fine-grained aspects, such as fluency and adherence to commonsense, suggesting that an overall quality score has limited utility for comprehensive story evaluation.
In this paper, we present DELTASCORE, a method that evaluates story quality by measuring the likelihood difference using a PLM between an original story and its perturbed version.The underlying concept is that higher quality stories will exhibit more significant effects from the perturbation compared to lower quality ones.To provide fine-grained assessment of story quality, we experiment with perturbations that target specific aspects.Figure 1 presents two examples to demonstrate the intuition of our approach: 1) When we introduce random typos to modify the two stories shown in Figure 1a, we observe that the story with higher fluency is affected more by the perturbation; 2) When we modify the two stories in Figure 1b by removing relevant words, we observe that the perturbation affects the story that exhibits a closer association with the title to a greater extent.Our empirical analysis demonstrates the superior performance of DELTAS-CORE compared to existing metrics in evaluating intricate story aspects.Furthermore, our investigation reveals an interesting discovery: one of our simplest perturbation methods, which simply shuffles all the words in the story, exhibits remarkable effectiveness in capturing multiple aspects.This points to a possible interpretation that it may be functioning as a normalisation factor to modulate the effects of word frequency and text length.

Automatic Evaluation Metrics
Existing automatic evaluation metrics can be broadly categorized into three paradigms.
Discriminative metrics typically involve training a discriminator model to differentiate between high-quality and low-quality texts, including UNION (Guan and Huang, 2020), MAN-PLTS (Ghazarian et al., 2021), CTC (Deng et al., 2021), StoryER (Chen et al., 2022), and UNIEVAL (Zhong et al., 2022).Specifically, UNION constructs negative samples of original stories using heuristic rules and trains a discriminator to differentiate them.MANPLTS is an extension of UNION that constructs improved negative samples by manipulating storylines and generating alternate stories based on these manipulated storylines using a story generation model.StoryER builds a classifier to learn human preference by training it to differentiate highly-upvoted stories from lowlyupvoted ones on Reddit.CTC treats the evaluation task as an information alignment task.UNIEVAL frames the evaluation as a question answering task where different questions are asked to assess a particular aspect.
Generative metrics usually rely on generative likelihood to determine the quality of the text, including BARTScore (Yuan et al., 2021), T5Score (Qin et al., 2022) and GPTScore (Fu et al., 2023).Specifically, BARTScore evaluates generated text by calculating its conditional likelihood under BART.GPTScore calculates the likelihood of the story under a PLM with additional prefix to target a particular aspect.T5Score benefits from both worlds by employing both generative training with the standard negative log likelihood loss and discriminative training with contrastive loss where human judgments for generation quality are available.

Natural Text Perturbation
The use of perturbations is a conventional technique to generate negative samples for both discriminative (Guan and Huang, 2020) and generative (Zhong et al., 2022) tasks. Ribeiro et al. (2020) propose CheckList, a suite of perturbation techniques to evaluate the behavioral performance of NLP models.Sai et al. (2021) further delve into applying perturbations to assess robustness of NLG evaluation metrics, while Karpinska et al. (2022) specifically focus on machine translation evaluation.He et al. (2022) also develop perturbation tests to identify blind spots of model-based evaluation metrics.Notably, all of these perturbations rely on heuristic rules.In contrast, recent adversarial attacks such as those proposed by Li et al. (2020); Morris et al. (2020) use language models to generate adversarial examples, which can also be considered a form of text perturbation.In our work, we explore perturbation for a different purpose: to evaluate fine-grained story qualities.

DELTASCORE
We now describe the idea of our approach.Given a story condition (e.g., a story title) c = c 1 , ..., c n containing n tokens, a model-generated story s = s 1 , ..., s m containing m tokens, and a perturbed story s ′ = s ′ 1 , ..., s ′ m ′ containing m ′ tokens, DELTASCORE calculates the likelihood difference under a language model: where p(s|c) represents the likelihood of s conditioned on c under a language model.In our experiments, we investigate several PLMs with varying architectures ( § 3.1) and perturbation techniques that are designed to target specific aspects ( § 3.2).
Denoting language model parameters as θ, we compute DELTASCORE as follows for encoderdecoder PLMs: where t denotes timestep in the sequence, and s <t denotes all tokens before the current timestep.Intuitively, the story condition c is captured by the encoder, and the likelihood of the story s is produced by the decoder.
In terms of decoder PLMs, we concatenate c and s to form a sequence x (x 1 , ..., x n+m = c 1 , ..., c n , s 1 , ..., s m ) to compute DELTASCORE: This formulation means we feed the full sequence including the story condition c and story s as input to the decoder-only PLM, although when computing the story likelihood, we only consider the conditional probabilities for the s tokens.

Perturbations on Story Aspects
We follow Xie et al. (2023) to assess five fundamental aspects of story quality: fluency, coherence, relatedness, logicality, and interestingness.To this end, we survey perturbation methods from the literature (Ribeiro et al., 2020;Sai et al., 2021;Guan et al., 2021b;He et al., 2022) and attempt to align them to one of these five aspects.For some aspects, we also propose new perturbation methods.We now describe each aspect and its associated perturbation methods; A summary of these methods and examples is given in Table 1.
Fluency assesses the readability of sentences in the story.Perturbations targeting fluency modify the text at the word or phrase level.We use two perturbation approaches from Ribeiro et al. ( 2020): 1) Typo, where we randomly transpose a character with an adjacent one in the text, and 2) Subject-verb disagreement (SubjVerbDis), where we modify the verbs in a sentence so that they no longer agree with their subjects.
Coherence assesses the level of connectivity between sentences in the story.Perturbations targeting coherence modify the text at the sentence level.We use two perturbation approaches from Sai et al. ( 2021): 1) Jumble, where we randomly shuffle words within the story, and 2) Sentence Reorder (SentReorder), where we randomly shuffle the sentences within the story.
Relatedness focuses on the extent to which the story is relevant to the given condition (e.g., story title).Perturbations targeting relatedness alter the story to reduce its association with its condition.We propose two new methods: 1) Remove Relevant Words (RmRelWords), where we use Chat-GPT1 to identify words related to the given title and then remove them from the story, and 2) Story Replacement (StoryReplace), where we substitute the original story with another story from a different story condition.To select a "comparable" story, we choose a story with where its likelihood is similar to the original story. 2 Logicality focuses on the extent to which the story complies with commonsense.Perturbations targeting logicality introduce elements into the story that contradict commonsense.We adopt one approach from Guan et al. (2021b): Antonym, where we randomly replace the word with its antonym; and propose a new approach: Commonsense, where we use ChatGPT to modify some story elements to violate commonsense.
Interestingness measures the degree of predictability in the progression of events within a story, representing a highly subjective aspect.We propose one approach: Blander Narrative, where we use ChatGPT to modify a story to make the narrative less interesting.The ChatGPT 3 instructions for the aforementioned perturbations are detailed in Appendix A. For Typo, Jumbo and Antonym, we can control the degree of perturbation, and this parameter is tuned in § 5.1.

Benchmarks
We use the generated stories and human ratings collected by Xie et al. (2023) on two story datasets: 2 We calculate the likelihood of the original story and a candidate story without considering their story conditions. 3We use OpenAI API with the model gpt-3.5-turbo.

ROC
[FEMALE] dad took me fishing .
we sat in a spot and waited for days ... WP tell me a story where the first line and last line ... as i walked into the house , i was assailed by the smell of aging ...  Fan et al. (2018); longer fictional stories written by users on Reddit). 4 The story condition (c) for ROC is the leading sentence; for WP, it is the short paragraph that describes the idea of the story, which is called "prompt".We present two example stories from the two datasets in Table 2. Xie et al. (2023) experiment with 6 story generation models that cover large models with promptbased learning (e.g., GPT-3), smaller fine-tuned models (e.g., BART) and other methods that incorporate planning and commonsense (Xu et al., 2020;Guan et al., 2020Guan et al., , 2021a;;Tan et al., 2021).They then conduct human evaluation on five aspects, judged using an ordinal scale from 1 (worst) to 5 (best).Two distinct groups of annotators were recruited, comprising in-house PhD students and crowdworkers.The results obtained from both groups were found to be similar, indicating the robustness and reliability of the annotation process.The judgment from the first group is used for preliminary exploration of optimal settings, such as assessing the effectiveness perturbation methods and language models ( § 5.1).The judgment of the second group is used for the final comparison of our approach with existing evaluation metrics ( § 5.2).

Language Models
We select a set of representative PLMs to compute DELTASCORE.For encoder-decoder PLMs, we use BART and FLAN-T5 (Chung et al., 2022).For decoder PLMs, we use BLOOM (Scao et al., 2022), LLaMA (Touvron et al., 2023), OPT (Zhang et al., 2022a), and GPT-3.5. 5 We use the largest possible variant whenever possible as we found larger models tend to work better in preliminary experiments.We present a summary of these models in Table 3.

Compared Evaluation Metrics
To comprehensively compare DELTASCORE with other existing evaluation metrics, we select representative evaluation metrics from each of the three categories mentioned in § 2.1.
For similarity metrics, we run experiments for BLEU, BERTScore and MoverScore.For discriminative metrics, we have UNION, MANPLTS, StoryER, CTC and UNIEVAL.Since UNION, MANPLTS and StoryER are all originally designed for story evaluation, we use their released models without fine-tuning for our experiments.For CTC, we use the reference-free alignment approach, which is also called "consistency" in the original paper.For UNIEVAL, the question answering models are trained on text summarization and dialogue generation tasks.We modify the questions to adapt UNIEVAL for evaluating different aspects of stories as the authors demonstrate the zero-shot transfer capability.Please refer to Appendix B for our questions.For generative metrics, we select BARTScore and GPTScore.We use the reference-free version of BARTScore (i.e., c → s), and employ text-davinci-003 from OpenAI as the backbone of GPTScore with specific prompts for different story aspects.Prompts for GPTScore can be found in Appendix C.  "BT" = "billion tokens"; and "TT" = "trillion tokens")."LM" indicates causal language modeling objective.
Table 4: Statistics of compared evaluation metrics."FT" indicates whether the metric requires additional synthetic data to fine-tune on."B/F" indicates whether the metric is reference-based (B) or reference-free (F)."ST" indicates whether the metric is originally designed for story evaluation."MS" indicates whether the metric produces scores that consider multiple aspects.
We summarise all these metrics in Table 4, showing whether they: require additional training or ground truth reference; are originally introduced for story evaluation; and can measure fine-grained story aspects.

Results
We evaluate Kendall correlation at the story level, which involves comparing the predicted metric score versus the aggregated human rating for each story on a specific aspect.We use this as our primary metric due to the non-linear relationship between automatic and human metrics, as well as the ordinal scale employed in human judgments (Kendall, 1938).We explore different settings of our approach in § 5.1 and present a comparison of our best approach with existing evaluation metrics in § 5.2.Note that we use two different set of judgments, as explained in § 4.1, to avoid tuning and testing on the same test set.

Preliminary Exploration
Perturbation Methods We commence by showcasing the comparative performance of various perturbation methods ( § 3.2) in relation to human judgments across the five aspects, as demonstrated in Table 5.For this analysis, we employ LLaMA as the PLM.The notation "w/o perturbation" denotes the calculation of story likelihood directly under LLaMA, without any perturbations applied.Our findings revealed intriguing results.Notably, we observed that perturbations specifically designed to target a particular aspect did not consistently exhibit a higher correlation with human judgments for that aspect.Furthermore, our analysis indicates that measuring interestingness is particularly challenging, as the correlation numbers associated with this aspect are generally lower compared to the other aspects.Finally, our last and perhaps most surprising observation is that a small set of perturbation methods, namely Typo, Jumble, and Antonym, exhibit strong performance in evaluating most aspects.Based on this finding, we concentrate on these three methods for our subsequent experiments.
Perturbation Degree We next investigate impact of perturbation degree using the top 3 performing perturbation methods and present the results over ROC and WP in Figure 2. As before, we use LLaMA as the PLM and focus on evaluating coherence here.Interestingly, Typo appears to be relatively stable and unaffected by the perturbation degree, where else Jumble and Antonym work better with more aggressive perturbation.Based on these results, we set the perturbation degree to 0.4, 0.9, and 0.8 for Typo, Jumble, and Antonym respectively for both ROC and WP.
Language Models We next present DELTAS-CORE results using different PLMs in Table 6.We use the top 3 performing methods with the optimal degrees determined in our previous analysis.all instances, suggesting that measuring story quality using likelihood difference is generally a better approach than using its likelihood directly.Broadly speaking, Jumble is the most consistent perturbation method: in ROC it is the consistently the best performer, while in WP it is either the best or second best performer, depending on the PLM.This observation aligns with the findings presented in Table 5, providing further confirmation that the Jumble perturbation method demonstrates effectiveness in measuring various story aspects.When examining the correlation magnitudes for different story aspects, it is evident that interestingness consistently exhibits lower values, reaffirming its inherent difficulty in measurement.There are, however, some curious exceptions: in ROC the correlation for fluency and relatedness is particularly low.We do not have a strong hypothesis of these observations, but will note that the language of ROC stories are somewhat formulaic and possibly dif-ferent to the language of the pre-training data.For relatedness, the story condition in ROC is the first sentence, and it is a rather artificial condition to set the "topic" for story generation.A notable and expected observation is that larger models tend to exhibit stronger correlations, with GPT-3.5 and OPT performing the best among the PLMs.BLOOM and FLAN-T5 fall in the middle range, while BART shows the lowest correlation scores.Upon comparing GPT-3.5 and OPT, we observe a slight advantage for OPT despite its smaller model size and pre-training data.This finding suggests that beyond a certain scale, the benefits of further scaling may become less significant.

Comparison with Other Metrics
We proceed to compare DELTASCORE with other evaluation metrics in Figure 3.Note that in this comparison, we utilize OPT as the chosen PLM, considering its superior performance, along with the same top-performing perturbation methods.The results of our evaluation are highly promising, as DELTASCORE consistently outperforms all competitor metrics across all story aspects.Notably, Jumble stands out as the most effective perturbation method among the three.The similarity metrics generally exhibit the lowest performance, highlighting the inadequacy of reference-based metrics for story evaluation, which aligns with previous research findings (Guan and Huang, 2020;Xie et al., 2023).Among the discriminative metrics, CTC and UNIEVAL show relatively strong competitiveness, although they still fall behind DELTASCORE.The performance of generative scores is inconsistent.GPTScore shows strong performance in evaluating logicality and interestingness, especially in ROC, where it performs similarly to DELTASCORE.However, its effectiveness is limited in other scenarios.

Discussion and Conclusion
Initially, our aim was to investigate various types of perturbations for assessing fine-grained aspects of storytelling.Surprisingly, our findings revealed that one of the simplest perturbation methods, namely Jumble, displayed exceptional effectiveness in measuring most aspects.Upon observing the performance of each metric across different aspects in Figure 3, we noticed that there is not a significant disparity in the results.This suggests a potential inter-correlation among the aspects, leading us to speculate that they may be interconnected in some way.However, the effectiveness of the Jumble perturbation method in capturing multiple aspects still remains unexplained.Another hypothesis could be that Jumble is functioning as a normalisation factor to modulate word frequency and sentence length effects.This finding aligns with the research conducted by Lau et al. (2020), who explored the use of language model probabilities for assessing sentence acceptability.They highlighted the significance of normalizing these probabilities and introduced various normalization techniques, including the utilization of unigram language models, to mitigate the impact of word frequency.The likelihood associated with a shuffled word sequence, referred to as Jumble, can be considered as a form of normalization that incorporates word frequency and sentence length effects, obviating the need for an additional language model.This observation implies that DELTASCORE, utilizing the Jumble technique, may have broader applications beyond the evaluation of story quality.For instance, it could be applicable in scenarios involving sentence scoring, such as machine translation and abstractive summarization.
In conclusion, we propose DELTASCORE, a novel approach to assess fine-grained story aspects by comparing the likelihood difference between the original story and a perturbed version using a pre-trained language model.Surprisingly, we discovered that a small set of perturbation methods excel in measuring the majority of story aspects.Furthermore, our findings demonstrate that DELTASCORE exhibits stronger correlations with human judgments compared to a range of existing metrics across two different story domains.

Limitations
Our study investigates a constrained range of perturbations for evaluating stories, acknowledging the possibility of additional perturbations that might exhibit superior effectiveness beyond the scope of our analysis.While our current paper focuses on applying the perturbation method specifically to story evaluation, we recognize its potential for adaptation in assessing other text generation tasks, such as machine translation and abstractive summarization.This includes the utilization of Jumble or alternative perturbation techniques, thereby paving the way for promising avenues of future investigation.
Perturbation "Add typos" affects the highly fluent story (top) more than the less fluent one (bottom).
Two stories are conditioned on the same title "I always go to the local supermarket".Perturbation "Remove relevant words" affects the highly related story (top) more while not affect the unrelated one (bottom).

FluFigure 3 :
Figure 3: Absolute value of Story-level Kendall correlation (|τ |) between different metrics and crowdworker ratings.Higher bar indicates better performance.Red bars indicate DELTASCORE.Blue bars indicate similarity based metrics.Green bars indicate discriminative metrics.Purple bars indicate generative metrics.

Table 2 :
Sampled examples of given story condition and its generated story for each dataset.
Figure 2: Impact of perturbation degree with LLaMA on in-house judgements for measuring coherence.

Table 6 :
Absolute value ofStory-level Kendall correlation (|τ |)between different metrics and in-house judgements.We bold the best scores for each aspect and highlight instances where DELTASCORE improves over vanilla likelihood ("w/o perturbation").
Encouragingly, across different PLMs and story aspects, we see that DELTASCORE outperforms vanilla likelihood ("w/o perturbation") in almost