StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Existing automatic story evaluation methods place a premium on story lexical level coherence, deviating from human preference.We go beyond this limitation by considering a novel Story Evaluation method that mimics human preference when judging a story, namely StoryER, which consists of three sub-tasks: Ranking, Rating and Reasoning.Given either a machine-generated or a human-written story, StoryER requires the machine to output 1) a preference score that corresponds to human preference, 2) specific ratings and their corresponding confidences and 3) comments for various aspects (e.g., opening, character-shaping).To support these tasks, we introduce a well-annotated dataset comprising (i) 100k ranked story pairs; and (ii) a set of 46k ratings and comments on various aspects of the story.We finetune Longformer-Encoder-Decoder (LED) on the collected dataset, with the encoder responsible for preference score and aspect prediction and the decoder for comment generation.Our comprehensive experiments result a competitive benchmark for each task, showing the high correlation to human preference.In addition, we have witnessed the joint learning of the preference scores, the aspect ratings, and the comments brings gain each single task.Our dataset and benchmarks are publicly available to advance the research of story evaluation tasks.


Introduction
Even for humans, evaluating story quality is a challenging task.Although many literature criteria have been proposed, the most straightforward way is to count how many readers like the story which is referred as to human preference.Bearing it in mind, story writing community usually uses upvote count Figure 1: The existing story evaluation method (UNION) outputs a score for estimating the coherence of the stories, while human-written stories rarely suffer from this problem.Our model (Ours) which is trained by comparing two stories (Ranking), evaluates the story based on human preference (i.e., upvote counts), produces scores for various aspects (Rating), and leaves comments (Reasoning).Our model is applicable to both machine-generated and human-written stories.
as a story quality criterion.As shown in Fig. 1, more readers like the left story (upvote count = 1.8k) rather than the right one (upvote count = 1).
We build a model upon Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020), where the encoder predicts the preference score (Ranking), aspect ratings and confidences (Rating) while the decoder generates the comments (Reasoning).Inspired by widely-used pairwise comparison in story evaluation, we train our model with the ranking objectives.In this way, the score margin between Figure 2: The Writing Prompt Dataset with metadata (left) contains prompt, story, upvotes, and comments from readers.Our dataset collection pipeline (right) shows the template for data collection.We ask the workers to select 3-5 aspects, score each aspect 1-5 from poor to good and leave the comments that shows the reason for the score they rated.
good and poor stories are enlarged, resulting in high correlation between human preference and our predicted preference score (Fig. 1).We also witness that our performance is improved when we conduct joint training on three subtasks.
In aid of the proposed task, we present a wellannotated crowd-sourcing dataset, consisting of two parts.(i) One is built from 63,929 stories and their corresponding upvote counts provided in Writ-ingPrompt dataset (WP) (Fan et al., 2018) (Figure 2 (left)) by pairing one highly-upvoted story (upvotes ≥ 50) and one lowly-upvoted story (upvotes ≤ 0) under the same prompt.As a result, we obtain 100k pairs of stories, namely 100k story ranking data, used to train and evaluate the preference score prediction.(ii) The other part is made up of 45,948 aspect comments and their respective rating scores (1-5) by Amazon Mechanical Turk (AMT) and augmented data (Section 3.2), namely 46k aspect rating and reasoning data, used for model explanation.Our contributions are three-fold: • This study addresses a novel task StoryER, that consists of preference score prediction, aspect rating and comment generation.• We introduce a new dataset for StoryER task and create benchmarks to promote the story evaluation research.

• Comprehensive experiments and intensive
analysis indicate our preference score prediction outperforms previous metrics, and more accurately reflects human preference.Aspect rating and comment generation also helps in the evaluation and provide explanations.Moreover, we point out the remaining challenges under various scenarios in the hope that facilitates future research.

Related work
Overlap-based metrics such as BLEU (Sulem et al., 2018) and ROUGE (Lin, 2004) calculate lexical matches (i.e., n-gram matching) and reward the words that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.Recent research (Edunov et al., 2020) indicates that these metrics do not reflect human preferences, particularly for open-ended text generation tasks.
Neural-based metrics are motivated by the success of transformers as multitask learners (Vaswani et al., 2017), and adapt them for the task of neural language evaluation.When compared to overlapbased metrics, BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019), BLEURT (Sellam et al., 2020) report stronger correlations with human judgment.For specific use, in open dialogue generation, Adem (Lowe et al., 2017) captures semantic similarity beyond word overlap statistics, and exploits both the context and the reference response to calculate its score for the model response.RUBER (Tao et al., 2018) and its variant, RUBER-BERT (Ghazarian et al., 2019) evaluates a reply by taking into consideration both a ground-truth reply and a query without requiring labels of human satisfaction and can be extended to different datasets and languages.
Neural discriminator is proposed particularly for story evaluation.The metrics mentioned above show limited performance in story evaluation as demonstrated in Guan et al. (2021).UNION (Guan and Huang, 2020)  guish by a BERT-based model (Devlin et al., 2019).
The coherence score they produce can be expressed as the probability of the story being identified as human-written story.In this paper, we require our model to follow human preference, not only the coherence, which we believe is a more general way of story evaluation.

Dataset
Our dataset comprises of two parts: 100k story ranking, and 46k aspect rating and reasoning.2

100k Story Ranking Data
As we mentioned above, ranking method is more flexible and better than discrimination when evaluating the story (we also experimentally compare them in Sec.F.1).We thus prepare 100k pairwise ranking data for training the model.To this end, we first collect 193,842 stories prior to 03/2020 from WP3 along with their prompt, the number of upvotes and uncategorized comments.We remove the stories updated from 12/2019 to 03/2020, since newly-updated stories usually have few upvotes regardless of whether they are good or bad.Then, we exclusively keep stories with the word count between 200 and 800.Finally, we pick two stories from the same prompt, one highly upvoted (i.e., upvotes ≥ 504 ) and one lowly upvoted (i.e., upvotes ≤ 0), resulting in a total of 63,929 unique stories and 116,971 story pairs.We split the story pairs based on the prompts into training, validation and testing (Table 1), to ensure that each division receives a unique set of prompts.

46k Aspect Rating and Reasoning Data
Apart from the preference score, we require our model to provide ratings and comments on predefined aspects to aid in the explanation of the predicted preference score.Aspect category extraction.To begin with, we must determine which aspects in the content should be measured.As some readers leave comments to explain why they upvote or downvote the stories, a straightforward way is to extract aspect categories based on those uncategorized comments.We therefore adopt latent Dirichlet allocation (LDA), which models the documents with a certain number of topics, based upon the co-occurrence of individual words.More precisely, we follow Brody and Elhadad (2010) to treat each comment as a separate document.LDA can produce a distribution of frequency of occurrence for each word in the topics.We optimize LDA through a cluster validation scheme, and obtain the optimal number of aspects 10.Based on the most representative words in each topic, we manually name each topic as the aspect category.These aspect categories are defined using some widely used aspects inspired from the websites. 5omment and aspect collection.Comments in WP meta data are neither categorized with aspect categories, nor labeled with ratings, and some of them are totally irrelevant to the content.More importantly, there is a bias towards positive comments, which implies that not too many readers are willing to leave comments on poor stories.Therefore, we collect new comments via crowd-sourcing.By learning from these well-annotated comment data, we train neural models to filter out noisy data from comments in WP meta data.To collect the data, we ask workers from AMT to select aspects, rate sentiment and leave comments on 5,964 unique stories from WP.For increasing the diversity of comments, some stories are allocated to two different annotators, resulting in a total of 9,112 submissions (i.e., 1.53 annotations/story).As shown in Figure 2 (right), each story requires the annotators to rate (normalized to 0-1) and leave comments on 3 to 5 aspects that are most confident by the workers.The final statistics of the comments is listed in We list the number of comments with rating scores (2nd and 3rd columns), averaged rating scores (4th and 5th columns) and averaged word count (6th and 7th columns).
trained with our collected data.The training details can be found in the supplementary material.We filter out irrelevant comments by eliminating those with no values in aspect categories that exceeds 0.9 after softmax and retain the comments with the word count ranged from 15 to 50.The remaining comments are then rated by the their sentiments.Finally, we obtain 17,849 valuable comments for 6,705 additional unique stories and merge them into our collected data, resulting in a total number of 45,948 for comments and 12,669 for unique stories.We split the collected data into training, validation, and test data in the ratio of 8:1:1 and put the augmented data into the training data (Table 2).

StoryER 4.1 Task Definition
Given a story s, the task is to output a set {p s , a c , a r , c} where p s denotes the preference score of the story s, which is used for comparing story quality.For more explicit explanation, we further output confidence scores a c = {a c k } K k=1 , aspect ratings a r = {a r k } K k=1 , and comments c = {c k } K k=1 for K aspects (K = 10 in our experiments), respectively.Confidence scores a c reflect the likelihood of utilizing the specific aspects as measures, as some aspects (e.g., horror) are not applicable in some stories (e.g., comic story).Aspect ratings a r k are considered as the scores of each aspect.Comments c demonstrate the reason that the reader upvotes/downvotes the story, producing a more explicit explanation for the aspect rating.We assume K k=1 a c k = 1 for aspect confidence, and a r k ∈ [0, 1] for aspect rating, which is calculated separately during the training.
Please note that aspect rating and comment gen- eration results are not used as metrics in this work, while they are used for 1) improving preference score prediction by joint learning, and 2) producing explanation.Investigating how to include them into metrics is a future direction for this research.

Learning a Story Evaluator
Following Ghazarian et al. ( 2021), we use Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020) to produce a preference score, as well as ratings and comments for the pre-defined aspects.As shown in Figure 3, we encode the story s, and use its feature on the special token (i.e., [CLS]) to predict the preference score p s , aspect confidence a c and rating a r by additional layers.For generating comments, we concatenate the story with aspect category name with a special token (i.e., <sep>), and send it into the same encoder.The decoder outputs the comment c that implies the performance of the story on the given aspect.

Task 1: Preference Score Prediction (Ranking)
Our model learns to predict the preference score by ranking two stories from the same prompt.As shown in Figure 3, we use the feature of [CLS] in the story, following a linear layer with sigmoid activation and finally turning it into a scalar score.
We take Margin Ranking Loss to enlarge the margin gap m of the scores between stories with high and low upvotes: where W ps denotes a linear layer for the feature of the story v s .σ(•) is the sigmoid activation function.s high and s low represent the highly-upvoted and lowly-upvoted stories.
Negative sample.Machine-generated stories often suffer from the coherence and consistency problem, while human-written stories usually do not.Therefore our model trained on human-written stories can hardly evaluate story coherence.To enable our model to evaluate story considering coherence issues, we further train our model (Ours (N)) with negative stories that are generated by the methods in the previous works (Guan and Huang, 2020;Ghazarian et al., 2021).We change the margin ranking loss as follow: where s neg denotes the negative stories derived from the previous works.In each iteration, we takes two pairs as training data: s high and s low , s low and s neg .
We adopt two additional linear layers on the same feature v s used in the story ranking.One is with learnable parameters W a c , outputting confidence scores a c = softmax(W a c v s ).The other one has W a r , producing aspect rating a r = σ(W a r v s ).
Let y a c ∈ {0, 1} K , y a r ∈ [0, 1] K be the groundtruth confidence and rating, we define the confi-dence and rating loss functions as follows: We calculate the multi-class cross-entropy loss for the aspect confidence.
For aspect rating, binary cross-entropy loss is calculated separately for each selected aspects.M s denotes the set of aspects that are selected for story s. y a r [k] denotes the normalized rating score for the k-th aspect.
The comments are generated conditioned on the aspect a and the story s.We input the concatenation of the aspect category name, special token, story, and train the LED under Maximum Likelihood Estimation (MLE) with the comment as target: where the c t denotes the t-th token in the comment.
For joint training three tasks, our final loss is the summation of all above loss functions:

Hyperparameters
We conduct a comprehensive set of experiments to examine the effectiveness under different scenarios.
We fine-tune pre-trained LED from Huggingface6 with the batch size 16, the margin 0.3 and run 20k iterations for training (10 hours).We adopt AdamW optimizer (Loshchilov and Hutter, 2018) with an initial learning rate of 4e-6, warming up in the first epoch and decreasing by a linear schedule.The reported results are averaged by the best results from three models with the same structure but initialized with three different seeds.More details and code can be found in the Appendix.

Compared Methods
We We evaluate the predicted preference scores obtained by all compared methods on 100k Story Ranking test data.Pairwise Ranking Accuracy (Acc) is calculated as the percentage of the story with higher upvotes getting a higher score than the one with lower upvotes.We also compute the averaged score gap (Dis) between two stories in pairs.Table 3 (Human (Ranking)) indicates that existing methods on preference-aware story evaluation on human-written stories are close to random selection (i.e., Acc=0.5, Dis=0).In contrast, our method can successfully compare two stories and achieve an acceptable score gap between two stories.

Correlation with Human Judgments
We calculate the correlation between our predicted preference scores and human judgment for stories.We use the correlation metrics Spearman (ρ) (Zar, 1972) and Kendall (τ ) (Schaeffer and Levitt, 1956), which are known to be beneficial in estimating monotonic associations for not normally distributed and ranked scores.We collect and annotate both human-written and machine-generated stories as our test data: WP 200 .We collect human judgments for the stories in WP (sampled from test data in Table 2), where each story is assigned to 8 annotators.Annotators are asked to rate each story on a scale of 1 to 5 (from poor to good).To ensure correctness, we follow Clark et al. (2021) to ask the annotators to compare the stories and write down the reason for clarification.We carefully detect the worker behavior and set traps inside the annotation (see Appendix for details).Finally, we obtain 100 highly-upvoted and 100 lowly-upvoted stories and average the human rates as the target scores in this test data, namely, WP 200 in the following experiments.Inside, we witness a higher score for highly-voted stories, proving our hypothesis that upvote counts reflect human preference.SCARY 200 .We crawled scary stories from Reddit (r/shortscarystories7 ), which are similar to the stories in WP but in a constrained story type.We use the same procedure for WP 200 to create another human-annotated test dataset, namely SCARY 200 .PREF 200 .The same procedure is also used for collecting human annotation for machine-generated stories.We select 100 generated stories by LED trained with highly-voted stories in WP and 100 stories by another LED trained with lowly-voted stories.We manually ensure that the selected stories do not contain severe coherence issues, and ask the annotators to rate the stories based on whether they enjoy the stories.COH 200 .We use the same human collected data in the previous work (Ghazarian et al., 2021) 8 , which focused on recognizing coherence issues in the machine-generated stories (e.g., repeat plots, conflict logic).
Results.(PREF 200 ) and coherence-based judgments (COH 200 ) are distinct.Metrics that perform well in terms of coherence may perform poorly in terms of preference, and vice versa.To mitigate the gap between preference and coherence, we train our model using negative stories created by UNION and MANPLTS.As a result, Ours (N) shows rapidly increasing performance on the evaluation in terms of coherence with a bit of performance drop on the preference-aware evaluation, indicating a potential to take into account both coherence and human preference when evaluating a story.

Preference Score Prediction
In this section, we further test the performance of preference score prediction combined with other components: aspects a, comments c and negative stories N. Table 4 summarizes the results by joint training.When aspects are used, performance decreases in the WP 200 but increases in the SCARY 200 , and the pattern is reversed when comments are used.We also test the model performance trained with the dataset without data augmentation △, and we can see that our model trained with augmented data outperforms that with the original data, which shows the significance of data augmentation.

Aspect Evaluation
We evaluate our model for predicting confidence scores and ratings for the aspects.For confidence scores, we calculate the recall performance on topk (i.e., k=1,3,5) on the test split of 46K Aspect Rating and Reasoning data to show the percentage of human selected aspects that can be involved within the aspects with top-k confidence.For ratings, we calculate the correlation between human annotation and our model prediction.Story ranking and reasoning help the model output more correct confidence and ratings.

Comment Evaluation
We evaluate the comment generation with automatic metrics and human evaluation.For automatic scores, we apply Perplexity (PPL), Averaged BLEU1-4 (B), ROUGE (R).For human evaluation, we mainly measure the relativeness between comments with the given story Rel(s), aspect category Rel(a) and rating score (0-1 negative-positive) Rel(r).We also measure Overall (O) quality by calculating the percentage of the comments that are agreed upon by annotators.Each comment is assigned to 5 annotators with a binary choice (i.e., related or not related, agree or not agree).From the result in Table 10, our generated comments are highly related to the given stories and the aspects.Together with the training on preference score prediction and aspect rating further improve the comment generation performance.The results so far show that the preference score, aspects, and comments all benefit one another, illustrating the significance of incorporating aspects and comments into our task.

Pairwise Evaluation with StoryER
Given a set of prompts, two story generation models can generate stories based on the given prompt.We have two straightforward ways to compare two models using our proposed preference scores: 1) average the preference scores for stories on each model and compare the mean average scores.2) perform pairwise comparisons for stories from the same prompt and get the preference percentage.We recommend the second method as it strictly follows our pairwise ranking strategy.

Domain Transfer in Preference Score
To show the generalization of evaluation metrics, we calculate the averaged predicted preference scores for data from different domains (see Table 7).We compute average scores on 1) lowlyvoted (low) and highly-voted stories (high) on both WP 200 and SCARY 200 , 2) machine-generated stories by LED (LED), and with Plan-and-Write strategy (Yao et al., 2019) (P&W) trained separately on the highly-upvoted and lowly-upvoted stories, 3) negative stories generated from previous works (Guan and Huang, 2020;Ghazarian et al., 2021), 4) stories from other datasets: fairy tales (short stories), childbook dataset (Hill et al., 2015) and bookcorpus (Zhu et al., 2015).
As shown in Table 7, UNION and MANPLTS consistently produce higher scores for humanwritten stories (Human and Other blocks) while producing lower scores for machine-generated stories (Machine and N blocks).While looking into more details, we can see that they cannot successfully distinguish the story quality, e.g., SCARY 200 (low) and SCARY 200 (high) receive identical scores.These observations strongly indicate that UNION and MANPLTS work well on evaluating coherence but deviate from human preference when evaluating human-written stories.
Our method, on the other hand, is capable of following human preference (Human and Machine block) (also see SCARY 200 (low) and SCARY 200 (high) as an example).The model trained with highly-voted stories can generate better stories than that trained with lowly-voted stories, and P&W strategy performs even better as proved in many previous works (Fan et al., 2019;Tan et al., 2021).From the results, our model produces higher scores for LED (high) compared with LED (low) and even higher scores for LED P&W (high), which indicates that our model still follows the human preference on machine-generated stories.As serious coherence problems do not commonly occur in our training data, our method show failure in recognizing manually created incoherent stories (N block).However, our model (Ours (N)) works after we incorporate these stories into our training data, leading to a future direction that unifies the coherence-based and preference-aware metrics.Surprisingly, our model gives relatively low scores when adopting stories from other domains (Other block).We think this is because the writing style changes the criterion of human preference, which misleads our model to predict a not reasonable score, thus leading us to a big challenge in generalizing preference-aware story evaluation.

More Analysis
Due to the page limit, we put more analysis in the ablation studies.In Appendix Sec.D, we witness high correlation scores between preference score and each aspect rating, indicating the effectiveness of all pre-defined aspects in the evaluation.We also analyze the confidence and rating scores of the horror aspect with the preference score on scary stories in Appendix Sec.E. The result follows the human intuition that evaluation on scary stories shows a tendency to rely on the horror aspect.

Conclusion
In this paper, we investigate a novel task of preference-aware story evaluation, StoryER, which produce a score with explanation through various aspects and comments, bringing gains on both machine-generated and human-written stories evaluation.To support the task, we present a new dataset consisting of paired ranked stories and more explicit annotation (i.e., rating and reasons) for predefined aspects.Our comprehensive ablation studies and intensive analysis show the effectiveness of using aspect rating and reasoning on preference score prediction.With the development of story generation, we believe that preference-aware story evaluation will be the mainstream research when machine-generated stories do not suffer from serious coherence problems.Further studies on our dataset can also be conducted to reveal the point that influence the readers to upvote the stories.

Limitations
Our work (currently) has the following limitations: (1) As indicated in Section 7.2, our proposed metrics are negatively affected by the significant domain shift, since we only take stories from one platform to train our model.Idealistically, a more general model can be trained with all types of stories, but it needs massive annotations on human preference (i.e., upvote counts).
(2) Since the upvote counts in the original dataset will be influenced by the prompt's topic, typically, fantastic stories get more upvotes than others.Our model is only trained by story pairs within the same topic, thus if a user inputs two unrelated stories, our system will provide unpredictable results.Therefore, we propose using pairwise evaluation with the same given prompt to avoid comparing stories with diverse topics.
(3) In this work, we propose to implicitly joint training to increase the performance of each task without explicitly addressing the connection of three subtasks.Although we have aspect rating and comment generation, preference score is still the most effective approach to assess the quality of the story.How to use these comments and aspect ratings is a challenge that will be addressed in the future work.

Ethics and Broader Impacts
We hereby acknowledge that all of the co-authors of this work are aware of the provided ACM Code of Ethics and honor the code of conduct.This work is mainly about propose a novel method in automatic story evaluation.The followings give the aspects of both our ethical considerations and our potential impacts to the community.
Dataset.We collect the human annotation of the aspect rating and comments via Amazon Mechanical Turk (MTurk) and ensure that all the personal information of the workers involved (e.g., usernames, emails, urls, demographic information, etc.) is discarded in our dataset.All the stories in our dataset are collected from a public dataset, namely WritingPrompt.Although we aim at providing a dataset that agreed upon from various people, there might still be unintended biases within the judgements, we make efforts on reducing these biases by collecting diverse comments and replacing the annotators who tends to be racist.The detailed annotation process (pay per amount of work, guidelines) is included in the appendix and our public website; We primarily consider English speaking regions for our annotations as the task requires certain level of English proficiency.
Techniques.We benchmark the story evaluation task with conventional metrics and our proposed metric.As the story evaluation are of our main focus, we do not anticipate production of harmful outputs on our proposed task.

A Website Demo
We display the collected data, AMT template and models on our website9 .The users can input their own stories or randomly select one story.The server then runs our model and output a preference score, and comments for each aspect.Figure 4 shows an example.

B Code
We also put our source codes into the supplementary materials.Due to the upload size limitation.We truncate our 100k Story Ranking data into a size of 1000, as well as the 46k Aspect Rating and Reasoning data.Please kindly follow the README to run the experiment.Our human annotation results can be also found under the folder "data".Additionally, we put some examples for machine-generated stories introduced in our paper.

C Correlation Between Story Quality and Aspect Rating
We calculate the correlation between human ratings on each aspect with the upvote number, and the predicted aspect rating with the predicted preference score, to figure out the correlation between the aspect rating and the preference score.The results are listed in Figure 5.We can see the results from our model greatly match the distribution of the correlation between human aspect rating and human upvote number.None of these shows domination, which proves that all pre-defined aspects affect the final preference score prediction.

D Horror/Scary Aspect with SCARY 200
To show how aspect ratings and confidence are related to the story, we further analyze their performance on WP 200 and SCARY 200 .We calculate the recall performance and rating correlation on "horror/scary" aspect only to detect how this aspect works in both data.Table 8 depicts that horror aspect can achieve 36% probability to be the top confident aspect in SCARY 200 , while the number is only 0.5% in the original WP 200 .On the other hand, the preference score also has a higher correlation with the rating from "horror/scary" aspect.These results prove that the predicted aspects show high connection to the preference score prediction.In each iteration, we adopt two pairs: one from 100k Story Ranking data and the other from 46k Aspect Rating and Reasoning data, to our model.We take AdamW optimizer (Loshchilov and Hutter, 2018) with an initial learning rate of 4e-6, warming up in the first epoch and decreasing by a linear schedule.The reported results are averaged by the best results from three models with the same structure but initialized with three different seeds.
For hyper-parameter search, we search margin m from 0.2 to 1.0 with the step of 0.1, learning rate from 4e-4, 4e-5, 4e-6 and 4e-7, and record the best hyper-parameters.

F.3 Results of Comment Evaluation
Due to the page limitation, we put the results of comment evaluation with more metrics in Table 10.
We see that our model achieves higher performance on most of the metrics.

F.4 Results of Aspect Category Classification
We use aspect category classification model, introduced in Sec.E.2, for filtering out noisy comments.
Figure 6 shows the classification results.Except for "ending" and "heartwarming", all aspect classes can achieve an average of around 80% accuracy, showing high performance on classification.We filter out the comments, with no aspect category score exceeding 0.9 after softmax function.

F.5 Results of Comment Sentiment Analysis
Comment Sentiment Analysis model, introduced in Sec.E.3, is used to rate comments by their sentiments.Table 11 shows the results.Our output is the rates from 1 to 5. In the evaluation, we simply group 1 and 2 as the negative, 3 as the neutral, 4 and 5 as the positive.The results show that our sentiment analysis model can correctly predict the sentiment, especially on positive and negative.

F.6 Comment Data Augmentation
We collect over 150k uncategorized comments from metadata in WP.We use the aspect category classification model and filter out the irrelevent comments.However, we found bias inside the comments.For example, we get almost 9000 comments about "ending", while only 1200 for "sad".
To mitigate the bias that would be inducted into our story evaluation model, we sample about 2000 comments for each aspect, and use all comments for the aspect which contains less than 2000 comments.The final data statistics of comments can be referred to our website.

G Human Annotation
G.1 Human Annotation on Test Data For evaluation, we collect human judgments through AMT for 200 highly-upvoted stories and 200 lowly-upvoted stories from WP (sampled from test data in 100k Story Ranking data), where each story is assigned to 8 annotators.Annotators are asked to rate each story on a scale of 1 to 5 (from poor to good).Following Clark et al. (2021), we asked the annotators to compare the stories before rating and write down a very brief reason for clarification.To further ensure the correctness of the annotation, we calculate the statistics of the annotator behavior (i.e., working time per hit) and set traps in the batch (i.e., insert extremely poor story, duplicate stories for one annotator to test their consistency).The submissions from annotators with poor quality are all rejected and then recollected from new annotators.Finally, we exclusively keep the 100 highly-upvoted and 100 lowly-upvoted stories with the lowest variance from 8 annotators and average the human rates as the target scores in this test data, namely, WP 200 in the following experiments.Annotators get $0.2 as the reward for each submission.Besides, we crawled scary stories from Reddit (r/shortscarystories11 ), which have a similar writing style to the stories in WP but in a constrained story type.We repeat the procedure for WP 200 and create another humanannotated test data, namely SCARY 200 .The same procedure is also used for collecting human annotation on machine-generated stories.We generate 200 stories using LED trained with highly-voted stories and another 200 stories using LED trained with lowly-voted stories for annotation.We ask the annotators to rate the stories based on human preference and also ask them to distinguish whether the given stories are human-written or machinegenerated.We exclusively keep the stories that  deceive the annotators, as these stories do not contain serious coherence problems.

G.2 Data Collection
In this paper, we mainly collect data for two different uses.Annotators get $1 as the reward for each submission.The total data collection takes 2 months.To assess the quality of each annotator, we randomly sample the submissions from each annotator every two days, bonus the one with good quality and warn the annotators who give nonsense comments.

G.3 Human Annotation Inner-Agreement
As we assign one story for more than one annotator, we calculate the inner-agreement from different annotators on aspect selection.As a result, 65.80% aspects are selected by more than one annotator, and the correlation coefficient of multi-annotation on aspect ratings are 0.913 and 0.811, corresponding to the Spearman (Zar, 1972) and Kendall (Schaeffer and Levitt, 1956) respectively.

H Aspect Category Name Definition
As no standard criterion exists for story evaluation, we collect some well-used aspects that used in the Internet.We mainly refer to the websites12 13 14 .

Figure 3 :
Figure 3: Overview of our model.The encoder (left) predicts the preference score, aspect confidence, and aspect rating.The decoder (right) generates the comment for each aspect.

Figure 4 :
Figure 4: An example on the website.

Table 1 :
Data statistics of 100k story ranking data.#prompt denotes the number of unique prompts, #S high and #S low denotes the number of highly-voted stories and lowly-voted stories.We also show the averaged word count in the parentheses.#pairs shows the number of ranked story pairs.
(Ghazarian et al., 2021) al., 2021)analyze the problem from machinegenerated stories and generate negative data by heuristics and plot manipulation, and then distin-

Table 2 :
Data statistics in 46k aspect rating and reasoning data.* denotes the data statistics after data augmentation.

Table 3 :
(Diedenhofen and Musch, 2015)e prediction.Compared with previous works, our predict scores more correctly match the human judgement.We conduct hypothesis test(Diedenhofen and Musch, 2015), and * denotes that p ≤ 0.01.

Table 4 :
Table 3 depicts the correlation between human and automatic evaluation metrics on preference (WP 200 , SCARY 200 and PREF 200 ).Ablation study on preference score prediction.All results are statistical significant p < 0.01.△ means that we use the collected data without augmentation.More results are listed in supplementary materials.

Table 5 :
Table 5 shows the results compared with joint training other two tasks.Evaluation on aspect confidence and rating.p s , a, c, N denotes the preference score, aspects, comments and negative samples that are used in training our model respectively.

Table 6 :
Comment generation evaluation on automatic scores and human evaluation.In human evaluation, the kappa coefficient κ for each score are located in 0.4-0.6,indicating a moderate agreement between annotators.

Table 7 :
Our model and existing works on various domains of stories.We report the averaged preference score on stories from four different domains.

Table 8 :
Confidence for aspect horror/scary for WP 200 and SCARY 200 dataset and the correlation between the preference score and horror/scary aspect ratings in two dataset.

Table 10 :
Comment generation evaluation on automatic scores and human evaluation.In human evaluation, the kappa coefficient κ for each score are located in 0.4-0.6,indicating a moderate agreement between annotators.

Table 11 :
Comment sentiment analysis results