A Corpus for Understanding and Generating Moral Stories

Teaching morals is one of the most important purposes of storytelling. An essential ability for understanding and writing moral stories is bridging story plots and implied morals. Its challenges mainly lie in: (1) grasping knowledge about abstract concepts in morals, (2) capturing inter-event discourse relations in stories, and (3) aligning value preferences of stories and morals concerning good or bad behavior. In this paper, we propose two understanding tasks and two generation tasks to assess these abilities of machines. We present STORAL, a new dataset of Chinese and English human-written moral stories. We show the difficulty of the proposed tasks by testing various models with automatic and manual evaluation on STORAL. Furthermore, we present a retrieval-augmented algorithm that effectively exploits related concepts or events in training sets as additional guidance to improve performance on these tasks.


Introduction
Stories play an essential role in one's moral development (Vitz, 1990). For example, individuals usually learn morals from life experiences or literature such as fables and tell their morals by representing their lived experience in a narrative form (Tappan and Brown, 1989). Accordingly, it is a crucial ability for humans to bridge abstract morals and concrete events in stories. However, this ability has not yet been investigated for machines.
There have been many tasks proposed for evaluating story understanding and generation, including story ending selection (Mostafazadeh et al., 2016) and story generation from short prompts (Fan et al., 2018). Unlike these tasks, which focus on reasoning plots from context, we emphasize the ability to associate plots with implied morals. As exemplified in Table 1, the challenges mainly lie in (1) * Corresponding author Stories: Four cows lived in a forest near a meadow. They were good friends and did everything together. They grazed together and stayed together, because of which no tigers or lions were able to kill them for food.
But one day, the friends fought and each cow went to graze in a different direction. A tiger and a lion saw this and decided that it was the perfect opportunity to kill the cows. They hid in the bushes and surprised the cows and killed them all, one by one.
Morals: Unity is strength. grasping knowledge about abstract concepts (e.g., "unity," "strength") and relations among them (e.g., "is") in morals; (2) capturing inter-event discourse relations in stories (e.g., the contrast between endings of the "cows" when they are "united" and "divided"); and (3) aligning value preferences (Jiang et al., 2021) of stories and morals (e.g., the story implies support for "unity", not opposition, which agrees with "is strength" in the moral). To test these abilities of machines, we propose two understanding tasks and two generation tasks. Both understanding tasks require selecting the correct moral from several candidates given a story. And they have respective candidate sets for testing machines in two aspects, including concept understanding (MOCPT for short) and preference alignment (MOPREF for short). The generation tasks require concluding the moral of a story (ST2MO for short), and conversely generating a coherent story to convey a moral (MO2ST for short).
Furthermore, we collected a new dataset named STORAL composed of 4k Chinese and 2k English human-written stories paired with morals through human annotation to address the above challenges. We call the Chinese dataset STORAL-ZH and the English dataset STORAL-EN, respectively. And we construct datasets for the proposed tasks based on STORAL. Our focus of morals is on the social set of standards for good or bad behavior and character, or the quality of being right, honest or accept-able (Ianinska and Garcia-Zamor, 2006). We conduct extensive experiments on the proposed tasks. Furthermore, we present a retrieval-augmented algorithm to improve model performance by retrieving related concepts or events from training sets as additional guidance. However, the experiment results demonstrate that existing models still fall short of understanding and generating moral stories, which requires a better modeling of discourse and commonsense relations among concrete events and abstract concepts 1 . 2 Related Work Story Datasets ROCStories (Mostafazadeh et al., 2016) and WritingPrompts (Fan et al., 2018) are two frequently used story datasets in related studies. The former consists of artificial five-sentence stories regarding everyday events, while the latter contains fictional stories of 1k words paired with short prompts. Besides, some recent works collected extra-long stories such as roleplayerguild (Louis and Sutton, 2018), PG-19 (Rae et al., 2020), and STORIUM (Akoury et al., 2020). Guan et al. (2022) proposed a collection of Chinese stories. These stories usually aim to narrate a coherent event sequence but not convince readers of any morals.

Story Understanding and Generation
There have been many tasks proposed for evaluating story understanding and generation. Firstly, some works tested the machinery commonsense reasoning ability regarding inter-event causal and temporal relations through story ending selection (Mostafazadeh et al., 2016), story ending generation (Guan et al., 2019) and story completion (Wang and Wan, 2019). Secondly, a series of studies focused on the coherence of story generation (Fan et al., 2018;Yao et al., 2019;Guan et al., 2020). Another line of works concentrated on controllability to impose specified attributes into story generation. These attributes involved outlines (Rashkin et al., 2020), emotional trajectories (Brahman and Chaturvedi, 2020) and story styles (Kong et al., 2021). Our tasks investigate not only the above aspects but also the ability to understand abstract concepts and reason value preferences of stories.
A task similar to ST2MO is text summarization (Finlayson, 2012) since both tasks require generating a short text to condense crucial information of a long text. But summarization requires reorganizing a few words of the original text instead of concluding a character-independent moral. For example, a plausible summary of the story in Table 1 is "Four cows were killed by two tigers and a lion" (generated by BART Large (Lewis et al., 2020) finetuned on a summarization dataset XSUM (Narayan et al., 2018)), which includes specific characters and events of the original story. Moreover, MO2ST is similar to persuasive essay generation (Stab and Gurevych, 2017), which also requires conveying a viewpoint in generated texts. However, persuasive essays usually convince readers by directly presenting arguments but not narrating a story. (2004) provided a theoretical framework named Moral Foundations Theory (MFT) to summarize five basic moral foundations such as "Care/Harm," "Fairness/Cheating," etc. Based on the theory, recent studies have explored to classify the moral foundations of partisan news (Fulgoni et al., 2016), tweets (Johnson and Goldwasser, 2018;Hoover et al., 2020), and crowd-sourced texts (Pavan et al., 2020). And Volkova et al. (2017) proposed identifying suspicious news based on the features of moral foundations. However, we focus on morals which are free-form texts far beyond the scope of the five categories in MFT. In addition, recent studies proposed multiple datasets for machine ethics research such as SBIC (Sap et al., 2020), Social Chemistry (Forbes et al., 2020), Moral Stories (Emelin et al., 2020, ETHICS (Hendrycks et al., 2021) and Scruples (Lourie et al., 2021). But these datasets focus more on how machines behave ethically in some scenario, while STORAL emphasizes the ability to conclude the moral implied by a story. Moreover, most cases in these datasets consist of short texts of descriptive ethical behavior, typically in the form of one sentence. In contrast, STORAL provided longer and more context-specific stories for moral understanding.

STORAL Dataset
We collected STORAL from multiple web pages of moral stories. All stories are allowed to use and redistribute for research and have been reviewed by the website editors as stated on the pages. We show the full list of links to these pages in Section A.1. After de-duplication, we collected 19,197 Chinese and 2,598 English raw texts. Then we adopted human annotation for decoupling the story and moral in each raw text. Due to resource limitations, we only constructed 4,209 Chinese and 1,779 English story-moral pairs. We will first show the details of human annotation, then present the topic analysis and statistics of STORAL, and finally describe the details of dataset construction for the proposed tasks.

Human Annotation
To narrow down our focus, we define a story as a series of coherent events involving several interrelated characters, and implies support or opposition of some behavior. Such a definition constrains the story to exhibit a moral without any explicit arguments. And we define a moral as a judgment to describe what the story implies concerning good or bad behavior. Note that we do not require morals in STORAL to be always reflective of normatively virtuous behavior. We emphasize that the morals should align with the story. Then, a key issue is how to extract the story and moral from a raw text. We observe that there are no markers such as "The story tells us" to separate the story and moral in most cases. The moral may be tightly weaved into the plot (e.g., included in a dialogue). Therefore, we adopted human annotation for this extraction task. We hired a commercial team to annotate STORAL-ZH. All annotators are native Chinese speakers and well trained for our task. For STORAL-EN, we hired three graduates with good English language proficiency. We did not use AMT since it is inconvenient to train online annotators. Figure 1 shows the annotation pipeline.
We first ask annotators to judge whether the raw text contains a story and moral and whether they meet our constraints shown in Figure 1. We show the examples given to the annotators to inform them of our requirements for stories and morals in Section A.2. If the constraints are not met, we then ask annotators to refine the story and moral.
In the refinement stage, annotators have to clean up the data with following heuristics: (1) refusing examples which may violate general ethical principles (e.g., discrimination); (2) deleting noisy words (e.g., links, codes); (3) refining the stories and morals to be coherent and formal. And to ensure the quality of collected data, annotators may refuse to refine the example if it requires much creative writing. Finally, we review the annotation results and provide detailed feedback to the annotators before approving their submissions. We show an annotation example in Table 2.
Raw Text: A man whowWw.xxx.c0Mlived a long time ago believed that he could read the future in the stars. He called himself an Astrologer, and spent his time at night gazing at the sky. One evening he was walking along the open road outside the village. His eyes were fixed on the stars. He thought he saw there that the end of the world was at hand, when all at once, down he went into a hole full of mud and water. There he stood up to his ears, in the muddy water, and madly clawing at the slippery sides of the hole in his effort to climb out. His cries for help soon brought the villagers running. As they pulled him out of the mud, one of them said:"You pretend to read the future in the stars, and yet you fail to see what is at your feet! This may teach you to pay more attention to what is right in front of you, and let the future take care of itself.""what use is it? " said another, " to read the stars, when you can't see what's right here on the earth?" Story: A man who lived a long time ago believed that he could read · · · As they pulled him out of the mud, one of them said: "You pretend to read the future in the stars, and yet you fail to see what is at your feet!" Moral: Pay more attention to what is right in front of you, and let the future take care of itself. Table 2: An example for extracting the story and moral from a raw text. We highlight the words which should be revised in the raw text in italic. And the moral in the raw text is bold. To save space, we replace some events with "· · · " in the story.

Topic Analysis
To provide insight into the taxonomy of morals within STORAL, we adopt LDA (Blei et al., 2003) for topic modeling of morals. Let B denote the number of topics and V denote the vocabulary size. Based on the variational parameter for topic word distribution β ∈ R B×V , we determine B as the minimum value that makes the following formula

Topic Words
Examples 懂得 (understand), 也是 (also), 了解 (know), 方法 (method), 收 获 (gain), 保护 (protect), 大脑 (brain), 才能 (able), 付出 (pay), 进步 (progress) 在犯错的时候我们要懂得看全局，要了解全局才能对事情有定义。(When making mistakes, we must understand the overall situation. And we are able to have a definition of things only when knowing the overall situation.) 不 要 (not), 一 定 要 (must), 危 险 (danger), 时 候 (when), 对待 (treat), 安全 (safety), 千 万 (any way), 好 好 (well), 学 会 (learn), 遇到 (encounter) 生活中也要牢记"安全"这两字，在"安全"两字面前切不可存有侥幸心理， 把安全当成儿戏。 (Keep in mind the word "safety" in your life, and do not take any chances to treat safety as a joke.) 做好自己该做的事情，做自己的主人。(Do what you should do and be your own master.) 时 候 (when), 其 实 (actually), 很 多 (many), 发现 (discover), 人要善于自己发现自己，而不是老等着别人来发现我们。(We should be good at discovering ourselves instead of waiting for others to do.)  holds true for any b ∈ {1, 2, · · · , B}: where β bv is the element at the b-th row and v-th column of β, k ∈ {1, 2, · · · , V } is the size of the top-k vocabulary V (k) b , and h ∈ [0, 1] is a predefined threshold. s b is used to measure the specificity of the b-th topic. Intuitively, the larger s b , the more specific the topic. We set k to 20 and h to 0.5. Finally, we derive 40/24 topics for STORAL-ZH/STORAL-EN, respectively. And the minimum proportion of examples of one topic is 1.6%/3.2% for STORAL-ZH/STORAL-EN, respectively. Table 3 shows the topic words in V (10) of each topic and two morals assigned to each topic with the highest probabilities for the five topics with the largest specificity scores. The topics cover diverse situations ranging from facing others ("honesty," "help"), parents ("love"), ourselves ("self-help," "self-discovery") to facing difficulties ("think") and danger ("safety"). And examples of the same topic present related semantics to some extent, such as "being honest" and "not believing liars" for the first topic in STORAL-EN. We also show the analysis of high-frequency words of stories and morals in Section A.3 and discussion about the commonsense and discourse relations in stories in Section A.4. Table 4 shows the statistics of STORAL. We regard the unlabeled data which contain entangled stories and morals as an in-domain resource for research on unsupervised or semi-supervised learning for the proposed tasks. And the data are also suitable for learning to generate morals stories where the morals are weaved naturally into the story plots.

Task-Specific Dataset Construction
Based on STORAL, we build task-specific datasets for our understanding tasks (MOCPT and MOPREF) and generation tasks (ST2MO and MO2ST). We randomly split the labeled data in STORAL-ZH and STORAL-EN for training/validation/testing by 8:1:1 and 3:1:1, respectively.   descriptions and data sizes.
MOCPT It requires selecting the correct moral from five candidates given a story. We constructed the dataset by taking the original moral as the correct candidate and four negatively sampled morals as incorrect candidates for each example. To avoid more than one plausible candidate, we ensured that the negative morals are assigned to different topics from the original one by the LDA model (Section 3.2). In this way, MOCPT can effectively test the ability to distinguish different concepts.
MOPREF It requires selecting the correct moral from two candidates. Its difference from MOCPT is that we created the incorrect candidate by substituting one random token in the original moral to its antonym. For example, the moral "unity is strength" can be transformed to "unity is weakness". We perform the transformation using a rulebased method (Ribeiro et al., 2020). Because there exist examples where no words have antonyms, the number of examples for MOPREF are a little fewer than MOCPT. MOPREF will serve for testing the ability to capture the value preference of stories.
ST2MO It requires generating the moral of a given story. We regard the original story as input and the original moral as target output.
MO2ST It requires generating a story to convey a given moral. Unfortunately, automatic evaluation for open-ended story generation is still highly challenging due to the notorious one-to-many issue (Zhao et al., 2017): There may be multiple plausible stories with the same moral. For example, the moral in Table 1 can also be conveyed by another story: "bees unite to build their beehives." Such openness makes automatic metrics unreliable for quality evaluation (Guan and Huang, 2020).
To alleviate this issue, we extract the first sentence and an outline from a target story, and pair them with the moral as input for generating the story. We follow Rashkin et al. (2020) to extract a set of at most eight phrases from a story through RAKE (Rose et al., 2010) as the outline. We set the maximum number of words in each phrase to eight. We also filtered those phrases that are substrings of others. For example, the outline for the story in Table 1 is {"lions," "friends fought," "good friends," "grazed," "perfect opportunity"}. Finally, for STORAL-ZH/STORAL-EN, the average number of phrases for each example is 7.5/6.8 and the average number of words in each phrase is 2.87/2.44, respectively.

Retrieval Augmentation
A critical challenge for tackling the proposed tasks is the sparsity of morals and events makes it difficult to learn relations between them. Prior studies have shown that retrieval improves performance towards infrequent data points across various tasks such as open-domain question answering (Chen et al., 2017) and text classification (Lin et al., 2021). We present a retrieval-augmented algorithm that exploits the moral-event relations in training sets. We illustrate our model for the MOPREF   ure 2. Our models for other tasks are similar. For both MOCPT and MOPREF, we encode the story and candidates using an input encoder, and then predict a probability distribution over the candidates by normalizing the dot-product scores between the representations of the story and each candidate. We optimize the model by minimizing the cross-entropy loss. We insert special tokens [S] and [C] before the story and each candidate, respectively, and take the corresponding hidden states as their representations. Furthermore, we propose to retrieve related concepts from the training set using the input story. We encode the story using a query encoder, then take the output as the query to retrieve m most related stories based on a story index, i.e., a set of dense vectors as the representations of stories in the training set. We adopt BERT (Devlin et al., 2019) followed by a mean-pooling layer to build the query encoder and story index, which are frozen in the training stage. Finally, we extract the nouns, verbs, adjectives and adverbs from the morals of the top-m stories and lemmatize them as the retrieved concepts. We feed the concepts together with the original input to the input encoder. For example, the retrieved concepts for the story in Table 1 include "support" and "strength", which may serve as additional guidance for models' prediction.
The retrieval-augmented algorithm can easily adapt to the generation tasks. For ST2MO, we take the input story paired with the retrieved concepts into the encoder and then generate the output using the decoder. And for MO2ST, we use the input moral as the query to retrieve top-m stories, and regard their outlines as the retrieved additional information to guide the subsequent story generation.

Evaluated Models
We evaluated the following baselines for the understanding tasks: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and T5 (Raffel et al., 2020). When evaluating T5, we feed the input to both the encoder and decoder of T5 and optimize the model using the cross-entropy loss. To investigate potential biases of the proposed datasets, we added a baseline called BERT w/o story, which is fine-tuned to make prediction without taking the story as input. For the generation tasks, we evaluated ConvS2S (Gehring et al., 2017), Fusion (Fan et al., 2018, GPT2 (Radford et al., 2019) and T5, which are trained or fine-tuned with the standard language modeling objective. Moreover, we also evaluate a task-specific model PlotMachines (PM for short) (Rashkin et al., 2020), which is proposed for tackling outline-conditioned generation by tracking the dynamic plot states. We use GPT2 as the backbone model of PM.
We also design models to test the adaption of the unlabeled data of STORAL to the proposed tasks. Specifically, we first post-train RoBERTa and T5 on the unlabeled data with their original pretraining objectives, respectively (i.e., masked language model and text infilling) and then fine-tune them on the labeled data for the downstream tasks (Gururangan et al., 2020). We call the baselines RoBERTa-Post and T5-Post. We perform our retrieval-augmented algorithm based on the post-trained models, called RA-RoBERTa and RA-T5, respectively.

Experiment Settings
We implement the pretrained models based on the codes and pretrained checkpoints of Hugging-Face's Transformers (Wolf et al., 2020). We use LongLM base (Guan et al., 2022) as the T5 model for experiments on STORAL-ZH, and set all pretrained models to the base version due to limited computational resources. As for the hyperparameters, we set the batch size to 16, the maximum sequence length to 1,024, the learning rate to 3e-5, m to 10 for our retrieval-augmented model. We generate outputs using top-k sampling (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7 (Goodfellow et al., 2016). We show more details in Section B.1.

Automatic Evaluation
Evaluation Metrics We adopt accuracy to evaluate the understanding tasks. For generation tasks, we do not use perplexity since perplexity scores are not comparable among models with different vocabularies. We adopt the following metrics for automatic evaluation: (1) BLEU (B-n): It is used to measure n-gram overlaps (n = 1, 2) between generated and ground-truth texts (Papineni et al., 2002).
(2) BERTScore-F1 (BS): It is used to measure the semantic similarity between generated and ground-truth texts (Zhang et al., 2019). (3) Repetition (R-n): It calculates the ratio of texts that repeat at least one n-gram in all generated texts (Shao et al., 2019). (4) Distinct (D-n): It measure the diversity using the percentage of distinct n-grams to all n-grams in generated texts (Li et al., 2016). For both R-n and D-n, we set n = 2 for ST2MO and n = 4 for MO2ST considering the much shorter length of morals than stories. Besides, we also report the average number of generated words (Len).
We also adopt the following metrics for automatic evaluation of MO2ST: (1) Coverage (Cov): It computes Rouge-L recall (Lin, 2004) between generated stories and phrases in the corresponding outlines. A higher score means the generated stories cover more phrases in the given outlines.
(2) Order (Ord): It measures the disparity between the positional orders of given phrases in the ground truth and generated story using the percentage of inversions in the generated story (Guan et al., 2022). An inversion is a position pair of two phrases that is out of the ground-truth order. Higher order scores mean that the stories arrange the outline more reasonably. In Section B.2, we also construct a learnable automatic metric to measure the faithfulness between morals and stories.
Results Table 6 and 7 show the results on the understanding and generation tasks, respectively. To get the human performance on MOCPT and MO-PREF, we sampled 100 examples from the test set and recruited three annotators with good Chinese or English language proficiency to complete these tasks. We made final decisions among the annotators through major voting. The annotation results show an almost perfect agreement with Fleiss's κ > 0. 85 (Fleiss and Joseph, 1971).
We summarize the results on the understanding tasks as follows: (1) The MOPREF datasets suffer from innate biases as indicated by the high accuracy of BERT w/o story. Such biases may result from the noise introduced by the automatic construction technique, i.e., antonym substitution. And models may learn patterns of good behavior (e.g., "unity" is good and "disunity" is bad in general) and make predictions easily without depending on stories. However, MOPREF is still meaningful as an evaluation task since BERT can achieve much  Table 6: Accuracy (%) for MOCPT and MOPREF. # P is the number of parameters. The best performance is highlighted in bold and the second best is underlined. The scores marked with * and * * of RA model mean it outperforms the best baseline significantly with p-value<0.1 and p-value<0.05 (sign test), respectively.
better accuracy when taking stories as input. And we experiment using manually constructed examples for evaluating preference alignment in the appendix.
(2) T5 performs better than RoBERTa on MOCPT but worse on MOPREF, indicating T5 may not be good at capturing value preferences. (3) Post-training on the unlabeled data (i.e., RoBERTa-Post and T5-Post) does not always bring improvement on both tasks, suggesting that it is necessary to develop a better way to exploit these data in future work. (4) Retrieving additional concepts improves models' performance effectively, particularly for the MOCPT task on STORAL-EN. However, there is still a big gap between our models and human performance. As for the generation tasks, we draw the following conclusions: (1) Almost all pretrained models achieve better lexical and semantic similarity with ground-truth texts than non-pretrained models, as indicated by higher BLEU and BERTScore values.
(2) Non-pretrained models have less repetition than pretrained ones, and repeat even less than the ground-truth texts when generating morals. It may be because non-pretrained models generate shorter sequences than pretrained models despite the same decoding algorithm, which also accounts for the higher distinct scores of the non-pretrained models on the MO2ST task. (3) When generating stories, T5-Post can cover more input phrases and arrange them in a correct order than other baselines, as indicated by higher coverage and order scores. (4) Retrieval augmentation can improve the generation similarity with the ground-truth texts on both tasks and improve the coverage and order scores on ST2MO significantly compared with T5-Post.

Manual Evaluation
On the generation tasks, we conducted a Likertscale based manual evaluation to measure the gap between existing models and humans. For STOAL-ZH, we hired three graduate students (native Chinese speakers) as annotators. We conducted evaluation on STORAL-EN using Amazon Mechanical Turk (AMT). For both tasks, we randomly sampled 100 examples from the test set, and obtained 300 generated texts from Fusion, T5 and RA-T5. For each text we require three annotators to rate its qual-ity along with the input using a binary score in three following aspects: (1) linguistic fluency: correctness in grammaticality; (2) coherence: reasonable relations between sentences regarding relatedness, causality and temporal orders; and (3) moral faithfulness: exhibition of a faithful moral to the input. Three aspects are independently evaluated. We decided the final score of a text through majority voting. The annotation instruction is shown in Section B.3. Table 8 shows the manual evaluation results. We show p-values of the results in Section B.4. For ST2MO, T5 achieves a substantial improvement compared with Fusion (p < 0.01), and our model further outperforms T5. The superiority becomes less significant for MO2ST. However, the big gap between these models and humans, particularly in terms of faithfulness, proves both tasks challenging for existing models. Furthermore, we evaluate whether machines can capture the value preference of a story using manually constructed examples. And we show error analysis and case study for the proposed tasks in Section C. We believe that explicit modeling of the relations among events and abstract concepts will further promote progress on these tasks, which we regard as future work.

Conclusion
We present STORAL, a collection of Chinese and English moral stories. To test the ability to bridge concrete events and abstract morals, we propose new understanding and generation tasks based on STORAL, including selecting the correct moral from several candidates with different topics or opposite value preferences, concluding the moral of a story and generating a story to convey a moral. Extensive experiments prove these tasks still to be challenging for existing models. We propose a retrieval-augmented algorithm to improve performance by retrieving related concepts or events from training sets. Although it is possible to further increase the dataset size, we expect to make meaningful progress by developing better representations of commonsense and discourse relations among events and abstract concepts in future work.

Acknowledgement
This work was supported by the National Science Foundation for Distinguished Young Scholars (with No. 62125604) and the NSFC projects (Key project with No. 61936010 and regular project with No. 61876096). This work was also supported by the Guoqiang Institute of Tsinghua University, with Grant No. 2019GQG1 and 2020GQG0005. This work was also sponsored by Tsinghua-Toyota Joint Research Fund. We would also like to thank the anonymous reviewers for their invaluable suggestions and feedback.

Ethics Statements and Broader Impact
We collected STORAL from public web resources. All stories are under licenses that allow use and redistribution for research purposes. We asked commercial annotation teams to extract stories and morals from the crawled raw texts. We required the annotators to refuse the examples which violate general ethical principles (e.g., showing discrimination for someone, containing disrespectful content, or encouraging to disturb public order, etc.). Totally, we payed more than $7 (CNY 45) per hour on average for annotating each example in STORAL, which was far beyond the minimum hourly wage in China (CNY 21). Furthermore, we resorted to AMT for manual evaluation of generated and human-written texts for two proposed generation tasks. We hired three annotators and payed each annotator $0.2 on average for annotating each example.
In this paper, we emphasize the ability to model relations between concrete events and abstract morals, which is also helpful for various scenar-ios such as reading comprehension (e.g., drawing authors' viewpoints from narratives) and essay writing (e.g., writing essays to convince readers of some arguments by presenting examples or anecdotes). STORAL provides a good start point for exploring these directions.

A.1 Data Source
We show the full list of web pages used for constructing STORAL in Table 11. We initially collect 52,017 Chinese and 2,630 English raw texts from the web pages. Then we de-duplicate the texts by removing those texts which overlap with others more than twenty words. After de-duplication, we finally collected 19,197 Chinese and 2,598 English texts. And we construct STORAL based on these texts. Table 9/10 shows the examples given to the annotators to inform them of the requirements for stories/morals, respectively. If the constraints are not met, we ask annotators to refine the story and moral. All workers were paid more than $7 per hour on average.

A.2 Data Annotation
Example 1: Come on Bear! What a beautiful day! Go for a walk with your father! Take a deep breath and smell the flowers. But don't pick the flowers. Listen to the birds sing. But don't scare them. How beautiful the world is. Isn't it, dear Bear?
Example 2: When I was a child, I heard a story that felt very regrettable. I felt sorry for the protagonist of the story. Long ago, there lived · · · Such trees are now found all over Uganda.
Example 3: I have a well-off friend. When she first entered college, she had many good wishes and thought she could achieve her goals. · · · Now she felt very painful under the strong mental pressure. I can understand her feelings. · · · If magnifying your own pain, you will get trapped in the mire of your pain, and even feel that life is too unfair to you.
Example 4: Raul sat at his door, frowning. · · · His father told Raul a true story: A wild wolf escaped into a cave after being wounded by a hunter's arrow. · · · After hearing the story, Raul cheered up immediately. · · · Table 9: Examples of stories provided for the annotators. Each example does not meet one of the following requirements in order: (1) having a clear beginning and ending; (2) not stating anything irrelevant to the main plot; (3) not stating any explicit arguments for the moral; and (4) not telling the story in a nested form. The sentences causing the above issues are in italic.
Example 1: If you saw a thief in a crowded bus, would you bravely stop him? Please reflect on yourself instead of just complaining that our world is becoming worse. Without the foothold for dirt, the flower of civilization is bound to be fragrant.
Example 2: The story tells us: we should remember that we should become a polite person and communicate with others carefully.
Example 3: As long as you keep your sanity and make right judgments, all the barriers will not become an obstacle, just like the beautiful girl in the story.  (3) not involving any specific characters in the story. We highlight the sentences leading to the above issues in italic.

A.3 Analysis of High-Frequency Words
To investigate the topic features of STORAL, we count the top 50 most frequent nouns in STORAL (excluding stop words) as shown in Figure 3. We roughly categorize these words into four types: (1) Animals: animals are popular as protagonists in moral stories since they usually have various but clear characteristics (e.g., "sly foxes"), which embody rich commonsense knowledge; (2) Relationships: such nouns are used to describe the inter-character relationships in a story (e.g., "friend"), which are useful for modeling characters' motivation and behavior; (3) Concrete nouns: they refer to physical entities that can be observed, such as "water"; and (4) Abstract nouns: they re-fer to abstract concepts, such as "difficulty". We manually check the proportional distribution of the four types for stories and morals, respectively. The results in Figure 3 demonstrate that morals contain significantly less concrete nouns and more abstract nouns than stories. And morals contain little animal words but almost as many relationship words as stories, indicating that morals may be independent of specific characters but relate to general interpersonal relations. The result shows that morals are more abstract than stories.
Furthermore, Table 12 shows the most frequent 4-grams in STORAL, further indicating that morals are more abstract than stories. Each of the 4-grams in Table 12 comprises less than 0.01% of all 4-

Morals
Dataset: STORAL-ZH as one is walking we should be a say to him that you everyone has say after thinking everyone has his own the most in the world has own all the animals each of us all the persons we should know to a place far away for anything, we should the dad of the pink pig be one who knows to this is my for anything, be in the forest there lived a is a true

A.4 Discussion about STORAL
The high-quality examples in STORAL are full of commonsense and discourse relations. As exemplified in Table 1 in the main paper, the common sense is mainly regarding the characters' reaction and intention (e.g., "the cows dispersed" and then the "tiger" and "lion" intend to kill them), as well as the nature of physical objects and abstract concepts (e.g. "cows" may be the food of "lions" and "tigers", and "unity" refers to "keeping together for a common goal"). Additionally, the stories usually have a specific discourse structure, i.e., the premise to introduce the story settings (e.g., the characters "four cows" and the location "a meadow"), the right or wrong behavior ("stay together or not") and the endings ("living well or being killed"). We believe it is an essential topic of future work to develop a better approach to model such commonsense and discourse relations.

B.1 Implementation
We implement the pretrained models used in our experiment mainly based on the register models of HuggingFace (Wolf et al., 2020). Table 13 shows the names of the used register models. Note that we use LongLM base (Guan et al., 2022) as the T5 model for experiments on STORAL-ZH, which has not been registered on HuggingFace.
All results in the main paper and the appendix are based on one NVIDIA Tesla v100 (16G memory). All reported results are based on one single running. The CPU is Intel Xeon Gold 5218. It cost less than 5 hours for fine-tuning each model on STORAL. We set the hyper-parameters following the default parameters of HuggingFace.

B.2 Automatic Evaluation for Moral Faithfulness
We follow Guan and Huang (2020) to train a learnable metric to evaluate moral faithfulness. Specifically, we fine-tune RoBERTa BASE as a classifier to distinguish whether a story matches a moral. We regard ground-truth examples as positive where the story and moral are matched, and construct negative examples by replacing the story or moral with a randomly sampled one. Finally, the classifier achieves an accuracy of 77.32/79.21% on the data constructed based on the test set of STORAL-ZH/STORAL-EN respectively. Then we calculate the faithfulness score as the average classifier score of all generated texts for the inputs. Table 14 presents the evaluation results. We can see that pretrained models achieve better faithfulness than the non-pretrained models as shown by the much higher faithfulness scores. However, we also observe that the faithfulness score of the ground-truth texts is lower than some models (e.g., T5) when generating morals. Therefore, it is still necessary to manually evaluate faithfulness.

Results on Validation Sets
We show the performance of several baselines and RA-T5 on the validation sets of the understanding tasks and the generation tasks in Table 15 and Table 16, respectively.

B.3 Manual Evaluation Instruction
We show the manual annotation interface in Figure 4. To ensure that the annotators guarantee a consistent standard in the annotation process, we asked annotators to rate four examples with the same input at the same HIT (human intelligence task). In these four examples, one is written by humans and three are generated by models (i.e., Fusion, T5 and RA-T5). We payed each annotator $0.2 on average for annotating each example.

Instruction
Moral: Nothing can be gained without effort. First Sentence: There was a farmer who had three sons.  of a story, the automatically constructed dataset may bias machines to focus on distinguishing general standards of good behaviour without considering story plots. Therefore, in this section, we construct examples manually to test this ability beyond the token level. Specifically, we randomly sampled 50 examples from the test sets of STORAL-ZH and STORAL-EN respectively. For each example, we manually rewrote the moral to convey a synonymous or antonymous value preference. For example, a synonymous moral with "unity is strength" in Table 1 can be "we are powerful as long as we unite with each other" and an antonymous one can be "everyone can also be powerful enough." Then we expect a model to be able to accept the synonymous moral but reject the antonymous one. We use three typical models, including BERT w/o Story, RA-RoBERTa and RA-T5, to compute the winning   Table 18: Winning rates of pair-wise comparisons which require selecting a correct moral from two candidates. Each candidate is a ground-truth (True), synonymous (Syn), or antonymous (Ant) moral. The number in the parenthesis is the corresponding p-value (sign test). Table 18 shows the evaluation results. We observe that BERT can not distinguish different types of morals without input stories. RA-RoBERTa fails to accept the synonymous morals on STORAL-EN (winning rate of only 36% w.r.t the ground truth, p < 0.1), and can not distinguish synonymous and antonymous morals on both STORAL-ZH and STORAL-EN (winning rate near 50% with p > 0.1). Additionally, it prefers antonymous morals to the ground truth significantly on both datasets (winning rate less than 50% and p < 0.1 ). The results indicate that existing models still struggle to capture the value preference of moral stories.  Table 19: Percentage of the texts annotated with a certain error in all annotated 100 texts in terms of moral faithfulness.

C Error Analysis and Case Study
In this section, we conducted a case study and investigated the errors of existing models on the proposed tasks to provide insight into future work. We show several typical error cases in Table 20.

C.1 Understanding Tasks
The example in Table 20 for MOCPT shows that the model may not grasp abstract concepts such as "good will" and "good acts" and align them to the story plots. It makes predictions possibly based on only token-level features such as relations between "ask after" and "attention". On the other hand, the example for MOPREF indicates that the model can not capture the value preference of the story in terms of "whether it is intelligent to regard others are illiterate". The results demonstrate the necessity of introducing concept knowledge and modeling high-level semantic information. Tasks   Table 21 shows cases generated by several baselines and our model for the generation tasks. We can see that retrieval can provide effective guidance for both moral and story generation. Baseline models including GPT2 and T5 tend to generate unrelated concepts or non-moral texts. However, as shown by the manual evaluation results, there is still a big gap between RA-T5 and humans. To provide quantitative error analysis, in the process of manual evaluation on STORAL-EN, we required annotators to annotate the error type of a text when it exhibit an unfaithful moral. We summarize three main error types as follows:

C.2 Generation
(1) Not a moral text (NAM): not stating or imply- Table 20: Typical error cases predicted by RA-T5 (for the understanding tasks) or sampled from RA-T5 (for the generation tasks). For the generation tasks, the error types in terms of moral faithfulness include "not a moral text" (NAM), "unrelated concepts" (UNREL) and "conflicting value preference" (CONF). The underlined words are improper concepts/events which leads to corresponding errors. Bold words for MO2ST are the given first sentence and the outline of multiple phrases. ing what is right or what is wrong; (2) Unrelated concepts (UNREL): containing unrelated concepts with the input; and (3) Conflicting value preference (CONF): conveying a value preference conflicting with the input despite related concepts. In addition, we also provide annotators with another option Others. The annotators are allowed to annotate a text with multiple errors. When at least two of three annotators annotate the text with some error, we decide it has the error. We show the distribution of the error types in Table 19, suggesting that existing models still struggle to generate meaningful morals and stories, and align the concepts and value preferences between them.
Furthermore, as exemplified in Table 20, when generating morals, we can see from Case 1 that the