Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences

In social settings, much of human behavior is governed by unspoken rules of conduct. For artificial systems to be fully integrated into social environments, adherence to such norms is a central prerequisite. We investigate whether contemporary NLG models can function as behavioral priors for systems deployed in social settings by generating action hypotheses that achieve predefined goals under moral constraints. Moreover, we examine if models can anticipate likely consequences of (im)moral actions, or explain why certain actions are preferable by generating relevant norms. For this purpose, we introduce 'Moral Stories', a crowd-sourced dataset of structured, branching narratives for the study of grounded, goal-oriented social reasoning. Finally, we propose decoding strategies that effectively combine multiple expert models to significantly improve the quality of generated actions, consequences, and norms compared to strong baselines, e.g. though abductive reasoning.


Introduction
The ability to successfully navigate social situations in order to achieve specific goals, such as ordering food at a restaurant or taking the bus to work, is fundamental to everyday life. Importantly, it combines two distinct competencies -completion of actions consistent with the one's intention and adherence to unspoken rules of social conduct. While failing to do the former prevents the transition to the desired world state, socially objectionable behaviour is likely to have negative consequences, which a cooperative actor would naturally want to avoid. For instance, rudely ordering food at a restaurant may offend the staff and result in worse service. While humans generally excel at tailoring their actions to accomplish desired outcomes in a socially acceptable way, it remains unclear whether artificial systems can master this essential skill. 1 Data and code: https://github.com/demelin/moral_stories. In this work, we examine moral reasoning capabilities of natural language generation (NLG) models as proxies for intelligent agents navigating social spaces. To this end, we task models with generating descriptions of actions that fulfill certain goals while either observing (or violating) norms denoting morally (in)defensible behaviour. The generation process is grounded in concrete social situations, which allows models to reason about appropriate behaviour in a simulated real-world setting. Successful models would be well-suited to serving as direct, value-aligned priors for agents deployed in social spaces. Concretely, executing the generated actions descriptions should enable agents to complete their assigned tasks in a sociallycompatible way. To further examine the suitability of generative models as priors for moral reasoning, we task them with identifying likely consequences of morally-valued actions, and to discover new norms based on morally divergent action pairs. Previous efforts to model intentions underlying social actions and their consequences (Rashkin et al., 2018;Hwang et al., 2020) largely regard ac-tions in isolation, without taking into account their broader situational context or norm conformity. Conversely, recent work examining the alignment of social behaviour with established conventions (Forbes et al., 2020;Hendrycks et al., 2020) does not consider the actors' motivations or action outcomes. This work unifies and extends both research directions by grounding model decisions in concrete social situations, introducing moral norms as constraints on goal-directed action generation, and anticipating consequences to inform action choice. To our knowledge, this represents the first study of goal-oriented moral reasoning in social settings, as expected of intelligent agents collaborating with humans in interactive environments.
In order to evaluate the extent to which models are capable of this type of reasoning, we introduce Moral Stories -a novel, crowd-sourced dataset of structured narratives that describe moral and immoral actions taken by individuals to accomplish certain goals in concrete situations, and their respective consequences. Our focus is on descriptive morality, i.e. people's subjective judgments about the character and actions of others guided by an implicit code of conduct (Gert and Gert, 2002). Based on this resource, we develop a series of tasks that target models' ability to reason about goal-directed behaviour while considering its adherence to moral directives. We furthermore propose several decoding strategies that improve generation quality by either anticipating consequences of actions or reranking predictions based on their adherence to normative and narrative constraints. The primary contributions of our work are as follows: 1. We present Moral Stories, a structured corpus of 12k short narratives for goal-oriented, moral reasoning grounded in social situations. 2. We evaluate competitive baseline models on a range of classification and generation tasks enabled by the Moral Stories dataset. 3. We define a family of Chain-of-Experts decoding algorithms that sequentially combine expert models to improve generation quality.

The Moral Stories Dataset
All stories in the dataset consist of seven sentences, each belonging to one of the following categories: Norm: Moral rule of conduct generally observed by most people in everyday situations.
Situation: Description of the story's social setting that introduces one or more story participants.
Intention: Reasonable goal that one story participant, i.e. the actor, wants to fulfill.
Moral action: Action performed by the actor that fulfills the intention while observing the norm.
Moral consequence: 2 Likely effect of the moral action on the actor's environment.
Immoral action: Action performed by the actor that fulfills the intention while violating the norm.
Immoral consequence: Likely effect of the immoral action on the actor's environment.
Accordingly, each story's constituent sentences can be grouped into three segments. The context segment grounds actions within a particular social scenario, the moral path segment contains the moral action and its consequence, whereas the immoral path includes their immoral analogues. Combining the context segment separately with each path segment yields two self-contained, morally divergent sub-stories. Figure 1 illustrates the hierarchical structure of an example narrative.

Dataset Collection
We collect our dataset via the Amazon Mechanical Turk (AMT) platform with the help of crowdworkers. One central challenge in constructing the dataset has been obtaining narratives that are thematically varied. To achieve this, workers were given semantically diverse moral norms as writing prompts. Suitable norms were extracted from the Morality/Ethics and Social Norms categories of the SOCIAL-CHEM-101 dataset (Forbes et al., 2020), ignoring controversial or value-neutral entries.
For each story, workers were given three different norms and asked to chose one as their prompt. To guide the writing process, we provided workers with detailed writing instructions, including: • Situations must describe realistic, every-day events and introduce one or more participants. • Intentions must be rational and expected given respective situations. • Both actions must represent a valid way to satisfy the actor's intention, while being plausible. • Consequences must describe direct and plausible reactions of the actor's environment, or the actor, to respective actions. Furthermore, workers were instructed to avoid morally-charged words, such as praised, joyous, assaulted, or steal, when composing actions, in order to mitigate potential lexical artifacts.
To ensure high quality of collected narratives, workers had to complete a qualification round before contributing to the dataset. Throughout the collection process, a fraction of each worker's submissions was periodically reviewed to provide both personalized and general feedback about any format violations. Workers who repeatedly submitted substandard stories and ignored corrective feedback were disqualified. Once the initial set of stories had been collected, a validation round was conducted to identify and remove inadequate entries. Of the initially collected ∼14k stories, 12k were retained following the validation step. Dataset statistics, additional story examples, and representative excerpts of worker instructions can be found in Appendix A. All workers were paid >$15/hour, on average.
With the dataset at our disposal, we first examine whether models can identify actions that satisfy normative constraints, as well as their likely consequences. Since classification is a demonstrably easier task than generation (Bhagavatula et al., 2019;Rudinger et al., 2020), establishing classification efficacy promises insights into potential strategies for improving generation quality.

Grounded Classification
The information-rich, structured nature of our data allows us to examine several challenging classification tasks that target different story components and incorporate varying amounts of grounding information. By examining different grounding levels, we aim to establish the importance of contextual knowledge for accurate classification decisions.
In all experiments we rely on RoBERTa (Liu et al., 2019) 3 as our classification model of choice, due to it's SOTA performance on various natural language understanding (NLU) benchmarks (Wang et al., 2019a). For each task, a grid-search over hyper-parameters is conducted to ensure representative performance. 4 A summary of best-performing hyper-parameter settings for each task is provided in Appendix B, which also reports model performance on development data and data subset sizes.

Data Splits
To probe the classifier's generalization ability and vulnerability to spurious correlations, we consider three different strategies for splitting the dataset: Norm Distance (ND): Examines how well classifiers generalize to novel norms. To perform the split, all norms are embedded and grouped into 1k clusters via agglomerative clustering 5 . We then order clusters according to their degree of isolation (DoI), defined as the cosine distance between a cluster's centroid and the next-closest cluster's centroid. Stories with norms from most isolated clusters are assigned to test and development sets, while the training set contains the least unique norms.
Lexical Bias (LB): Probes the susceptibility of classifiers to surface-level lexical correlations, similar to (Emelin et al., 2020). We first identify 100 biased lemmas that occur most frequently either in moral or immoral actions. 6 Each story is then assigned a bias score (BS) corresponding to the total number of biased lemmas present in both actions (or consequences). Starting with the lowest bias scores, stories are assigned to the test, development, and, lastly, training set.
Minimal Pairs (MP): Evaluates the model's ability to perform nuanced moral reasoning. Splits are obtained by ordering stories according to the Damerau-Levenshtein distance (DL) (Brill and Moore, 2000) between their actions (or consequences) and assigning stories with lowest distances to the test set, followed by the development set. The remainder makes up the training set. As table 1 shows, the so obtained test sets noticeably differ from training sets, thus requiring classifiers to be robust and capable of generalization.

Split
Train Dev Test

Action Classification
We define four binary action classification settings by grounding actions in varying amounts of auxiliary information. 7 (In the following, story com-ponents are abbreviated as N =norm, S=situation, I=intention, A=action, C=consequence of A): Setting Grounding action None action+norm N action+context N + S + I action+context+consequence N + S + I + C For each setting, the model's objective is to determine whether a given action is moral (relative to the norm, if provided). Each story yields two classification samples, one for each action, that share norm and context sentences. Table 2 lists test accuracy for each setting and data split.  A clear trend towards improved classification accuracy emerges with increasing amounts of grounding, across all test sets. Notably, classifying actions in isolation proves to be challenging once lexical biases have been controlled for. Improvements in accuracy observed for models with access to relevant norms, meanwhile, demonstrate the classifier's ability to relate actions to behavioral rules. We also find that contextual grounding facilitates moral reasoning in the absence of shortcuts. Lastly, the near-perfect performance achieved by including consequences into the classifiers' input (in addition to norms and context) can be attributed to workers' tendency to associate moral actions with positive consequences and immoral actions with negative ones, 8 allowing the model to 'solve' the task by predicting consequence sentiment. Indeed, accuracy remains at 98-99% even when consequences are used as the sole grounding source.
Finally, differences in performance across test sets indicate that while the model learns to exploit annotation artifacts in form of lexical correlations, their importance diminishes with improved grounding. Also noteworthy is that lexical bias and minimal pairs sets appear to be similarly challenging, implying that lexical frequency is one of the domi-8 This emerged naturally during dataset collection and can be argued to be (mostly) representative of reality. nant surface-level cues exploited by the classifier.

Consequence Classification
Next, we investigate classifiers' ability to discriminate between plausible and implausible consequences of morally divergent actions. To this end, we define the following settings:

Setting
Grounding Negative classification samples are constructed by assigning consequences to actions of opposing moral orientation within the same story. Table 3 summarizes test set results for each setting. As with action classification, contextual grounding clearly benefits model accuracy, suggesting that related tasks such as commonsense knowledge base completion (Malaviya et al., 2020) are likely to benefit from providing models with rich situational context, where possible. Examining the different test sets, we once again find the classifier to be adept at exploiting lexical correlations. Surprisingly, the minimal pairs split appears to be least challenging, possibly due to the generally low similarity of consequences, as shown in Table 1.  Overall, we find that classification models can successfully leverage grounding information to accurately distinguish between morally contrasting actions and identify plausible consequences.

Grounded Generation
While insights collected from classification experiments are valuable, behavioural priors for intelligent agents must not be limited to merely recognizing socially acceptable actions. Evaluation of contemporary models on generative tasks enabled by the Moral Stories dataset promises to offer initial insights into their ability to perform desired forms of reasoning. Specifically, we aim to establish whether generative models can 1) produce action descriptions that satisfy goals while adhering to normative constraints, 2) predict plausible  Table 4: Test results for action generation (best, second best). Metrics of interest are highlighted . For human evaluation, the format is as follows: total | moral target | immoral target.
Norm: It's expected to keep your pets on a leech. Situation: James took his border collie on long walks because she was very high-energy. Intention: James wants to wear his border collie out, so she's not hyper at home. Immoral action (action|context): James puts his border collie on a leech and forces her to go on long walks at full-mast every day. Immoral action (action|context+consequence): James takes his border collie for long walks, wearing her out. Immoral action (CoE ranking): James kept taking his border collie for long walks because he thought she might lose energy. Immoral action (CoE abductive refinement): James lets his border collie out without wearing a leash. Immoral action (reference): James lets his border collie off her leash, so she can run around as he walks. Immoral consequence: James' border collie jumps on another pedestrian, and they threaten to call animal control. consequences of actions, and 3) generate relevant norms to explain the difference between morally divergent actions.
Owing to their exceptional performance across related NLG tasks (Forbes et al., 2020;Rudinger et al., 2020;Sakaguchi et al., 2020), our main interest is in evaluating pre-trained transformer language models (LMs). We examine two encoderdecoder architectures, BART (Lewis et al., 2019) and T5 (Raffel et al., 2019), and a single 'standard' LM,  In discussing generation results, we focus on the best architecture for each task, and summarize our findings for the remainder in Appendix C. All models are fine-tuned on taskspecific instances of Moral Stories, split according to norm distance. Throughout, nucleus sampling (NS) (Holtzman et al., 2019) is used for decoding. Refer to Appendix C for data subset sizes, model hyper-parameters, and input formats.
Generation quality is assessed using a combination of automatic metrics and human evaluation. The former relies on BLEU (Papineni et al., 2002) and ROUGE-L 10 (Lin, 2004). For models that perform best on automatic metrics, human evaluation is conducted by expert workers who contributed a 9 We use following model configurations: BART-large, T5-large, and GPT2-XL (Radford et al.) 10 As implemented by SacreBLEU (Post, 2018) and SacreROUGE (Deutsch and Roth, 2019), respectively. large number of high-quality stories to the dataset. Each model-generated sample is evaluated by averaging ratings obtained from three different workers. For action and consequence generation, scores highlighted in green denote judgments collected for moral targets, while scores in red refer to their immoral counterparts. Judgments are obtained for a fixed set of 200 randomly selected test samples per task, to keep comparisons fair. Krippendorff's α (Krippendorff, 2018) is used to estimate interannotator agreement.

Action Generation
In evaluating models' ability to generate action hypotheses that simultaneously fulfill the stated goal and follow / violate the given norm, we consider two settings with varying levels of grounding:

Setting
Grounding action|context N + S + I action|context+consequence N + S + I + C Each story yields two samples that share the same context. While the action|context setting emulates the process by which an agent decides on a suitable action according to information available at decision time, action|context+consequence corresponds to the agent incorporating a probable outcome of their action into the reasoning process. By conditioning the generation step on future infor-  While the addition of consequences has little impact on automatic metrics, human judges prefer actions informed by their projected outcomes. By considering future information, models generate actions that more often satisfy goals and normative requirements. Since consequences describe direct outcomes of goals being fulfilled, they may bias models to generate goal-directed actions. Similarly, consequence sentiment may be a useful signal for the moral orientation of actions, as noted in §3.2.
Interestingly, moral actions are consistently rated more favourably on the Intention and Norm criteria than their immoral analogues. This suggests that evaluated LMs may have a moral positivity bias, since the majority of interactions in their pretraining data can be expected to adhere to established rules of conduct. Overall, our initial findings illustrate the utility of grounding offered by future information for guiding the behavior of social agents, while leaving much room for improvement.

Consequence Generation
Prediction of plausible consequences that follow isolated social actions has been studied in the past (Rashkin et al., 2018;Bosselut et al., 2019). We expand upon such efforts by considering generation settings that ground actions to varying degree and are centered around morally-valued behavior:

Setting
Grounding consequence|action A consequence|context+action N + S + I + A 11 I.e. whether actions that are expected to follow / violate the norm do, in fact, follow / violate the specified norm.
Social agents capable of correctly anticipating effects of their actions can adjust their behaviour to be most beneficial to most situation participants, thus adhering to the utilitarianism principle (Lazari-Radek and Singer, 2017). As before, two samples are derived from each story, sharing the same context. Quality assessment of predicted consequences is presented in Table 5. Generation examples are included in Appendix C. Human judges indicated whether the consequence is coherent and whether it can plausibly follow the respective action.
The effect of contextual grounding is evident from automatic and human evaluation alike. Crucially, grounded prediction yields more plausible consequences, but fails to do so reliably. We again observe inferior model performance for immoral targets, which supports the presence of a moral positivity bias in pre-trained LMs. Importantly, our results demonstrate that NLG models are capable of exploiting rich grounding information when reasoning about expected outcomes of actions.

Norm Discovery
The final task probes the ability of generative models to explain the difference between acceptable and objectionable behaviour by producing relevant norms. Being able to identify unstated rules of conduct would enable agents to autonomously discover value systems by observing their environment. As with previous tasks, we define several settings that permit varying levels of grounding: 12

Setting
Grounding To assess generation quality, human judges indicated whether norms are coherent and adequately explain the moral contrast between actions. In a pilot study, we found the generated norms to be less specific than human-authored ones which we quan- 12 Here, A = both actions, and C = both consequences.   Table 6, while example predictions can be found in Appendix C.
In contrast to previous tasks, contextual grounding does not improve norm relevance, suggesting a possible mismatch of useful conditioning information. As expected, we find generated norms to be consistently less diverse than ones used as story prompts, which holds across all settings. Of note is the increase in norm relevance caused by including consequences in the set of grounding information. It is likely that consequences, by referencing parts of action descriptions, point the model towards relevant action properties. Even so, the absolute relevance of predicted norms remains quite low.

Chain-of-Experts Decoding Strategies
Our initial investigation revealed that NLG models produce coherent sequences, but often fail to fully satisfy both explicit and implicit generation constraints. To address this deficit, we propose taskspecific decoding strategies that employ chains of fine-tuned expert models (CoE) to enforce constraint satisfaction. Specifically, we use classifiers to rank model outputs and condition generative models on other experts' predictions. Appendix C lists models employed as experts for each strategy.

Improving action morality
To facilitate action adherence to normative constraints, we propose two strategies (in all experiments, we set N = 10 and decode with NS (p=0.9)): Ranking: 1. Per sample, predict N diverse actions using the action|context generator. 2. Rank actions based on target class probabilities 14 assigned by the action+context classifier. 3. Return best action per sample. 13 We jointly consider all 1-to 4-grams. 14 I.e. action is moral or action is immoral.
Abductive refinement: 1. Per sample, predict and rank N initial actions using action|context and action+context models. 2. Predict and rank N consequences of best initial action using conseq.|context+action and conseq.+context+action models. 3. Predict and rank N refined actions using action| context+conseq. and action+context+conseq. models, conditioned on best consequence. 4. Return best refined action per sample.
The ranking algorithm aims to leverage high accuracy of action classifiers, while abductive refinement is moreover informed by the superior performance of models conditioned on probable consequences. Taking into consideration likely outcomes of initial action hypotheses, a suitable expert model is able to refine predictions by performing abductive inference grounded in anticipated future states. As Table 4 shows, both strategies yield actions that are substantially more relevant to specified norms. Compared to the action|context baseline, abductive refinement achieves an improvement of 23%, effectively showcasing the utility of anticipating future states for socially optimal decision making. Consistent with previous findings, generation of immoral actions continues to be more challenging, but also significantly improves for both algorithms.

Improving consequence plausibility
To aid generation of plausible consequences, we propose following CoE strategies: Ranking: 1. Per sample, predict N diverse consequences using the conseq.|context+action generator. 2. Rank consequences based on probabilities 15 assigned by the conseq.+context+action classifier. 3. Return best consequence per sample.
Each algorithm relies on a classifier to identify plausible consequences with high accuracy. From results in Table 5, we conclude that both obtain improvements in plausibility, whereby the simpler ranking strategy is more successful, surpassing the best non-CoE result by 7%. We attribute this to the combination of high recall achieved by sampling multiple hypotheses, and high precision afforded by the strong classifier. Limited to a single hypothesis, iterative refinement is unable to effectively explore the output space. The refinement model may also struggle to fully utilize classifier labels as instructions to rewrite the consequence draft. While immoral consequences continue to be less plausible than moral ones, both strategies narrow the gap compared to single-model baselines.

Improving norm relevance
Finally, we consider how norm relevance can be improved when action outcomes are not known a priori, which is the default scenario for agents navigating social spaces. We implement the following algorithm that uses a dedicated expert model to anticipate consequences of actions: Generation with synthetic consequences: 1. Per sample, predict N consequences for both actions, using the conseq.|context+action model. 2. Rank consequences based on probabilities assigned by the conseq.+context+action classifier. 3. Use norm|context+actions+conseq. generator with best consequences to predict relevant norm. As Table 6 shows, norms informed by synthetic consequences are just as relevant as those based on reference consequences. Thus, anticipating action outcomes is an effective strategy for learning salient behavioural norms that improves upon generation conditioned solely on actions and context.

Related Work
Our study is, in large parts, motivated by the existing body of research into computational study of social dynamics (Rashkin et al., 2018;Sap et al., 2019aSap et al., ,b, 2020, as well as recent efforts investigating whether NLU / NLG models can reason about moral and ethical principles. Among the latter category, (Frazier et al., 2020) is notable for proposing the use of linguistic priors to guide the behaviour of intelligent agents as a viable alternative to imitation and preference learning, which has been recently attempted for procedural, object-oriented reasoning by (Shridhar et al., 2020). In constructing Moral Stories, we relied on richly annotated norms in the SOCIAL-CHEM-101 dataset of (Forbes et al., 2020). Initial forays into evaluating ethical judgments of NLU models on long-form, unstructured texts were made in (Lourie et al., 2020;Hendrycks et al., 2020), but remained limited to classification. To the best of our knowledge, our work is first to evaluate moral reasoning capabilities of generative models in realistic, grounded, social scenarios represented by multi-sentence stories.
The proposed CoE algorithms, on the other hand, are closely related to rescoring methods employed in NLG, including work by (Holtzman et al., 2018;Cho et al., 2019;Gabriel et al., 2019;Hossain et al., 2020;Goldfarb-Tarrant et al., 2020), among others. Refinement of initial hypotheses by a secondary expert model, on the other hand, follows the general principle underlying deliberation networks initially developed to improve machine translation quality (Xia et al., 2017;Wang et al., 2019b), although limited to inference only for our purposes.

Conclusion and Future Work
We conducted a thorough investigation of goaldirected moral reasoning grounded in concrete social situations, using the new Moral Stories dataset. Our findings demonstrate that strong classifiers can identify moral actions and plausible consequences with high accuracy by leveraging rich grounding information. On the other hand, generative models frequently fail to adhere to task-specific constraints such as norm relevance or plausibility. We address this issue by introducing a family of decoding algorithms that rely on expert models to facilitate constraint satisfaction, and show their effectiveness according to human evaluation. Notably, we demonstrate the usefulness of anticipating highly plausible action outcomes for socially-optimal decision making and for the discovery of unspoken moral principles that govern social interactions.
Future efforts may extend the computational study of moral reasoning to more complex scenarios, develop methods for automated norm discovery that are applicable to non-Western norms and customs, or integrate presented methods into narrative and dialogue generation.
In constructing the Moral Stories dataset, great care was taken to ensure that crowd-workers are compensated fairly for their efforts. To this end, we monitored median HIT 16 completion times for each published batch, adjusting the monetary reward so that the median worker always received >$15/hour, which is roughly double the minimum wage in the United States (the country of residence for most of our workers). This included the qualification and evaluation rounds. The following data statement (Bender and Friedman, 2018) summarizes relevant aspects of the data collection process: A. CURATION RATIONALE: Selection criteria for stories included in the presented dataset are discussed in detail in §2.1. For narratives to be accepted into the dataset, they had to be coherent and internally cohesive, and follow the format specified in the instructions given to workers. Contributors were further directed to avoid offensive and biased language, and to focus on real-life, every-day scenarios. When describing actions and consequences, we asked workers to imagine themselves as either the actor or the person affected by the actor's actions, so as to obtain realistic representations of social dynamics.
B. LANGUAGE VARIETY: The dataset is available in English, with mainstream US Englishes being the dominant variety, as indicated by selfreported contributor demographics.
C. SPEAKER DEMOGRAPHIC: We asked crowdworkers to provide basic demographic information during the qualification round, and summarize the corresponding statistics for all 130 contributors to the final dataset (each dominant group is underlined for clarity):  16 Human Intelligence Task, corresponding to writing / evaluating a single narrative, in our case. middle: 43.9%, upper-middle: 7.7%, no answer: 3.9% • Location: US: 98.5%, non-US: 1.5% As such, the data includes contributions from writers across different age brackets, genders, and economic backgrounds. At the same time, it skews noticeably towards White, educated US residents. Future efforts must therefore be aimed at the collection of moral narratives for less-represented groups.
D. ANNOTATOR DEMOGRAPHIC: N/A E. SPEECH SITUATION: All narratives were collected and validated over a period of approximately 12 weeks, between June and September 2020, through the AMT platform. As mentioned in §2.1, workers were given regular, detailed feedback regarding the quality of their submissions and were able to address any questions or comments to the study's main author via Email / Slack. F. TEXT CHARACTERISTICS: In line with the intended purpose of the dataset, the included narratives describe social interactions related (but not limited) to domestic life, platonic and romantic relationships, as well as appropriate conduct at school or work. A break-down of most representative, automatically discovered topics is given in Appendix A. Notably, COVID-19 features prominently in several stories, serving as a diachronic marker of the data collection period.
G. RECORDING QUALITY: N/A H. OTHER: N/A I. PROVENANCE APPENDIX: To obtain thematically varied narratives, workers were given norms extracted from the SOCIAL-CHEM-101 corpus as writing prompts. As reported in (Forbes et al., 2020), the demographics of contributing crowdworkers are comparable to those involved in the creation of Moral Stories, showing a roughly balanced gender, age, and economic class distribution. Similarly, the vast majority of workers self-identified as white (89%) and resided in the US (94%).
Lastly, we want to emphasize that our work is strictly scientific in nature, and serves the exploration of machine reasoning alone. It was not developed to offer guidance or advice for human interactions, nor should it be treated as such. Conceivably, the inclusion of immoral action choices and their consequences in the dataset could allow adversaries to train malicious agents that purposefully violate norms in order to sow social discord. We are aware of this risk, but also want to emphasize the utility of immoral choices as explicit examples of behaviour to be avoided by cooperative agents. As such, they provide a useful negative training signal for minimizing harm that may be caused by agents operating in social spaces. It is, therefore, necessary for future work that uses our dataset to specify how the collected examples of both moral and immoral behaviour are used, and for what purpose. As touched upon in the data statement, we aimed to minimize the presence of offensive or biased language in the dataset by providing workers with corresponding instructions.  In addition to reporting the overall datset size, we examine the average length of individual story component categories. As Table 7 shows, morally divergent actions and consequences are of comparable length, making sequence length an unlikely data artifact to be exploited by classification models for performance gains. Moreover, we find norms and intentions to be substantially shorter than other categories, which is attributable to their limited semantic content. In contrast, situation, action, and consequence descriptions are significantly more open-ended and, as a result, longer.

References
To develop a better understanding of the different story topics represented in the Moral Stories dataset, we perform latent Dirichlet allocation (LDA) (Blei et al., 2003) on the collected narratives, 17 and list words corresponding to ten latent topics in Table 13. We conclude that the dataset is centered around interpersonal relationships in a variety of settings, which includes domestic life, commerce, and education. Since we instructed crowdworkers to compose realistic narratives based on norms describing rules of social conduct, this is an expected outcome that supports the effectiveness of our data collection method. Example narratives shown in Figure 3 further showcase the thematic diversity of the dataset.
Lastly, we provide excerpts of HIT instructions given to AMT workers during the story collection phase in Figures 7-14. While the instructions are extensive, workers were able to familiarize themselves with the task during the qualification round and were provided with annotated, positive and negative examples that highlighted different aspects of the required format. Detailed feedback helped workers resolve any remaining uncertainties. 17 We use the implementation provided by the Gensim library (Rehurek and Sojka, 2011).

B Classification: Supplementary Details
Hyper-parameters used for training the classification models for all tasks, settings, and data splits are given in Table 14. Following hyper-parameters were kept constant for all classification experiments: Max. input length (subwords): 100, Adam : 1e-8, Gradient norm: 1.0. # Warm-up steps: 0. All models were fine-tuned and evaluated on a single NVIDIA QUADRO RTX 8000 GPU, for classification and generation alike.
We report classifier performance in the development sets in Tables 8 and 9. Given that development sets are less challenging than test sets by design, as indicated by the split properties reported in Table 1, models perform better on development data across the board by exploiting shortcuts present in the training data.

C Generation: Supplementary Details
Hyper-parameters used to fine-tune all generation models are specified in Table 11. Default values are adopted for the rest. Overall training duration differs between tasks and model architectures, due to early stopping. We report automatic quality esti-mation metrics for second-and third-best models for all generation tasks and settings in Tables 15-17.  Table 12 lists the sizes of data subsets used in all generation experiments, across all settings. For further clarity, Table 18 illustrates input formats that correspond to different generation settings. Special separator tokens formatted as <|TOKEN|> are added to each model's vocabulary prior to fine-tuning and assigned randomly initialized embeddings. Examples of actions, consequences, and norms produced by the methods discussed in the main text are supplied in Figures  4, 5, and 6, respectively. Finally, Table 19    18 For iterative consequence refinement, <|CSQ_PL|> / <|CSQ_IMPL|> corresponds to the label assigned by the classifier, i.e. consequence draft is plausible / implausible. relationships-1 education commerce domestic meals relationships-2 festive family relationships-3 romantic  friend  school  money  get  eat  tell  family work  want  man  want  class  pay  dog  food  want  party  want  brother  girlfriend  tell  get  want  car  dinner  mother  want  child  people  sister  go  want  buy  home  want  feel  gift  get  get  woman  feel  student  get  want  clean  make  people parent  phone  date  Table 14: Hyper-parameters used for fine-tuning best-performing classification models; Format: ND / LB / MP.