NormBank: A Knowledge Bank of Situational Social Norms

We present NormBank, a knowledge bank of 155k situational norms. This resource is designed to ground flexible normative reasoning for interactive, assistive, and collaborative AI systems. Unlike prior commonsense resources, NormBank grounds each inference within a multivalent sociocultural frame, which includes the setting (e.g., restaurant), the agents’ contingent roles (waiter, customer), their attributes (age, gender), and other physical, social, and cultural constraints (e.g., the temperature or the country of operation). In total, NormBank contains 63k unique constraints from a taxonomy that we introduce and iteratively refine here. Constraints then apply in different combinations to frame social norms. Under these manipulations, norms are non-monotonic — one can cancel an inference by updating its frame even slightly. Still, we find evidence that neural models can help reliably extend the scope and coverage of NormBank. We further demonstrate the utility of this resource with a series of transfer experiments. For data and code, see https://github.com/SALT-NLP/normbank


Introduction
As AI systems continually evolve for human assistance and collaboration, they will increasingly operate within cultural and social spaces, and require increasingly robust and flexible knowledge of social norms (Carlucci et al., 2015). From dialogue systems (Molnár and Szüts, 2018;Vaidyam et al., 2019;Bavaresco et al., 2020;Grossman et al., 2019) to socially interactive robots (Fong et al., 2003;Deng et al., 2019) and augmented or mixed reality technologies (Anderson and Rainie, 2022), each could benefit from understanding how humans effectively communicate, make decisions, engage with requests, and broadly interact with others (Sunstein, 1996;Sherif and Sherif, 1953).
Work done at Meta AI Research Figure 1: What is special about NORMBANK? Norms are grounded by situational constraints-environmental and personal attributes, as well as roles and other behaviors. In this example, drinking coffee is an encouraged activity in its prototypical context, for a customer in a cafe, but it is counternormative for a working barista to do so in the same cafe, or for a child-age student do so in a classroom. These represent only some of the nonmonotonic normative inferences that are represented in NORMBANK.
Natural language is flexible and highly expressive; thus it is a promising medium for encoding knowledge of social norms (Sap et al., 2019a). The goal of this work is to construct NORMBANK, a natural language bank of social norms that will allow AI systems to reason about social situations under complex constraints. NORMBANK encodes 155k norms via scalable human annotation, bootstrapped with implicit knowledge from large pre-trained language models (LLMs).
NORMBANK factors in two important considerations that have been previously overlooked. First: norms are not rigid truths; they are flexibly assumed standards that may be updated with new information from the social context (Blass and Horswill, 2015). Second: this social context is not a flat list of facts but a matrix of hierarchical dependencies (Hovy and Yang, 2021). These two considerations have important design implications for norm representations and reasoning in AI, inspiring two objectives for this work.
Objective 1 is to support non-monotonic reasoning (Reiter, 1981) over defeasible (Pollock, 1987) norms. This means inferences that hold under most cases can be updated or even retracted based on new information. For example, dancing is a positive behavior that is generally permitted in many casual settings and in many cultures. We can still strengthen or cancel this inference. On the one hand, dancing is expected from a professional dancer. But in an Islamic cultural context, individuals are forbidden from publicly dancing with members of the opposite sex. In a hospital setting, a young child is allowed to dance in the waiting room, but this behavior would not be expected from an adult visiting a dying relative. For more examples, see Figure 1. This kind of reasoning will not always follow straightforward compositional logic (Klimczyk, 2021), and we expect it to be a challenge for AI systems. 1 NORMBANK is the first data resource to support non-monotonic normative reasoning by encoding contrasting situations under which the same behavior could alternatively be expected or considered taboo (see §4).
There is a combinatorially explosive space of situational contexts, each with non-compositional and thus unpredictable norms. Enumerating the set of all possible constraints is intractable. To effi-ciently learn norms in this space, models can rely on the regularizing effects of hierarchical organization and social theory. NORMBANK introduces hierarchical organization (Objective 2) by means of a rich taxonomy over the relevant contextual signals that inform behaviors.
Our new SCENE taxonomy is the first to use Goffman's (1959) dramaturgical theory of social life. We operationalize the theory with settings that have additional environmental constraints. In each setting, there are agents with different roles and attributes, who then perform behaviors. Norms apply to behaviors in certain situations. For example, in Figure 1, norms around drinking hot coffee differ for agents with different roles (e.g., barista, customer) and attributes (e.g., adult, child), in different settings (e.g., cafe, classroom).
Having addressed the objectives above, we train neural models to expand NORMBANK with automatic knowledge completion. Experiments show promising results: these models can extrapolate social commonsense to new behaviors in new situations, leveraging similarities in analogous roles across different situations. Finally, we demonstrate how to transfer knowledge via sequential finetuning from NORMBANK to social reasoning tasks. Together, knowledge completion and transfer learning suggest that our dataset will serve as a useful resource for advancing neural models toward situationally-grounded social reasoning.
Norm discovery is an emerging method inspired by automatic knowledge base construction (Mitchell et al., 2015;Craven et al., 2000) and extracting social knowledge from LLMs via prompting (Trinh and Le, 2018;Petroni et al., 2019;Wang et al., 2019;Sakaguchi et al., 2020). In concurrent work, Fung et al. (2022) propose NORMSAGE, which automatically discovers mandated or conventional behaviors from dialogues. Their prompts resemble our bootstrapping efforts in §3, with the added step of automatic self-verification. NORMBANK differs from NORM-SAGE in that we rely on human annotation to collect more creatively non-prototypical situations to challenge and expand normative reasoning models.
Normative reasoning systems like Delphi (Jiang et al., 2021), and UNICORN (Lourie et al., 2021a) are pre-trained on existing social knowledge bases (Forbes et al., 2020;Emelin et al., 2021;Hendrycks et al., 2021;Lourie et al., 2021b;Sap et al., 2020), which contain more conventional social behaviors from narrative contexts. Until Pyatkin et al. (2022), in work concurrent to our own, normative reasoning systems have been framed as universal oracles, making forced-choice judgments about human behaviors in a vacuum (Talat et al., 2022). These models lack the capacity for defeasible reasoning (Madaan et al., 2021;Rudinger et al., 2020). Oracles instead tend to assume the most prototypical contexts (Boratko et al., 2020). Many of these predictions will appear reasonable if we pragmatically infer a conventional narrative, but for systems to achieve robust social intelligence, they must account for the long tail of the distribution. We can easily find unconventional contexts in which the correct inference contained in NORMBANK is misunderstood by current models. 2

SCENE: A Dramaturgical Framework
The self... is a dramatic effect arising diffusely from a SCENE.
-Erving Goffman (1959) To help models efficiently learn non-monotonic normative reasoning over a seemingly unbounded 2 For example, Delphi believes yelling and clenching your fists, breathing heavily, or asking someone personal questions about their sex life are all conventionally inappropriate. NORMBANK gives acceptable contexts for each: guests riding a roller coaster, athletes running track, and doctors performing routine checkups, respectively. set of possible contexts, and to test this understanding in LLMs, we will need to establish a more tractable set of elements to represent this social matrix. For this purpose, we construct a hierarchical taxonomy of constraints, which we call Situational Constraints for social Expectations, Norms, and Etiquette (SCENE for short). SCENE follows Goffman's (1959) dramaturgical model of social life. According to this model, people are like actors trying to maintain a social performance in front of an audience. Each actor performs a particular role as if in a scene from a movie. The scene is grounded in a particular setting, which includes aspects of the environment that inform the performance. Each scene also has a script (Schank and Abelson, 1977), which tells the actor what kinds of behaviors will be perceived as in-character or out-of-character. Additionally, the actor will embody socially meaningful attributes like age, gender, status, etc. These attributes may be relevant to the scene and the actors place in it. In Figure 2, the example setting is a restaurant where the environment is uncrowded and the hour is night. There are two primary roles of customer and server, and for norm formation, some relevant attributes include their respective genders, sexualities, and ages, which parameterize the behaviors that are appropriate for this dinner, such as dating and drinking alcohol.

Settings
(e.g., banks, classrooms, homes, hospitals) are the loci of scripted social interactions (Schank and Abelson, 1977), and they frame all subsequent elements of NORMBANK, so we begin here with 129 distinct settings like amusement park, bus, and elevator. Settings derive from two popular knowledge resources. First, there are 80 settings from from ConceptNet (Speer et al., 2017), a broad knowledge base of the words and phrases that people commonly use. 3 There are another 255 settings from the "movie scene" label in the MovieGraphs (Vicol et al., 2018) resource-a collection of social situations that were depicted in movie clips.

The Environment
contains signals that can trigger associative priming of social norms (e.g., the noise level of a study space; Aarts and Dijksterhuis). This portion of the taxonomy is designed to be broad and general-purpose, with a set of attributes that can refine any setting. Our taxonomy is based on a broad review of the literature on norm Figure 2: An example of the SCENE Dramaturgical Framework used to constrain NORMBANK. The restaurant setting is specified by the attendance (not crowded) and time of day (night) in the environment. The two agent roles, customer and server; the latter is specified by the age bracket (adult) and gender (male) attributes. The former are engaged in the behaviors drinking alcohol and going on a date. Note: Graphics are for illustration. NORMBANK is a text dataset and does not contain any images.
formation and its relevant factors (van Rijswijk and Haans, 2018;Janicik and Bartel, 2003;Boyce et al., 2000;Russell and Ward, 1982;Durkheim, 1915). Importantly, the taxonomy is further refined through crowdsourced feedback ( §4). Ultimately, our taxonomy grows to contain 404 environmental constraints. An extensive overview of the environmental constraints is given in Appendix B.1, but we summarize them here.
In the environment, there are important taxonomic subclasses of factors that inform norms. One subclass is time constraints, like seasonality (Janicik and Bartel, 2003), holidays and special customary observances (Durkheim, 1915), and another is the country of operation, which serves as a proxy for regional cultural differences (Meyer, 2014). We also include factors from environmental psychology (Bell et al., 2001) that involve the agent's comfort and ease in the environment (e.g., noise level, privacy, and cleanliness). Additionally, physical conditions include factors like weather, which impact visibility, coordination, safety, and comfort (Boyce et al., 2000;van Rijswijk and Haans, 2018;Cunningham, 1979). In addition to the imposed taxonomy, annotator feedback ( §4) lead us to add a subclass called restrictions that formally limit attendance, participation, and behavior, due to notions of formality, religiosity, or exclusiveness.

Roles
may be ubiquitous, but it is challenging to collect reliable, setting-specific roles with high coverage. Our solution is to use the powerful associative knowledge of LLMs to automatically enumerate roles for each setting via prompting, inspired by Trinh and Le (2018), Petroni et al. (2019), and others (Wang et al., 2019;Sakaguchi et al., 2020). Specifically, we prompt GPT-3 (Brown et al., 2020) text-davinci-002 in a zero-shot manner with the phrase "Some roles <preposition> <determiner> <setting>:" where the preposition and determiner are manually configured to match the setting (for example, "some roles at a casino:" or "some roles on the beach:"). On average, we generate 5.5 roles per setting, with a total of 928 unique roles.

Attributes
are properties of individual agents that determine their social norms. Here again, the goal is to derive a general purpose taxonomy from the literature. Some attributes are basic demographic categories like the person's age bracket, gender, race, religion, and sexuality ( Thompson Jr and Pleck, 1986;Dempsey and De Vaus, 2004;Helgeson, 2016). Related demographic categories include education level, employment, and marital status. Since food is a focal point for culture and morality, we include diet. We also include material constraints like medical condition and social class. Finally, we increase the coverage of this set by including generic descriptors of two types: condition or state adjectives, which describe a temporary condition (e.g., dizzy), and characteristic adjectives that describe more permanent attributes (e.g., blonde). In total, our taxonomy defines 578 attribute constraints.

Behaviors
are the primary target of analysis for social norms. As with roles, we co-opt GPT-3 to enumerate behaviors for each setting and role, but the approach here is augmented in two ways. First, we include a norm expectation in the prompt. By querying for unexpected behaviors, we can begin to shift the distribution of behaviors away from the prototypical. Second, we increase the diversity of generations by conditioning on the agent's attribute. This further reduces the number of conventional behaviors in our set. The prompt is "Some things you would (never) do <preposition> <determiner> <setting> (if you were <attribute>):" where elements in parentheses are optional elements. In this way, we generated an average of 776.5 behaviors per setting, which was filtered down to 112.6 behaviors per setting, via programmatic methods described in Appendix B.2.

Building NORMBANK
Section 3 gave us a high-recall set of constraint variables for explaining situational social norms. Our end goal is to build a resource that contains reliable norms to ground, train, and test automatic normative reasoning systems. We want these norms to describe challenging, non-prototypical examples, and to depend on subtly contrasting situations that, when shifted, change the norm label non-monotonically. This motivates us to use human annotation over the rich SCENE taxonomy.
Our process is essentially the reverse of the current paradigm established by prior work, which starts with a basic narrative context and subsequently extracts (Fung et al., 2022) or annotates (Forbes et al., 2020) the expected behaviors. Instead, we start with behaviors and ask annotators to provide us with different dramaturgical contexts (SCENE constraints) under which that behavior could be variously seen as expected, okay, or unexpected. Thus we obtain richer and less prototypical instances-examples not mentioned in standard dialogue, which will significantly challenge models. The approach is inspired by contrast sets (Gardner et al., 2020) and counterfactual augmentation (Kaushik et al., 2020) as means of reducing spurious correlations in model inferences.

Annotation Task
For the annotation task, we recruit experienced English-speaking Mechanical Turk annotators who have ≥98% acceptance with ≥100 HITs and are located in the United States. The task requires human creativity over a large combinatorial space. For a given setting s and a behavior b, an annotator will tell us distinct situational contexts under which b is alternatively expected (required by duty or anticipated with high probability), okay (permitted or anticipated with moderate probability), or unexpected (forbidden, stigmatized, taboo, or otherwise anticipated with very low probability).
These expected, okay, and unexpected categories are called "norm labels." The language of expectation is useful for describing behavioral patternsthe focus of this work-rather than prescribing rules for what ought to be done, as in prior datasets (Ziems et al., 2022;Emelin et al., 2021;Lourie et al., 2021b;Forbes et al., 2020;Sap et al., 2020). Importantly, we do not impose any ethical or moral philosophy (Hendrycks et al., 2021), but instead, encourage annotators to find norms that merely describe observable social life (Cialdini et al., 1991).
The annotator fully specifies the appropriate situational context by means of disjunctions and conjunctions of constraints. For example, "spit at a dentist's office" can be unexpected when Annotators select SCENE constraints using drop down menus that follow the hierarchy of §3 (for details on the HIT interface, see Appendix C.2). They are also free to insert their own custom constraints into the hierarchy. In this way, we iteratively expand the taxonomy.

Dataset Quality
Quality Control. Manual inspection of over 2.5k data points reveals that the open-ended and creative aspects of the task are natural incentives for high-quality work (Chandler et al., 2015;Sheehan, 2018). To further ensure the quality of NORM-BANK, we trained annotators with careful instruction, a qualification test, a staging round, personalized feedback, programmatic filtering, and finally, a series of random audits (Litman et al., 2015;Sheehan, 2018). The instructions included at least 3 fully-worked examples for each norm label, plus suggestions and explanations for a total of 24 constraints. We administered a six-question qualifier, which tested workers' knowledge of the taxonomy, definitions, free text response, and how to properly indicate constraint conjunctions and disjunctions through the task interface. If the worker passed at least five questions correctly on the first try, she would gain access to the staging round -a smallscale version of the task in which each submission would receive detailed and personalized feedback.
We invested a significant amount of time to feedback, offering 75 to 200 words of review for each of 2, 502 staging HITs. Once a worker submitted 3  Table 1: Summary statistics show the immense scale of NORMBANK ( §4) and the broad coverage of our SCENE framework ( §3). There are 155k total annotated norms, comprised of 70k unique situations, and each situation is drawn from a conjunction of some subset of the 408k annotated constraints. Of these annotated constraints, 94% of them use the structure of our SCENE taxonomy, and 69% use a pre-populated constraint value from one of our taxonomic dropdown menus.
high-quality HITs in the staging round, he or she could move to the full task. To identify poor work here, we programmatically flagged workers with extremely low variation in their annotations. Finally, we periodically performed a total of three random audits, sampling 250 annotations in each audit, to confirm the quality of the annotations. Workers were paid a base rate, plus an additional itemized bonus for every additional constraint they added, which incentivized workers to be more expressive and creative. Annotators received a median of $30 per hour for this task.
Quality Metrics. The above methods all proved remarkably successful in generating a creative and high-quality resource. Because our task is creative and subjective, data quality is not easily measured by inter-annotator agreement. We instead report human evaluations over the Gold NORMBANK data in the bottom row of With regard to the correctness metric, annotator disagreements can be traced to differences in the annotators' models of the world, which likely stem from their own personal differences, including age, profession, and worldview. For example, an annotator likely familiar with the Cambodian tradition of "Pithi Srang Preah" marked that "honoring your ancestors" is normal for Cambodians on Cambodian New Year, while an annotator unfamiliar with this practice marked it as unexpected. Furthermore, we administered political leaning and the moral foundations surveys to all annotators, which we release alongside NORMBANK to help explain how these personal differences informed the probabilities they assigned to events. This resource will be of interest to computer scientists and social scientists, since NORMBANK contains not only commonsense facts, but also culturally-conditioned distributions over behavior and expectations about behavior.

Dataset Summary
Summary Statistics.  straint value from a pre-populated list in 69% cases. For concurrent behavior and attribute constraints, annotators had to input their own values in 59% and 33% of cases respectively, followed by 27% and 9% of cases for the environment and roles. Overall, this indicates that our GPT-3 prompting method achieved high recall, especially for roles, and least so for behaviors, which is unsurprising, given the almost unbounded space of viable human behavior. Figure 3 gives the distribution of constraints in NORMBANK. Constraints are dominated most by the agent's attributes and roles. Age, condition, and characteristic are the most popular attributes, while roles vary. There is an even split between behaviors and the environment. In the environment, there is a notable focus on time constraints, and slightly lesser but more even attention towards the remaining subcategories.
Links to Existing Knowledge. NORMBANK's SCENE taxonomy has close links to existing knowledge resources. ConceptNet directly seeded 80 settings in SCENE. Beyond this, we successfully link over 90% of taxonomic items from the setting, environment, roles, and attributes directly with concepts in ConceptNet. These taxonomic items cover 93.6% of all constraint categories and 70.0% of all constraint values. ConceptNet is further linked to WordNet, DBPedia, Umbel, Cyc, and Wiktionary, so by extension, NORMBANK can be coupled to these resources.

Experiments: How to use NORMBANK
NORMBANK is not designed for any particular narrow task; it is designed as a general-purpose knowledge resource that can ground social reasoning through downstream tasks (compare ATOMIC (Sap et al., 2019a) and ConceptNet (Liu and Singh, 2004)). Towards this end, NORMBANK should contain richly organized knowledge that can be learned by neural models and applied for non-monotonic reasoning in new settings. In this way, it should be possible to automatically expand the NORMBANK resource. The knowledge contained here should also be applicable across a range of social reasoning tasks. Thus our experiments aim to demonstrate two things: ( §5.1) that we can automatically expand NORMBANK using neural methods, and ( §5.2) that NORMBANK is a useful resource with relevant knowledge for downstream applications. For all experiments in the following subsections, we use an 80%-10%-10% train-dev-test split in which <setting, behavior> tuples in one set are never seen in another.

Automatic Knowledge Completion
How can we expand NORMBANK? We considered two different methods of knowledge bank completion Craven et al., 2000), which rely on different assumptions. Results from both methods indicate that NORMBANK is rich enough to support its own automatic expansion. Classification is the simpler case, where we assume a closed world Lin et al., 2015b,a;Socher et al., 2013), while generation assumes an open world (Shi and Weninger, 2018) with a modifiable set of constraints.
Classification. Here, our known constraints and behaviors ( §3) will remain fixed (Shi and Weninger, 2018), but we can discover new relationships by classifying unseen behavior and constraint combinations as expected, okay, or unexpected. The advantage of this approach is that it is straightforward, and the disadvantage is that evaluating classifiers over the power set of the entire constraint space would be intractable; thus more efficient search methods will be needed.
For all norm classification tasks, we fine-tune three popular transformer models: BERT-baseuncased (Devlin et al., 2019), RoBERTa-base (Liu et al., 2019), and ALBERT-base-v2 (Lan et al., 2020), with hyperparameters in Appendix A. Results in Table 2   At inference time, the model generates the list of constraints c 1 , ..., c k that will make the norm label true as it is conditioned on the setting s and behavior b. For this purpose, it is sufficient to use BART (Lewis et al., 2020), GPT-2 (Radford et al., 2019), and T5 (Raffel et al., 2020), three powerful language models used widely for generative inference. We also prompted GPT-3 davinci-002 and davinci-003 in a few-shot manner via the Ope-nAI API (see the prompts in Appendix A). Evaluation comes from both automatic and human metrics. Humans evaluate 300 <setting, behavior> data points for each of the 11 Model × Decoding combinations, plus gold standard examples from NORMBANK. For constraints, they provide us the % Norm C (proportion that helps represent a human rule or expectation for behavior) and % Relevant C (proportion that relates to the norm without redundancy or tautology). They also give us situation 4 evals: are they % Correct S (they produce an accurate norm label) and mutually % Sensible S (all constraints can be true at the same time). 4 Situations are defined as the intersection of constraints Table 3 gives the generation results for the three fine-tuned models: BART, GPT-2, and T-5. According to human judgment, all models produce text that successfully constrains human expectations for behavior (Norm C ∼90%). BART + nucleus sampling (p = 0.9) gives the most Sensible S (100%) and Correct S situations (46%) with the most Relevant C constraints (97%). This is clearly a challenging task: situations are deemed correct only 46% of the time. Yet they closely approach the scores of human gold-standard data (55%). Notably, generated constraints are highly relevant to the norm label and entirely mutually-sensible. Given the challenging nature of the task, the results are quite encouraging, suggesting that NORM-BANK can facilitate its own expansion via natural language generation.
Prompting Results in Table 3 show that few-shot GPT-3 models fail to match our best performing BART model's ability to generate Sensible situations (95 vs. 100%) with high %Norm (91.8 vs 94.3%) and %Relevant constraints (87.8 vs 97%). Still, annotators are more likely to find GPT-3 output to be Correct overall (61.1 vs 46%). Automatic metrics show that GPT-3 achieves higher precision (28.6 vs 23.1 BLEU) at the expense of recall (51.7 vs 77.1 ROUGE-L), suggesting that GPT-3's generations, while often correct, may be more prototypical. Qualitative analysis confirms this.
Sometimes the conventional answer leads GPT-3 astray, as when it uses a series of faulty lexical associations to explain that 'drinking milk' is unexpected on an 'athletic field' for individuals who are not the coach and for those whose behavior is not 'hydrate' while the temperature is 'warm.' BART, however, correctly discerns that it's unexpected for athletes whose behavior is 'playing sports.' In general, GPT-3 appears more likely to underspecify the situation. For example, GPT-3 responds that it's expected for a homeowner to 'leave the gate open' in the 'backyard,' and BART agrees, but BART further specifies that the owner might be 'working outside' to justify the expectation.
Both quantitative and qualitative analyses indicate that prompting methods can certainly complement, but may not fully replace, fine-tuned generation approaches to NORMBANK expansion. A mixed approach may be most desirable due to coverage and correctness, while generation errors may be fixed using self-correction via classification (above) or further prompting (Fung et al., 2022).
Finally, the middle pane of Table 3 shows the proportion of generated constraints that fall into our taxonomy (Tax. Constr.) and the proportion contained in NORMBANK (Pre-pop. Constr.). The former shows that our taxonomy broadly captures the relevant axes (80-90% of our best models' generations are taxonomic). The latter shows that between one third and one half of generations 'link' prior constraints to new situations; the rest of generated constraints are brand new.

Transfer Learning for Downstream Tasks
Finally, we conduct transfer learning experiments to demonstrate the utility of the data for downstream applications, further indicating the scope and power of NORMBANK as a general-purpose resource for social reasoning. Concretely, we follow the sequential training paradigm (Pratt et al., 1991), which has proven better than multitask training and fine-tuning on a broad range of commonsense tasks (Lourie et al., 2021a). Specifically, we initialize a RoBERTa model with weights from our best-performing norm classifier from Section 5.1 and fine-tune on the target set for 7 epochs.
We evaluate on two specifically moral reasoning tasks, Anecdotes and Dilemmas, both from the SCRUPLES benchmark (Lourie et al., 2021b). We also consider two multiple-choice commonsense QA datasets. Social IQa (Sap et al., 2019b) is designed to test social intelligence (e.g., inferring motivations, emotional reactions), while CosmosQA (Huang et al., 2019) tests cause and effect and counterfactual reasoning in everyday situations.
All results in Table 4 are averaged over five sepa-  rate train-test runs, and significance is given by the paired bootstrap test. NORMBANK's utility is seen by comparing the accuracy of models with transfer learning from NORMBANK against those with task-only fine-tuning (Base Model). Results show that NORMBANK improves situational moral classification (Anecdotes; +0.4%) and forced choice binary moral judgments (Dilemmas +6.8%) with significance. Also consider NORMBANK utility as compared to transfer learning from either Cos-mosQA (CQA) or SocialIQa (SIQA). The only task on which transfer from NORMBANK does not achieve the best performance is on CosmosQA evaluation. Here, we find that transfer from the more structurally related Social IQa task is preferred. We conclude that NORMBANK is a useful resource for a range of downstream applications in moral, social, and emotional reasoning in context.

Conclusion
Social norms are the foundation of culture and society (McDonald and Crandall, 2015;Hogg and Reid, 2006), and an understanding of these norms is crucial for assistive and collaborative AI. In this work, we introduced SCENE a new scheme for hierarchically organizing the seemingly unbounded space of situational contexts that determine social norms.
With this framework, we built NORMBANK, the first social knowledge bank to leverage such contextual information for contrast sets of richly conditioned defeasible social norms. We found that NORMBANK supports its own automatic expansion via classification, generation, and prompting methods. Finally, we demonstrated the utility of NORMBANK for situational social reasoning tasks.
At its core, NORMBANK is a collection of logical operations on unique constraints. Consequently, one practical limitation stems from the issue that some situations cannot be reasonably expressed as a set of constraints. While theoretically all logic can be decomposed into AND and OR operations, the logic may be too challenging for an individual to formulate, or the set of constraints themselves might be too large and unwieldy. The latter is problematic, because language models have a finite input token capacity, and for the set of constraints to be digestible, they must fit within that capacity. Relatedly, if the logic to encode constraints become more sophisticated, ensuring that logic is not unnecessarily duplicated will pose a greater challenge. Additionally, certain properties of NORMBANK like the role and behaviors may be challenging to succinctly describe. Further work will be needed to ascertain how these can be incorporated or to more clearly define situations that are out of scope. Due to limitations on time and computational resources, we have not exhaustively evaluated all downstream applications of NORMBANK, and in future work, we will test additional transfer tasks beyond the moral and social classification tasks considered in this work. Since NORMBANK is the first to encode non-monotonic situational norms, there was no other available benchmark that is directly analogous to ours. Instead, our primary evidence for NormBank's utility is in Table 3, where human evaluators confirm that models trained on NORMBANK can reliably learn to make new inferences about non-monotonic situational norms.
Other follow up studies should consider training larger normative reasoning models, and/or engineering better prompts for expanding NORMBANK. Relatedly, we have no data to speculate about the long-term evolution of real-world norms relative to this resource, nor the rate of decay in the reliability of NORMBANK. Future work should also expand this resource with perspectives from cultures other than our available annotator pool. The pool was not representative of all cultures and people groups, as we discuss further in the Ethics section.

Ethics
Ethical Assumptions. First, to set proper boundaries on this resource and the tasks it can facilitate, we will outline the ethical assumptions of this work and address some potential misconceptions. We want to stress that NORMBANK represents a collection of situational norms that we do not treat as prescriptive, but rather descriptive. Unlike prior moral / ethics datasets (Ziems et al., 2022;Emelin et al., 2021;Lourie et al., 2021b;Forbes et al., 2020;Sap et al., 2020), we use the neutral language of expected, okay, and unexpected behaviors to focus on empirically observed patterns and avoid an overemphasis on the ethical grey area of what ought to be done. Unlike tricky moral dilemmas, the situational social norms of NORMBANK have an answer that a majority can agree is descriptively observable as the expectation under the respective conditions and/or cultural context. Nevertheless, normative judgments can vary between individuals in different social groups and time periods (Haidt et al., 1993;Shweder, 1990;Bicchieri, 2005;Culley and Madhavan, 2013;Amaya et al., 2021). NORM-BANK can and should be expanded via automatic or manual methods that can incorporate these axes of variation. Our annotator pool was limited to English-speaking individuals living in the United States in the year 2022. Future expansion efforts could be crowdsourced from other cultures and geographic regions and in future decades.
We reiterate that the norms in NORMBANK should not be used for prescriptive advice or personal guidance in any way. Our work intends to unlock future work in the capacity to imbue language models with situational commonsense and enable them to jointly reason with the situational contexts. Language models which ignore situational contexts altogether may be just as hazardous, if not more.
Finally, there are likely biases towards certain roles and values in NORMBANK. We have taken steps to mitigate some forms, such as gender bias, by neutralizing constraints (e.g., [PERSON]'s role is 'cowboy or cowgirl' and [PERSON]'s role is 'ball boy or girl'). Our SCENE taxonomy, with the standardized structure of its role and attribute constraints, will allow practitioners to further analyze specific axes of prejudice and thus implement targeted mitigation strategies. Specific identity attributes like gender, ethnicity, and religion are represented in 24% of norms. 5 Stakeholders can invest a smaller but more concerted effort towards mitigating bias in these constraints. We encourage stakeholders to give auditing control over a given norm to those who are affected by it. Previous norm-datasets encode norms in free-text annotations which lack a hierarchical taxonomy of contexts, but our taxonomy can be used to interpret, diagnose, and mitigate prejudice, and to return power to those affected by these prejudices.
Risks in deployment. Before starting any annotation, the resources and findings presented in this work were thoroughly reviewed and approved by an internal review board. Prior to being put into production, the method would also need to be re-evaluated when applied to a new domain to ensure reliable performance in order to prevent unintended consequences. To help mitigate risks in deployment from misunderstandings about the ethical assumptions above, we require users of this data to complete a Data Use Agreement. The user will check that they understand the ethical assumptions above: especially that NORMBANK is not to be taken for advice. Practitioners will also agree not to use NORMBANK for malicious purposes "including (but not limited to): mockery, discrimination, and hate speech."

C.2 HIT Interface
For each HIT, the annotator is presented with a setting s ∈ S and a behavior b ∈ B that we generated for the given s. The annotator helps us describe when this behavior would be expected; then describes when it is merely okay; and finally unexpected. Annotators describe each norm with the conjunction and disjunction of SCENE constraints. The annotator appends each constraint to its conjunction as a 4-tuple consisting of a (1) category,  (4) and HIT Interface ( Figure 5) screenshots. The category helps annotators search for constraints and organize their thoughts. The category is a high-level designation of where the constraint is organized: according to the environment, role, attribute, or behavior. The name, relation and value constitute a standard semantic triple. The name designates the subject of the constraint, and it a specification of the category, like the "temperature of the environment." The relation is a logical type that includes equality and inequality. The value designates the predicate of the constraint (e.g., "freezing"). Annotators can build constraint 4-tuples from drop-down menus that enumerate our hierarchical taxonomy (Section 3). Annotators can also freely edit the above fields and contribute novel constraints. Finally, annotators compose constraints into disjunctive normal form (DNF), the OR of ANDs, to describe when behaviors are expected, okay, or unexpected in a given setting.