Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

Commonsense norms are defeasible by context: reading books is usually great, but not when driving a car. While contexts can be explicitly described in language, in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both visual understanding and reasoning about commonsense norms. We construct a new multimodal benchmark for studying visual-grounded commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment? and (2) how well can models explain their predicted judgments? We find that state-of-the-art model judgments and explanations are not well-aligned with human annotation. Additionally, we present a new approach to better align models with humans by distilling social commonsense knowledge from large language models. The data and code are released at https://seungjuhan.me/normlens.


Introduction
Reasoning about commonsense norms 1 highly depends on the context in which actions are performed (Pyatkin et al., 2022;Jin et al., 2022;Ziems et al., 2023).While an action reading a book is generally considered positive, the action is deemed 1 One line of developmental moral psychology tradition argues moral and social conventional norms present salient distinctions (Turiel, 1983).Nevertheless, recent studies point out that these two concepts are inherently interconnected without meaningful distinctions (Stich, 2018).Additionally, other recent studies identify that what counts as moral or socially acceptable is highly provincial (Levine et al., 2021).In this work, we consider a wide range of socio-moral judgments for our inclusive definition of commonsense norms.

It is wrong
Is "reading a book" in the context of a given image morally okay?
You shouldn't read a book while driving and pay attention to the road.
It would be nice to read together on the couch.
Figure 1: Commonsense norms are dependent on their context, e.g., reading a book is generally okay but is wrong while driving a car.What if the context is given by image?Our NORMLENS dataset is a multimodal benchmark to evaluate how well models align with human reasoning about defeasible commonsense norms, incorporating visual grounding.
to be wrong in the context of driving a car because the attention should be focused on the road.Understanding the defeasible commonsense normsnorms that could be further strengthened or attenuated based on the context -are crucial, and prior works (Hendrycks et al., 2021;Jiang et al., 2021;Forbes et al., 2020) have primarily focused on the defeasible norms based solely on text inputs.
However, real-world scenarios often lack explicit contextual information described in language.Consider the situations depicted in Figure 1: when humans see the first image, the action of reading a book will be considered to be wrong.Conversely, when looking at the second image, the same action will be considered to be okay as reading a book together while sitting on the couch is viewed positively.When humans make judgments, they perceive the visual scene, make adjustments to reflect the visual defeasible cues, and then make intuitive judgments.It is a more natural process to go directly from visual scene to judgment, but this is very understudied.
In this work, we study model capacity for visually grounded reasoning about defeasible common-Is "reading a book" in the context of a given image morally okay?

Human Annotator
It would be nice to read together on the couch.
In the loud setting of a concert you would not be able to read.
It might be considered rude to read a book during a concert.
You shouldn't read a book while driving and pay attention to the road.
It is good to read books and nice to do so with others you care about.sense norms that align with humans.To this end, we introduce NORMLENS, a dataset consisting of 10K human annotations about 2K multimodal situations.Our dataset covers diverse situations about defeasible commonsense norms ( §2).Each situation consists of a visual context and an associated action, and five human annotators make moral judgments about the situation and provide explanations for the judgments.
To construct a truly multimodal benchmark centered around defeasible commonsense norms, we employ a data collection pipeline that is based on human-AI collaboration (see Figure 3).The starting point is image-description pairs sourced from existing vision-language datasets -Sherlock (Hessel et al., 2022), COCO captions (Lin et al., 2014), andLocalized Narratives (Pont-Tuset et al., 2020) dataset.Then, we utilize language models (LMs) to generate a set of multimodal situations conditioned on input descriptions such that: (1) the generated action is morally appropriate given the context provided by the input image description, and (2) in contrast, the generated action is morally inappropriate under the generated situation ( §2.1).Finally, for each multimodal situation, we employ human annotation to collect moral judgments and explanations ( §2.2).
An important consideration in constructing our benchmark is the subjective nature of moral judgments (Talat et al., 2022), which can lead to disagreements among individuals when facing a single situation.For instance, in the last image of Figure 2, one human annotator deems it is rude to read a book during a concert, while others find it is okay or reading a book is impractical during a concert.To consider this inherent characteristic of moral reasoning task, we organize our benchmark by splitting the dataset into two different parts (NORMLENS HA and NORMLENS M A ) based on the degree of agreement among human annotators ( §2.3).
We design two tests based on NORMLENS to study how well models' predictions align with humans in this context ( §3).Given a multimodal situation, a model is asked to (1) provide a moral judgment about the situation, and (2) offer a plausible explanation for its judgment.Experimental results demonstrate that these tests are challenging even for state-of-the-art large pretrained models ( §4).In particular, models struggle to account for defeasible visual contexts, and also often fail to identify cases where humans agree that the action is impossible to perform.
Finally, we investigate a method for improving model agreement with human judgment without relying on additional human annotations ( §5).We begin by utilizing image-description pairs once more, seeding image descriptions into the LM to generate 90K instances of actions with judgments and explanations.Then, we construct multimodal situations by combining the generated actions and images that are paired with provided descriptions.Subsequently, we fine-tune models using these generated examples, and find that fine-tuned models exhibit better alignments with humans, achieving the highest improvement of 31.5% compared to the counterpart in the judgment task for NORM-LENS HA .
In summary, our main contributions are: 1. NORMLENS, a new dataset/benchmark of 10K human annotations covering 2K multimodal situations about commonsense norms.2. Two new tasks posed over the corpus: making judgments and explaining judgments.3. Experimental results demonstrating that while these two tasks remain challenging for models, that multimodal models can be improved with a newly proposed text-only distillation step.

Overview of NORMLENS
The NORMLENS dataset is a new multimodal benchmark.The purpose of the corpus is to assess models' capacity to perform visually-grounded reasoning about defeasible commonsense norms.The dataset covers wide range of multimodal situations in real-world.Each situation in the dataset is annotated by multiple human annotators with moral judgments and explanations about judgments (as in Figure 2).
To collect NORMLENS, we employ human-AI collaboration.Given a multimodal situation, we collect human judgments, which serve as labels to measure correlation between model predictions.In early testing, we found that humans had trouble concocting diverse and interesting multimodal situations.Thus, we utilize a LM to help "brainstorm" input situations.More specifically, we (1) generate multimodal situations that follow the requirement using AI models ( §2.1), especially considering the defeasibility of commonsense norms, and (2) employ human annotators to collect actual human judgments and explanations about the generated multimodal situations ( §2.2).Detailed analysis about the dataset is provided in §2.3.Our data pipeline is illustrated in Figure 3.

Generating Multimodal Situations about Defeasible Commonsense Norms with AI
To sample situations that manifest multimodallydefeasible commonsense norms, we define a requirement: generated situations should consist an action that itself is generally considered to be "okay," but wrong for given context (e.g., an action is "reading a book", and context is "driving a car").This stage consists of three steps: (1) generating text-form situations (D → S T 0 ), (2) gradually filtering the situations that do not meet the requirement (S T 0 → S T 1 → S T 2 ), (3) retrieving the image to convert text-form situations into multimodal situations (S T 2 → S M 0 ), and (4) running a diversity filter (S M 0 → S M 1 ).Details about prompts and filters are in Appendix B. We use ChatGPT (GPT-3.5-turbo)as our LM for the data-generation pipeline.
Generating Text-form Situations with LM.To initiate, we randomly sample 15K image descriptions D = {d 0 , ..., d N −1 } (not the image) from existing vision-language datasets.We concatenated three datasets for a source to promote diversity: Sherlock (Hessel et al., 2022), Localized Narratives (Pont-Tuset et al., 2020), and COCO Captions (Lin et al., 2014) dataset.These datasets are characterized by different design principles: for image descriptions, Sherlock provides inferences, Localized Narratives offers fine-grained details, and COCO captions presents representative captions for the given images.By feeding D to the LM, we generate text-form situations.Given the image description d i , the LM is prompted with d i to generate action and context pair (a i , c T i ) under the following instruction: generated action a i should be morally okay with the given image description d i , but should be morally wrong with the generated context c T i .For example, when d i is "two people seating together on sofa", then possible a i is "reading a book" and c T i is "driving a car".After generation, we have S T 0 = {(a 0 , c T 0 ), ..., (a M −1 , c T M −1 )}.Note that we generate three action-context pairs per given image description, so M = 3N .
Sequential Filtration with LM.The LMgenerated actions are error prone: while we instruct the LM to generate the action a i which is not morally acceptable for a generated context c i , the LM frequently generates actions that are okay or not possible to perform in the c i ; Madaan et al. (2023);Shinn et al. (2023) also observe LMs sometimes fail to follow complex instructions.
Inspired by the success of iterative refinement with simpler instructions, we apply two automatic sequential filters using the LM.The first filter (implemented with a prompt) attempts to remove impossible actions: for example, if the generated action is follow the traffic rules and the generated context is a group of people running in a park, then this situation should be filtered because there is no traffic rules in the park for runners.Second filter (also implemented with a prompt) aims to remove examples from S T 1 if the LM predicts that generated action a i is morally appropriate to perform in the generated context c T i .After filtration, we have is number of instances after sequential filtration.
Creating Multimodal Situations by Image Retrieval.We create multimodal situations S M 0 from S T 2 .We construct a FAISS index (Johnson et al., 2019) of 1.4M image descriptions {d 1 , ..., d M } (which is a superset of D in the first step), by using the LM to turn image descriptions into LM-based text embeddings.Then, we use generated text-form context c T i as a query to find the similar image description d l from the index and obtain the corresponding image of the description x l .Finally, we yield 18K multimodal situations S M 0 = {(a 0 , x 0 ), ..., (a L−1 , x L−1 )}.Diversity Filtration.We observe that certain keywords like funeral and hospital come up frequently in the contexts in S M 0 .To enrich the diversity of the contexts, we set up the list of specific keywords and filter out examples if the language description d of the image x includes one of the specific keywords.We keep the occurrence of these keywords from contexts under 30.

Collecting Annotations from Humans
After the first stage, we randomly sample 2.2K instances from S M 1 and ask human workers to provide annotations.Further details concerning human annotations processes, including on the annotation interface, can be found in Appendix C.

Making Judgments and Explaining Judgments.
Our procedure involves instructing human annotators to make judgments, denoted as y i , pertaining to a given multimodal situation, represented as (a i , x i ).They are provided with three options: the action is (1) morally inappropriate, (2) morally appropriate, and (3) not possible to perform phys-ically.We also request the annotators to descriptively explain their judgments in free-form text e i .To account for the subjectivity inherent in moral judgments, each situation is annotated by five different people.
Validation.After the previous annotation step, we exclude annotations with implausible explanations about judgments by additional validation step.For example, consider the first situation in Figure 2. If someone labeled the situation as Okay.with the explanation "It is morally okay to read a book, because reading a book is always great", then this annotation should be excluded as the explanation does not make sense.Each annotation (y i , e i ) for the situation (x i , a i ) is provided to one worker, and workers are asked to review the explanations for the judgments.After reviewing, they mark each annotations as either I agree or I do not agree.Only annotations that are marked as I agree are retained.

Dataset Analysis
The result of our data pipeline is 2.2K multimodal situations (image-action pairs) with pertaining multiple moral judgments and explanations.Disagreement Among Annotators.We observe that for approximately half of the situations, there is a divergence in the judgments offered by different annotators (as in the third and the fourth examples in Figure 2).This discrepancy is induced by the inherent variability of moral reasoning, in which commonsense norms can be influenced by cultural differences and diverse perspectives.
We take into account this inherent subjectivity by splitting the dataset into two subparts: NORM-LENS HA (HA=High Agreement) and NORM-LENS M A (MA=Medium Agreement).In NORM- LENS HA , there is a unanimous consensus among all annotators regarding the moral judgment for situations, as in the first and the second situations in Figure 2. In NORMLENS M A , two out of three options regarding the moral judgment are chosen by annotators, e.g., one annotator chooses Wrong., and the other four annotators choose Okay., as in the third situation in Figure 2. We note that in 10% (230) of instances, human annotation results exhibit that all judgments could be possible (e.g., the last situation in Figure 2).We have excluded these instances from the evaluation, but they will still be made available as they can serve as a potentially valuable resource for further exploration.
Weakness of LM for Creating Situations.We find the necessity of our human annotation stage to construct the benchmark about commonsense norms.As shown in Table 1, more than 70% of the situations are judged as okay or impossible.
Considering that we only run annotations with the situations that the system determined to be morally wrong, it suggests that machine-generated judgments are frequently misaligned with human judgments.In other words, it is not possible to construct high-quality benchmark about commonsense norms without human annotations.

Task Overview
We conduct two tests based on NORMLENS to examine the extent to which the models' predictions aligns with humans on visually grounded reasoning task regarding defeasible commonsense norms.
Making Judgments.The first test requires models to provide a moral judgment about given mul-timodal situation to investigate how well the models align with human judgments.Given an action a i and an image x i , the model returns a judgment ŷi .There is a corresponding set of human judgments, denoted as Y i = {y 0 i , ..., y n−1 i }, and n (≤ 5) varies.There are three possible judgments -Wrong (Wr.), Okay (Ok.), and Action is Impossible (Im.) -i.e., ŷi and y k i must be included in {W r., Ok., Im.}.To measure the degree of alignment, we use precision as a metric, i.e., model is considered in alignment with human judgments if one of the y k i ∈ Y i is equal to ŷi .Explaining Judgments.We further require models to provide explanations about their judgments since moral judgments are subjective; thus, the underlying rationale of judgment becomes crucial.Assume that model returns a judgment ŷi for a given situation and generates an explanation êi about ŷi .We assess how well the generated explanation êi is aligned with humans' explanation about judgments.Inspired by Min et al. 2020, we use an explanation score E i that is formulated as , where δ( ŷi , y j i ) = 1 if ŷi is the same as y j i else it is a zero, and f ( êi , e j i ) is a similarity score between generated explanation and the human's explanation.For the similarity score f , we take into account BLEU-2 (Papineni et al., 2002), Rouge-L (Lin, 2004), and METEOR (Banerjee and Lavie, 2005).As NORM-LENS M A may contain varying numbers of explanations per label, we assess models solely on the explaining task using NORMLENS HA .ments only with actions.We do not test the LMs against explanation generation since our human explanations are strongly dependent on the visual inputs and are not directly comparable to the explanations only for action.
Socratic Model (SM).SM (Zeng et al., 2022) works in a two-staged framework, where the first stage transforms the visual inputs into intermediate text descriptions using a vision-language model (VLM), and the next stage applies reasoning on the descriptions using the LM.To implement SMs, we use the same set of LMs as described above and use BLIP-2 Flan-12B (Li et al., 2023) as the VLM.
VLM. Different from SMs, here we include baselines that directly output the judgments from the VLMs without an external reasoning stage.We cover the state-of-the-art pretrained VLMs LLaVA (Liu et al., 2023), BLIP-2 (Li et al., 2023), and InstructBLIP (Dai et al., 2023).

Results
Metrics.We report the scores averaged classwise: we first compute averages of scores per class and then get the final score by averaging the classlevel scores uniformly.We employ this macro average to counteract the class imbalance (Hong et al., 2021) in NORMLENS.
Making Judgments.We share three notable findings from our results on the judgment task ( Explaining Judgments.As shown in Table 2b, SM built on GPT-4 achieves the best explanation scores among the baselines in NORMLENS HA , establishing a strong baseline for the task.As in the previous judgment task, we attribute this strong performance of GPT-4 to its formidable reasoning capability (Bubeck et al., 2023).The score gaps between SM using GPT-4 and the other baselines are also significant.We believe these gaps indicate that VLMs require a stronger reasoning capability to perform reasoning on NORMLENS.
Error Analysis on Making Judgments.when making judgments, in Table 3, we provide classwise precision scores on NORMLENS HA (full break-down results are in Appendix E).Overall, except for SM with stronger LMs (ChatGPT/GPT-4), models show low judgment scores on Wrong.and Impossible.classes.On the other hand, SM with GPT-4 shows impressive scores across all three classes, particularly excelling in the Impossible.class compared to baselines, resulting in the highest overall score.Interestingly, SM with ChatGPT achieves the highest score on Wrong.class (71.1%).We suspect that this might be attributed to the data pipeline using ChatGPT, which is employed to collect multimodal situations that are likely to be morally wrong based on judgments of ChatGPT.
We raise an interesting question: considering the fact that ChatGPT is employed in our data pipeline, why does SM with ChatGPT only exhibits 71.1% on the Wrong class, rather than nearing 100%?We suspect that this is due to errors in BLIP-2 prediction.The key distinction between ChatGPT in the data pipeline and SM with ChatGPT in the testing situation is the inclusion of precise image descriptions.To explore this further, with SM built on ChatGPT, we further test on the judgment task by using ground-truth image descriptions as inputs instead of relying on BLIP-2 predictions.The model shows a higher score in the Wrong.class (80.2% v.s.71.1%), but demonstrates lower scores in the other classes (Okay -59.7% v.s.67.7%, Impossible -42.1% v.s.52.9%).This result infers that visual reasoning capability is crucial for SMs, as the scores are highly affected by visual grounding.

Better Aligning Models with Humans
Our findings indicate that most SMs and VLMs face challenges when it comes to visually grounded reasoning about defeasible commonsense norms.Here, we explore an efficient solution that can enhance both SMs and VLMs for better alignment with human values.Drawing inspirations from recent works that distill knowledge from LMs (West et al., 2022;Wang et al., 2022;Kim et al., 2022), we propose using text-only LMs to build annotations for our multimodal problem automatically.
We use the LM (ChatGPT) to generate 90K examples of multimodal situations, including moral judgments and explanations.In particular, we begin with randomly sampling 30K image descriptions from image-text datasets (same dataset in §2.1).Then, we prompt the LM with the given image description to generate three different actions that are: (1) morally wrong, (2) morally okay, and (3) unrelated to the context.Finally, these generated actions are then combined with the images associated with the provided image descriptions, resulting in the construction of multimodal situations.These instances are splitted into train-validation sets with an 8:1 ratio and use the valid set for the hyperparameter search.
There are significant distinctions between the data pipeline discussed in §2 and the generation process described here.Firstly, the data pipeline involves the collection of human annotations.Secondly, the data pipeline places emphasis on defeasibility, employing specific instructions for LM to generate examples, which are then subjected to multiple filtration steps.
Results.Automatic training data generation offers an accessible alternative to expensive human annotations.We fine-tune the SMs (only the LM parts) and VLMs to predict judgment and explanations when the generated situation is given.As shown in 4a, the machine-generated data improves alignment scores in most cases.Especially, scores Table 4: Alignment scores of fine-tuned SMs and VLMs on NORMLENS HA .The number after + denotes that the fine-tuning leads to that amount of increase in scores.
in Wrong.and Impossible.classes are improved across the board as depicted in Table 4b.
Still, there is a decrease in scores for the Okay.class, indicating that the machine-generated data induces more conservative model decisions.More details are described in Appendix F.

Related Works
Visually Grounded Reasoning.Various tasks have emerged in the field of visually grounded reasoning, including commonsense reasoning (Zellers et al., 2019;Park et al., 2020) and abductive reasoning (Hessel et al., 2022).With the advent of LMs that have powerful reasoning capabilities (Chiang et al., 2023;OpenAI, 2023), methods that harness the general reasoning capabilities of LMs for visual grounded reasoning settings are proposed (Wu et al., 2023;Chase, 2022).For example, Socratic Models (Zeng et al., 2022) turn visual contexts into language description and take this description as input for LMs to perform reasoning.In contrast, there exist vision-language models that process visual information and directly perform reasoning (Li et al., 2023;Dai et al., 2023;Liu et al., 2023;Han et al., 2023).Despite their general visual grounded reasoning capability and potent applications, their reasoning abilities about commonsense norms are not yet explored.Rather than these forms of atomic groundings in certain categories, in NORMLENS we provide freetext contextualizations, and we also add supporting commonsense rationales which justify how each piece of context alters the morality of the action.

Conclusion
We introduce NORMLENS, a new dataset of visual-grounded commonsense norms.Based on NORMLENS, we design two tests to measure how well models align with humans on visually grounded reasoning tasks about commonsense norms.These tests demonstrate that even state-ofthe-art large pretrained models cannot easily make predictions that match with humans.We encourage further explorations to investigate the abilities to ground on visual contexts to reason about defeasible commonsense norms.

Limitations
NORMLENS is manually annotated by Englishspeaking workers who reside in Canada, UK, and US.Therefore, it may not cover all commonsense norms based on different sociocultural backgrounds or diverse perspectives.Furthermore, our experiments focus on aligning with averaged crowd judgments: averaging can mask valid minority perspectives.While we consider high and medium agreement datasets explicitly as a step to account for this, future work would be well-suited to explicitly model annotator disagreements.We hope to extend the coverage of commonsense norms to more diverse backgrounds and perspectives.Moreover, we plan to scale the dataset to cover a broader range of situations, which will promote models to better align with humans in ethical perspectives.

A Visualizing Contents in Dataset
We investigate the types of situations covered by NORMLENS, following studies done by Wang et al. 2022;Jiang et al. 2021.Figure 5 shows that NORM-LENS covers diverse situations, shown by wide range of topics related to people and daily lives.
We extract the verb-noun structure using the Berkeley Neural Parser (Kitaev and Klein, 2018) to plot these diagrams.Generated actions, in general, tend to exhibit a morally neutral nature.In Figure 5a, we plot the top-20 verbs along with their corresponding direct objects that falling within top-5 and appear three or more times.The judgment of specific sentences, such as "take photo", "feed elephant", "give speech", and "use laptop" relies on the contextual circumstances in which these actions take place.Training model with actions which are inappropriate regardless of contexts such as "steal the purse", induces model to impose strong prior to language without considering context depicted by images (Kiela et al., 2020).In order to promote effective integration of information related to both the image-indicated situation and the provided text action, we employ context-dependent judgments by utilizing actions comprising inherently neutral sentences.
When visualizing image descriptions, we concentrate on the nouns rather than the verb-noun structure.We follow this strategy due to the fact that nouns in captions contain most of the information pertaining to the description of the image.As a result, we find that 1,011 unique nouns were generated.In Figure 5b, we plot the top 30 nouns that appeared in the caption.This implies that the visual contexts in NORMLENS captures a multitude of contextual elements, presenting a wide array of diverse situations.

B Generating Multimodal Situations about Defeasible Commonsense Norms
We employ ChatGPT (GPT-3.5-turbo) to generate textual situations and filtering generated situations, as described in §2.1.Throughout our data pipeline, we use temperature sampling with a temperature value of 0.1, a top-p value of 0.95, and set both the frequency and presence penalty values to 0. The prompt templates that are used for situation generation and filtering are described in Table 5, Table 6, and Table 7.For diversity filtration, we set specific keywords as "funeral", "library", "hospital", "construction", "courtroom", and "historical".

C Collecting Annotations from Human
We utilize Amazon Mechanical Turk (MTurk) for worker recruitment in order to perform task annotations.To recruit qualified human annotators on MTurk, we establish qualification tasks.In order to guarantee fair compensation for the human annotators, we provide an hourly wage of $15 for their valuable contributions.Figure 6 and Figure 7 depict the interfaces used for collecting human annotations.

D Prompt Templates for Large Pretrained Models
We provide prompt templates that are used to perform reasoning with large pretrained models, in Table 9, Table 10, and Table 11.

E Full Break-down of Evaluation Results
We provide full break-down of alignment scores, which provides detailed results about §4.2.As we already provide results for judgment task on NORMLENS HA , we further provide results for judgment task on NORMLENS M A (Table 12, Table 13, and Table 14) and explanation task on NORMLENS HA (Table 15 and Table 16).
F Enhancing Large Pretrained Models.
Generating Multimodal Situations.For situation generation, we employ the prompt illustrated in Table 8.To encourage diversity, we utilize temperature sampling with a temperature value of 0.7, and we set the top-p value to 0.95 and assign 0 values for both frequency and presence penalty.
Fine-tuning Details.We fine-tune large pretrained models on generated examples to enhance them.To conduct fine-tuning on VLMs, we adhere to the fine-tuning specifications outlined in (Liu et al., 2023) for LLaVA and (Dai et al., 2023) for InstructBLIP.We train both models for one epoch.We use initial learning rate of 2e-5 with using batch size of 32 to train LLAVA, and use initial learning rate of 1e-5 using batch size of 16 to train Instruct-BLIP.
When fine-tuning SMs, we solely focus on finetuning the language model component of the model.For fine-tuning the SM based on Vicuna-13B, we follow the fine-tuning details presented in (Chiang et al., 2023), while for fine-tuning GPT-3 Curie and Davinci, we utilize the OpenAI fine-tuning API.In particular, when fine-tuning Vicuna-13B, we use learning rate of 2e-5 with one epoch of training, using batch size of 256 (with gradient accumulation steps of 8).You are a succinct and moral assistant.
I will give you an action.Your goal is to respond whether the action is morally inappropriate or not.Given an action, please answer with judgment.
Judgment is one of the number 0, 1, 2 when 0: if the action is morally inappropriate,

Figure 2 :
Figure 2: NORMLENS dataset comprises 10K human annotations pertaining to 2K multimodal situations.Each multimodal situation consists of a visual context along with an associated action.For each situation, five human annotators have provided moral judgments and explanations for their judgments.The first and the second situation are included in NORMLENS HA as there is unanimous consensus among all human annotators.The third situation is included in NORMLENS M A as two out of three options (Wrong.and Okay.) are chosen by human annotators.

Figure 3 :
Figure 3: An overview of NORMLENS collection data pipeline.Human-AI collaboration is employed to effectively collect the multimodal situations about defeasible commonsense norms.We first generate multimodal situations using the LM (Steps 1-4, §2.1), then collect judgments and explanations from human annotators (Step 5, §2.2).
is morally okay.Explanation: It is possible to climb the tower.Is the action "climb the tower" morally okay for the given context?GPT-4 (SM) Judgment: Action is not possible.Explanation: It is not physically possible for an individual to climb the Eiffel Tower without proper equipment and authorization.Human AnnotationJudgment: It is morally wrong.Explanation: It is dangerous, and a form of trespassing, to try to climb this monument.

Figure 4 :
Figure 4: Examples of predictions (judgment and explanation) made by models on NORMLENS.
Norms.Jiang et al. (2021) present Delphi, a commonsense moral reasoning model trained to present a descriptive view of ethical judgments.In ClarifyDelphi, Pyatkin et al. (2022) work towards contextualizing moral reasoning, producing a system to ask clarification questions to elicit the context surrounding a judgment.In contrast, our work directly generates contextualizations to strengthen or attenuate the morality of an action without asking specific questions.Jin et al. (2022) propose MoralExceptQA, a task aimed at assessing the acceptability of violating established moral rule in different situations.With NormBank, Ziems et al. (2023) introduce a framework for grounded reasoning about situational norms, adding auxiliary information such as environmental conditions and agent characteristics. photograph Visualization about image descriptions included in NORMLENS.

Figure 6 :Figure 7 :
Figure 6: An interface for collecting human annotations from Mturk.

Table 1 :
Statistics of NORMLENS dataset.Each instance consists of multiple moral judgments with the explanations regarding multimodal situation, and Avg.#Judgments denotes the average number of annotations per situations.

Table 2 :
Random guesses the judgment randomly, and Majority Vote always selects the most frequent class (i.e., Im. for NORMLENS HA ).We provide four in-context examples as additional inputs for all baselines below.Alignment scores (macro average) of models on NORMLENS.

Table 3 :
Classwise precision of models on NORM-LENS HA with judgment task.
To investigate the difficulties encountered by models Wr.Ok.Im.Avg.