ClarifyDelphi: Reinforced Clarification Questions with Defeasibility Rewards for Social and Moral Situations

Context is everything, even in commonsense moral reasoning. Changing contexts can flip the moral judgment of an action; Lying to a friend is wrong in general, but may be morally acceptable if it is intended to protect their life.We present ClarifyDelphi, an interactive system that learns to ask clarification questions (e.g., why did you lie to your friend?) in order to elicit additional salient contexts of a social or moral situation. We posit that questions whose potential answers lead to diverging moral judgments are the most informative. Thus, we propose a reinforcement learning framework with a defeasibility reward that aims to maximize the divergence between moral judgments of hypothetical answers to a question. Human evaluation demonstrates that our system generates more relevant, informative and defeasible questions compared to competitive baselines. Our work is ultimately inspired by studies in cognitive science that have investigated the flexibility in moral cognition (i.e., the diverse contexts in which moral rules can be bent), and we hope that research in this direction can assist both cognitive and computational investigations of moral judgments.

We present CLARIFYDELPHI, an interactive system that learns to ask clarification questions (e.g., "why did you lie to your friend?") in order to elicit additional salient contexts of a social or moral situation. We posit that questions whose potential answers lead to diverging moral judgments are the most informative. Thus, we propose a reinforcement learning framework with a defeasibility reward that aims to maximize the divergence between moral judgments of hypothetical answers to a question. Human evaluation demonstrates that our system generates more relevant, informative and defeasible questions compared to competitive baselines. Our work is ultimately inspired by studies in cognitive science that have investigated the flexibility in moral cognition (i.e., the diverse contexts in which moral rules can be bent), and we hope that research in this direction can assist both cognitive and computational investigations of moral judgments.

Introduction
Commonsense moral reasoning of social situations and actions depends squarely on their context. Offering someone a cup of coffee is generally considered appropriate. If offered to a work colleague, it may even be viewed as a courteous gesture. However, offering coffee to a toddler would be deemed morally irresponsible.
Delphi (Jiang et al., 2022), a recently proposed commonsense moral reasoning model, generates moral judgments for simple actions described in text. However, Delphi's judgments are made in isolation, without any knowledge of surrounding context. Grounding moral reasoning in context is crucial (Talat et al., 2022). How can moral reason-

Question Generation
Who did you offer it to?
When did you offer it?
in the morning to a toddler to a work colleague

Good Bad
Low Divergence

Question Selected
Policy Reward in the afternoon High Divergence

Good Good
"offering a cup of coffee" Figure 1: The CLARIFYDELPHI question generation approach is trained via reinforcement learning. The reward simulates a set of possible (defeasible) answers to the questions and, using Delphi for feedback, optimizes for questions leading to maximally diverging answers. ers elicit missing salient context? A natural way to do so is by asking clarification questions.
We present CLARIFYDELPHI, an interactive system that learns to ask questions to elicit salient context. Prior research in cognitive science shows that human reasoning exhibits the flexibility not only to articulate where a certain moral rule should hold, but also to imagine valid exceptions where the rule can be bent or defeated based on the demands of the context (Kwon et al., 2022;Levine et al., 2020;Awad et al., 2022).
We present a first step toward computationally exploring and discovering these defeasible contexts which can potentially flip the moral judgement of a situation. Given a situation and its default judgment (e.g., it is nice to offer a cup of coffee to someone), defeasible contexts can strengthen (e.g., offering it to a colleague) or weaken (e.g., giving it  Figure 2: Interaction between a user and CLARIFYDELPHI. The user inputs a situation and CLARIFYDELPHI answers with an initial judgement (obtained from DELPHI) and a clarification question, which the user then answers.
to a toddler) the judgment (Rudinger et al., 2020;Madaan et al., 2021;Allaway et al., 2022). We aim to generate questions whose answers might uncover missing context for making better-informed moral judgments, and we propose to do so in a conversational setting between a user and CLARI-FYDELPHI.
Our method for clarification question generation is based on reinforcement learning. Using Proximal Policy Optimization (PPO; Schulman et al. 2017;Ouyang et al. 2022) we optimize for generating questions that invoke responses that provide morally salient contexts. CLARIFYDELPHI "imagines" answers to a generated question, using a trained answer generation model. A reward is calculated by comparing the probability distributions Delphi assigns to the imagined answers. Fig. 1 provides an overview of CLARIFYDELPHI.
The intuition behind our approach is that questions that lead to maximally divergent answers (e.g., "Who did you offer it to?") are also those that elicit most morally salient contexts and therefore are more consequential to the situation. These morally consequential questions surface latent ambiguities that may directly affect the moral decision process. Questions with little divergence in its imagined answers (e.g., "When did you offer it?") have little to offer in terms of resolving contextual moral ambiguities.
Our results show that our approach outperforms other strong clarification question generation baselines; its generated questions lead to consequential answers. We additionally quantify how much supervised clarification question training data is needed for a good initial policy. Lastly we show that questions help with generating defeasible updates.
Our contributions are as follows. We introduce the task of clarification question generation for social and moral situations. For this task we propose an RL based approach, defining defeasibility as a new type of relevance for clarification questions. We publicly release δ-CLARIFY, a dataset of 33k crowdsourced clarification questions, and δ-CLARIFY silver containing generated questions conditioned on a defeasible inference dataset. We also release trained models with their code. 1

Problem Setup
Given a situation, such as lie to my friend, we aim to generate question(s) that are the most relevant for uncovering the most consequential context with respect to making a social or moral judgement. While situations could evoke a multitude of potential questions, the following work is concerned with predicting questions whose answers are likely to be consequential, i.e. answers that could function as either weakeners or strengtheners of the default judgement. The terms weakener and strengthener come from the concept of defeasible inference (Rudinger et al., 2020), which defines a way of reasoning that takes into consideration (new) evidence which could either support (e.g. strengthen) or cancel/weaken an initial inference.
Formally, the task is to predict a question q given a base situation s. The base situation has a default moral judgement j ∈ {bad, ok, good}. For every input tuple of (s i , q i , j i ) there is a hypothetical set of strengthening answers A S and weakening answers A W . Adding the additional information obtained from any q i and corresponding answer a i to Algorithm 1 Training CLARIFYDELPHI Input initial policy model θ0, initial value model φ0, Delphi ψDelphi D δ-CLARIFY ← Get dataset of clarification questions. θQ ← Fine-tune θ0 with Eqn 1 from D δ-CLARIFY . Sec. 3.1 D δ-CLARIFY silver ← Get silver dataset of defeasible answers to questions.
for step = 1, 2, . . . , s do Calculate r using θA and ψDelphi with Eqn 3. Compute lossPPO on the minibatch with Eqn 6. Optimize θ and φ with LPPO for one step. θQ old ← θQ, φold ← φ return θQ Output θCLARIFYDELPHI the base situation s i results in an updated situation s ui , with an updated judgement j ui .

CLARIFYDELPHI: A Reinforced Clarification Question Generator
The CLARIFYDELPHI approach is based on reinforcement learning. Algorithm 1 gives an overview of the training process. As a first step, before performing reinforcement learning, we obtain a question generation model θ Q and an answer generation model θ A , which we both train on data that we curated, described in the later Sec. 4. The question generation model predicts the clarification questions and the answer generation model provides (defeasible) answers to the generated questions. By using these two models in addition to Delphi (ψ Delphi ) for calculating the rewards, we do not require any supervised data during RL training. We consider question generation conditioned on a given situation a sequential decision making process over the natural language vocabulary space, where the generated question q with T tokens has an episode of length T . At step t ∈ [1, T ], the state s t = (s, q <t ) is the combination of the given situation and the question decoded up to the (t − 1)-th token; the action c t = q t would be the t-th token to decode. The question generation model, θ Q (q t |q, q <t ; θ), is the policy model that we optimize. We define a reward function r(s, q, a w , a s ) that characterizes the divergence of answers from θ A conditioned on generated question q and discuss the definition of this reward function in §3.3.

Supervised Question Generation
The first subcomponent is a basic question generation system θ Q that outputs a question q conditioned on a situation s. It is used as the initial policy model during RL training.

Defeasible Answer Simulation
For each generated question q, we need to generate a weakening answer a w and a strengthening answer a s in order to calculate the reward r (Formula 3).
For the defeasible answer generation system θ A , we take as input a situation s i , the generated question q i ( §3.1), and an update type u ∈ {weakener, strengthener} to predict a weakener-strengthener answer pair a w and a s : An example of an instantiated input/output: Input It's bad to be a snitch, TYPE: Weakener, Q.: Why would being a snitch be beneficial? Output doing so would save someones life.
The crucial element in the input is the update type, as it allows to generate two types of answers for the same s and q. When computing the reward during training, for each question, we filter out all its generated answers which either contradict or are entailed (i.e. no new information) by the given situation, using an off-the-shelf NLI model.

Reward
As a reward for generating a question, we aim to quantify how well the generated questions are able to elicit consequential answers. For this purpose we query Delphi (Jiang et al., 2022) for feedback, using situations updated with answers.
We optimize for questions that lead to maximally divergent answers by defining a reward function which uses the JS-Divergence, between the Delphi probability distribution of the weakener updated situation and the strengthener updated situation: Sentence Fusion To create an updated situation that sounds natural and can be used to query Delphi, the situation s, question q i and answer (both a w and a s separately) have to be fused together into s ui . For example: Situation refraining from doing something bad Question When do you do something bad?
Answer when I'm angry Fusion: refraining from doing something bad when you're angry. We train a model to distill fusion in-context examples obtained from GPT-3 (text-curie-001).
Delphi for Feedback Delphi is then queried with the updated situation s ui for a judgement, leveraging the probability distribution that Delphi provides over three classes: j ∈ {bad, ok, good}. The probability scores are the probabilities of the special T5 tokens representing each of the three classes, normalized by their sum.
JS-Divergence We calculate the Jensen-Shannon divergence between the Delphi probability distributions j w and j s obtained from two updated situations originating from defeasible answers to q.

Reward normalization
We normalize the reward during training as follows: The µ 0 and σ 0 of rewards are obtained before training begins, by generating a question and calculating its rewards for all s in the training data.

Proximal Policy Optimization (PPO)
We maximize the reward using Proximal Policy Optimization (PPO) (Schulman et al., 2017) as our RL algorithm, which previous works have shown to be suitable for NLG tasks (Liu et al., 2022b;Ramamurthy et al., 2022). Our implementation of PPO is an adaptions of Ouyang et al. (2022)'s, which includes a KL penalty between the initial policy model θ Q old and the updated policy θ Q . In addition to the policy model, PPO employs a value model (parameterized by φ) to estimate the value function for states with incomplete decoded text, i.e. V (s t ; φ) for any t. PPO's loss consists of a  value model (φ) loss and the policy loss, which is jointly minimized:

δ-CLARIFY: a Dataset of Clarification Question
We require data for various components of our CLARIFYDELPHI model: The policy needs bootstrapping from a clarification question dataset and the answer generation model needs data to learn to generate defeasible answers to questions. To the best of our knowledge no such datasets exist. We therefore collect a crowdsourced dataset of clarification question for social and moral situation and a silver dataset of defeasible QAs to train θ Q . The situations are sampled from SOCIAL-CHEM-101 (Forbes et al., 2020) and the COMMONSENSE NORM BANK (Jiang et al., 2022). We call our dataset δ-CLARIFY and it consists of crowdsourced questions, enriched with questions generated by GPT-3 (Brown et al., 2020).
What did they do for you? Q2 Can you afford to tip? Q3 Was the service good? Q4 Did the people perform the service adequately? Q5 Do you always tip people well regardless of the service quality?
Situation: Jeff ignores the comment and laughs about it with his boss Q1-4 What was the comment? Q5 Who made the comment they were laughing at? Next we describe how we create the dataset.
δ-CLARIFY gold : We crowdsource clarification questions by showing annotators a situation and asking them to write a clarification question they would ask an imagined colleague requesting advice on the situation. Each of the 6425 situations is presented to 5 annotators, resulting in 5 questions per situation (500 situations are used for dev and test respectively). Details of the annotation are found in Appendix A.1.
δ-CLARIFY silver : The δ-SOCIAL part of the defeasible inference dataset (Rudinger et al., 2020) consists of statements that express default judgments over situations (It is good to protect your kids) and updates that weaken (Your protection is overbearing) or strengthen (Your child is in danger) the default. These updates could be viewed as potential answers to an implicit question about a base situation: "What are you protecting your child from?" for Your child is in danger. We 5-shot prompt GPT-3 to generate questions, conditioned on situation and answer, resulting in ≈ 80K (situation, update type, question, answer) tuples.
Dataset Analysis Fig. 3 shows that the crowdsourced δ-CLARIFY gold has more variety in its most common question starts, which reflects the general diversity of the dataset: For only 10% of the situations, more than 1 Turker asked exactly the same question, and for only 8% of the situations all 5 Turkers used the same Wh-word to start the question. This highlights that there is more than one possible salient clarification question to be asked for any given situation. For the situation tipping people decently in Tab. 1, all 5 Turkers chose to start their questions differently, even though three out of these five questions ask in one way or the other about the service quality. For the other situation 4/5 Turkers asked for a specification "What was the comment?" and 1 Turker asked about the missing agent role. We also see that polar (yes/no) questions appear less frequently, as Turkers were explicitly asked to avoid them unless no other suitable question comes to mind. 2 The δ-CLARIFY silver questions are generated by conditioning on weakener or strengthener updates. Since we aim to predict defeasible questions, 2 This is to prevent leading questions such as "Do you intend to give it to a kid?" for "offering a cup of coffee". the most desirable questions are those whose answers can be both weakeners and strengtheners. In the silver data, 53% of situations have at least one question that has been generated by GPT-3 for both update types. The situation Your kids should be your number one priority, for example, has the same question "What are your kids' ages?" for the weakener update They are adult children. and the strengthener update Your children are toddlers. Interestingly, among the subset of defeasible questions in δ-CLARIFY silver , we find that the most frequent question start is 'why'. This suggests that it is easiest to come up with both weakener and strengthener answers to why-questions.

Baselines
We consider four baselines in our experiments.
Question Generation Without RL To assess what additional improvements training an RL model with a defeasibility rewards provides, we report performance of the supervised question generation model θ Q on its own ( §3.1). We refer to this baseline as t5 fine-tuned. We decode using nucleus sampling with top-p = 0.6.
Pipelines with Question Selection Next, we implement two pipelines where, as the first step, a diverse set of questions is generated for a given situation and then, as the second step, the best question is selected according to a score.
In order to generate a diverse set of questions we fine-tune T5 on δ-CLARIFY, conditioned on a modified input compared to the model from §3.1: Input <Situation>. Q.: <wh-word> -Output <Question> By also conditioning on the first wh-word of the question it is possible to generate different questions. During inference we generate questions for 14 different question starts. 3 We propose two approaches to scoring these questions: using a discriminator model and using divergence ranking, which we describe as follows.
Discriminator We train a discriminator classifier which labels these questions as either relevant or irrelevant to a given situation. We then choose the question that has been assigned the relevant label with the highest probability.
The discriminator is a binary classifier based on DeBERTa ( Divergence Ranking We run the defeasible answer simulation with feedback from Delphi for each question in the set. This process is the same as the reward function of the RL approach, except that the JS-divergence score is used to rank the questions instead of being used as a reward for question generation. We compare two variations of this baseline: one with answer filtering using an NLI model as described in Sec. 3.2 (pipeline-nli) and one without filtering (pipeline).

Why-Baseline
We saw in §4 that questions conditioned on weakener/strengthener updates are usually causal questions. Using the same input/output configuration as in the pipeline baseline, we generate a why-question for each situation (called why).

Human Evaluation
Automatic evaluation of questions and their usefulness for clarifying moral situations is tricky. While we do have gold reference questions, we have shown that humans will produce diverse questions for the same situation ( §4) and just because a question does not appear in the reference set does not necessarily indicate that it is not a consequential question. We therefore perform human evaluation of the models' outputs on Amazon Mechanical Turk on the 500 test set instances from δ-CLARIFY. Given a situation and a question, Turkers are asked to rate the question along three different attributes: Grammaticality (Is the question grammatical?), Relevance (Is the question relevant and plausible to the situation?), and Informativeness (Does the question access new information or regurgitate the situation?). The attributes are further detailed in Appendix A.1.
Additionally, and most importantly, we aim to evaluate the defeasibility of the questions, e.g. how well the generated questions can elicit weakener or strengthener answers. For this purpose, Turkers are given a situation with a question and are first asked to judge this situation (generally ok vs. generally not ok). They then need to say whether and specify if they can think of an answer to the question which might support their judge-ment and also of an answer which would flip their judgement.

Results of Human Evaluation
We first run the grammaticality, relevance and informativeness evaluation. All questions which are given the lowest rating (e.g. irrelevant and/or uninformative) by at least two annotators are excluded from the second evaluation. It does not make sense to ask about defeasibility for questions which already are irrelevant, and additional weakening or strengthening context is not feasible for uninformative questions.
We find, as displayed in Fig. 4, that CLARIFY-DELPHI has the biggest percentage of relevant and informative questions in the test set, compared to the baselines. We also see that a big majority of the generated questions, from all models, are relevant and informative, with the lowest performing model (discriminator) still producing 448/500 questions that are passed on to the next evaluation round.
We also find that grammaticality across all systems is high with the lowest average score being 0.98 and the highest 0.99 (on a scale from 0 to 1, with 1 being grammatical). The minimal variation in grammaticality score is expected since all models are based upon the same transformer model.
The CLARIFYDELPHI questions also outperform the baselines in terms of defeasibility, as seen in Table 2: annotators can more often think of a strengthener answer and/or a weakener answer to our questions. The evaluation also shows that adding the answer-filtering with NLI step to the pipeline improves the question selection on all 4 evaluation criteria. The why-baseline is shown to be a strong baseline, indicating that motives and reasons are important for moral reasoning.

How much supervision does the policy require?
Our approach uses RL in conjunction with a supervised policy that has been fine-tuned on question generation. This has been shown to outperform approaches which use RL on top of a "vanilla" lmpolicy (Ramamurthy et al., 2022). To assess the effect of supervision on question generation performance, we trained multiple initial policies on varying percentages of δ-CLARIFY training data: 25%, 50%, 75% and 100%. To compare against more traditional supervised question generation approaches we additionally trained a policy on SQuAD v1.1 data (Rajpurkar et al., 2016). We report two automatic evaluation metrics. To measure informativeness we use an off-the-shelf QA model trained on SQuAD 2.0 from AllenNLP (Gardner et al., 2018). This model either answers a question by pointing to a span in the input or outputs that the question in unanswerable with respect to a given context. For a clarification question to be informative it would not ask about anything already mentioned in the situation. For the QA-metric we thus report the percentage of non-answerable questions. 4 We also report the average maximum BERTScore (Zhang et al., 2019) between a generated question and one of the 5 gold questions in δ-CLARIFY. Fig. 5 shows the following trends with regards to training a supervised policy. More training data leads to more informative questions. The policy trained on SQuAD produces the most uninforma-tive questions which can be explained by the fact that SQuAD questions are conditioned on existing answers in a text. While performance consistently increases from 25% to 75% of the training data, improvements after 75% are minimal. We conclude that for our use case training on about 5000 (75%) situations with 5 questions each leads to a sufficiently good policy. These results are also supported by the BERTScore.

Analysis
Answer Simulation The answer generation model generally succeeds at generating diverse weakener and strengthener answers to the same question: for only about 0.05% of questions per 1000 PPO epochs the model generates the same answer for both weakener and strengthener.
Our answer generation could be looked at as question-guided defeasible update generation. Rudinger et al. (2020)'s task of Generative Defeasible Inference generates an update given a situation, a moral judgement and the update type (e.g. weakener/strengthener). In our answer generation approach we condition on the same input together with a generated question. This intermediate question generation step functions as a type of macro planning which has been shown to be effective for NLG (Puduppully et al., 2022;Narayan et al., 2022). We evaluate our approach on the same test set using the same evaluation metrics as Rudinger et al. (2020). Table 3 shows that by first predicting the question and then the updates, we improve upon generating defeasible updates for δ-SOCIAL.   (Papineni et al., 2002) and ROUGE-L (Lin and Hovy, 2002). The first two results are from Rudinger et al. (2020).
Questions We qualitatively inspect the types of generated questions: There are many specification questions asking about a hyponym of an argument in the base situation, for example, exterminating pests on your property -"What kind of pests?". The situations extracted from SocialChem often include underspecified pronouns, such as 'something' or 'somewhere'. 60% of the situations containing 'something', for example, elicit what-questions from our model. Note that while such specification questions are valid clarification questions, the SQUAD 2.0 QA model would mark them as answerable given the situation. It is also interesting to see that often when a situation has a missing or implicit semantic argument, such as being anxious sometimes, CLARIFYDELPHI inquires about it: "What are you anxious about?" The generated why-questions most often ask about the motives of the agent in the situation, such as Ben tells Maggie that he's traveling alone -"Why is Ben traveling alone?". More rarely the model generates questions asking about the viewpoint of the patient: asking a friend [...] whether your attire is appropriate for an event -"What is the friend's opinion of your attire?" Analysis of Delphi's Probabilities In Tab. 4 we quantify the JSD of Delphi's judgments. Even though the human evaluation showed that CLAR-IFYDELPHI produced the most questions leading to defeasible answers, the JSD and the precentage of judgment flips is higher for the pipeline_nli approach, where we explicitly filter questions to maximize the JSD. Nevertheless, CLARIFYDEL-PHI leads to more Delphi judgment flips and higher JSD between answers than the fine-tuned t5 model without RL (and also all other baselines besides the pipeline). This automatic evaluation and the disagreement with the human annotators also reveals that Delphi's probabilities are not always perfectly calibrated and relying too much on a model's output might potentially lead to some error propaga-  Table 4: Average JSD between P jw and P js of a situation. Judgment Flips: % of answers which led to a flip in Delphi's judgment.

Interactive Judgements
While we use answer simulation during PPO training, inference only requires a situation as input.
The clarification questions can then be used to elicit additional context, in the form of answers, through interaction. Fig. 2 illustrates examples of such an interaction between a user, Delphi as the moral reasoning system and CLARIFYDELPHI. After each turn, the situation is updated with the user provided context, for which Delphi produces a new decision. We limit the interaction to three turns. This is based on the observation that after the third turn the sentence fusion starts to deteriorate, resulting in less relevant and more repetitive questions. Additionally, we find that the first two questions generally can capture missing contexts that are most central to making moral decisions. We provide more examples of generated questions in the Appendix.

Related Work
Question Generation Clarification question generation has been studied for various domains from image recognition questions to product description questions (Rao and Daumé III, 2018;Majumder et al., 2021;White et al., 2021), defining the goodness of clarification questions along the lines of information theoretic measures such as relevance, informativeness or utility (Rao and Daumé III, 2018;White et al., 2021;Warstadt and Agha, to appear;Daumé III, 2018, 2019).
Most of existing works focus on questions that lead to single true answer, whereas we focus on generating clarification questions based on social situations, defining the relevance and utility of a question in terms of defeasibility. Additionally, we offer a high-quality clarification question dataset for social and moral situation-comprising of more than 30K questions-that breaks the mold from the domain-specificity of previous clarification datasets (Kumar and Black, 2020;Aliannejadi et al., 2019).
Some general question generation approaches have incorporated an RL-based approach. Buck et al. (2018) learn to paraphrase questions with a reward that maximizes the QA answer F1 score. And Rao and Daumé III (2019) optimize a binary utility reward, using Reinforce in an adversarial setup for generating clarification questions. In our setup, we use Proximal Policy Optimization (Schulman et al., 2017;Ouyang et al., 2022) with a trained model for feedback as part of the reward.  (Sap et al., 2020), and SCRU-PLES (Lourie et al., 2021).
Delphi is based on pre-trained UNICORN, a universal commonsense reasoning model, trained on a number of commonsense reasoning tasks. Delphi can predict the ethical judgment given a description of a situation.

Conclusion
In this work we introduce CLARIFYDELPHI, which generates clarification questions for social and moral situations. We show how a RL approach that optimizes for maximally divergent answers in terms of defeasibility outperforms other clarification question baselines. While we start with a supervised policy, the reward function makes use of already trained models and does not rely on any additional training data. We believe that our questions can be useful for providing more disambiguating context through interaction.

Limitations
On Western-centricity The majority of the crowdworkers producing the source data (δ-Social and Delphi) and δ-CLARIFY were located in the United States. Due to this, the predictions generated by CLARIFYDELPHI are currently limited to representing only the perspectives of western culture (particularly the United States). Overcoming the western-centric bias is a compelling direction for future research.
On Defeasibility We rely upon Delphi to produce acceptable judgments given a situation and the modifying context as a measure of defeasibility. We recognize that, however, Delphi is not perfect and is characterized by a variety of limitations such as limited cultural awareness and inconsistent predictions (Jiang et al., 2022). Investigating improved methods for identifying answer divergences that will better capture defeasibility is a topic for future investigation.

Ethics Statement
Annotations are conducted on Amazon Mechanical Turk (MTurk). We maintain an average pay of $15 per hour for all our crowdsourcing data collection and evaluation tasks. Our crowdsourcing tasks do not collect personal information and are strictly limited to gathering workers' general knowledge. We do not keep any deanonymizing information such as MTurk IDs so that the identity of the workers cannot be directly or indirectly ascertained. Finally, our crowdsourcing task meets the standards for exemptions as human research and has obtained the necessary documentation that deems it exempt through our internal IRB procedures.
Our model is intended to be used for research purposes only and it is not supposed to provide any sort of advice applicable outside of that domain.
IRB approval: We sought and received exemption from our internal IRB. In accordance to the regulations, we do not collect sensitive information. If we do publish WorkerIDs, we will do so by fully anonymizing the information. The exemption received does not require a consent form.

Language and Demographics:
We have not collected any demographic information from the workers. However, all crowdsourcing was conducted in English and the region (current location of the crowdworker) was set to US. Consequently, what counts as a context of consequence is centered around western views, or views of the English speaking cultures within the United States.

A.2 Prompting for Answer Generation
One way to elicit a set of opposing answers is through prompting. We instruct GPT-3 to provide a so-called "bad" and a so-called "good" answer to a question about a situation. For the situation learning how to take a joke and the question "What was the joke?", the two answers could be: "it was a lighthearted joke among friends" and "it was an offensive joke". In order to determine which of the answers is a weakener and which a strengthener, we compare the difference in Delphi's judgement for s and s + a good or s + a bad .

A.3 Details of PPO
Policy loss. To compute the policy loss, we first define the truncated estimated advantage function, where the value function of a certain state s t is estimated by the value model V φ (·), r t denotes intermediate reward obtained at time step t, and γ and λ denote reward decay factor. PPO then maximizes the empirical expectation of clipped surrogate objective term, defined as cso(Â t , r t (θ), ) = min r t (θ)Â t , clip(r t (θ), 1 − , 1 + )Â t , Value loss. The value model is trained with MSE loss with respect to the target value, V targ t :

A.4 Implementation Details
The question generation model θ Q is trained with T5-large (Raffel et al., 2020) on δ-CLARIFY.
We train the answer generation model θ A (T5large) on 77k instances from δ-CLARIFY silver ( §4) and about 4k instances obtained through prompting . For the answer-filtering using NLI we use WaNLI (Liu et al., 2022a) as an off-the-shelf NLI model.
We run the PPO algorithm for 6000 steps, with a batch size of 64, top_p= 0.6 and a temperature of 0.7 for the sampling policy. Training for 6000 step took approximately 2 days on 4 A6000, 48G, GPUs.
Our models are based on T5-large which has 770 million parameters.
For the DELPHI model we use the t5 − large variant.

A.5 Licensing
CLARIFYDELPHI and the code will be released under the Apache License 2.0. We will release the dataset under CC-BY.

A.6 Sample of Generated Questions
We show a sample of 15 generated questions in Tab. 5.

A.7 Sample of Generated Questions in a Compositional Setting
We show a sample of generated questions in a twoturn interactive setting in Tab. 6. Situation Question sleeping in when you have to get up.
What are you doing when you have to get up? Paying more when you are able to because you make more. What are you making? Apologizing when you have done something wrong.
What did you do? being careful when opening your car door in a parking lot.
Are you aware of what is inside your car? being out after curfew.
Are you allowed to be out after curfew? respecting your friend's wishes What is your friend's wish? making a scene What kind of scene? being friends with benefits with someone Why are you friends with benefits with them? not liking certain dogs Why don't you like them? smoking in one's own home What are the risks? wanting to go home with someone How do you feel about them? being angry with your partner when they hide things What are they hiding? Talking about things you know nothing about Why are you talking about these things? agreeing to go out with someone who bullied you Why are they bullying you? using your money how you see fit What are you doing with your money?       getting cold feet before a wedding Why are you getting cold feet? getting cold feet before a wedding because you only know your partner for 1 month What is your relationship status?
Being mindful of your health How are you being mindful? being mindful of your health by working out What are the benefits of working out? Table 6: Sample of situations from the dev set of SocialChem, their expansion through interaction (2 turns) and the CLARIFYDELPHI generated question.