Prompting Contrastive Explanations for Commonsense Reasoning Tasks

Many commonsense reasoning NLP tasks involve choosing between one or more possible answers to a question or prompt based on knowledge that is often implicit. Large pretrained language models (PLMs) can achieve near-human performance on such tasks, while providing little human-interpretable evidence of the underlying reasoning they use. In this work, we show how to use these same models to generate such evidence: inspired by the contrastive nature of human explanations, we use PLMs to complete explanation prompts which contrast alternatives according to the key attribute(s) required to justify the correct answer (for example, peanuts are usually salty while raisins are sweet). Conditioning model decisions on these explanations improves performance on two commonsense reasoning benchmarks, as compared to previous non-contrastive alternatives. These explanations are also judged by humans to be more relevant for solving the task, and facilitate a novel method to evaluate explanation faithfulfness.


Introduction
Pretrained Language Models (PLMs) (Raffel et al., 2020;Lewis et al., 2020;Radford et al., 2019;Brown et al., 2020) have been shown to encode substantial amounts of knowledge in their parameters (Petroni et al., 2019;Talmor et al., 2020;Roberts et al., 2020) and have achieved impressive performance on commonsense reasoning (CSR) tasks without the use of external knowledge (Trinh and Le, 2018;Yang et al., 2020). However, these models provide little human-interpretable evidence of the intermediate commonsense knowledge or reasoning they use, and have been observed to overly rely on superficial dataset artifacts (Poliak et al., 2018;Geva et al., 2019). To overcome this limitation, recent work has shown that PLMs can i) I picked up a bag of peanuts and raisins for a snack. I wanted a sweeter snack out so I ate the for now. Contrastive Expl. -Peanuts are salty while raisins tend to be sweet.
ii) The geese prefer to nest in the fields rather than the forests because in the predators are more hidden. Contrastive Expl. -Forests are denser than fields Table 1: Examples of Winograd Schema Instances where the correct and incorrect answer choices are highlighted in blue and red respectively. Choices are contrasted along attributes like taste (for i) and density of vegetation (for ii) by humans to explain why they prefer some answer choice. explain themselves by generating free-form natural language explanations of their reasoning patterns (Rajani et al., 2019a;Camburu et al., 2018;Narang et al., 2020). However, the space of possible free-form explanations is incredibly large, inherently ambiguous, and difficult to annotate or evaluate (Wiegreffe et al., 2020;Latcinnik and Berant, 2020). Furthermore, quantifying the model's dependence on free-form explanations is also challenging (Camburu et al., 2020). We address these challenges by proposing an unsupervised method that uses contrastive prompts, which require the model to explicitly contrast different possible answers in its explanation (Table 1).
Our approach is based on a key observation: Many commonsense reasoning tasks require the comparison or contrast of plausible alternatives along a distinguishing attribute. For instance, in Table 1, the differentiating attributes for the two answer choices maybe taste (for i) and vegetation density (for ii). People commonly use contrastive explanations to explain their reasoning (Miller, 2018). Rather than asking "Why P?", they ask "Why P rather than Q?", where Q may be implicit from the context. For example, instead of justifying why raisins are the appropriate choice, people tend to ex-plain why they are more likely than peanuts. Miller (2018) also argues that such contrastive explanations are computationally efficient, as they only require focusing on the limited set of reasons that might make one answer more likely than the other instead of exhaustively enumerating all possible reasons for an answer. For instance, the raisin's taste (not its size, temperature, etc.) in Table 1 is adequate to explain why it is the best answer.
Our goal is to enable PLMs that explain their predictions to similarly benefit from such constraints. We develop a small set of contrastive generation prompts that can be in-filled by a PLM such as T5 (Raffel et al., 2020) or BART (Lewis et al., 2020) (see Table 3). These templates are designed to cover a multitude of language patterns used by humans to compare and contrast entities. Another PLM then conditions on both the original input and the generated contrastive explanation, to predict the final answer. This approach is inspired by Shwartz et al. (2020), who also use textual prompts to query the PLM with clarification questions. However, their prompts are generic while we prompt for instance-specific information.
Our approach shows quantitative improvements in task performance over two existing methods for model explainability (Shwartz et al., 2020;Latcinnik and Berant, 2020), for two commonsense reasoning tasks: the Winograd Schema Challenge (Levesque et al., 2012) and multiple-choice question answering about physical commonsense (Bisk et al., 2020). Our gains in the zero-shot setting are especially notable, outperforming the best reported results on publicly available PLMs and improving over Shwartz et al. (2020) by up to 11%. We also show, through human evaluations, that contrastive explanations are deemed more useful for solving the original task compared to generic clarification questions. Finally, contrastive explanations can be semantically perturbed to quantify the model's dependence on them by flipping the contrast in the explanation to support the foil, facilitating quantification of model faithfulness. 1

Related Work
Models that rationalize their decisions by extracting a contiguous subsequence of the input as an explanation (Lei et al., 2016;DeYoung et al., 2020;Paranjape et al., 2020) are inadequate in explaining commonsense reasoning tasks that require knowledge that is implicit in the input. Such tasks necessitate PLMs to rely on embedded parametric knowledge. Recent work use free-form textual explanations to generate explanations for commonsense reasoning tasks like SNLI (Camburu et al., 2018), Winograd Schemas (Zhang et al., 2020 and Com-monsenseQA (Rajani et al., 2019b) through explicit human supervision, which are inherently ambiguous, incomplete and consequently, expensive to collect and evaluate on (Camburu et al., 2019b,a;DeYoung et al., 2020). Most recently, Latcinnik and Berant (2020) use an unsupervised approach to generate free-form explanations as sequences of tokens that are not well-formed sentences. In contrast, our method uses specialized prompts to generate well-formed human-interpretable explanations without any additional supervision.
Specialized prompts have been shown useful for extracting knowledge from PLMs in a targeted manner (Petroni et al., 2020;Richardson and Sabharwal, 2020;Talmor et al., 2020;Donahue et al., 2020;Lin et al., 2019) and improving performance on downstream tasks (Brown et al., 2020;Shin et al., 2020). Most relevant to our work is the self-talk model of Shwartz et al. (2020), an unsupervised approach using a fixed set of clarification questions as prompts to elicit knowledge from PLMs for commonsense reasoning tasks. Our work differs by focusing specifically on contrastive PLM prompts, which we find further improve performance by eliciting explanations which are highly relevant to the classification decision (Section 6).
Our approach to contrastive reasoning is also closely related to counterfactuals, which can be used to give contrastive explanations, i.e., answers to "Why P rather than Q?", by providing a counterfactual case in which Q would have held. Ross et al. (2020) use this idea to generate contrastive explanations, while it has also been used for evaluation (Gardner et al., 2020) and training (Kaushik et al., 2019) with the aim of addressing model robustness. Most of this work explicitly constructs counterfactual cases by perturbing the input data of a task in order to produce changes in the output label. In contrast, we do not construct counterfactual inputs, but aim to explicitly represent counterfactual knowledge: a contrast between the fact P and foil Q that, were it hypothetically reversed, would change the output label. We include an evaluation of our models on this question in Section 6.3.

Dataset Instance
Human-Authored Contrastive Explanation Winograd Schema 1. The party was more interesting and uplifing than the • Parties are for celebrating while funerals are for mourning funeral because the was rigid.
• People wear colorful clothes at parties and black at funerals 2. The geese prefer to nest in the fields rather than the • Forests are dense while fields are sparse forests because in the predators are more hidden.
• Forests have more predators than fields.
PIQA 1. How do you get strong hamstrings?
• Hamstrings are located in the legs while biceps are located in (a) work out your upper body (b) work out your legs the upper body 2. How do you flood a room?
• Filling it with objects can clutter a room while filling it (a) fill it with objects (b) fill it with water with water floods the room. Table 2: Examples of commonsense tasks that can be explained using contrastive language and some contrastive explanations authored by in-house annotators. The Fact and Foil are marked in the input.

Contrastive Explanations
We present the theory of contrastive explanations adopted in this work (Section 3.1) and the intuition behind using them for commonsense reasoning tasks (Section 3.2).

Definition and Motivation
A contrastive explanation is generally defined as an answer to a counterfactual question of the form "Why P rather than Q?" for two potential hypotheses P and Q that can follow from some event E. It explains why some fact P occurred instead of some foil Q, where Q can be implicit (Hesslow, 1988;Lipton, 1990;Miller, 2019). A good contrastive explanation points to differences between the fact and foil with regard to certain attributes, not just conveying that the fact has a certain attribute. Table  1 shows examples of contrastive explanations that differentiate between peanuts and raisins (on the basis of taste) or forests and fields (on the basis of vegetation densities) to explain the more probable answers to Winograd Schema instances. Previous studies (Miller, 2019) in philosophy, psychology, and cognitive science show that humans use such contrastive explanations when explaining their decisions to each other. Importantly, Miller (2018) also argues that contrastive explanations are computationally efficient -exhaustively describing all causes for the occurrence of an event P is harder than only enlisting causes for why another event Q did not occur instead of P .

Contrastive Explanations for Commonsense Reasoning Tasks
Many recently proposed commonsense reasoning tasks are framed in a multiple-choice format that facilitates contrastive explanation (see Table 2). In this study, we focus on the following two tasks.
The Winograd Schema Challenge (Levesque et al., 2012, WSC) is a pronoun coreference resolution task designed as a hard benchmark for evaluating everyday knowledge and commonsense reasoning (Zhang et al., 2020). For instance, in the sentence "The city councilmen refused the demonstrators a permit because they feared violence," the pronoun they must be disambiguated between fact (the city councilmen) and foil (the demonstrators). Both fact and foil are explicit in such sentences.
The Physical Interaction Question Answering (Bisk et al., 2020, PIQA) challenge is designed to test knowledge of physical commonsense. PIQA requires choosing between which one of two solutions is a better way of achieving a goal posed as a question (see Table 2). PIQA questions relate to physical properties of entities, their affordances, and how they can be manipulated. The fact and foil are explicit in the two solutions, which typically differ from one another by a short noun phrase.
To validate our intuition that contrastive reasoning is instrumental in these tasks, we performed a pilot study with 10 annotators over 100 commonsense questions from Winogrande and PIQA. We instructed them to answer the questions and explain their reasoning, but gave no specific instructions about what the explanations should look like. Examples are shown in Table 2. In 76% of Winogrande and 64% of PIQA examples, annotators explicitly contrasted the fact and foil. The frequent use of certain phrase structures, like P are while Q are , strongly informed our method for generating them (Section 4).

Our Approach
We assume the input to a commonsense reasoning problem consists of a textual context c which contains a placeholder , and two marked answer Prompt Pattern

Commonsense Example & Model Generated Explanation
Personal Characteristics =⇒ P likes/likes to while Q likes/likes to Megan said it would be liberating to go out without makeup like P likes/likes to while Q does not like/like to Elena does since never wore makeup P prefers/prefers to while Q prefers Explanation: Elena likes to be natural while Q prefers while P does not prefer/prefer to Megan likes to wear lipstick Q thinks while P thinks/does not think Object Characteristics P is taller/shorter/smaller/larger/slower/faster than Q How to tie pieces of paper together? =⇒ P is/are while/but/however Q is/are (a) Thread ruler through the holes Q has/have while/but/however P has/have (b) Thread ribbon through the holes P has/have more/less than Q Explanation: Ruler is hard while a ribbon is P is/are than Q flexible Spatial/Temporal Contrast =⇒ P is inside/outside/above/below Q Emily looked up and saw Patricia racing by overhead. was on the is closer to P and farther away from Q ramp. P is to the right/left of Q Explanation: Emily is below Patricia Q takes longer to than P Use cases and causes P is used for Q To prepare the puff pastry for your pie, line a baking sheet with P is used to do Q parchment. Then =⇒ P is used for/to/in while Q is used for/to/in (a) Unroll the pastry, lay it over baking twine. Q is used while P is used (b) Unroll the pastry, lay it over fishing line. Q because while P because Explanation: Baking twine is used in Q can cause while P results in baking while fishing line is used in fishing choices a 1 and a 2 corresponding to the fact and foil ( Table 2, left column). Let c x denote substitution of x for the placeholder in c. The task is to predict whether c a 1 or c a 2 is more likely to be true, i.e., whether a 1 or a 2 best completes the context. Our approach has two stages: First, an Explainer PLM P expln generates contrastive explanations (Section 4.2) by infilling preset contrastive templates (Sec. 4.1) on the basis of c, a 1 , and a 2 . Then, a Task Model P LM selects the correct answer conditioned on both the context and the generated explanations (Sec. 4.3).

Contrastive Templates
We develop a list of contrastive templates on the basis of an annotation study. For 250 instances from Winogrande and PIQA, we asked three annotators to explain why one answer is more likely than the other. We manually examined these explanations and abstracted them into templates containing at least two placeholders: two for the fact and foil being contrasted, and possibly more corresponding to the properties they are being contrasted on. For instance, peanuts are salty while raisins are sweet becomes Q are while P are . We retained templates used by annotators at least 10 times. Table 3 shows several examples. A template is converted into an explanation by replacing placeholders for the fact and foil with answers a 1 and a 2 and the remaining placeholders with the appropriate contrastive information.
We evaluate the quality and coverage of our templates with another round of human evaluation. For 100 WSC and PIQA examples, we ask three annotators to either write contrastive explanations using one or more of the templates, or indicate that none of the them were appropriate. Annotators used the templates in over 82% of cases, indicating high coverage for the tasks we study.

Generating Explanations
Let t denote a contrastive template. We write t a 1 ,a 2 to denote the customization of t to an input by filling its marked placeholders for fact and foil with the answer choices. For instance, in Figure 1, the template P are while Q are is customized to Fields are while forests are . 2 A full explanation may be produced by filling the remaining gaps in t a 1 ,a 2 by leveraging an infilling language model, the explainer P expln .
We first construct a neutral context c a 0 by filling c's placeholder with a task-specific neutral answer  Figure 1: (1) A commonsense reasoning instance (c, a 1 , a 2 ) is converted into a custom prompt c a0 ⊕ t a1,a2 as input for the explainer PLM (2) The combination of input and explanation (c ai ⊕ e j ) is used by task model to score a i ∀i∀j. For a 1 and a 2 , scores are aggregated over templates.
that does not indicate if a 1 or a 2 is correct. For Winogrande Schemas, c a 0 is constructed using the ambiguous pronoun in c (them in Figure 1). For PIQA, c a 0 is constructed as "c ⊕ a 1 or a 2 ", where ⊕ is string concatenation, e.g., upper body or legs in Figure 1 (More dataset-specific details are in Section 5.2). We then prepend c a 0 to the customized template t a 1 ,a 2 and use it as input to the infilling language model to fill in the remaining gaps in the template. We use the maximum likelihood candidate phrases from top-K decoding to transform the template into a full explanation e.
We use a list of templates t 1 , . . . , t n to generate a list of candidate explanations e 1 , . . . , e n for each input, which are all fed into the task model. We also use some task-specific heuristics to reduce the number of prompts for each example, detailed in Appendix A.

Task Model
Given the context and answer choices (c, a 1 , a 2 ) and a list of explanations e 1 , . . . , e n , the second stage of our pipeline is a binary classifier between a 1 and a 2 which marginalizes over the explanations. We first assign a score to each answer a ∈ {a 1 , a 2 } and explanation e ∈ {e 1 , ..., e n }: where c a denotes the substitution of a into c, P LM is string probability under the task language model, and k is the string length of c a ⊕ e. We use φ as input to a logistic regression classifier which marginalizes over explanations: where Z is a normalizer over a 1 and a 2 . At initialization, φ uses a pretrained language model, and we fine-tune it to minimize the cross-entropy loss of P(a * | c, a 1 , a 2 ), where a * is the correct answer. We do not fine-tune the explainer PLM since the top-K beam decoding is a discrete operation that is hard to backpropagate through. In the zero-shot setting (where the task PLM is not fine-tuned) and during inference, the answer is predicted by aggregating scores assigned to an answer by all n explanations: argmax a i j φ(c, a i , e j ).

Baselines
Context-Only We experiment with a baseline that does not condition on explanations at all. Here, and gold answer is argmax a i φ(a i , c) Unconstrained Generation Latcinnik and Berant (2020) generate explanations from a PLM by beam-decoding a free-form sequence termed a hypothesis which is then used by a classifier to solve the task. The model is trained end-to-end and loss terms are added to encourage the hypothesis to sound natural. Explanation generation is otherwise unconstrained. For fair comparison with our approach, we do not fine-tune the explainer PLM (more details are in Appendix C). Shwartz et al. (2020) propose an unsupervised model that uses a PLM as the answer scorer and a (possibly different) PLM as a knowledge source, similar to our framework. They formulate the process of obtaining relevant knowledge as self-talk with the following steps: 1) completing clarification question prefixes such as "what is the definition of ..." conditioned on input context, 2) generating their corresponding answers (clarifications), and 3) conditioning on the clarification questions and answers to make predictions. The key difference between their approach and ours is in the choice of prompts for the PLM, and the kinds of knowledge the prompts seek. While Shwartz et al.

Implementation details
We Each instance provides two answer choices, which we use directly as a 1 and a 2 . For the neutral answer c a 0 , we use the sentence with the original ambiguous pronoun. Since Winogrande has a blank space for the answer, we replace it with the most likely pronoun under a masked language model (BERT), following Shwartz et al. (2020). c a 1 , c a 2 are obtained by replacing the blank space or pronoun with the answer choice.
Physical Interaction Question Answering (PIQA) (Bisk et al., 2020) PIQA provides two answer choices which mostly vary from each other on a substring (e.g., "work out your [upper body]/[legs]"). We use these differing substrings as a 1 =legs and a 2 =upper body. For the neutral answer a 0 , we combine the answers into "a 1 or a 2 " (upper body or legs). In the cases where a 1 or a 2 is longer than 2 words, we include an or between the full answers. More details and examples are presented in Appendix A. We use question-answer pairs for c a 1 and c a 2 .

Experimental Results
In this section, we present an extensive evaluation of our approach, demonstrating performance gains which are independently verified by human judges.

Task Performance
We report task accuracy as a proxy for explanation quality. Table 4 compares the task performance of our model with the baselines defined in Section 5.1. We observe that generating and conditioning on additional information from PLMs improves performance over just using the original input (Row 1 vs. 2-6). Using templates to prompt the PLM for specific knowledge is better than unconstrained generation of text (Row 2 vs. 3-6). Contrastive explanations outperform previous work that use clarification questions in self-talk (Shwartz et al., 2020). The T5-Large explainer already surpasses the results of self-talk despite being smaller than GPT2-XL, demonstrating the impact of using contrastive explanations over clarification questions.
We also observe that larger explainer PLMs (going from T5-Large to T5-11B) yield higher performance. Our zero-shot results with T5-11B are the highest reported on Winogrande, PIQA and WSC for an open-sourced model. 3 Finally, our approach gets smaller improvements when finetuning the task model. This suggests that some of the reasoning is still learned implicitly by the task model. Figure 2 shows task performance with various training data sizes of Winogrande, indicating a larger gap between the Context-Only baseline and our approach when training data is scarce.

Human Evaluation
Setup Following the human evaluation setup of Shwartz et al. (2020), we sample up to 50 highest-    Crowd workers are presented with a commonsense instance, the correct answer, and an explanation, and are asked to judge for: 1) Grammaticality, whether the explanation is grammatical; 2) Relevance, whether it's relevant to the topic of the text; 3) Factual Correctness, whether it's factually correct or likely true; and 4) Helpfulness, whether it adds helpful evidence for the correct answer. These metrics and definitions follow from Shwartz et al.
(2020) with more details in Appendix B. The annotators are also shown the same explanation with fact and foil flipped (details in Section 6.3) and are asked to judge if the other answer is more likely than before if they assume the flipped explanation to be hypothetically true.
Results Table 5 shows the results of human evaluation of contrastive and self-talk explanations. The contrastive explanations are overwhelmingly preferred over self-talk explanations for relevance, factual correctness and helpfulness. They may be con-sidered less grammatical because of in-filling noise (such as incomplete phrases). Table 6 presents some qualitative examples of instances where contrastive explanations improve over all baselines.

Analysis
We also analyze how much the task model relies on contrastive explanations for its decisions.
Flipping Explanations Our choice of contrastive language templates facilitates a novel way to evaluate explanation usefulness in prediction. The contrast in the explanation can be reversed by flipping the position of the fact and the foil in the explanation. If the choice between fact and foil actually depends on the contrastive explanation, then the flipped explanation should provide a hypothetical situation where the foil is more likely than the fact. For instance, "peanuts are salty while raisins are sweet," when switched to "raisins are sweet while peanuts are salty," may provide evidence that peanuts is a more likely label for the example in Table 1 (i). This may cause a model that uses the explanation to flip its prediction and lead to a drop in accuracy. The magnitude of drop can quantify the extent to which the model relies on the contrast    provided in the explanation. In fact, humans also deem the flipped explanation to imply the opposite label in a majority of cases (Table 5), indicating that our contrastive explanations frequently capture contrastive properties that the labels truly rely on. Table 7 shows the flipped evaluation results. We observe declines in accuracy of up to 8%, indicating that the model does use some contrastive knowledge to reason about the task. Finetuned models show a smaller decline in accuracy compared to the zero-shot setting. In this case, the task model may be directly fitting the data in lieu of relying on the knowledge conveyed by the explanation.
Abstracting Fact and Foil Given input context c (consisting of the fact and foil a 1 , a 2 ) and an explanation e, the explainer PLM P expl infills its explanation e on c while the task model P LM conditions on both c and e. We can test the quality of the generated explanations and the task model's reliance on them by forcing the task model to rely on e when information in input c is restricted. One potential way to do so is to scrub the identities of the fact and foil, a 1 and a 2 , from c.
We replace the fact and foil with placeholder tokens to create an abstract context c . For instance, the example in Table 6 (ii) becomes "The <mask1> and <mask2> helped me navigate ... down.", where the model must now choose between <mask1> and <mask2>. 4 Running the task model on c lower-bounds the performance possible without knowing answer identities. We can now test the relevant answer-based knowledge contrastive contained in the explanations by allowing the explanation model to see the original answers in c, but then abstracting them out when passing the input context and explanations to the task model. More formally, the task model conditions its decision on c and e . For Table 6 (ii) c and e are "The <mask1> and <mask2> helped me navigate ... down." and "The <mask1> is right-side-up while the <mask2> is upside down." Since only the explainer PLM is shown answer identities, the task model's decision is conditionally independent of the answer identities given the explanation.
Experiments on Winogrande and PIQA in the Banerjee and Baral (2020a) 38.8 Table 9: Zero-shot test performance on Common-senseQA for baselines as well as contrastive models which ensemble fact/foil pairs by voting (V) and maximum margin (MM). The best reported unsupervised performance (Banerjee and Baral, 2020b) uses Con-ceptNet, which was used to construct the dataset.
fine-tuned setting (Table 8) show that performance improves significantly when the task model conditions on both c and e compared to a fully abstracted contrastive baseline that only conditions on c (from 63.2 to 70.4 for Winogrande), covering almost half of the gap between the fully abstracted setting and the non-abstracted original model (79.1). This indicates that our contrastive explanations encode a significant amount of information required for commonsense tasks. Even if the full model does not always use the explanations, these evaluations show that our contrastive explanations contain rich task-relevant knowledge, and suggest that future work might focus on how to better make use of this signal.

Generalizability of Prompts
The set of contrastive prompts used in our framework are curated from an in-house analysis of training instances from Winogrande and PIQA datasets.
To determine the generalizability of these prompts for other commonsense reasoning tasks, we also experiment with the CommonsenseQA dataset (Talmor et al., 2019), which consists of multiple-choice questions created over ConceptNet -"Where on a river can you hold a cup upright to catch water on a sunny day? a) waterfall, b) bridge, c) valley, d) pebble, e) mountain". Since there are more than two answer choices to contrast, we convert each instance into 10 pairwise (binary) classification instances. Contrastive explanations are generated for each pairwise decision in the zero-shot setting, similar to Winograd and PIQA datasets. To choose the final answer, we consider two inference procedures: (a) Vote: The answer that receives the maximum number of votes across all binary clas-sification instances is selected, and (b) Maximum Margin: The choice with the maximum difference (margin) between answer likelihoods for any binary classification instance is selected. In Table  9, we observe that self-talk significantly hurts performance for this dataset. On the other hand, contrastive explanations are found to be useful and approach the zero-shot performance of the stateof-the-art, which uses ConceptNet (Banerjee and Baral, 2020b). These results demonstrate that the set of contrastive prompts are generalizable to other commonsense reasoning datasets, and that while our contrastive prompts are limited to contrasting two answer choices at a time, the framework can be easily extended to tasks with multiple foils.

Conclusion
We show it is possible to prompt pretrained language models (PLMs) to generate contrastive explanations of their reasoning patterns, inspired by explanations that humans naturally provide for their reasoning. Conditioning model decisions on these explanations improves performance on two commonsense reasoning benchmarks, and humans judge the explanations to be highly relevant and helpful in comparison to prior work. We also showed how contrastive explanations can facilitate in-depth evaluations of faithfulness by flipping or abstracting the fact and foil, finding that our explanations encode a significant amount of information relevant to the classification decision, and in many cases models rely on the contrast in the expected way. While we have shown that our method is flexible enough to apply to multiple-choice commonsense tasks with many foils, leveraging contrastive reasoning in a wider variety of open-ended tasks remains an exciting challenge for future work. A Generating Contrastive Templates Table 12 shows the complete list of contrastive patterns used in our work, categorized under different types of attributes/properties. For templates with no place holders for the explainer to fill out, we only replace placeholders for answers (fact and foil). Table 10 lists a 0 , a 1 , a 2 , c a 0 , c a 1 , c a 2 for different examples in Winogrande and PIQA to explain dataset specific transformations made by our approach. Detection of P , Q: For WSC, the fact and foil are typically 1-word nouns. However, they may by qualified in the context and these qualifiers are important for contrasting. For instance, in the WSC example "She remembered how annoying it is to dust her wood chair so she bought a plastic

B Human Evaluation
The annotation task was carried out in Amazon Mechanical Turk, following (Shwartz et al., 2020).
To ensure the quality of annotations, workers were required to be located in the US, UK, or Canada, and have a 99% approval rate for at least 5000 prior tasks. Annotators were paid .30$ per HIT to ensure participants get approximately $15/hr if they are doing the task. Annotation were aggregated from 3 workers using majority vote. The annotations yielded moderate levels of agreement, with Fleiss Kappa κ = 0.43 (Landis and Koch, 1977).

C Hyperparameters
Explainer PLM For T5 we use special symbols <extra id 0> and <extra id 1> in place of the blanks ( ) in our templates. We observe that T5 is able to replace these tokens with multi-word phrases. For BART, we substitute blanks with a sequence with four [MASK] tokens to encourage generating multiple words. BART can choose to delete a [MASK] token during generation. Top-K decoding was done with a beam size of 200 and a maximum output sequence length of 20 for T5 models and 100 for BART. This is because both T5 is pre-trained to in-fill by only generating missing phrases while BART is pre-trained to decode the entire input with missing phrases filled in. We used early stopping for BART.
Self-Talk (Shwartz et al., 2020) generate multiple clarification questions conditioned on the context, by 1) concatenating one of several question prefixes to the input prompt or question; and 2) generating 5 questions for each prefix using Nucleus sampling with p = 0.2, i.e., sampling from the top 20% tokens (Holtzman et al., 2019) limiting the question length to up to 6 tokens excluding the prefix. For each well-formed question, they generate multiple answers using a similar method. They Winogrande Ian volunteered to eat Dennis's menudo after already having a bowl because despised eating a1 : Ian a2 : Dennis a0 : he ca 0 : Ian volunteered to eat Dennis's menudo after already having a bowl because he despised eating ca 1 : Ian volunteered to eat Dennis's menudo after already having a bowl because Ian despised eating ca 2 : Ian volunteered to eat Dennis's menudo after already having a bowl because Dennis despised eating PIQA (difference between answers is 1-2 words) To prepare carrots before cooking with them, you can a1 : Run them in the sink under boiling water a2 : Run them in the sink under cold water a0 : boiling water or cold water ca 0 : To prepare carrots before cooking with them, you can run them in the sink under boiling water or cold water ca 1 : To prepare carrots before cooking with them, you can run them in the sink under boiling water ca 2 : To prepare carrots before cooking with them, you can run them in the sink under cold water PIQA (difference between answers is larger) To prevent gunk buildup in cup holders of a car, a1 : place coffee filters inside of the cup holders. a2 : pour a thin layer of oil into the cup holders. a0 : place coffee filters inside of the cup holders or pour a thin layer of oil into the cup holders. ca 0 : To prevent gunk buildup in cup holders of a car, place coffee filters inside of the cup holders or pour a thin layer of oil into the cup holders ca 1 : To prevent gunk buildup in cup holders of a car, place coffee filters inside of the cup holders ca 2 : To prevent gunk buildup in cup holders of a car, pour a thin layer of oil into the cup holders Table 10: Examples of Winogrande and PIQA, with fact, foil , neutral answer and respective substituted contexts used in our approach for prompting the explainer PLM or computing answer likelihood.
Original Input: The geese prefer to nest in the fields rather than the forests because in the predators are more hidden.
(i) Context-Only Input to task model: The geese prefer to nest in the <mask1> rather than the <mask2> because in the predators are more hidden.
(ii) Fully Abstracted Input to explainer: The geese prefer to nest in the <mask1> rather than the <mask2> because in the predators are more hidden. Generated Explanation: <mask1> is smaller than <mask2> Input to task model: The geese prefer to nest in the <mask1> rather than the <mask2> because in the predators are more hidden. <mask1> is smaller than <mask2> (iii) Abstraction after Explanation Input to explainer: The geese prefer to nest in the fields rather than the forests because in the predators are more hidden. Generated Explanation: Forests have more predators than fields Input to task model: The geese prefer to nest in the <mask1> rather than the <mask2> because in the predators are more hidden. <mask2> have more predators than <mask1> Table 11: Input to Explainer and Task model for Abstractive Evaluation limit the answer length to 10 generated tokens, and use Nucleus sampling with p = 0.5. Shwartz et al.
(2020) only condition task prediction on a single clarification question and answer pair that increases the model's belief of a certain answer choice. Thus, the score of each answer choice is selected as the score of the text containing the clarification that most supports it, i.e., whose combination with it yields maximum language model likelihood.
Unconstrained Generation For unconstrained explanation baseline, maximum output sequence length was set to 20 and beam size for beam decoding was set to 200. Again we use early stopping. WSC and PIQA OPT1 is/are smaller than OPT2 OPT1 is/are larger than OPT2 OPT1 is/are slower than OPT2 OPT1 is/are faster than OPT2 OPT1 is than OPT2 OPT1 are than OPT2 OPT1 is while OPT2 is OPT1 is but OPT2 is OPT1 is however OPT2 is OPT1 are while OPT2 are OPT1 are but OPT2 are OPT1 are however OPT2 are OPT1 has while/but/however OPT2 has/does not have OPT1 have while/but/however OPT2 have/do not have OPT1 is made of/to however OPT2 is made of/to OPT1 is made of/to while OPT2 is made of/to Spatial: WSC and PIQA OPT1 is above OPT2 OPT1 is below OPT2 OPT1 is to the right of OPT2 OPT1 is to the left of OPT2 OPT1 is inside OPT2 OPT1 is outside OPT2 is closer to OPT1 and father away from OPT2 OPT1 is closer to while OPT2 is father away from Usecase: WSC(No PERSON entity) and PIQA OPT1 can while OPT2 can/cannot OPT1 is/can be used for OPT2 OPT1 is/can be used to do OPT2 OPT1 is/can be used for but OPT2 cannot OPT1 is/can be used for while OPT2 is used for OPT1 is/can be s used for but OPT2 is used for OPT1 is/can be used to while OPT2 is used to OPT1 is/can be used to but OPT2 is used to Causes: WSC (No PERSON entity) and PIQA OPT1 has because while OPT2 is because OPT1 can cause while OPT2 causes/results in Since it can OPT1 but not OPT2 Since it can OPT1 but because it is not it can't OPT2

Miscellaneous:
WSC (No PERSON entity) and PIQA can be OPT1 but cannot be OPT2 OPT1 means to while OPT2 means to OPT1 is defined as while OPT2 is defined as OPT1 OPT2 OPT1 but not OPT2 OPT1 exists while an OPT2 doesn't Table 12: Complete list of contrastive patterns used in this work.