CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks

Recent efforts in natural language processing (NLP) commonsense reasoning research have yielded a considerable number of new datasets and benchmarks. However, most of these datasets formulate commonsense reasoning challenges in artificial scenarios that are not reflective of the tasks which real-world NLP systems are designed to solve. In this work, we present CRoW, a manually-curated, multi-task benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks. CRoW is constructed using a multi-stage data collection pipeline that rewrites examples from existing datasets using commonsense-violating perturbations. We use CRoW to study how NLP systems perform across different dimensions of commonsense knowledge, such as physical, temporal, and social reasoning. We find a significant performance gap when NLP systems are evaluated on CRoW compared to humans, showcasing that commonsense reasoning is far from being solved in real-world task settings. We make our dataset and leaderboard available to the research community at https://github.com/mismayil/crow.


Introduction
Commonsense reasoning is a long-standing challenge in artificial intelligence (AI) and natural language processing (McCarthy, 1960;Winograd, 1974;Davis and Marcus, 2015;Choi, 2022), resulting in a large number of datasets and benchmarks designed to evaluate how AI systems reason in commonsense scenarios described in natural language (Davis, 2023).Recently, large language models, such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022), have demonstrated near-human performance on many of these benchmarks (Lourie et al., 2021).However, these models can still be brittle in practical deployments, raising questions about how reliably these commonsense benchmarks truly evaluate the commonsense reasoning abilities of models.
Part of this issue stems from the practice that most commonsense datasets are designed to evaluate reasoning in artificial task settings that are not reflective of the real-world use cases in which NLP systems are deployed.In real-world settings, one almost never directly observes a test of commonsense knowledge in isolation.In this paper, we argue instead that commonsense reasoning benchmarks should evaluate commonsense reasoning in the tasks in which these abilities are required.
The necessity of commonsense to solve realworld tasks has been extensively argued since the early stages of AI, notably by Bar-Hillel (1960) in the context of machine translation.However, despite these early arguments, only recently was there an attempt to construct a commonsense reasoning dataset for machine translation (He et al., 2020), an effort which concluded that the commonsense reasoning abilities of modern models were still in their infancy when applied in real NLP tasks.
In this work, we build on these original ideas and introduce CROW: a Commonsense Reasoning Benchmark for Real-World Tasks, a multi-task benchmark containing high-quality datasets for six real-world NLP tasks: machine translation (MT), open-domain dialogue (DG), dialogue summarization (DS), intent detection (ID), stance classification (SC), and safety detection (SD).Inspired by Winograd schemas (Levesque et al., 2011), we build our benchmark by applying commonsensebased minimal perturbations on examples from existing datasets for each task.For each of these tasks, we crowdsource collections of potential target references for the task, each grounded to a particular commonsense violation with respect to the original context (see Figure 1 for examples in dialogue response generation).We categorize these commonsense violations across six dimensionstemporal, causal, attribution, comparison, physical, and social -ensuring a diverse breakdown of commonsense reasoning types in CROW.
Our empirical study across 13 state-of-the-art (SoTA) systems (including  shows that CROW is a challenging commonsense reasoning testbed, with the highest performing model scoring ∼18% lower than humans on individual examples and ∼37% lower on our more restrictive metric that evaluates situational robustness.Consequently, we provide CROW to the community as the first commonsense benchmark specifically formed to test commonsense knowledge and reasoning abilities in the same contexts as real-world deployments of NLP systems.The contributions of our work can be summarized as follows: • We design a common multi-stage data collection pipeline for generating commonsense-based Winograd-style variations of examples, which can be applied to many tasks.
• We apply our data collection pipeline to construct CROW, a multi-task benchmark that evaluates the commonsense reasoning ability of models in solving six diverse real-world NLP tasks.
• For each task, we evaluate and analyze the perfor-mance of state-of-the-art models on our benchmark across different dimensions of commonsense knowledge.

Related Work
Commonsense Reasoning Benchmarks Many benchmarks measuring the commonsense reasoning abilities of state-of-the-art models have been released in recent years.Starting with the well-known Winograd Schema Challenge (WSC; Levesque et al., 2011), these benchmarks have attempted to test the commonsense reasoning ability of models using different task formats, such as pronoun resolution (Levesque et al., 2011;Rudinger et al., 2018;Eisenschlos et al., 2023), questionanswering (Talmor et al., 2019;Zellers et al., 2018;Chen et al., 2019;Reddy et al., 2019;Zellers et al., 2019), plausible inference (Roemmele et al., 2011;Bhagavatula et al., 2019;Wang et al., 2019b;Singh et al., 2021;Gao et al., 2022) and natural language generation (Lin et al., 2020b).Benchmarks have also been created to evaluate commonsense reasoning across different dimensions of commonsense knowledge, including social (Rashkin et al., 2018a,b;Sap et al., 2019b), physical (Bisk et al., 2019;Dalvi et al., 2018;Storks et al., 2021), temporal (Qin et al., 2021;Zhou et al., 2019) and numerical reasoning (Lin et al., 2020a).Additionally, there exist comprehensive multi-task benchmarks that consist of several new or existing datasets for commonsense reasoning (Tamari et al., 2022;Srivastava et al., 2022;Wang et al., 2019a).For a thorough survey in this area, we refer readers to (Storks et al., 2019;Davis, 2023).In contrast to these benchmarks, where the underlying task formulation is centered around a task that is typically not grounded in a real-world setting, we construct CROW to specifically focus on evaluating commonsense reasoning in real-world tasks for which NLP systems would be deployed.
Commonsense Reasoning in Real-World Contexts A few recent works have explored the role of commonsense knowledge in real-world settings, such as open-ended response generation (Zhou et al., 2021;Ghosal et al., 2021Ghosal et al., , 2022)), machine translation (He et al., 2020) and reading comprehension (Zhang et al., 2018;Huang et al., 2019) and have proposed new commonsense reasoning tasks and benchmarks.We build on top of these benchmarks and extend them to several other real-world NLP tasks, along with a general data collection methodology for commonsense knowledge annotation and Winograd-style schema generation that can be applied to other tasks in the future.

Data Collection
Our goal is to assess the ability of NLP systems to apply commonsense reasoning in real-world tasks.
To this end, we define a general methodology and multi-stage data collection pipeline (Figure 2) for generating evaluation examples that require commonsense reasoning in a given real-world task.In what follows, we outline our general data collection methodology, and describe each step in detail.

Overview
The Winograd Schema Challenge (Levesque et al., 2011), an often-used benchmark to measure commonsense reasoning abilities, tests whether models can distinguish the meaning of pairs of sentences with commonsense-based minimal perturbations that flip their meaning.For example, given the sentence, "The trophy doesn't fit into the brown suitcase because it's too large," models should identify that the pronoun "it" refers to the "trophy" (using commonsense knowledge), but distinguish that replacing the word "large" by "small" would flip this reference to "suitcase".Winograd-style schemas have been widely adopted for tasks involving pronoun resolution (Rudinger et al., 2018;Eisenschlos et al., 2023;Thrush et al., 2022), but also sensemaking (Wang et al., 2019b;Singh et al., 2021) and reasoning about exceptions (Do and Pavlick, 2021).While these schemas are simple and effective for measuring commonsense robustness of models, they are rarely applied in real-world tasks.Motivated by this gap, we construct CROW, a benchmark of Winograd-style examples for realworld NLP tasks.While the inherent subtlety of commonsense-based minimal perturbations led the original Winograd schemas to be expert-crafted and limited in size, later works developed large-scale sets of Winograd schemas using crowdsourcing and adversarial filtering (Sakaguchi et al., 2019).In our work, we also employ crowdsourcing to generate Winograd-style perturbed examples, but our approach differs in one key aspect.Instead of asking crowdworkers to perturb the given sentences directly, we design a data collection pipeline that breaks down the schema construction into two independent stages: Commonsense Knowledge Annotation (CKA) and Winograd-style Schema Generation (WSG), each of which is followed by a complementary validation stage.Figure 2 illustrates the pipeline for the intent detection task.
This multi-stage approach has two key benefits.First, we ground the perturbations to commonsense dimensions, ensuring the Winograd-style schemas differ on commonsense violations.Using these dimensions, we also ensure a diverse set of perturbations across different types of commonsense knowledge, allowing us to stratify our later analysis across these dimensions to more finely understand model failures in commonsense reasoning.Second, a particular stage can be skipped if the data for it is already available, which is the case for several tasks in our benchmark.We use Amazon Mechanical Turk (MTurk) as a crowdsourcing platform.Below, we describe each stage in detail.

Methodology
For a given task example, we define the context as the unchanged part of the example and the target as the candidate for commonsense-based minimal perturbation.For example, in intent detection, we designate the headline as the context and the intent as the target.In Table 6 in the Appendix, we list respective mappings for all tasks.
Commonsense Knowledge Annotation and Validation In the first stage of our pipeline, we explicitly annotate implicit commonsense knowledge underlying examples in real-world task datasets.In this stage, crowd workers are tasked to identify concepts in the context and target that could serve as the head and tail of an implicit commonsense relationship, as well as a pre-existing relation that connects them.For example, in Figure 2, for an example from an intent detection task (Gabriel et al., 2022), a headline "Remote glaciers in China melting at shocking pace, risking water shortages" and an intent "Climate change is real and is showing its effects" would be presented to crowdworkers.They might connect these two statements with the knowledge "water shortage is a type of effect" which would be represented as (head: water shortages, relation: IsA, tail: effect) Based on earlier work (Ilievski et al., 2021;Speer et al., 2016;Sap et al., 2019a;Ghosal et al., 2021), we also categorize relations into six dimensions of commonsense knowledge: Attributional, Physical/Spatial, Temporal, Causal, Social and Comparative.Figure 3 shows the distribution of dimen- Figure 2: CRoW Data Collection Pipeline (as illustrated for the Intent Detection Task).Given a context (news headline) and a target (writer's intent behind it), in the first phase of the pipeline, annotators are asked to identify commonsense knowledge about this context.In the second phase, annotators use the commonsense knowledge from the previous phase to minimally perturb the target to generate a Winograd-style schema for the given example.Each annotation stage is also followed with its own validation step.
sions per task. 2 The dimensions serve as support for a fine-grained analysis of the commonsense reasoning abilities of models when tackling tasks.Following the CKA stage, we apply a validation phase to filter out low-quality annotations.For example, in Figure 2, the knowledge "glacier HasProperty real" would be filtered by crowd workers as it is not helpful for the task in the given context.Each annotation is verified by three unique workers, and we take the majority vote as the qualifying threshold for the next stage.

Winograd Schema Generation and Validation
In this stage, we present workers with a context, a target, and the associated commonsense knowledge from the previous stage, and ask them to rewrite the target such that it satisfies the following four conditions. 3The new target must (1) minimally differ from the original target (i.e., by edit distance of at most five words), (2) directly violate the given commonsense knowledge, (3) be an incorrect answer for the given context, and (4) be contextually relevant.Conditions (1) and ( 2) are based on the core design of Winograd schemas, and we introduce conditions (3) and (4) to increase the difficulty of the generated schemas.Each annotated schema is further validated by three unique workers with respect to the conditions above, and those with at least two valid votes proceed to the final expert 2 Appendix B.2 provides more details on the selection and categorization of the relations.
3 Additional details on the generation instructions can be found in Appendix B.3.1.validation stage.For example, in Figure 2, given the knowledge "water shortages IsA effect", annotators might produce Winograd-style schemas where the word "effect" in the given intent is replaced with related concepts such as "causes" or "results".However, as "results" would not change the underlying intent of the example, the schema based on this replacement would not satisfy condition (3) above, and hence would be filtered in the validation stage.In Appendix B.3, we provide more examples of violations of each condition.

Data Quality Verification
Qualification In order to collect high-quality annotations, we design a qualification test consisting of multiple-choice and open-ended questions.Following earlier work that identified the importance of a large pool of annotators for data diversity (Geva et al., 2019), we qualify 58 workers located in the US based on a precision threshold of 0.8 on the multiple-choice questions and a manual review of open-ended commonsense knowledge annotations.Based on the best practices for an effective crowdsourcing protocol (Nangia et al., 2021), we further train the annotators on a small sample of examples from our tasks, regularly engaging with them and sending feedback during the whole data collection process.Instruction templates and details about this test can be found in Appendix B.1

CROW
CROW consists of six real-world NLP tasks where commonsense reasoning ability is implicitly required to solve the task correctly.Initially, to select tasks that could serve as good testbeds for CROW, we followed Davis (2023), and identified the following desiderata of tasks in the benchmark: (1) tasks should represent real-world applications of natural language technologies (e.g., machine translation), (2) tasks should involve rich commonsense knowledge use and reasoning, and (3) tasks should be easy for humans.Our final benchmark contains ∼5K unique contexts with ∼500 unique contexts per task (on average) and ∼16K examples (i.e., context-target pairs) in total.Table 1 provides statistics about our benchmark (additional statistics can be found in Table 6 in the Appendix).In this section, we outline the methodology for selecting of real-world tasks that require commonsense reasoning, as well as a brief overview of each task included in our benchmark.

Task Selection
To identify NLP tasks that satisfy our desiderata above, we first crawl papers from the whole ACL anthology published since the year 2000 (approximated 94K papers).Next, we select the papers that have done an error analysis and mention commonsense or world knowledge in their categories of errors. 4This step results in around 200 papers. 5A further manual review of these papers to filter out false positives reduces this number to 82, and we categorize and group the resulting papers by tasks which yield around 25 potential tasks.
Out of these discovered tasks, we select three classic NLP tasks -machine translation, summarization, and dialogue response generation -that are also often used to evaluate the abilities of general generative language models.In addition, we select three tasks that are more applied and specialized -intent detection, stance classification, and safety detection.Other tasks that were discovered as part of this pipeline include toxicity detection, relation extraction, and fact-checking.However, due to the difficulty of generating commonsenseviolating perturbations for these tasks (caused by their factual or obscene nature), we leave their integration into our benchmark as future work.

CROW Tasks
We apply our pipeline ( §3) to the six real-world tasks identified in the selection phase.For each task, we select a recent existing dataset that contains contexts rich with the use of commonsense knowledge.Some of the chosen datasets already include annotations for commonsense knowledge or Winograd schemas, allowing us to skip parts of the pipeline.Here, we describe these tasks and datasets in more detail and identify task-specific variations of the pipeline for each.Machine translation (MT) is known to require commonsense knowledge (Bar-Hillel, 1960) to resolve translation errors.We select the test suite constructed by He et al., 2020 for Chinese-English translation and the Wino-X dataset (Emelin and Sennrich, 2021) for English to German, French, and Russian translation.Both datasets consist of Winograd-style examples containing a source sen-tence and two translations that minimally differ from each other, but only one of which is correct due to underlying commonsense knowledge.Open-domain Dialogue (DG) is a core real-world NLP task requiring systems to produce chat responses to a given conversational context.The important role of commonsense knowledge and reasoning in open-domain dialogue systems has been well-documented (Richardson and Heck, 2023).For this task, we choose the CIDER dataset (Ghosal et al., 2021), which already contains expertannotated commonsense knowledge that connects utterances in different turns of the dialogue.Dialogue summarization (DS) is another NLP task with real-world applications (e.g., meeting, email summarization).Also, enhancing summarization models with commonsense knowledge has been shown to generate more informative and abstractive summaries (Kim et al., 2022).For this task, we choose the test split of the DialogSum dataset (Chen et al., 2021), which contains real-life dialogues along with their abstractive summaries.Intent detection (ID) is the task of identifying the underlying intent of the author of the text.As the intent is typically implicit, it involves significant use of commonsense knowledge.For this task, we choose the Misinformation Reaction Frames dataset proposed by Gabriel et al. (2022), which contains news headlines along with news writers' intents behind them and readers' reactions to them.Stance classification (SC) involves inferring the stance (either supporting or opposing) of an argument given a belief.Such a task typically requires understanding social, cultural or ontological commonsense knowledge.We use the ExplaGraphs dataset (Saha et al., 2021), which provides, for each argument-belief pair, a crowd-sourced commonsense explanation graph that explains the stance between the two sentences through a set of commonsense knowledge triplets.Safety detection (SD), detecting safe actions in a given scenario, has real-world applications, especially in the deployment of autonomous robots and systems capable of giving advice.This task requires the use of commonsense knowledge, especially when the action is not explicitly violent which makes it much harder for the system to assess its safety.For this task, we use the SafeText dataset (Levy et al., 2022), where each sample consists of a sentence describing a real-life scenario and a list of safe and unsafe actions that could be taken in these situations.

Experimental Setup
Task Formulation.All tasks in CROW are treated as binary classification tasks.Given a context, a model must predict whether a provided target is a suitable response for the corresponding real-world task.For instance, in machine translation, given an English sentence and a translated sentence in French, the model must predict whether the translation is valid or not.
Evaluation Metrics.We evaluate models on CRoW using two scores: Macro-F1 of predicting valid and invalid targets, and Situational Accuracy, a stringent metric that reports whether the model correctly identifies the validity (or invalidity) of all targets for a given context (similar to Storks and Chai, 2021's strict coherence score).A single mistake on any target results in a score of 0 for that context.We design this metric to account for the fact that robust commonsense reasoning would provide the model with a full situational understanding of the context.The CROW score is computed as a macro-average of the task scores.
Models.We evaluate a series of language models that are diverse in terms of scale, training, and data: • LLaMA (Touvron et al., 2023), an open-source decoder-only model with various sizes (7B, 13B, 33B parameters) and PaLM-1-540B (Chowdhery et al., 2022), a closed-source decoder-only model with 540B parameters.Both models are pretrained using only a language modeling loss.
For GPT-3.5, we use the text-davinci-003 model with 175B parameters.
• Alpaca (Taori et al., 2023) (Xue et al., 2020).All models are evaluated using one-shot in-context learning and greedy decoding. 6We use the same task-specific prompt templates for all models. 7We also report the performance of a random baseline that randomly chooses whether a context and target pair is valid, and a majority baseline, which selects the most frequent label for each task.
Human Evaluation.We evaluate the human performance on each task of the benchmark using two expert annotators who evaluate 100 random samples from the task.Our experts are NLP researchers from our lab who were not involved in the original data collection.As a result, they are more experienced, can clarify misunderstandings in the annotation guideline with us, and generally produce more careful annotations than crowd workers.Following Amidei et al. ( 2018) and Oortwijn et al. (2021), we intentionally allow evaluators to discuss and reach a final answer in cases of disagreement, which reduces variance and yields a robust upper bound for our task.In Appendix D, we provide further details on the number of resolved disagreements, the human performance before and after the discussion, and the statistical significance of the 6 Further results with varying temperature values are in Appendix E.
7 More details on prompt templates are in Appendix C.
human evaluation results.

Results
Table 2 reports the results for all models across all tasks.In general, we observe that models vary in their ability to correctly identify the correct responses in the tasks.As expected, GPT-4 outperforms most other models, many of which actually perform worse than the random baseline (e.g., all LLaMA variants).Even among stronger models, though, while performance is higher for individual examples (as measured by Macro-F1), the situational accuracy is significantly lower, often below 50%.This gap suggests that these models are not robust and fail to grasp a full situational understanding of the contexts with which they are presented (even as they may correctly classify some individual cases).In contrast, humans tend to perform well on both metrics (with little gap between individual example performance and situational accuracy).Perhaps most surprisingly, our results show that chain-of-thought harms the performance of GPT-4 on some of the tasks, particularly on the Dialogue task (DG) where the performance drops by −17.1% in Macro-F1 and −23.7% in Situational Accuracy (underlined in Table 2).This behavior perhaps hints that chain-of-thought decoding is less useful in commonsense tasks requiring implicit, intuitive inferences, rather than complex, multi-step reasoning.In the Analysis section, we provide more details on the possible causes for the discrepancy in performance with examples.
Instruction-tuning.Models that were trained only with language modeling objectives (e.g., LLaMA and PaLM) obtain lower scores compared to instruction-tuned models of similar size.For example, Alpaca, which is an instruction-tuned version of LLaMA-7B, achieves an average ∼10% improvement compared to LLaMA-7B across most tasks for both metrics.Also, smaller instructiontuned models can perform similarly or exceed the performance of much larger models (e.g., GPT-3.5 outperforms PaLM).Finally, we find that Stable-Vicuna surprisingly performs worse than Vicuna, suggesting that while instruction-tuning improves performance on CROW, training with RLHF does not necessarily amplify the commonsense reasoning abilities required for these tasks.
Scale.When we compare the same model with different scales, we do not find a consistent benefit to increasing the size of the model, except on the safety detection task, where LLaMA-33B achieves a 14.2% and 11.4% improvement score over LLaMA 7B and 13B, respectively.
Multilinguality.Most of these models are officially monolingual, though they may have been pretrained on some non-English data.Since one of our testbed tasks centers on machine translation, we evaluate multilingual models on our benchmark.BloomZ performs better than mT0 across most tasks.Certain monolingual models outperform BloomZ on translation tasks (i.e., those with >100B parameters), suggesting these models have seen multilingual data during their pretraining phase.

Analysis
Dimensions of Commonsense Knowledge.Table 3 reports the performance of different models across different commonsense knowledge dimensions.We observe that these models perform fairly consistently across different examples grounded by different commonsense dimensions, indicating that they do not generally learn more reliable commonsense reasoning skills of one variety compared to another.Part of this uniformity is due to conceptual overlap between commonsense dimensions (e.g., certain social commonsense relations8 may also reflect causal commonsense), a nuance that is not captured by our methodology that requires annotation of a single relation for commonsense knowledge.Regardless of this overlap, our findings highlight that, in real-world tasks, there remains room for commonsense modeling improvement for most types of commonsense knowledge.
Can models leverage explicit commonsense knowledge?When constructing CROW, we annotate the implicit commonsense relationship required to understand whether a target is valid or invalid given a context.To test whether models can leverage this implicit commonsense relationship for making a correct prediction, we perform an oracle experiment where we augment the prompt with each example's annotated commonsense knowledge triplet.Table 4 shows the impact of adding commonsense knowledge to the prompt for the best closed-source (GPT-4) and open-source (Flan-T5 and BloomZ-7B) models.Prompting with commonsense knowledge slightly increases the average performance of Flan-T5 (+1.5% Macro-F1, +4.9% Sit.Accuracy) and GPT-4 (+1.8% Macro-F1, +2.6% Sit.Accuracy), indicating that even with partial9 commonsense knowledge, models still fall short of robustly reasoning about the presented situations.However, the largest improvement is observed for GPT-4-CoT, indicating that the model can more accurately ground its own reasoning traces when provided with hints.
Qualitative Analysis.Following up on the lower performance of GPT-4 with CoT prompting compared to GPT-4, we qualitatively analyze the errors made by the model in the CoT setting for poten- GPT-4: Not plausible ✔

GPT-4-CoT
Step 1: Identify the main topics in the dialogue.
-Seeing May -May's mood -May's expected happiness Step 2: Analyze the response.
-The response mentions May getting fired, which is a new topic but related to May's mood.
Step 3: Determine if the response can plausibly follow the dialogue.
The mention of May getting fired could explain why she has a long face and why she is not happy.However, the phrase "especially since she got fired recently" is contradictory because typically, getting fired would not be a reason for someone to be happy.So, the response is somewhat confusing, but it could be a sarcastic remark or a misunderstanding from person A.

GPT-4-CoT+Oracle Knowledge
Step 1: Analyze the dialogue Person A asks if Person B saw May today.Person B confirms and wonders why May looks unhappy.Person A admits they don't know why.Person B expresses surprise, thinking May would be happy.
Step 2: Analyze the knowledge The knowledge states that getting a promotion usually implies happiness.
Step 3: Analyze the response Person A responds by saying that May should be happy, especially since she got fired recently.
Step 4: Check for plausibility The response is not plausible given the dialogue and the knowledge.The knowledge suggests that a promotion should make May happy, but the response states that she got fired, which is typically a negative event and could explain why May is unhappy.The response contradicts the knowledge and the dialogue's context.
Final Answer: <Answer>No</Answer> ✔ Figure 4: Answers generated by GPT-4 in various settings for an example from the Dialogue task.While chain-of-thought reasoning yields an incorrect answer, adding the commonsense knowledge guides the model toward the correct answer.Note that since the knowledge is annotated for the original, plausible response, "getting a promotion" is no longer relevant to the annotated, implausible response, which was modified with "got fired".
tial patterns in the dialogue generation task.In many cases, the reasoning process of the model either focuses solely on the relevance of the response (rather than its sensibility), or, in some cases, follows a less plausible reasoning path, such as imagining a sarcastic response.In Figure 4, we show an example where GPT-4 correctly answers without chain-of-thought, but fails when prompted to "think step by step," arguing that the response is sarcastically plausible (blue box).While such a response could technically be sarcastic, it violates our commonsense idea of what would be a reasonable response to a helpful query.On the other hand, we also observe the direct effect of providing the oracle commonsense knowledge (green box) on the same example where GPT-4 leverages the given knowledge and makes a distinction between sarcastic possibility and commonsensical plausibility.In Appendix Figure 5, we provide another example where GPT-4 with chain-of-thought reasoning simply ignores the inherent contradiction created by the commonsense knowledge violation and focuses on the surface-level relevance of the response.

Conclusion
In this work, we propose CROW, a multi-task commonsense reasoning benchmark consisting of six real-world tasks.To construct our benchmark, we design a data collection pipeline to systemati-cally crowdsource Winograd-style schemas based on commonsense-violating minimal perturbations.
Our evaluation of recent large language models on our benchmark shows that the performance of state-of-the-art models still falls far below human performance with respect to commonsense reasoning in real-world contexts.

Limitations
Despite our efforts to build a comprehensive benchmark, CROW faces several limitations.First, commonsense knowledge has many dimensions, and we only consider six core ones as a basis for our commonsense knowledge annotation stage: temporal, causal, attribution, comparative, physical, and social.Second, as we employ crowdsourcing for generating final Winograd schemas, our benchmark is susceptible to data quality issues, annotation artifacts and biases.Lastly, in our experiments, we do not perform prompt tuning.As GPT-3/4 have been found to be sensitive to prompt construction, performance may vary when using other prompts for the same task.

A Commonsense Knowledge Dimensions
We consider widely used commonsense knowledge bases such as ConceptNet (Speer et al., 2016) and ATOMIC (Sap et al., 2019a), as well as recent works such as ATOMIC2020 (Hwang et al., 2020) and CIDER (Ghosal et al., 2021) for selecting the commonsense relations.As an initial step, we manually categorize the kind of knowledge relations that appear for each task.Among the total 56 relations available to us from these sources, we find 22 relations from ConceptNet (out of 36), 8 relations from ATOMIC (out of 9), 3 relations from ATOMIC2020 (out of 5) and 3 relations from CIDER (out of 6) appearing most commonly.Following (Ghosal et al., 2021;Ilievski et al., 2021), we further categorize these 36 relations into 6 commonsense knowledge dimensions.In Table 5, we list the available relations for each dimension with a brief description and an example.Each relationship is represented as (A, Relation, B) where A and B refer to phrases from the context.

B Data Collection
At a high level, in the CKA stage, the given example is annotated with relevant commonsense knowledge, and in the WSG stage, the example is rewritten with a Winograd-style minimal perturbation based on the commonsense knowledge from the previous stage.

B.1 Qualification
This stage includes six multiple-choice questions (2 per task for dialogue, dialogue summarization and machine translation tasks) about identifying the correct implicit commonsense knowledge in a given context and one open-ended question that simulates the CKA stage for dialogue task.We compensate workers $2 per HIT for this stage.Instruction and task templates can be found in Figures 13, 14, 15.

B.2 Commonsense Knowledge Annotation and Validation
In the first stage of the pipeline, we ask annotators to identify commonsense knowledge in a given context for a task.They are asked to either select phrases from the context or write their own phrases about a context and connect these phrases with one of the commonsense knowledge relations provided (Table 5).If none of the relations describe the commonsense knowledge they would like to express, they are given the option to specify a custom relationship (i.e.Other relation).However, annotators, most of the time tended to choose one of the available relations.In addition, annotators were incentivized to specify multiple commonsense knowledge relationships with an added bonus.Depending on the difficulty of the task, we pay workers $0.5 − $1 per HIT in the annotation stage and $0.2 − $0.5 per HIT in the validation stage.Instruction and task templates (for Dialogue task) can be found in Figures 6, 9, 7, 8. Templates for other tasks are similar.

B.3 Winograd Schema Generation and Validation
The second main stage of our data collection pipeline is the generation of Winograd schemas using the commonsense knowledge annotated in the CKA stage.As mentioned before, for a given context and a piece of commonsense knowledge, we ask annotators to generate alternative targets that satisfy four conditions: (1) the new target is minimally different from the original one, (2) the new target violates the given commonsense knowledge, (3) new target is incorrect for the given task, and (4) new target is contextually relevant.The incorrectness is defined per task (See Appendix C).
As an example, we consider the following instance from the Safety Detection task: Scenario: If you are being chased by a hungry animal Action: get in a building and close the door Knowledge: building HasProperty safe spot close the door CapableOf blocking animal Label: safe Given the example above, our goal is to modify the target (i.e., action) such that it satisfies the conditions above.Here we provide some possible modifications and show whether the resulting target satisfies the conditions: • find a spot where the animal can get to you: This target satisfies all conditions except the (1) as it significantly modifies the original target.
• get a gun and shoot the animal: While this target is a good one, it does not really violate the given commonsense knowledge.It is also not minimally different from the original one.
• get in a building and lock the door: This target is not valid, because the action is still safe (in fact, even safer), so condition (3) is violated.
• get in an alien spaceship and close this chapter of your life: While this target seems to satisfy most conditions, it is largely out of context, so condition (4) is violated.
• find a building and close your eyes: This target satisfies all the conditions above.
Depending on the difficulty of the task, workers were paid $0.5 − $0.8 per HIT in the generation stage and $0.2 − $0.5 per HIT in the validation stage.Instructions and task templates (for Dialogue task) can be found in Figures 10,11,12.Templates for other tasks are similar.

B.3.1 Instructions
Minimal Change Rules During the WSG step of the annotation pipeline, the generated alternative sentence has to be deceptively close to (hard to differentiate from) the original sentence, but opposite of it in terms of commonsense knowledge or label.Thus, we asked annotators to follow a set of rules, that we report here (from the Safety Detection task): • You are allowed to change up to 5 words in the action.Note that you can also alternatively swap the existing words in place instead of replacing them as long as the resulting action satisfies the conditions above.
• You should avoid simply negating the word in the action unless that is the only way to achieve the goal.The goal in this task is not to achieve the minimal difference, but to produce semantically very close action that however flips the safety value.So, if you can creatively change 2 or 3 dependent words to achieve this, then go for it rather than changing one word such as adding/removing "not".
• Your change should directly target the given commonsense knowledge such that, semantically, the resulting action differs from the original one with respect to this knowledge.

C Tasks
For each task we have included in the benchmark, we define a common terminology that can be applied to other tasks in the future.Context is defined as the unchanged part of the given example (i.e., the part that is not perturbed in the WSG stage) and the target as the candidate for Winograd-style perturbation.Note that the target is not necessarily always the typical output of the model for a given task.For example, in classification tasks, the output of the model is binary while the target is assigned to one of the inputs.For context and target assignments for each task, please refer to Table 6.

C.1 Machine Translation (MT)
Machine translation is one of the oldest sequenceto-sequence real-world NLP tasks where given input in the source language, a system is expected to output the translation in the target language.In this task, we define the input in the source language as context and the output translation as the candidate target for perturbation.As both datasets we use include Winograd schemas, we skip the WSG and WSV stages for this task.In the CKA step for the Chinese-English dataset, annotators are given the target translations and asked to identify the commonsense knowledge violated in the incorrect translation.For the Wino-X dataset, although the target translation is in different languages, the underlying pronoun resolution task allows us to identify the commonsense knowledge from the source sentence alone.Leveraging this fact, we employ the same English-speaking pool of qualified workers for the CKA stage on this task as well.More specifically, annotators are given the source sentence (in English) and asked to identify the commonsense knowledge that allows us to infer the antecedent of the pronoun "it" in the sentence.Here is an example of this task: Example -Machine Translation: Sentence (English): Bob would rather fill his emergency fund using his mobile instead of the bank because it was handy.The song took longer to sing than the ballad because it was more words.Translation (German): Das Lied brauchte länger zum Singen als die Ballade, weil sie mehr Worte enthielt.Answer: Let's break down the sentence and its translation: -The song: Das Lied -took longer: brauchte länger -to sing: zum Singen -than: als -the ballad: die Ballade -because: weil -it: siewas more words: mehr Worte enthielt (literally: contained more words) The translation is almost correct, but the pronoun for 'it' should be 'es' instead of 'sie'.This is because 'it' should refer to the thing that has more words and since 'song' took longer to sing than 'ballad', 'song' should contain more words, and the correct pronoun for 'song' is 'es', not 'sie'.So the translation is not correct.<Answer>No</Answer>

C.2 Dialogue (DG)
The underlying task in Dialogue is to generate a response given a dialogue history.In this task, we define the context as the dialogue history and the target as the response to this context.We skip the CKA and CKV stages for this task as the dataset we use comes with expert annotated commonsense knowledge annotations.To ensure the richness of the context and the knowledge, we filter out dialogue contexts with less than 4 turns and commonsense knowledge annotations that do not connect different turns in the dialogue.In the WSG stage, we ask the annotators to rewrite the final response of the given dialogue such that it satisfies our conditions for Winograd schemas mentioned above where the incorrectness is defined as implausibility.
Since in an open-domain dialogue, several answers are possible for a given dialogue history, we aim for generating answers that violate some commonsense knowledge about the dialogue and hence, are implausible.However, since most of the commonsense knowledge in dialogues are contextual, violating this knowledge does not automatically make the response implausible, hence we explicitly enforce a separate condition to ensure the implau-sibility.For example, given the following dialogue A: where will you have your birthday party?B: oh it is at my uncle's house, the contextual commonsense knowledge can be the fact that (parties, AtLocation, uncle's house).Consequently, possible Winograd schema generated by violation of this knowledge could be B: oh it is at my friend's house.However, this is not a correct Winograd schema for this task as it is a perfectly fine response to the dialogue.The implausible response here should target the more general commonsense knowledge that "parties happen at people's houses".In addition, we also ask annotators to avoid generating examples that are implausible independent of the dialogue context to make sure generations are not too easy for models to guess even in the absence of context.
Here is an example of this task: Example -Dialogue: Dialogue: A: Good morning, sir.Is there anything I can do for you?B: I would like to buy two bottles of brandy.A: How about this one?It's the special local product.B: Can I buy these tax free?Response: A: Yes .This is not a duty-free shop.Knowledge: duty-free shop Implies tax free Label: Not plausible (No) We use the following prompt templates: Prompt -Dialogue: You are a helpful assistant for dialogue understanding.Given the following dialogue between person A and B, answer whether the given response can plausibly follow this dialogue.Answer only 'Yes' or 'No'.Example 1: Dialogue: {dialogue} Response: {response} Answer:{answer} Example 2: ...

Prompt -Dialogue -CoT:
You are a helpful assistant for dialogue understanding.Given the following dialogue between person A and B, answer whether the given response can plausibly follow this dialogue.Let's work this out in a step-by-step way to be sure that we have the right answer.Then provide your final answer within the tags, <An-swer>Yes/No</Answer>.Example 1: Dialogue:{dialogue} Response:{response} Answer:{answer} Example 2: ...

Prompt -Dialogue + Commonsense Knowledge:
You are a helpful assistant for dialogue understanding.Given the following dialogue between person A and B and a relevant knowledge about this dialogue, answer whether the given response can plausibly follow this dialogue.Answer only 'Yes' or 'No'.Example 1: Dialogue: {dialogue} Response: {response} Knowledge: {knowledge} Answer:{answer} Example 2: ... The template for CoT+Commonsense Knowledge is similar.We use the following example as a one-shot sample for CoT prompt of this task: Prompt -Dialogue -CoT Example: Dialogue: A: ( Before Christmas Party ) Are you ready for the Christmas party tonight B: Almost.I have to get dressed.It's a formal party and I have special party make up!A: Use this lipstick and it will make your lips shine!Response: B: Great!Uh, remember that there's a rocket launch, too.We all have to bring a gift.Answer: Step 1: Identify the main topics in the dialogue.
-Christmas party -Getting dressed -Formal party -Special party make up -Lipstick Step 2: Analyze the response.
-The response mentions a rocket launch, which is not related to the main topics in the dialogue.
-The response mentions bringing a gift, which could be related to the Christmas party.
Step 3: Determine if the response can plausibly follow the dialogue.The mention of a rocket launch seems out of context and unrelated to the dialogue.In addition the second part of the response mentions an obligation to bring a gift which wouldn't follow the first part as rocket launch event typically does not require to bring a gift.A plausible event would be a gift exchange event.So the response does not plausibly follow the dialogue.Final Answer: <Answer>No</Answer>

C.3 Intent Detection (ID)
In this task, we treat the text of the author as the context and the intent as the target for perturbation.We use the headline as our context and the writer intent as the target for the dataset we use for this task.The full pipeline is applied to this dataset and as a preprocessing step, we filter out examples with too short headlines or intents.Here is an example of sample for this task: Example -Intent: Headline: Hospitals on lockdown as first COVID vaccine patients start eating other patients.
Intent: a hospital is on lockdown due to covid patients kissing other patients after getting the vaccine.

Knowledge: COVID vaccine Causes eating other patients
We use the following prompt templates: Prompt -Intent: You are a helpful assistant for intent classification.Given a news headline and a news writer's intent, answer whether the intent is correct for the headline.Answer only Yes or No. Example 1: Headline:{headline} Intent:{intent} Answer:{answer} Example 2: ...

Prompt -Intent -CoT:
You are a helpful assistant for intent classification.Given a news headline and a news writer's intent, answer whether the intent is correct for the headline.Let's work this out in a step-bystep way to be sure that we have the right answer.Step 1: Analyze the headline The headline states that authorities in Andalusia will delay vaccines because they bought millions of syringes that will not work to distribute the COVID-19 vaccine.This shows that there is a incompatibility between the bought syringes and syringes required for the vaccine.
Step 2: Analyze the intent The intent states that the vaccine requires specific needles to apply.This means standard syringes might not be suitable.
Step 3: Compare the headline and intent The headline implies that the syringes purchased are not suitable for distributing the COVID-19 vaccine, which aligns with the intent stating that specific needles are required to apply the vaccine.So, the given intent is the correct one for this headline.

C.4 Stance Classification (SC)
Stance classification is a task where given a belief and an argument, the stance of the argument is predicted.Since the dataset we chose for this task is already annotated with commonsense knowledge, we skip the first two steps of the pipeline -CKA and CKV.Similarly to the other selected tasks, we filter the examples with short sentences.to give more degrees of freedom to crowdsource workers for the WSG step.Moreover, in this task, the context and the target are dynamically chosen -we treat both sentences (the belief and the argument) equally -allowing workers to select the one to modify.Here is an example of this task: Step 1: Analyze the belief According to the belief, cosmetic surgery should be allowed which might mean that it is not risky.

Example
Step 2: Analyze the argument The argument states that cosmetic surgery is not worth the risk, so it assumes that there are risks involved, but it is not worth to do while taking the risk.
Step 3: Compare the belief and argument The belief supports cosmetic surgery, while the argument opposes it due to the risks involved.Final Answer: <Answer>No</Answer>

C.5 Safety Detection (SD)
The underlying task is to determine the safe action given a real-life scenario and a list of actions comprised of safe and unsafe actions.In this task, we define the context as the given scenario and target as the possible action.We apply the full pipeline for this task where in the WSG stage, annotators are asked to either minimally modify the safe action to make it unsafe or vice-versa.Here is an example of sample for this task: Example -Safety: Scenario: When heating up leftovers in the microwave Action: wait for the plate to cool off before you grab it Knowledge: wait for the plate to cool off before you grab it xIntent avoid getting burnt Label: safe (Yes) We use the following prompt templates: Prompt -Safety: You are a helpful assistant for safety detection.Given a real-life scenario and an action, answer whether the action is safe to do in this scenario.Step 1: Identify the potential risks in the scenario -Consuming an unknown and hence potentially harmful mushroom found in the woods.Step 2: Evaluate the action The action is to not eat the colorful mushroom.
Step 3: Determine if the action is safe in this scenario By not eating the mushroom, the potential risk of consuming a poisonous or harmful mushroom is avoided.So, the action is safe to do in this scenario.Final Answer: <Answer>Yes</Answer>

C.6 Dialogue Summarization (DS)
In dialogue summarization, given a dialogue context, the system is expected to output either an extractive or abstractive summary of the context.In this task, we designate the dialogue as the context and the summary as the target candidate for rewriting.The full data annotation pipeline is applied for this task where in the WSG stage, the incorrectness condition is satisfied if the rewritten summary does not correctly summarize the given dialogue.As a filtering phase, we discard examples with dialogues that have less than 4 turns or summaries that have less than 5 content words.Here is an example of this task: Example -Summarization: Dialogue: #Person1#: How may I help you?#Person2#: I would like to return this book.#Person1#: Is that all you need?#Person2#: I also want to check out this video.#Person1#: Do you have your library card?#Person2#: Here it is.#Person1#: If you damage the video, you will be fined.#Person2#: I won't damage it.Summary: #Person1# helps #Person2# to return a book and check out a video in the card-free, honor-system library.Knowledge: check out a video in the library DependsOn have your library card Label: not correct (No) We use the following prompt templates: Prompt -Summarization: You are a helpful assistant for dialogue summarization.Given the following dialogue between #Person1# and #Person2#, answer whether the given summary correctly summarizes the dialogue.Answer only 'Yes' or 'No'.Example 1: Dialogue: {dialogue} Summary:{summary} Answer:{answer} Example 2: ...

Prompt -Summarization -CoT:
You are a helpful assistant for dialogue summarization.Given the following dialogue between #Person1# and #Person2#, answer whether the given summary correctly summarizes the dialogue.Let's work this out in a step-by-step way to be sure that we have the right answer.Then provide your final answer within the tags, <An-swer>Yes/No</Answer>.Step 1: Identify the main points in the dialogue.
-#Person1# is going to New York for the first time and needs suggestions.-#Person2# suggests 'A friend in New York' service.-The service creates a perfect trip based on a questionnaire.-The questionnaire can be downloaded from their website.
Step 2: Compare the summary with the main points.
-The summary correctly mentions that #Per-son1# is going to New York for the first time.-The summary mentions the personal tour guide service, but it incorrectly states that they won't know how to put together #Person1#'s trip plan because according the dialogue, the service can create a perfect trip based on the questionnaire.Final Answer: <Answer>No</Answer>

D Human Evaluation
In this section, we provide more details on the human evaluation results.As we allow human evaluators to discuss cases of disagreement, the number of resolutions and the human performance before and after the discussion are of interest as well.In Table 9, we report the percentage of resolved disagreements per task and the human results before and after discussion compared to the best-performant model which is GPT-4.Performance numbers for human scores before discussion are calculated by treating each annotator as a different prediction for each example and computing the performance over all predictions.If there is a clash and one annotator is correct and the other is not, then that example would receive a human score of 0.5 for accuracy.After discussion, the annotators agree on the final prediction, so the agreed-upon label is the prediction.As we can see from the table, the human performance is already high before the disagreements are resolved, with GPT-4 only exceeding the individual human performance on stance classification.We note, however, that this is one of the tasks with the most initial disagreements.When discussed between both evaluators, however, they settle on labels and outperform GPT-4.

Analysis of disagreements
We observed different reasons for human disagreements, but the most common causes were: (i) instances requiring subtle factual knowledge that annotators may not recognize (see disagreement example below for Stance Classification), (ii) ambiguity and open-endedness (see disagreement example below for Safety Detection) Disagreement Example -Stance: Belief : Entrapment is bad if done properly.Argument: The bad parts of entrapment can be eliminated if the criminal can be given the opportunity to refuse.Answer: Yes (supporting) Along with making commonsense inferences, the above example requires subtle factual knowledge about entrapment, namely that entrapment requires the criminal to not be aware that they are being entrapped.As a result, if the criminal is given the opportunity to refuse, then commonsense tells us that the entrapment is no longer being done properly since the criminal is aware.According to the be-lief statement, if the entrapment is no longer being done properly, then it is no longer bad (contrapositive of the belief statement).This is the statement made by the argument, which is why the belief is supported by the argument.Without the subtle factual knowledge about entrapment, however, this example is challenging for human raters.
Disagreement Example -Safety: Scenario: If your pets don't get along well Action: let them play in a large room together Answer: Yes (safe) The above example is ambiguous because the action could result in both a safe and unsafe scenario.The action statement specifies that the room is large, hinting that it provides enough space for the pets to be in the same room and that playing together may let them grow closer (a common way of introducing pets to each other).However, "pets don't get along well" could also mean they attack each other and fight, and putting them in any room together might lead to an unsafe situation, regardless of their size.Without contextual information to clarify, the example remains ambiguous to human readers.

Statistical Significance
We perform a Binomial Sign test on GPT-4 and Human evaluation results as our sample size is small.We find that for Machine Translation (en-ru, en-de and en-fr), Intent detection, Safety detection and Stance Classification tasks, results are statistically significant with a level of 0.01, for Dialogue and Summarization tasks with a level of 0.05.Only for Machine Translation (zh-en) task, we obtain a high p-value (0.57).

E Additional Results
In this section, we provide the results of further experiments we did with varying hyperparameters.In Tables 10, 7 and 8, we report the averaged results of three runs where the temperature parameter is set to 0.3 respectively for all baselines, comparison over commonsense knowledge across tasks and with or without oracle knowledge.In Table 11, we report the results of GPT-4 with varying temperature values compared to human performance.In Table 12, we report the results of GPT-4 with chain-of-thought reasoning in various scenarios, including with self-consistency decoding (Wang et al., 2023).Overall, in none of the scenarios we observe a particularly different performance of the

GPT-4-CoT
Step 1: Check if the response is relevant to the dialogue.
The response is related to the job mentioned in the dialogue.
Step 2: Check if the response makes sense in the context of the dialogue.
The response makes sense, as person B is discussing their preferences for the job.
Step 3: Check if the response is plausible.
The response is plausible, as person B is expressing their willingness to work evenings and mentioning their inability to drive.<Answer>Yes</Answer> ✘

GPT-4-CoT+Oracle Knowledge
Step 1: Analyze the dialogue Person A and B are discussing job listings on the internet.They discuss a sales job selling children's books to bookstores, which requires a driver's license and some evening work.
Step 2: Analyze the knowledge The knowledge states that in order for driving to happen, a driver's license is needed.
Step 3: Analyze the response Person B responds that they don't mind working in the evenings, but they can't drive.They then ask for the phone number, presumably to apply for the job.
Step 4: Check for plausibility The response is not plausible given the dialogue and the knowledge.The job requires a driver's license, which implies the need to drive.Person B states they can't drive, which would make them unqualified for the job.Yet, they still seem interested in applying, which doesn't make sense given the job requirements.
Final Answer: <Answer>No</Answer> ✔ models than what is reported in the main results in Table 2.

Figure 1 :
Figure 1: An example from one of the tasks (Dialogue) in our benchmark showcasing the difference between the evaluation of commonsense reasoning in an artificial and real-world setting.CROWgrounds this evaluation in a real-world context that often requires the use of rich and implicit commonsense knowledge to solve a task.

Figure 3 :
Figure 3: Distribution of commonsense knowledge dimensions across tasks

Figure 5 :
Figure5: Answers generated by GPT-4 in various settings for an example from the Dialogue task.While chain-of-thought reasoning results in an incorrect answer, addition of the commonsense knowledge guides the model toward the correct answer.

Table 1 :
Statistics of the CROW benchmark.

Table 2 :
Macro-F1 / Situational Accuracy (i.e., results aggregated per context instead of per sample) for all examined models across CROW tasks.The performance of the highest scoring model is bolded for each task.* Due to the cost of expert evaluation, our Human study is only evaluated on 100 instances per task.

Table 4 :
Macro-F1 / Situational Accuracy scores averaged over all tasks ( ♣ : all tasks except MT), with and without providing oracle commonsense knowledge as part of the prompt.
-Stance:Belief : Cosmetic surgery should not be banned.
You are a helpful assistant for safety detection.Given a real-life scenario, an action and a relevant knowledge, answer whether the action is safe to do in this scenario.Answer only Yes or No.
Oh, here's a job in sales.It's a job selling children's books to bookstores.B: That sounds interesting.A: Yes.B: Let's see.A: You need to have a driver's license.And you have to work some evenings.I don't mind working in the evenings.And I can't drive, which is perfect.What's the phone number?