CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation

The full power of human language-based communication cannot be realized without negation. All human languages have some form of negation. Despite this, negation remains a challenging phenomenon for current natural language understanding systems. To facilitate the future development of models that can process negation effectively, we present CONDAQA, the first English reading comprehension dataset which requires reasoning about the implications of negated statements in paragraphs. We collect paragraphs with diverse negation cues, then have crowdworkers ask questions about the implications of the negated statement in the passage. We also have workers make three kinds of edits to the passage—paraphrasing the negated statement, changing the scope of the negation, and reversing the negation—resulting in clusters of question-answer pairs that are difficult for models to answer with spurious shortcuts. CONDAQA features 14,182 question-answer pairs with over 200 unique negation cues and is challenging for current state-of-the-art models. The best performing model on CONDAQA (UnifiedQA-v2-3b) achieves only 42% on our consistency metric, well below human performance which is 81%. We release our dataset, along with fully-finetuned, few-shot, and zero-shot evaluations, to facilitate the development of future NLP methods that work on negated language.


Introduction
Negation is fundamental to human communication.It is a phenomenon of semantic opposition, relating one expression to another whose meaning is in some way opposed.Negation supports key properties of human linguistic systems such as contradiction and denial (Horn, 1989).
Despite the prevalence of negation, processing it effectively continues to elude models.Here are just a few of the many recently reported failures: "The model [BERT-Large trained on SQuAD] does not seem capable of handling...simple examples of negation" (Ribeiro et al., 2020)."We find that indeed the presence of negation can significantly impact downstream quality [of machine translation systems]" (Hossain et al., 2020a)."State-of-the-art models answer questions from the VQA...correctly, but struggle when asked a logical composition including negation" (Gokhale et al., 2020).How can NLU systems meet this long-standing challenge?
To facilitate systems that can process negation effectively, it is crucial to have high-quality evaluations that accurately measure models' competency at processing and understanding negation.In this work, we take a step toward this goal by contributing the first large-scale reading comprehension dataset, CONDAQA, focused on reasoning about negated statements in language. 1he three-stage annotation process we develop to construct CONDAQA is illustrated in Fig. 1.We first collect passages from English Wikipedia that contain negation cues, including single-and multiword negation phrases, as well as affixal negation.In the first stage, crowdworkers make three types of modifications to the original passage: (1) they paraphrase the negated statement, (2) they modify the scope of the negated statement (while retaining the negation cue), and (3) they undo the negation.In the second stage, we instruct crowdworkers to ask challenging questions about the implications of the negated statement.The crowdworkers then answer the questions they wrote previously for the original and edited passages.
This process resulted in a dataset of 14,182 questions, covering a variety of negation cue types and over 200 unique negation cues, as well as a con- Gold answers given by crowdworkers; Answers predicted by InstructGPT (text-davinci-002) prompted with 8 shots.See §2 for more details about each stage.
trastive dataset, with passages that are lexically similar to each other but that may induce different answers for the same questions.To perform well on CONDAQA, models must be able to reason about the implications of negated statements in text.In addition to accuracy, the contrastive nature of CONDAQA enables us to measure the consistency of models-i.e., the extent to which models make correct predictions on closely-related inputs.
We extensively benchmark baseline models on CONDAQA in three training data regimes: using all training examples, using only a small fraction (few-shot), or not using any examples (zeroshot).We show that CONDAQA is challenging for current models.Finetuning UNIFIED-QA-3B (Khashabi et al., 2022)-which was trained on 20 QA datasets-on CONDAQA, achieves the best result of 73.26% compared to human accuracy of 91.49%.Further, we find that models are largely inconsistent; the best model achieves a consistency score of only 42.18% (40% below human consistency).This very low consistency score demonstrates that handling negation phenomena is still a major unresolved issue in NLP, along with sensitivity to contrasting data more generally.The dataset and baselines are available at https:// github.com/AbhilashaRavichander/CondaQA.

CONDAQA Data Collection
This section describes our goals in constructing CONDAQA and our data collection procedure.
Design Considerations Our goal is to evaluate models on their ability to process the contextual implications of negation.We have the following four desiderata for our question-answering dataset: 1.The dataset should include a wide variety of ways negation can be expressed.2. Questions should be targeted towards the implications of a negated statement, rather than the factual content of what was or wasn't negated, to remove common sources of spurious cues in QA datasets (Kaushik and Lipton, 2018;Naik et al., 2018;McCoy et al., 2019).3. The dataset should feature contrastive groups: passages that are closely-related, but that may admit different answers to questions, in order to reduce models' reliance on potential spurious cues in the data and to enable more robust evaluation (Kaushik et al., 2019;Gardner et al., 2020).4. Questions should probe the extent to which models are sensitive to how the negation is expressed.In order to do this, there should be contrasting passages that differ only in their negation cue or its scope.
Dataset Construction Overview We generate questions through a process that consists of the following steps, as shown in Figure 1: 1.We extract passages from Wikipedia which contain negated phrases.2. We show ten passages to crowdworkers, and allow them to choose a passage they would like to work on.3. Crowdworkers make three kinds of edits to the passage: (i) paraphrasing the negated statement, (ii) changing the scope of the negation, (iii) rewriting the passage to include an affirmative statement in place of the negated statement.For all three kinds of edits, the crowdworkers modified the passage as appropriate for internal consistency.4. Crowdworkers ask questions that target the implications of a negated statement in the passage, taking passage context into account.5. Crowdworkers provide answers to the constructed questions for the Wikipedia passage, as well as the three edited passages.Further, we validate the development and test portions of CONDAQA to ensure their quality.
Passage Selection We extract passages from a July 2021 version of Wikipedia that contain either single-word negation cues (e.g., 'no') or multiword negation cues (e.g., 'in the absence of').We use negation cues from (Morante et al., 2011;van Son et al., 2016) as a starting point which we extend.Our single-word negation cues include affixal negation cues (e.g., 'il-legal'), and span several grammatical categories including: 1. Verbs: In this novel, he took pains to avoid the scientific impossibilities which had bothered some readers of the "Skylark" novels.2. Nouns: In the absence of oxygen, the citric acid cycle ceases.3. Adjectives: Turning the club over to managers, later revealed to be honest people, still left Wills in desperate financial straits with heavy debts to the dishonest IRS for taxes.4. Adverbs: Nasheed reportedly resigned involuntarily to forestall an escalation of violence; 5. Prepositions: Nearly half a century later, after Fort Laramie had been built without permission on Lakota land.6. Pronouns: I mean, nobody retires anymore.7. Complementizers: Leave the door ajar, lest any latecomers should find themselves shut out.
8. Conjunctions: Virginia has no 'pocket veto' and bills will become law if the governor chooses to neither approve nor veto legislation.9. Particles: Botham did not bat again.
Crowdworker Recruitment We use the Crowdaq platform (Ning et al., 2020) to recruit a small pool of qualified workers to contribute to CON-DAQA.We provide instructions, a tutorial and a qualification task.Workers were asked to read the instructions, and optionally to also go through the tutorial.Workers then took a qualification exam which consisted of 12 multiple-choice questions that evaluated comprehension of the instructions.We recruit crowdworkers who answer >70% of the questions correctly for the next stage of the dataset construction task.In total, 36 crowdworkers contributed to CONDAQA.We paid 8 USD/HIT, which could on average be completed in less than 30 minutes.Each HIT consisted of choosing a passage, making edits to the passage, creating questions, and answering those questions.

Contrastive Dataset Construction
We use Amazon Mechanical Turk to crowdsource questionanswer pairs about negated statements.Each question is asked in the context of a negated statement in a Wikipedia passage.
In the first stage of the task, we show crowdworkers ten selected passages of approximately the same length and let them choose which to work on.This allows crowdworkers the flexibility to choose passages which are easy to understand, as well as to choose passages which are conducive to making contrastive edits (for example, it may be difficult to reverse the negation in a passage about 'Gödel's incompleteness theorems').
After selecting a passage, crowdworkers make three kinds of edits to the original Wikipedia passage (Fig. 1): (1) they rewrite the negated statement such that the sentence's meaning is preserved (PARAPHRASE EDIT); (2) they rewrite the negated statement, changing the scope of the negation (SCOPE EDIT); and (3) they reverse the negated event (AFFIRMATIVE EDIT).We ask crowdworkers to additionally make edits outside of the negated statement where necessary to ensure that the passage remains internally consistent.
In the second stage of the task, the crowdworker asks at least three questions about the implications of the negated statement in the original Wikipedia passage.We encourage the construction of good questions about implications by providing several examples of such questions, as well as by sending bonuses to creative crowdworkers, ranging from 10$-15$.Crowdworkers can group these questions, to indicate questions that are very similar to each other, but admit different answers.
In the final stage of this task, crowdworkers provide answers to the questions, in context of the Wikipedia passages as well as for the three edited passages.The answers to the questions are required to be either Yes/No/Don't Know, a span in the question, or a span in the passage.Following best practices for crowdsourcing protocols described in the literature (Nangia et al., 2021), we provide personalized feedback to each crowdworker based on their previous round of submissions, describing where their submission was incorrect, why their submission was incorrect, and what they could have submitted instead.In all, we provide over 15 iterations of expert feedback on the annotations.We collect this data over a period of ∼seven months.

Data Cleaning and Validation
In order to estimate human performance, and to construct a highquality evaluation with fewer ambiguous examples, we have five verifiers provide answers for each question in the development and test sets.Crowdworkers were given passages, as well as the passage edits and questions contributed in the previous stage of our task.In each HIT, crowdworkers answered 60 questions in total, spanning five passage sets.We found there was substantial interannotator agreement; for the test set we observed a Fleiss' κ of 63.27 for examples whose answers are Yes/No/Don't know (97% of examples), 62.75 when answers are a span in the question (2% of examples), and 48.54 when answers were indicated to be a span in the passage (1% of examples).We only retain examples in the test and development sets where at least four annotators agreed on the answer.However, since this procedure results in few questions with 'don't know' as the answer, we include an additional stage where we (the authors) manually verify and include questions where 'don't know' was the answer provided by the question author.As a result, we discard 1,160 instances from the test set, and 270 from the development set.

CONDAQA Data Analysis
In this section, we provide an analysis of the passages, questions, edits, and answers in CONDAQA.Descriptive statistics are provided in Table 1.
Negation Cues Negation is expressed in many complex and varied ways in language (Horn, 1989).
To characterize the distribution of types of negated statements in CONDAQA, we analyze the negation cues in Wikipedia passages that annotators could select.Figures 2 and 4 (Appendix) show that the distribution over these cues and their grammatical roles is considerably diverse.

Commonsense inferences
We assess commonsense inferences types required to answer CON-DAQA questions.We sample 100 questions from the test set and manually annotate the dimensions of commonsense reasoning required to answer them.Table 2 shows some of these reasoning types (the full version in Table 12 in the Appendix).
Editing Strategies Recall that the passages with negated statements are sourced from Wikipedia and crowdworkers make three kinds of edits (Fig. 1).Through a qualitative analysis of the data, we identify commonly employed edit strategies (Tables 3  and 13).We also analyze to what extent edits cause an answer to change.We find that the affirmative edits change the answers of 77.7% of questions from the original Wikipedia passage, and the scope edits change the answer of 70.6% of questions.
Potential edit artifacts Because we had crowdworkers edit Wikipedia paragraphs, a potential concern is that the edited text could be unnatural and give spurious cues to a model about the correct answer.We ran two tests to try to quantify potential bias in this edited data.First, we trained a BERT model (Devlin et al., 2019) to predict the edit type given just the passage.The model performs only a little better than random chance (34.4%), most of the improvement coming from the ability to sometimes detect affirmative edits (where the negation cue has been removed).Second, we compared the perplexity of the original paragraphs to the perplexity of the edited paragraphs, according to the GPT language model (Radford et al., 2018)

Synonym substitution
Local tetanus is an uncommona rare form of the disease and it causes persistent contractions of muscles in the same area of the sufferer's body as where the original injury was made.

Antonym substitution
The population of the Thirteen States was not homogeneous heterogeneous in political views and attitudes.

Ellipsis
Sunni scholars put trust in narrators such as Aisha, whom Shia rejectWhile the Shia tend to reject narrators such as Aisha, Sunni scholars tend to trust them.

SCOPE EDIT Complement inversion
SunniShia scholars put trust in narrators such as Aisha, whom ShiaSunni reject.

Supersetsubset
During the coronavirus outbreak of 2020, alcohol sales, and even the were made illegal, but the transportation of alcohol outside of one's home, was made illegal remained legal.

Temporal shift
As the new Emperor could not exert his constitutional powers untilonce he came of age, a regency was set up by the National Assembly.

Veridicality
Contrary to assumptions that he was illiterate, on arrival he was given aptitude tests which determined that he was illiteratenot only could he read the questions and respond in writing, but he also had an above-average IQ of 109.
Table 3: Examples of revision strategies employed by crowdworkers for paraphrase and scope edits.Categories for paraphrases are inspired by Bhagat and Hovy (2013).The negation cue is shown in blue and newly-inserted text is in red.Expanded analysis is shown in the Appendix (Table 13).
matches the ground truth answer exactly.
Group Consistency CONDAQA's dense annotations enable us to study model robustness through group consistency.We wish to measure whether a model correctly captures how the presence of negated phrases influences what can be inferred from a paragraph.Measuring this requires varying (and sometimes removing) the negated phrases and seeing how the model responds (see Table 14 in the Appendix); it is only by looking at consistency across these perturbations that we can tell whether a model understands the phenomena in question (Gardner et al., 2020).Specifically, for a group of minimally-different instances, consistency measures whether the prediction matches the ground truth answer for every element in that group.We consider two types of groups: (a) Questionlevel consistency: each group is formed around a question and the answers to that question for the original Wikipedia passage, as well as the three edited passage instances (ALL), (b) Edit-level consistency: each group is formed around a question, the answers to that question for the original Wikipedia passage, and only one of the edited passages (PARAPHRASE CONSISTENCY, SCOPE CON-SISTENCY, and AFFIRMATIVE CONSISTENCY).To compute consistency, we use the 5,608 questions in the test set that have (passage, answer) pairs for all four edit types (excluding any question where at least one passage was removed during validation).

Models and Controls
The baseline models that we benchmark on CON-DAQA are listed in Table 4.We categorize them based on whether they use (a) all of the training data we provide (full finetuned), (b) a small fraction of the available training data (few-shot), (c) no training data (zero-shot), and on (d) whether they measure dataset artifacts (controls).
For full finetuning, we train and evaluate three BERT-like models: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), andDeBERTa (He et al., 2021b,a), in addition to UnifiedQA-v2 (Khashabi et al., 2022), a T5 variant (Raffel et al., 2020) that was further specialized for QA by training the model on 20 QA datasets.More information about these models is given in Appendix C.1.We study Base, Large, and 3B sizes of UnifiedQA-v2.Each fully-finetuned model is trained with 5 random seeds, and we report the average performance across seeds on the entire test set.
In the few-shot setting with 8-9 shots, we evaluate UnifiedQA-v2-{Base, Large, 3B} (Khashabi et al., 2022), GPT-3 (davinci;Brown et al., 2020), and a version of InstructGPT orig (Ouyang et al., 2022) known as text-davinci-002; henceforth referred to as InstructGPT.We additionally prompt InstructGPT with chain of thoughts (CoT; Wei et al., 2022) as this should be beneficial for reasoning tasks.We do prompt-based finetuning of UnifiedQA-v2 (i.e., change its parameters) and incontext learning of the GPT models (i.e., we do not change their parameters).Besides these models, in the zero-shot setting, we also evaluate UnifiedQA-v2-11B and FLAN-T5-11B (Chung et al., 2022), a T5 variant that was further trained with instruction finetuning and CoT data.Details of few-and zeroshot settings are given in Appendix C.2. Due to the cost of the OpenAI API and sensitivity of few-shot learning to the choice of few examples (Zhao et al., 2021;Logan IV et al., 2022;Perez et al., 2021), we evaluate few-and zero-shot models as follows.We split the train/test sets into five disjoint sets, sample 9 shots from each train subset, evaluate models on such five train-test splits, and report the average performance across them.On average each test split contains 1448 instances.
We evaluate heuristic baselines to measure the extent to which models can use data artifacts to answer CONDAQA questions.These baselines can answer questions correctly only if there is bias in the answer distribution given a question or other metadata since they do not get paragraphs.We train UNIFIEDQA-V2-LARGE on just: (i) (question, answer) pairs, (ii) (question, edit type, answer) triples where the edit type denotes whether the passage was a paraphrase, scope edit, etc., and (iii) (question, negation cue, answer) triples.We find these baselines do little better than just answering "No".
Human Performance We measure human performance on CONDAQA development and test sets.Every question was answered by five crowdworkers.To evaluate human performance, we treat each answer to a question as the human prediction in turn, and compare it with the most frequent answer amongst the remaining answers.For questions where the gold answer was decided by experts ( §2), we treat each answer as the human prediction and compare it to the gold answer.Human accuracy is 91.94%, with a consistency score of 81.58%.

Results
Model performance on CONDAQA is given in Table 4.The best performing model is fully finetuned UNIFIEDQA-V2-3B with an accuracy of 73.26% and overall consistency of 42.18%, where the estimated human accuracy is 91.94% and consistency 81.58%.This gap shows that CONDAQA questions are both answerable by humans, and challenging for state-of-the-art models.
We create a contrastive dataset to be able to measure consistency as measuring models' ability to robustly predict answers across small input perturbations can provide a more accurate view of linguistic capabilities (Gardner et al., 2020).Here, there is a gap of ∼40% in consistency between humans and the best model.Models are most robust to paraphrase edits: if a model answers a question correctly for the original passage, it is likely to be robust to changes in how that negation is expressed.We observe that the heuristic-based baselines exhibit low consistency, suggesting the consistency metric may be a more reliable measure than accuracy to evaluate models' ability to process negation.Thus, mainstream benchmarks should consider including consistency as a metric to more reliably measure progress on language understanding.Few-and zero-shot baselines do not match fully finetuned models' performance, but considerably improve over the majority baseline.For UnifiedQA-v2 in particular, this suggests that some reasoning about implications of negation is acquired during pretraining.Surprisingly, UnifiedQA-v2 few-shot performance is worse than zero-shot.While this behavior has been reported for in-context learning with GPT-3 (Brown et al., 2020;Xie et al., 2022), we did not expect to observe this for a finetuned model.2UnifiedQA-v2-3B finetuned with a few examples is comparable to InstructGPT (text-davinci-002; at least 175B parameters) with in-context learning.Chain-ofthought prompting (CoT) notably improves the performance of InstructGPT, especially in terms of the most challenging metrics: scope and affirmative consistency.In the zero-shot setting, the 11B version of UnifiedQA-v2 performs the best, while the base version of only 220M parameters is comparable to InstructGPT.UnifiedQA-v2-11B is also better than FLAN-T5-XXL (a 11Bparameter model as well).Given that UnifiedQA-v1 (Khashabi et al., 2020) has been effective for tasks beyond QA (Bragg et al., 2021;Marasović et al., 2022) models are strong but overlooked baselines in recent works on large-scale models.

Analysis
While examining model errors, we find UNIFIEDQA-V2-LARGE has a negative correlation with question length (Figure 7 in Appendix D).Humans can still reliably answer such long questions that are frequent in CONDAQA.We also analyze the performance of UNIFIEDQA-V2-LARGE across answer types, finding that: (i) the model performs best when the answer is "No", (ii) it almost never predicts "Don't know", and (iii) its performance on span extraction questions is in-between those two extremes (Figure 8 in Appendix D).UNIFIEDQA-V2-3B exhibits similar behavior, with improved performance on questions which admit "Don't know" as an answer.We analyze questions across the Wikipedia pas-sages and the passages with edited scopes, with the focus on: (i) instances where the true answer does not change with the edited scope and the model should be stable, and (ii) instances where the true answer does change and the model should be sensitive to the edit.We find that when the fully-finetuned UNIFIEDQA-V2-3B (the bestperforming model) answers the question correctly for the Wikipedia passage, it only produces the answer correctly for 63.23% of questions where the scope edit induces a different answer.In contrast, the model answers questions correctly for 91.03% of the instances where the answer does not change with the scope edit. 3This suggests the model is not sensitive to changes of the scope of negated statements.
We also analyze to what extent UNIFIEDQA-V2-3B distinguishes between negated statements and their affirmative counterparts.We examine model predictions for 1080 sample pairs where the answer changes when the negation is undone.For 43.52% of these, the model changes its predictions.This suggests, in contrast to previous work (Kassner and Schütze, 2020;Ettinger, 2020), that models are sensitive to negated contexts to some extent.

Related Work
In Aristotle's de Interpretatione, all declarative statements are classified as either affirmations or negations used to affirm or contradict the occurrence of events (Ackrill et al., 1975).Negation is expressed through a variety of formulations (Horn, 1989) and is prevalent in English corpora (Hossain et al., 2020c).Despite that, evidence from multiple tasks that require language understanding capabilities-such as NLI (Naik et al., 2018), sentiment analysis (Li and Huang, 2009;Zhu et al., 2014;Barnes et al., 2019)

Conclusion
Negation supports key properties of human linguistic systems such as the ability to distinguish between truth and falsity.We present CONDAQA, a QA dataset that contains 14,182 examples to evaluate models' ability to reason about the implication of negated statements.We describe a procedure for contrastive dataset collection that results in challenging questions, present a detailed analysis of the dataset, and evaluate a suite of strong baselines in fully-finetuned, few-shot, and zero-shot settings.We evaluate models on both their accuracy and consistency, and find that this dataset is highly challenging-even the best-performing model is 18 points lower in accuracy than our human baseline, and about 40 points lower in consistency.We expect that CONDAQA will facilitate NLU systems that can handle negation.

Limitations
In

A Extended Comparison to Prior Negation Datasets
In this section, we complement the discussion in §7 on how CONDAQA differs from existing datasets focused on negation.A detailed comparison is given in Table 5.
Our goal with constructing CONDAQA is to contribute a high-quality and systematic evaluation that will facilitate future models that can adequately process negation.To this end, we aim to construct a benchmark where artifacts are carefully mitigated (Gardner et al., 2020), that is large enough to support robust evaluation, and that covers competencies any NLU system needs for adequate processing of negation.For example, the ability to recognize the implications of negated statements, distinguish them from their affirmative counterparts, and identify their scope.As such, main properties that CONDAQA has compared to prior datasets focused on negation are: 1.It is the first English reading-comprehension dataset that targets how models process negated statements in paragraphs (Gardner et al., 2019b

B Analysis of CONDAQA
Commonsense Inferences We provide a categorization of the types of commonsense inferences required to answer CONDAQA questions.These categories are presented in Table 12.
Edit Strategies We provide a set of edit strategies that were employed by crowdworkers to make paraphrase and scope edits.These edits are given in Table 13.
Question/Passage Overlap An issue with some NLU datasets is that simple heuristics based on  lexical overlap are sufficient to achieve high performance (Weissenborn et al., 2017;Naik et al., 2018).We measure the lexical overlap between CONDAQA questions and passages and find that is considerably lower than many prior QA datasets.Specifically, the average overlap between questions words and passage words is 0.52, which is lower compared to SQuAD 1.0 (

Distribution of grammatical categories of negation cues
We analyze the distribution over grammatical categories for single-word negation cues in CONDAQA.We use the NLTK library (Bird et al., 2009) to identify part-of-speech tags for these cues.These results are shown in Figure 4.
Model sensitivity to edits One potential issue with the dataset may be that models find it trivial to distinguish between edited passages and leverage this information to answer questions.To evaluate whether models can easily distinguish between the original passages and edited versions, we train BERT (Devlin et al., 2019) on the task of identifying whether a passage is sourced from Wikipedia or is an edited passage produced by a crowdworker.We expect it should be simple for these models to distinguish between the Wikipedia passages and the affirmative edits, as the model can simply rely on the presence or absence of a negation cue.We observe that as expected, models are somewhat able to distinguish the original Wikipedia passages from affirmative edits, but are largely unable to discriminate between the original passage and the paraphrase and scope edits (Table 6).
Naturalness of edits New edits made by crowdworkers may contain unnatural sentences or linguistic constructs.To quantify this and to exclude the possibility that model performance degrades only due to the unnaturalness of the edits, we compare the perplexity assigned by the OpenAI-GPT lan-   (Radford et al., 2018).
guage model (Radford et al., 2018) to the edited passages and the original Wikipedia passages, finding that they are largely similar (Table 7).

Consistency Groups
We provide data statistics on the instances that are used to compute consistency metrics on the dataset.There are 5,608 instances in the dataset that are included in consistency groups, and thus there are 1,402 "groups" to compute question-level consistency.and each edit-level consistency metric.

C Model Training Details
All models we evaluate on CONDAQA are pretrained transformer-based language models.We test them in three training settings: (ii) finetuned on the entire training data ( §C.1), (ii) finetuned on a few examples (few-shot; §C.2), and (iii) without training (zero-shot; §C.2).

C.1 Fully Finetuned
We train all fully-finetuned model with five seeds and report the average performance across them.
For every seed, we evaluate the model with the best validation accuracy on the entire test set.
BERT (Devlin et al., 2019) BERT is pretrained with masked language modeling (MLM) and a nextsentence prediction objective.Since a majority of the questions have Yes/No/Don't know as the answer, we finetune BERT and other BERT-like models (see below) in a multi-class classification setting.We train all BERT-like models in this fashion.In our experiments, we BERT-Large.We train with a learning rate of 1e-5 for 10 epochs.
RoBERTa (Liu et al., 2019) RoBERTa is a more robustly pretrained version of BERT.In our experiments, we use RoBERTa-Large.
DeBERTa (He et al., 2021b,a) DeBERTa has a disentangled attention mechanism and it is pretrained with a version of MLM objective that uses the content and position of the context words.In our experiments, we use DeBERTa-v2-XLarge and DeBERTa-v3-Large.
InstructGPT (text-davinci-002; Ouyang et al., 2022) This GPT variant does not come with a corresponding paper and little is known about it.It has recently been confirmed that it is an Instruct model, but unlike the original InstructGPT orig (text-davinci-001; Ouyang et al., 2022) it is not derived from GPT-3 (davinci). 4InstructGPT orig has been trained on the data that includes "prompts submitted to earlier versions of the InstructGPT models on the OpenAI API Playground".InstructGPT orig is finetuned with reinforcement learning from human feedback (Stiennon et al., 2020).text-davinci-002 has two times longer maximum input sequence length than davinci suggesting that the overall model size is notably larger too.This also means we can fit more examples in the context, but we do not find that to improve text-davinci-002's performance; see Table 9.It has been reported on social media that text-davinci-002 has notably stronger performance than text-davinci-001, but where do these improvements come from is publicly unknown. 5 Chain-of-Thoughts (CoT) prompting (Wei et al., 2022) This type of prompting makes the model explain its prediction before providing it.When it was introduced, CoT prompting demonstrated benefits for math and commonsense reasoning.Since then, Suzgun et al. (2022) report that CoT prompting gives substantial improvements for a hard subset of the BIG-Bench tasks (Srivastava et al., 2022). 6This makes it a promising prompt for our proposed task of reasoning about implications of negation.{answer} One of the authors wrote explanations for all shots in each split (45 explanations in total) in few hours.In Figure 6, we show an example of a CoT prompt we use for "InstructGPT" (text-davinci-002).
FLAN-T5 (Chung et al., 2022) FLAN-T5 is a T5 variant that is further trained with instruction finetuning that includes CoT prompting, on over 1.8K tasks.We prompt FLAN-T5 in the zero-shot setting by constructing each test instance as follows: • Input: Passage: {passage}\nQuestion: {question}\nGive the rationale before answering.• Output: {explanation} So the answer is {answer}.This output form is the most common, but the model sometimes generates "(final) answer is", "(final) answer:", etc., instead of "So the answer is".-v2 (Khashabi et al., 2022) We also evaluate UnifiedQA-v2 in a few-and zero-shot settings in addition to fully training it.We construct instances following how they are constructed for training UnifiedQA-v2:
Which few examples to select?CONDAQA's unique structure raises the question of which 8-9 examples to use for few-shot learning: 1. Randomly selected, 2. Random without affirmative paragraphs to include more paragraphs with negation cues, 3. Two groups of two questions and corresponding 4 paragraphs (original and three edited), 4. Three groups of two questions and corresponding 3 paragraphs (original, scope-and paraphrase-edited; no affirmative).We hypothesize that the last two options could be beneficial for consistency of few-shot models.We prompt davinci with 1st and 3rd options, and depending which is better we evaluate 2nd or 4th (i.e., the better option without affirmative paragraphs).Contrary to our expectations, we find that the 1st option works better than 3rd, as well as better than the 2nd option; see Table 8.Therefore, for each training split, we sample 9 paragraph-question pairs randomly (sometimes only 8 fit in the context) and use these samples for all few-shot experiments.

D Model analysis
Model performance stratified by passage type In Table 10, we report the accuracy of model predictions corresponding to the type of passage: i.e whether the question was asked on the original Wikipedia passage, its paraphrase edit, its scope edit or the affirmative edit.When we compare  those results with those in Table 4, we observe that UnifiedQA-v2 shows largely similar QA performance in terms of accuracy on these different passage types, despite having very different consistency scores with the original passage.In contrast, GPT-3 and Instruct-GPT in the few-shot setting perform better on the original Wikipedia passages and their paraphrased versions than on the scope and affirmative edits, possibly suggesting that these models work best on passages that are available online.

Model performance by question length
In Figure 7, we show model performance stratified by question length.We observe that longer questions are more difficult for UNIFIEDQA-V2-LARGE but UNIFIEDQA-V2-3B appears to exhibit similar QA performance on some of these long questions.
Model performance by answer type In Figure 8, we show results of model performance stratified  by answer type (Figure 8).
Variance in model performance We report the standard deviation of UnifiedQA-V2 models computed over the results from five seeds, as well as the standard deviation of GPT-3 and Instruct-GPT in few-shot and zero-shot settings computed over five splits.These are shown in Table 11.

Novelty of negation cues
We compare the performance of fully-finetuned UnifiedQA-v2 Large/3B on Wikipedia passages where the negation cue has

E Crowdsourcing Interface Templates
We include an example of the annotation interface we showed to crowdworkers.Figure 9 shows a sample of each stage of our task.Figure 9: Sample of our Question-Answering HIT, where crowdworkers can choose a passage, make edits to that passage, ask questions about that passage and then answer those questions.Paragraph #1: Scorsese was initially reluctant to develop the project, though he eventually came to relate to LaMotta's story.Schrader re-wrote Martin's first screenplay, and Scorsese and De Niro together made uncredited contributions thereafter.Pesci was a famous actor prior to appearing in this role, but Moriarty was unknown to the producers before he suggested her for her role.During principal photography, each of the boxing scenes was choreographed for a specific visual style and De Niro gained approximately to portray LaMotta in his later post-boxing years.Scorsese was exacting in the process of editing and mixing the film, expecting it to be his last major feature.
Question: Is it possible that the writers of this movie had specifically tailored the character to Joe Pesci's unique on-screen charisma, with the hopes that he would accept the role?
Answer: Yes Paragraph #2: Scorsese was initially reluctant to develop the project, though he eventually came to relate to LaMotta's story.Schrader re-wrote Martin's first screenplay, and Scorsese and De Niro together made uncredited contributions thereafter.Before appearing in this movie, Pesci had not achieved fame as an actor, and neither had Moriarty, who he suggested for her role.During principal photography, each of the boxing scenes was choreographed for a specific visual style and De Niro gained approximately to portray LaMotta in his later post-boxing years.Scorsese was exacting in the process of editing and mixing the film, expecting it to be his last major feature.
Question: Is it possible that the writers of this movie had specifically tailored the character to Joe Pesci's unique on-screen charisma, with the hopes that he would accept the role?
Answer: No Table 14: Presumably, answering this question in the context of the second paragraph requires reasoning about negation, while if the question is answered in the context of the first paragraph it does not.However, if the model is only ever presented instances like the second paragraph, it is possible that there would be subtle artifacts that lead to a model's good performance without ever needing to fully process the negation.By making minimal changes to the paragraph that intervene on the negation, we can increase our confidence that the model is able to correctly process the negation in the second paragraph.The question-paragraph pairs must be considered jointly to accurately characterize a model's ability handle negation, hence our focus on group consistency as our preferred performance metric.

Figure 1 :
Figure 1: CONDAQA three-stage collection procedure.The original passage is selected by a crowdworker from a given set of 10 passages.Gold answers given by crowdworkers; Answers predicted by InstructGPT (text-davinci-002) prompted with 8 shots.See §2 for more details about each stage.

Figure 2 :
Figure 2: Distribution of negation cues in CONDAQA.Inner circle represents distribution of negation cue types by their frequency and the outer circle represents cues.

Table 5 :
Comparison between CONDAQA and prior datasets focusing on probing negation.We examine the English data in Hartmann et al. (2021), the MNLI/SNLI/RTE splits in Hossain et al. (2020c), NMoNLI (Geiger et al., 2020), as well as the NEG-136-SIMP and NEG-136-NAT datasets (Ettinger, 2020).CONDAQA is a reading comprehension dataset (RC), tasks in Hartmann et al. (2021) and Hossain et al. (2020c) are stress tests for existing general-purpose NLI datasets such as MNLI.NMoNLI is used both as a challenge (evaluation) set and to train models on a subset of the data.NEG-136-SIMP/NEG-136-NAT are datasets of cloze-style prompts.Passage/Premise/Prompt length and Question/Hypothesis length are described using the average number of words in the input."Answer exists" describes whether a correct answer exists for the negated statement in the dataset, or if the evaluation relies on negated and affirmative statements requiring different predictions.

Figure 4 :
Figure 4: Distribution of grammatical categories of negation cues in CONDAQA.

Figure 5 :
Figure5: A prompt used to get generations from GPT-3 (davinci) and "InstructGPT" (text-davinci-002).We designed the task description followingWang et al. (2022).The zero-shot prompt is the same except that there are no examples.

Table 1 :
Dataset statistics of CONDAQA.Passage statistics are computed on Wiki passages but not on edits.
Moreover, there are 219 unique cues in CONDAQA and 75 novel cues in the test set that are unseen in the training data.This is a substantially wider range of negation cues than what is included in prior work; see Appendix A for a detailed comparison.

Table 2 :
Examples of questions that target the implications of negated statements in CONDAQA and reasoning steps to correctly answer the questions.Negated statements are in blue.Categories inspired by LoBue and Yates (2011).Expanded analysis is shown in the Appendix (Table12).
Though Philby claimed publicly in January 1988 that he did not regret his decisions and that he missed nothing about England except the only things he missed about England were some friends, Colman's mustard, and Lea & Perrins Worcestershire sauce...

Table 4 :
, this result suggests that UnifiedQA Model performance on CONDAQA.All heuristics are built on top of UNIFIEDQA-LARGE.Boldface indicates the best model on each metric for every training setup (Supervised, Few-Shot, Zero-Shot).Supervised models are trained and evaluated across five random seeds using the full train and test sets.Due to the cost of OpenAI API, for few-and zero-shot models we report the average performance across five train-test splits.For more details and description of metrics see §4.GPT-3 version: davinci; InstructGPT version: text-davinci-002.
general-purpose NLI datasets do not perform well, but finetuning with their dataset is sufficient to address this failure.In contrast to several of these works, we contribute training data and find that simply finetuning on these examples is not sufficient to address the challenges in CONDAQA.See Appendix §A for a detailed comparison.
this work, we contribute CONDAQA, a dataset to facilitate the development of models that can process negation.Though CONDAQA currently represents the largest NLU dataset that evaluates a model's ability to process the implications of negation statements, it is possible to construct a larger dataset, with more examples spanning different answer types.Further, CONDAQA is an English dataset, and it would be interesting to extend our data collection procedures to build highquality resources in non-English languages.Finally, while we attempt to extensively measure and control for artifacts in CONDAQA, it is possible that the dataset has hidden artifacts that we did not study.challenge.In Joaquin Quiñonero Candela, Ido Dagan, Bernardo Magnini, and Florence d'Alché-Buc, editors, Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, volume 3944 of Lecture Notes in Computer Science, pages 177-190.Springer.

Table 6 :
Performance of models trained to distinguish Wikipedia text from edits made by crowdworkers.We used Bert-base, averaged over three random seeds.

Table 8 :
Few-shot results ofon one split of the test data (1/5 of the entire test set, ∼1440 examples) using different strategies for sampling few shots.See §C.2 for descriptions of the sampling strategies.

Table 9 :
"InstructGPT" (text-davinci-002) performance on one split of the test data (1/5 of the entire test set, ∼1440 examples) with more and less examples in the context.The average number of shots that fit in 2045 tokens (davinci max.input length) is 8-9, and 17-18 if the context is 4000 tokens (text-davinci-002 max.input length).
On October 8, 1883, the US patent office ruled that Edison's patent was based on the work of William E. Sawyer and was, therefore, invalid .Litigation continued for nearly six years.In 1885, Latimer switched camps and started working with Edison.Disraeli later romanticised his origins, claiming his father's family was of grand Iberian and Venetian descent; in fact Isaac's family was of no great distinction [...] Historians differ on Disraeli's motives for rewriting his family history: [...] Sarah Bradford believes "his dislike of the commonplace would not allow him to accept the facts of his birth as being as middle-class and undramatic as they really were".Libi told the interrogators details about Richard Reid, a British citizen who had joined al-Qaeda and trained to carry out a suicide bombing of an airliner, which he unsuccessfully attempted on December22, 2001.[...]

Table 12 :
Examples of types of questions that target the implications of negated statements in CONDAQA, and reasoning steps to correctly answer the questions.Negated statements are in blue.Relevant categories derived from LoBue and Yates (2011) when appropriate.