SCENE: Self-Labeled Counterfactuals for Extrapolating to Negative Examples

Detecting negatives (such as non-entailment relationships, unanswerable questions, and false claims) is an important and challenging aspect of many natural language understanding tasks. Though manually collecting challenging negative examples can help models detect them, it is both costly and domain-specific. In this work, we propose Self-labeled Counterfactuals for Extrapolating to Negative Examples (SCENE), an automatic method for synthesizing training data that greatly improves models' ability to detect challenging negative examples. In contrast with standard data augmentation, which synthesizes new examples for existing labels, SCENE can synthesize negative examples zero-shot from only positive ones. Given a positive example, SCENE perturbs it with a mask infilling model, then determines whether the resulting example is negative based on a self-training heuristic. With access to only answerable training examples, SCENE can close 69.6% of the performance gap on SQuAD 2.0, a dataset where half of the evaluation examples are unanswerable, compared to a model trained on SQuAD 2.0. Our method also extends to boolean question answering and recognizing textual entailment, and improves generalization from SQuAD to ACE-whQA, an out-of-domain extractive QA benchmark.


Introduction
Many natural language understanding tasks require a model to distinguish claims that are supported by available evidence (i.e., positive instances) from ones that are not (i.e., negative instances).In question answering, unanswerable questions (negative) can be subtly different from answerable ones (positive)-for instance, inserting an unmentioned entity into an answerable question can make it unanswerable.In recognizing textual entailment (RTE) or fact verification, a hypothesis or claim that is not entailed by a given premise (negative) can be very similar to one that is entailed (positive).Training models to understand these fine distinctions remains an important open problem (Kim et al., 2021;Asai and Choi, 2021).
Collecting human-written negative examples, as done by datasets like SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018) and SQuAD 2.0 (Rajpurkar et al., 2018), can yield training data that helps models detect negatives.However, not only is this process expensive and time-consuming, but asking humans to write negative examples can also introduce dataset biases (Gururangan et al., 2018).Distant supervision (e.g., pairing questions with paragraphs that do not contain the answer to the question) can create negative examples at no additional annotation cost (Karpukhin et al., 2020;Lee et al., 2021), but the resulting examples are often simple for models.Training only on such examples does not prepare models for subtler negative examples that only differ in a small way from positive examples (e.g., see Table 1).Example (a) in Figure 1 provides a subtle unanswerable question by altering the phrase "resistant dormant structure" to "contagious strains."Such an edit keeps the question very related to the context but changes its meaning so that within the same context, the new question is no longer answerable.
We propose a new approach named SCENE (Self-labeled Counterfactuals for Extrapolating to Negative Examples) to automatically generate subtly negative examples given a dataset of only positive examples.We use these synthetic examples as training data to help a model recognize real negative examples zero-shot, thus avoiding the cost of collecting negative examples manually.Our approach first perturbs existing positive examples randomly using a mask-denoising model (BART), then uses self-training to dynamically label these perturbed examples as negative based on the model's current predictions (see Figure 1 for the full pipeline).To get self-training started, we also include negative examples generated by distant supervision to "warm-start" the model.Unlike previous work (Min et al., 2020;Ross et al., 2022;Wu et al., 2021;Howard et al., 2022) that uses perturbations or counterfactuals for data augmentation, we use augmentation for extrapolation: we generate an entirely new class of examples not seen in the original training set.
Synthetic negative examples generated by SCENE teach models to recognize real negative examples.For extractive question answering (QA), we can extrapolate from SQuAD 1.1 (Rajpurkar et al., 2016), a positive-only dataset containing no unanswerable questions, to SQuAD 2.0 (Rajpurkar et al., 2018), which contains human-written unanswerable questions: our approach closes 69.6% of the gap with respect to a model trained on the SQuAD 2.0 training set.Our method using SQuAD 1.1 even outperforms a SQuAD 2.0-trained model when evaluated on the out-of-domain test set ACE-whQA (Sulem et al., 2021).SCENE can also extrpolate from BoolQ (Clark et al., 2019), a dataset containing only Yes/No questions, to BoolQ-3L (Sulem et al., 2022) which also contains "unanswerable/IDK" examples, closing 89.3% of the gap with a model trained on BoolQ-3L.It further applies to RTE (Wang et al., 2018)

Problem Setup
Extractive QA.We first describe our setup for extractive QA, then modify it for other tasks.Our training data D Positive only contains positive examples where each question is answerable, i.e. examples (q, p, y) where y is the answer to question q on passage p.To extrapolate to the new unanswerable class, we aim to synthesize unanswerable examples zero-shot beyond existing baselines that generate simple unanswerable ones (see §3.1).Then, we use our proposed perturbation-based self-training procedure, SCENE, to generate more challenging unanswerable questions (see Figure 1 and §3.2).
We use RoBERTa Base (Liu et al., 2019) as our baseline QA model with parameters Θ, denoted as f Θ where for any possible answer y , f Θ (y , q, p) = P(y | p, q; Θ).Sometimes we drop the parameter Θ or shorten to f (q, p) to denote the probability output from QA model.
We adopt the standard procedure in identifying unanswerable questions in extractive QA tasks (Devlin et al., 2019): the model prediction f (q, p) gives the probability for the answer's start and end positions over the entire span of the sequence, and the 0-position is always the [CLS] token.If the model has the highest probability in predicting [CLS], we treat it as predicting unanswerable.
Other tasks.For tasks beyond extractive QA, we modify notation and representations slightly.For boolean QA, the unanswerable (IDK) examples are represented as an additional class, on top of the original binary (Yes/No) classes.For RTE, we represent the hypothesis-premise pair as

Method
We first present our method for extractive QA, then show how it applies to a broader set of tasks in §3.5.Our synthetic data generation process consists of two steps: the first is dedicated to generate easy unanswerable question-passage pairs; and the second is to generate harder unanswerable examples through question perturbation and self-labelling.

Baselines for Generating Negatives
In §3.2, we will use our model to self-label perturbed questions as unanswerable.For this to work, we first need to introduce the concept of unanswerable questions to the model.Thus, we present two methods to create simple unanswerable examples.
The easiest way of synthesizing unanswerable question-passage pairs is through random pairing.To do this efficiently, we randomly pair passages and questions within a batch.Given a batch of questionpassage pairs {(q 1 , p 1 ), • • • , (q m , p m )}, we randomly sample an element σ from the permutation group S m , and reshuffle the pairs to be {(q σ(1) , p 1 ), (q σ(2) , p 2 ), • • • , (q σ(m) , p m )}.For every k, we label the pair (q σ(k) , p k ) to be unanswerable if σ(k) = k; otherwise, we discard the example (since it is already present in the original batch).
An example can be found in Table 7 of Appendix A. We use Shuf to denote the cross-entropy loss on negative examples generated by shuffling.

Retrieval-Based Generation.
To create harder unanswerable examples, given a question q, we retrieve a passage with high word overlap from the pool P of all passages in the dataset i.e.P = {p|(q, p, y) ∈ D train }.Given an answerable example (q, p, y), we create an unanswerable example (q, R(q), NoAns) where R is the retrieval operator.In particular, R(q) returns the passage from P that does not contain the answer string y with the highest BM25 similarity (Robertson and Zaragoza, 2009) to the question q.We use Retr to denote the cross-entropy loss on negative examples generated by retrieval.

Self-Labeled Counterfactuals (SCENE)
The previous methods generate negative examples that are too easy, and thus do not teach the model to recognize more subtle negative examples (see evaluation in §4.2).To generate harder negative examples, we introduce Self-labeled Counterfactuals for Extrapolating to Negative Examples (SCENE).Given an example (q, p, y) from our positive-only dataset D Positive , we use a generator G to synthesize a perturbed version G(q) of the question q.We then impute a label y, which is often the "unanswerable" label, and train the QA model to output y given the input (G(q), p).Prior work (Bartolo et al., 2021;Ross et al., 2022) synthesizes questions based on passages and can only generate more ex-amples from the same distribution; our goal is to shift the distribution to include a novel unanswerable class.
Inspired by self-training and counterfactual data augmentation (Howard et al., 2022), our method synthesize questions in three steps: Perturb, Filter, and Self-label. Figure 1 illustrates our method, and selected examples are shown in Table 1 and Table 8 of Appendix A.1.

Perturb.
We first randomly mask some tokens from the question q and use a mask-denoising model to fill the masks.Specifically, the proportion α of words to be masked is randomly drawn from a Beta Distribution: 7 , and we use BART Large (Lewis et al., 2020) as the mask-denoising model.

Filter.
Before we self-label the perturbed questions and update our model on them, we first filter out perturbations for which we have a lot of uncertainty about the correct label.Let δ Pert (p, q, y) ∈ {0, 1} be the indicator variable for whether we use the perturbed pair (G(q), p) or filter it out.Given the perturbed question G(q), we compute the QA model prediction ŷ = arg max y f (y , q, p), and the perturbed QA model prediction ỹ = arg max y f (y , G(q), p).In order to better determine whether the prediction ỹ is likely to be correct, we adopt the idea of rejection sampling by using a paraphrase detection model Γ-a RoBERTa Base model pre-trained on QQP from GLUE (Wang et al., 2018)-to help determine whether to filter out this example.
For a synthetic example (G(q), p), we discard it (i.e., set δ Pert (p, q, y) = 0) if one of the following two cases happen: (1) ambiguity: Γ(G(q), q) = Paraphrase but the perturbation changes the model prediction, i.e. ŷ = ỹ.This suggests that we have contradictory conclusions from the paraphrase detector and the QA model, and we cannot self-label it with confidence.(2) bad prediction: if the perturbation doesn't change the QA model prediction but they're both wrong, i.e. ŷ = ỹ = y.We don't self-label with wrong labels.

Self-Label.
If a synthetic example passes the filter, i.e., δ Pert (p, q, y) = 1, we trust the QA model's decision on the synthetic labels by selflabeling example (G(q), p) with label ỹ.Though it's also possible to have a contradictory situation where the QA model prediction stays the same, (i.e.ŷ = ỹ) while the paraphrase detector predict Γ(G(q), q) = NotParaphrase, the perturbed question still passes the filter because non-paraphrase questions can still have the same answer.
The batched objective for SCENE is denoted as follows, where (•, •) is the cross-entropy loss.

Training
Putting everything together, we train on the weighted sum of the normal training objective and synthetic unanswerable objectives defined in previous sections.We denote the batched version of the weighted training objective as follows: where λ Shuf , λ Retr and λ SCENE denote weights for their corresponding losses.
Note that we perform SCENE data augmentation for every batch during training using the current model parameters Θ-we are not using a frozen a pre-trained model to statically perform SCENE augmentation just once at the start of training.To make sure the predictions ŷ and ỹ are mostly correct with respect to their questions, the initial weights for our QA model f Θ are pre-trained on the positive-only dataset D Positive .For warmstarting purposes (see §3.1), λ SCENE = 0 for the first few steps.The rest of the training follows the objective described in Eq 2. For the entire training procedure, our method is not provided with any human-annotated negative examples, and therefore methods such as early stopping, model selection, etc. are not used.

A Probabilistic Alternative
Instead of training with human-annotated negative examples, one can force the model to predict unanswerable by setting a probabilistic threshold, i.e., the model predicts unanswerable if P(ŷ | p, q; Θ) < θ threshold .However, one would need a dataset with negative examples first to find the best θ threshold .In our experiments, we compare SCENE with an oracle approach that uses the validation set of the target dataset to choose the best threshold.

Beyond Extractive Question Answering
Our method can also work beyond extractive QA tasks.SCENE can work on boolean QA in a setting where we extrapolate BoolQ (Clark et al., 2019), a collection of question-passage pairs (q, p) of label y = Yes or No, to BoolQ-3L (Sulem et al., 2022), an extended dataset to BoolQ with additional unanswerable (IDK) examples.We repeat the SCENE pipeline (Perturb, Filter and Self-label), together with Shuffle, and evaluated on BoolQ-3L test set.
Because the BoolQ-3L's IDK questions were produced using information retrieval similar to what we did in §3.1, we refrain from using retrievalbased IDK questions during training.
Our method can also work on binary RTE in a setting where we only have access to hypothesispremise pairs (h, p) ∈ D Positive of label y = entailment.For RTE, we modify the SCENE pipeline so that we perturb the premise p to G(p).Instead of the original self-labeling step, we label G(p) to be ỹ = not entailment if the paraphrase detector predicts that G(p) is not a paraphrase of p.We note that unlike for QA, SCENE can generate negative examples even without Shuffle or Retrieval-based negatives.The necessity of Shuffle for binary NLI tasks is ablated in experiments (see Table 6).We do not experiment with retrieval for RTE because the dataset contains only a small pool of hypotheses to use for retrieval.(Sulem et al., 2021) is the outof-domain (OOD) testing set for extractive QA, derived from an event extraction corpus, ACE (Walker et al., 2006).Aside from being an OOD test set for SQuAD 2.0, ACE-whQA also distinguishes two types of unanswerable (IDK) questions: (1) competitive, where the passage includes an entity of the same type as the expected answer; and (2) non-competitive, where the passage does not include an entity of the same type.examples, where 33% of them are IDK examples.

Experiments
When evaluating on BoolQ-3L, the metric of interest is the classification accuracy across three categories (Yes/No/IDK).

Extractive QA Results
For comparison as baseline/oracle models, we train normally both on the positive-only training set D Positive and on the full training set D Full .We evaluated on 3 different independent runs of training and report averages to mitigate the effects of randomness.For simplicity, we only let λ's be zero or one, unless otherwise indicated.We define a metric on test sets for comparison, Performance Gap, as follows, where M is any task-specific benchmark metric.
SCENE enables extrapolation to SQuAD 2.0.

SCENE improves out-of-domain generalization.
Table 3 shows results where we train on SQuAD-derived data and test on ACE-whQA.We focus on the overall average accuracy, where we average the accuracy on the "Has answer" and "No Answer" subsets of ACE-whQA.SCENE combined with retrieval and the positive-only SQuAD 1.1 dataset achieves the highest accuracy, even outperforming the oracle model trained on SQuAD 2.0 (73.1 F1 vs. 62.3 F1).Adding SCENE examples to the SQuAD 2.0 training data improves the ACE-whQA accuracy by 7.8 F1, suggesting that including SCENE examples improves the overall utility of SQuAD 2.0 as training data.
Qualitative results and statistics.SCENE can synthesize subtly unanswerable questions in many ways, such as inserting or replacing unmentioned entities, lexical changes, and tense modifications.Both Table 1 and Table 8 2018) and compare their frequency in SCENE with SQuAD 2.0.SCENE tends to create more questions that add an impossible condition, and fewer antonym substitutions or negations since the BART model is not tuned particularly to produce these edits.
Ablation study.We conduct ablation study on our best method, which combines retrieval and selflabeled counterfactuals.We test several simplified versions of our filtering and relabeling pipeline: 1. Assume NoAns: We simply accept every perturbed example and relabel all of them as unanswerable, i.e. ∀(p, q, y), δ Pert (p, q, y) = 1 and ỹ = NoAns.2. Assume NoAns w/o Retr: Same setup as above, but without Retrieval to generate simple unanswerable examples.3.Only NoAns: We additionally filter out perturbed questions for which our imputed label is not unanswerable, i.e., ỹ = NoAns.This tests whether the answerable questions generated by SCENE contribute to the performance of the method.4. No Filter: We do not filter anything and we simply perform self-training with synthetic examples (G(q), p, ỹ).
The results of these ablations are shown in Table 4.Our full approach performs the best both in-domain and out-of-domain.Accepting all perturbations as unanswerable is surprisingly competitive on SQuAD 2.0, but is much worse on ACE-whQA, as it encourages the model to predict NoAns too much.Counter-intuitively, both "Assume NoAns w/o Retr" and "Only NoAns" reduce accuracy on unanswerable questions in SQuAD 2.0.One possible explanation is that if all perturbations are unanswerable, detecting unanswerables becomes the task of detecting perturbations; also including perturbed answerable questions reduces this spurious correlation.Finally, removing the paraphrase detector decreases accuracy by 1.9 F1 on SQuAD

Boolean QA Results
SCENE enables extrapolation to BoolQ-3L.Table 5 shows results on BoolQ-3L evaluation.Note that to be consistent with our extractive QA experiments, we report the average of the accuracy on negative examples (IDK label) and positive examples (yes and no labels).The best SCENE method achieves 78.1% accuracy, which closes 89.3% of the gap in BoolQ-3L accuracy between the baseline that trains on BoolQ (38.9% accuracy) and the oracle method that trains on BoolQ-3L (82.8% accuracy).Our best method combines shuffle with selflabeled counterfactuals, and it's 0.6 points better than only using shuffle.Adding self-labelled counterfactuals to the BoolQ-3L training set slightly improves BoolQ-3L accuracy by 0.4 points.

RTE Results
SCENE also extends to binary RTE tasks, and it enables extrapolation from the entailment-only subset to entire RTE.Table 6 shows results on the RTE evaluation.The best SCENE method achieves 67.9% accuracy, which closes 56.1% of the gap in RTE accuracy between the baseline that trains on entailment-only subset (52.7% accuracy) and the oracle method that trains on full RTE (79.8%).
Our best method combines shuffle and self-labeled counterfactuals.Using shuffle-based negative examples barely improves over the baseline that trains on the subset, and using self-labeled counterfactuals alone improves over the baseline significantly by 11.6 points.Combining shuffle and self-labeled counterfactuals together improves accuracy by 13.7 and 3.6 points respectively, compared to only using their own.

Discussion and Related Work
Self-Training.Self-training is a semi-supervised learning approach utilizes a teacher model (could be trained on labelled data) to assign labels to unlabelled data, which is then used to train a student model (Yarowsky, 1995;McClosky et al., 2006;Kumar et al., 2020) (Wadden et al., 2022).Our baseline approach for generating negatives based on in-batch negatives via shuffling and retrieval is similar in motivation to mining negatives for neural ranking models (Karpukhin et al., 2020).
Counterfactual Data Augmentation.Counterfactual data augmentation (Kaushik et al., 2020) has been studied as an approach to reduce model reliance on spurious correlations (Gardner et al., 2021) by creating minimal pairs that flip the label.Past work includes manual approaches such as Contrast Sets (Gardner et al., 2020) which asks expert annotators to minimally perturb examples, or heuristic approaches that rely on synonymantonym sets to substitute words in the example (Wang and Culotta, 2021;Chen et al., 2021).Several approaches leveraging language models (LM) for generating counterfactuals have been proposed; Polyjuice (Wu et al., 2021) and Tailor (Ross et al., 2022) train LMs to generate minimal edit counterfactuals through specified control codes; LIT (Li et al., 2020) automatically generates contrasts sets for NLI tasks based on linguistic rules but lacks flexibility; NeuroCounterfactuals (Howard et al., 2022) generates counterfactuals that are relevant but not restricted to be minimal edits by leveraging constrained decoding from task fine-tuned LMs.Despite their success in perturbing texts, it is unclear how to use these methods to introduce distribution shift and extrapolate to unseen label sets (in our case, negative classes).Our approach of using masked language models for text in-filling followed by filtering is inspired by recent developments on open set classification (Xu et al., 2022).

Synthetic Question Generation
Many approaches have been proposed to synthesize questions given a passage and an answer within it (Du et al., 2017;Du and Cardie, 2018;Zhao et al., 2018;Lewis and Fan, 2019;Alberti et al., 2019;Nie et al., 2020;Bartolo et al., 2021;Lewis et al., 2021).These methods are designed to generate answerable question-answer pairs and do not directly apply to our setting where we must synthesize challenging unanswerable questions.Pan et al. (2021) use a question generator as a part of their fact verification pipeline.The question generator uses surrounding context from source documents of extractive QA examples to generate unanswerable questions; this may be viable strategy when such external sources are available for the task.

Conclusion
In this work, we developed a counterfactual generation pipeline, SCENE, mainly used for synthesizing negative examples from real positive ones.The counterfactual generation is performed text perturbation, filtering and relabelling.Only seeing real positive examples and synthetic negative ones, our method demonstrates strong extrapolation capabilities by closing 69.6% of the performance gap on extractive QA, 89.3% on boolean QA, and 56.1% on RTE, compared with models trained on both real positive and negative ones.
In the future, we hope to combine our automatic pipeline with human annotation effort.For example, we could use adversarial data collection to collect a small number of perturbations that create challenging negative examples, and use these to guide generation of self-labeled counterfactuals.We would also like to explore ways to backpropagate signal from the filtering process into the generator, so the generator learns to generate useful perturbations.Overall, we hope that our work can inspire more work on how synthetic data can enable new types of model extrapolation.

Limitations
Though our method alleviates data collection cost by human annotators, its computational cost is higher than training with human annotated datasets for multiple reasons (see Appendix B.3). Adding synthetic examples increases the time required for one epoch of training.Moreover, SCENE examples are generated with a BART model and filtered by a paraphrase detector during training, both of which add computational overhead.
We only validated on extractive QA, boolean QA and RTE.Whether SCENE can be applied to other tasks that require detecting challenging negatives is unknown.SCENE is also limited to extrapolating to one pre-defined new class, negative examples; whether SCENE can be used to extrapolate to other types of classes is also unknown.

Figure 1 :
Figure 1: The Self-labeled Counterfactuals for Extrapolating to Negative Examples (SCENE) pipeline: (1) question perturbation using a mask in-filling model (BART); (2) paraphrase detection on perturbed questions; (3) QA model prediction for perturbed questions (3a) and the original question (3b); and (4) answer relabelling or filtering based on (2) and (3) (details in §3.2).Accepted new examples (generated via black arrows) are used for training in the same way as the original examples (gray arrows).
, where we start with only entailment examples and synthesize nonentailment examples, closing 56.1% of the gap with a model trained on the full training set.Overall, our results show that automatically generated perturbations can be useful not only for increasing the amount of training data, but for enabling extrapolation to previously unseen types of examples.

Table 2 :
Evaluation on SQuAD 2.0.Training on positive-only SQuAD 1.1, SCENE with Retrieval can close 69.6% of the gap compared with an oracle model trained on SQuAD 2.0.We report the mean and standard deviation (in parentheses) over 3 runs.Best scores among all methods that train on SQuAD 1.1 are highlighted in bold.

Table 3 :
OOD Evaluation on ACE-whQA.Trained only on SQuAD 1.1, SCENE with Retrieval even outperforms the oracle model trained on SQuAD 2.0.Applying SCENE also improves SQuAD 2.0-trained models out-ofdomain.We report the mean and standard deviation (in parentheses) over 3 runs.Best scores among all methods that train on SQuAD 1.1 are highlighted in bold.

Table 2
in Appendix A.1 show some selected synthetic unanswerable examples generated through the SCENE framework, whereas Table 9 in Appendix A.2 shows randomly-sampled ones.In Table 10 at Appendix A.3, we adopt different categories of negative examples defined by Rajpurkar et al. (

Table 4 :
Ablation study on different filtering and labeling strategies for synthetic samples.The full SCENE pipeline achieves the best performance both in-domain and out-of-domain.2.0 and 5.0 F1 on ACE-whQA, showing that our filtering strategy does help reduce noise, but selftraining alone without filtering is still effective.We conduct additional ablation study on different combinations of losses, Shuf , Retr and SCENE , presented in Table11at Appendix A.4. Results suggest that our best model uses the combination of Retrieval and SCENE and performing Shuffle together with Retrieval can be redundant since unanswerable examples provided by Retrieval are a refined subset of examples provided by Shuffle.

Table 5 :
Evaluation on BoolQ-3L.SCENE with BoolQ alone achieves 89.3% of the BoolQ-3L performance of an oracle model trained on BoolQ-3L.Best scores among all methods that train on BoolQ are highlighted in bold.

Table 6 :
Evaluation on RTE.SCENE with entailmentonly subset alone achieves 56.1% of the RTE performance of an oracle model trained on RTE.Best scores among all methods that train on the entailment-only subset are highlighted in bold.
. In our work, we adopt a selftraining scheme where the teacher model and the student model are the same.During training, the current model (teacher) is used to annotate new examples for the next batch (student), and it is also used jointly with a paraphrase detector (as a rejection sampler) to reduce noise.SCENE also differs from standard self-training in that, rather than using a pool of unlabelled examples, we generate them.