TINA: Textual Inference with Negation Augmentation

Transformer-based language models achieve state-of-the-art results on several natural language processing tasks. One of these is textual entailment , i.e., the task of determining whether a premise logically entails a hypothesis. However, the models perform poorly on this task when the examples contain nega-tions. In this paper, we propose a new definition of textual entailment that captures also negation. This allows us to develop TINA (Textual Inference with Negation Augmentation), a principled technique for negated data augmentation that can be combined with the un-likelihood loss function. Our experiments with different transformer-based models show that our method can significantly improve the performance of the models on textual entailment datasets with negation – without sacrificing performance on datasets without negation.


Introduction
Textual entailment (TE, also called Natural Language Inference) is the task of recognizing whether one natural language sentence (the premise) semantically entails another one (the hypothesis).For example, the premise "I live in Paris" entails the hypothesis "I live in France".TE is at the heart of natural language understanding, as it is closely related to question answering and natural language reasoning (Dagan et al., 2005;Poliak, 2020).Nowadays, the state of the art performance in TE is achieved by transformer-based models such as BERT (Devlin et al., 2019).
However, transformer-based models can get derailed easily by trap words or syntactic variations (see, e.g., Helwe et al. (2021) for a survey).In particular, such models have difficulties with negation in textual entailment (Hossain et al., 2020;Hosseini et al., 2021).Here is an example from Hossain et al. (2020)'s dataset: Premise: Green cards are not becoming more difficult to obtain.Hypothesis: Green card is now difficult to receive.BERT Prediction: Entailment Label: Not Entailment In this paper, we provide a principled analysis of negation in textual entailment.In particular, we propose a probabilistic definition of entailment that can capture also negation.This allows us to develop TINA (Textual Inference with Negation Augmentation), an approach to automatically augment TE training datasets with negated instances.TINA uses logical deduction to generate new negated training examples from existing ones.For example, we can generate that "I don't live in France" entails "I don't live in Paris".We can then show that models finetuned on our augmented datasets are more resilient to negation, especially when combined with the unlikelihood loss.At the same time, the finetuned models perform just as well on datasets without negation.The contributions of our paper are as follows: • a novel probabilistic definition of entailment that considers also negation; • provably correct rules to derive new entailment relationships; • a method to automatically augment TE datasets using these derivations; • experiments showing that models that are finetuned on the augmented datasets are more resilient to negation in TE.
The rest of the paper is organized as follows.In Section 2, we review the related work.Section 3 describes TINA, our approach to defining textual entailment, and to making transformer-based models robust to negation in textual entailment.In Section 4, we evaluate our approach on different datasets.We conclude in Section 5, and list limita-tions of our approach afterwards.Appendix A contains the proofs of correctness, Appendix B contains the the hyperparameters used in our experiments, Appendix C shows a graphical representation of our evaluation results, and Appendix D contains a supplementary table of derivations.All data and code is available on GitHub1 .
2 Related Work

Negation in Language Models
Transformer-based models such as BERT (Devlin et al., 2019) achieve state-of-the-art results on a broad range of different NLP tasks, including machine translation, named entity recognition, and recognizing textual entailment.However, one of the pitfalls for such models is negation (Ettinger, 2020;Helwe et al., 2021;Kassner and Schütze, 2020).As shown by Kassner and Schütze (2020) and Ettinger (2020), a pretrained BERT-based model cannot differentiate between affirmative and negative statements.In addition, Niven and Kao (2019) have found that a finetuned BERT relies on simple cue words such as "not", and can thus be misled.To the best of our knowledge, the only attempt to improve the robustness of language models to negation is BERTNOT (Hosseini et al., 2021), a BERT-based model that adopts an unlikelihood objective function during training for the task of language modeling to learn to differentiate between affirmative and negative sentences.

Data Augmentation
Data augmentation is a technique to automatically create new instances in order to increase the size of a training dataset.It can mitigate problems of lowresource languages, class imbalance, and bias in datasets.Data augmentation techniques can be categorized into rule-based approaches, model-based approaches, and example interpolation (Feng et al., 2021).We are interested here in the rule-based category, which uses predefined rules to generate new instances (Hariharan and Girshick, 2017;Schwartz et al., 2018;Paschali et al., 2019;Wei and Zou, 2019;Xie et al., 2020;Şahin and Steedman, 2018;Wang et al., 2022).Our approach is inspired by the work of Wang et al. (2022), which uses logical rules for data augmentation.We go further by logically deriving new rules for data augmentation, and by combining the data augmentation with the unlikelihood loss for finetuning transformer-based models.

Textual Entailment
Textual Entailment is a task that was created to evaluate the "understanding capabilities" of NLP systems.The goal of this task is to determine if a hypothesis can be inferred from a premise (Dagan et al., 2005;Poliak, 2020).Different textual entailment datasets have been proposed.The most popular ones are SNLI (Stanford Natural Language Inference) (Bowman et al., 2015), MNLI (Multi-Genre Natural Language Inference) (Williams et al., 2018), and Pascal RTE (Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007Giampiccolo et al., , 2008;;Bentivogli et al., 2009).
SNLI is a large human-annotated corpus consisting of over 550K premise-hypothesis pairs that are labeled with one of the following classes: entailment, contradiction, and neutral.The premises of this dataset are image captions from Flickr30k, while its hypotheses were generated by human annotators.Here is an example from the SNLI dataset: Premise: A smiling costumed woman is holding an umbrella.Hypothesis: A happy woman in a fairy costume holds an umbrella.Label: Neutral MNLI is a large dataset of around 433K instances that are labeled in the same way as SNLI.However, unlike SNLI, MNLI covers different text genres such as fiction, telephone speech, and letters, and has longer instances.It also has a large portion of less grammatical text, as in this example: The state of the art achieves an accuracy of around 92-95% on these datasets.The best models are EFL (Wang et al., 2021) for SNLI, T5-11B (Raffel et al., 2020) for MNLI , and Google's Pathways Language Model (PaLM) (Chowdhery et al., 2022) for RTE.

Negated Textual Entailment
The good performance of language models on textual entailment datasets raises the question of whether this good performance persists in the presence of negation (Hossain et al., 2020(Hossain et al., , 2022)).Negation is generally underrepresented in TE datasets (Hossain et al. ( 2020)), with 7.16% of SNLI's sentences containing a negation, 22.63% in MNLI, and 1.19% in RTE. Therefore, Hossain et al. (2020) created new benchmarks by taking instances from SNLI, MNLI, and RTE and introducing a negation.They showed that language models perform poorly on these datasets.Hosseini et al. (2021) introduced the previously mentioned BERTNOT model to improve performance.In our work, we will show how that performance can be improved even further by using a principled way to augment the training datasets.
3 Our Approach: TINA TINA (Textual Inference with Negation Augmentation) is our proposed approach to build a language model that is robust to negation in textual entailment tasks.Our main idea is to finetune transformer-based models on a textual entailment dataset that has been augmented with negated instances.For this purpose, let us first revisit the definition of entailment.

Defining Entailment
We say that a text fragment A entails a text fragment B (written A ▷ B) if, typically, a human reading A would infer that B is most likely true (Dagan et al., 2005).Here, A is called the premise and B is called the hypothesis.For our purposes, we need a more formal definition of entailment.i.e. a definition in mathematical terms that matches the intuitive definition.
Entailment cannot be modeled as a material implication A ⇒ B for two reasons: First, a material implication A ⇒ B is true if B is true.Thus, "It rains" would entail "Paris is in France" -which is not the usual understanding of entailment.Propositional logic knows no satisfying way to avoid this.We could write A ▷ B := (A ⇒ B) ∧ (¬A ⇒ ¬B); but that is just equivalent to A ⇔ B, which is not what entailment means.The second problem with defining entailment as a logical implication is that it does not allow for exceptions.For example, "I obtained a university diploma" entails "I have a university diploma", even if diplomas can be withdrawn in rare cases of fraud.Propositional logic has no means to say that an implication holds "usually" or "in the majority of cases".
Therefore, previous work (Glickman et al., 2005) has proposed a probabilistic definition of entailment.In what follows, we assume a probabilistic universe Ω and two events (the premise A and the hypothesis B).Glickman et al. (2005) then defines Definition 3.1 (Entailment (Glickman et al., 2005)).
This definition says that A entails B if A increases the probability of B. Unfortunately, this definition has several problems: First, it is symmetric.We show in Proposition A.1 in the appendix that (A ▷ G B) ⇔ (B ▷ G A).For example, "I live in Paris" ▷ G "I live in France", because the probability of living in France increases to 100% once we know the person lives in Paris.However, knowing that someone lives in France also increases the probability that this person lives in Paris (from one in several million cities in the world to one in several thousand cities in France).Therefore "I live in France" ▷ G "I live in Paris" -which is not our common understanding of entailment.
The second problem with Definition 3.1 is that A ▷ G B even if A increases the probability of B only marginally.For example "I play in the lottery" ▷ G "I win the lottery".This is because the probability of winning the lottery increases by playing in the lottery.Again, this is not our usual understanding of entailment.
Therefore, we propose to add the condition P (B|A) > θ, where θ is a threshold for the acceptance of an entailment (say, 90%).Thus, our definition becomes A ▷ θ B := P (B|A) > P (B) ∧ P (B|A)>θ.This also makes the defini-tion asymmetric, thus solving both the first problem and the second problem.
However, the definition is still vulnerable to a third problem: It may get carried away by hypotheses B with a high baseline probability.For example, most people survive the yearly Flu season.Washing your hands further decreases the risk of attracting the Flu (and thus increases the probability of survival).Hence "Alice washes her hands this Monday" ▷ θ "Alice survives this year's Flu season".This is because (1) washing hands indeed increases the probability of survival, and (2) the probability of surviving is already larger than θ (for θ = 90%).However, we would not say that the entailment holds.To guard against such cases, we propose to add another condition, P (¬A|¬B) > θ.
Our definition is thus: We write A ̸ ▷ B to say that A does not entail B. We can then use our notion of entailment to define contradiction and neutrality.

Deriving New Instances
We can now use our definition of entailment to derive new premise-hypothesis pairs from a given pair.In what follows, let us denote the negation of a sentence A by ¬A.Formally, ¬A := Ω − A. For example, the negation of "I live in Paris" is "I don't live in Paris".The negation of natural language sentences is a research topic on its own.For example, the negation of Noam Chomsky's famous nonsensical sentence "Colorless green ideas sleep furiously" is not "Colorless green ideas do not sleep furiously", as both are nonsensical.We refer the reader to Horn (1989); Löbner (2000); Penka (2015) and Homer et al. (2019) for a discussion.Here, we assume that both the premise and the hypothesis of a textual entailment instance are simple sentences that can be negated.Now assume that we have A ▷ B. Then Definition 3.2 allows us to formally derive ¬B ▷ ¬A (Proposition A.2 in the appendix).For example, "I live in Paris" ▷ "I live in France", and hence "I don't live in France" ▷ "I don't live in Paris".This type of reasoning is known as Modus Tollens.
Table 1 shows other ways to derive new instances from a given instance, together with references to their proofs.A particularly interesting result is that ▶ is symmetric, i.e., (A ▶ B) ⇔ (B ▶ A).
Some of the derivations in Table 1 give us a label that an instance cannot have, rather than telling us which label it must have.We call such a label a rejected label.For example, an instance with the label A ▷ B (entailment) generates a new instance with the rejected label ¬A ̸ ▷ B (non-entailment, ¬A does not entail B).This means that the true label cannot be an entailment, and that it has to be either neutral or a contradiction.
We are interested in entailments that logically follow from A ▷ B, from A ▶ B, from A ⊸ B and from A ̸ ▷ B, as these are the labels that common textual entailment datasets use: MNLI and SNLI use the first three labels, while RTE uses the first and last label.While Table 1 shows all derivations that must hold, Table 8 (in the appendix) shows all other hypothetical derivations, and proves them wrong.We can thus use Table 1 to derive, for a given labeled instance, new labeled instances.Most of these contain negation.

Unlikelihood Loss
The previous step has given us a way to derive new labeled instances -with either rejected or accepted labels.For the rejected labels, we want to penalize the likelihood of a language model predicting the rejected label.For this purpose, we use the Unlikelihood Loss.This loss has been used in many tasks, including in language modeling (Hosseini et al., 2021;Noji and Takamura, 2020) and text generation (Welleck et al., 2019).In our case, the loss is defined as: Here, n runs over all N instances of the dataset.For each instance n and label y, p n,y is the score that the model assigns to the label y for the instance n.To each n we associate a ground truth label y n , and we know whether this label is accepted or rejected.To distinguish these two cases, v n is an indicator that takes the value 1 if the label is accepted, and the value 0 if the label is rejected.
Our loss is thus the sum of the cross-entropy loss of the accepted labels and the unlikelihood loss of the rejected labels.

Dataset Augmentation
To augment a textual entailment dataset with negated instances, we consider all instances one by one.We first check if the instance consists of a grammatically correct single-sentence premise and single-sentence hypothesis.We use DistillBERT (Sanh et al., 2019) to that end, a model that was finetuned on the The Corpus of Linguistic Acceptability (COLA) dataset (Warstadt et al., 2019).If the instance does not pass this test, we skip it.Otherwise, we check if we can negate both the premise and the hypothesis of the instance.We use the method developed by Hosseini et al. (2021) for this purpose, a rule-based approach with pre-defined rules written in Semgrex (Chambers et al., 2007).
It takes as input a sentence with part-of-speech tags (POS tags), the dependency parse, and the morphological features of the words, and it produces as output a negated sentence.We used Stanza (Qi et al., 2020) to get the POS tags, the dependency parse, and the morphological features.Here is an example: "The man is somewhere near the parade" ; "The man is nowhere near the parade" .
If both the premise and the hypothesis can be negated, we derive possible new instances as per Table 1.We illustrate this data augmentation process with an instance from SNLI2 : Premise: The two boys are in martial arts poses in an outside basketball court.Hypothesis: The two boys are outdoors.

Experiments
We conducted several experiments to investigate the robustness of models trained with our data augmentation technique, TINA, for the task of textual entailment with negation.

Settings
Datasets.We use the most common datasets for textual entailment, namely Stanford Natural Language Inference (SNLI) (Bowman et al., 2015), Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018), and Pascal RTE (RTE) (Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007Giampiccolo et al., , 2008;;Bentivogli et al., 2009)  • Adding a negation to the premise and keeping the original hypothesis • Adding a negation just to the hypothesis and keeping the original premise • Adding a negation to the premise and the hypothesis Finally, for each dataset, we generate an augmented variant Aug by our methodology from Section 3. We made sure that the generated instances are not in the negated benchmarks.Table 2 shows the number of instances from the training set of each dataset that were negated before deriving new instances.Table 3 shows the sizes of the datasets.
Models.We want to see whether TINA makes transformer-based models more robust to negation in textual entailment.Our experiments cover the following models: BERT (Devlin et al., 2019) 2020) for the number of epochs, batch size, learning rate, and weight decay.We recall them in Table 6.However, unlike the original work, we set the maximum sequence length to 512 instead of 128.We also applied our approach to BART (Base) and GPT-2.We split the training dataset as 90/10 for training and validation sets for these two models.We evaluated on each testing set with the best-performing models based on the validation set.We carried out a basic hyperparameter search and describe the hyperparameters that we found in Table 7.All models were trained on an NVIDIA A100 GPU with 40GB memory.
Competitors.The only other approach that specifically targets negation in textual entailment is BERTNOT (Hosseini et al., 2021).It was trained to model negation in the MLM task, and then it was finetuned on each TE training set.For reference, we also show the performance of a T5-Base model.This model is very powerful, as it was pretrained on a mixture of NLP tasks that include textual entailment, coreference resolution, linguistic acceptability, and semantic equivalence.

Results
Table 4 shows the performance of TINA applied to different transformer-based models averaged over 3 runs.TINA -is a variant of TINA that does not generate instances with rejected labels.We show, for each model, how the performance changes when TINA -and TINA are used.We compute a binomial confidence interval for each result (at a confidence level of α = 0.05), based on the total number of instances and the number of correctly predicted labels.
The main outcome is that, on the negated datasets, TINA -always improves the results, and TINA improves the results even more.At the same time, the augmentation techniques do not lower the results significantly on the original datasets.This is true across all models.
On the SNLI dataset, the improvement of the performance is considerable, with gains up to 20 percentage points, depending on the model.On MNLI, the gains are less.We assume that this is because MNLI contains many ungrammatical sentences, and also because it already contains some proportion of negated training examples.Nevertheless, the gains of TINA are still significant.On RTE, TINA and TINA -are identical, as the dataset only has two labels (entailment and non-entailment).The confidence intervals on RTE Dev are much larger, because the dataset is much smaller.Nevertheless, the gains on the negated dataset are significant, and can reach up to 21 percentage points, depending on the model.
For reference, we also show the performance of an off-the-shelf pretrained T5-Base model.It has a very good performance, and most notably outperforms our competitor BERTNOT significantly on the negated datasets.We assume that this is because it was pretrained on a large mixture of NLP tasks.Nevertheless, our method comes close to T5 on RTE, and outperforms the T5 model on SNLI and MNLI.
Most importantly, however, our approach serves its purpose, in that it increases the performance of transformer-based models on negated textual entailment by a large margin, across different models and all datasets.With this, our approach improves over the current state of the art (Hosseini et al., 2021).

Qualitative Analysis
To better understand the performances of TINA, we manually checked a sample of sentences from each augmented dataset.For SNLI, we find that the sentences are simple.; "The motion did not set waves of nausea running through him, but he could see the doctor" .The same goes for adjectives and prepositions that take a role akin to a conjunction, as in "despite concerns about the drinking water".Verbs of assertion are negated, but not the assertion itself: "The actor was outside a movie theater in central London's Leicester Square, London's Metropolitan Police said" ; "The actor was outside a movie theater in central London's Leicester Square, London's Metropolitan Police did not say".In this case, the negation does not work as intended, as the main verb merely states the source of the assertion.In other cases, the main verb may indeed be the intended target of the negation.Negation errors occur at times with Hosseini et al.
(2021)'s tool, as e.g. in "cannot not do" and "has did not given".Our filtering step with DistillBERT (Sanh et al., 2019) was apparently insufficient to remove the ungrammatical sentences.For the conjuncts, we found that the erroneous negation is mostly harmless: if a conjunction is negated only in its first conjunct, that might still be the conjunct that is relevant for the entailment.The same goes for verbs of assertion: the entailment may sometimes target the fact of asserting something (in which case the negation works correctly).Negation errors, too, may be harmless: while these can disturb a human reader, they may still yield useful signals for a machine learning model.
The negation of sentences thus remains a challenge in practice.It is, however, largely orthogonal to our contribution of creating negated training examples for textual entailment.We are thus hopeful that an improvement of these tools will confer even higher performance gains to TINA.

Conclusion
In this paper, we have studied the problem of negation in textual entailment in detail.We have argued that the previous formal definition of textual entailment is problematic, and we have proposed a new probabilistic definition.Based on this definition, we have proposed TINA, a principled negated data augmentation technique.TINA can be combined with the unlikelihood loss to improve the robustness of language models to negation in textual entailment tasks.Our experimental results across different negated textual entailment benchmarks show that our method can significantly increase the performance of different transformer-based models.Future work can explore how different loss functions, such as contrastive loss, could be used with our augmented datasets.Acknowledgements.This work was partially funded by ANR-20-CHIA-0012-01 ("NoRDF").

Limitations
One limitation of our approach is that it presupposes premise-hypothesis pairs that consist of simple, negatable sentences.We already filter out sentences that do not conform, but many cases of incorrect negations remain (Section 4.3).The correct negation of sentences thus remains an open challenge.
Our probabilistic definition of entailment can also be further scrutinized.While we believe that it filters out most counter-intuitive entailments, it may still be possible to come up with counter-intuitive examples that fulfill our definition.It is even possible that this cannot be avoided at all, as the textual entailment task itself suffers from a degree of vagueness.
Finally, our method focuses purely on the generation of training instances.However, it may be possible that specified models (one for negated instances and one for affirmative instances) lead to better results.

B Hyperparameters
Tables 6 and 7 show the hyperparameters that we used in our experiments (Section 4).

C Figures
Figure 1 shows a graphical illustration of the performances in Table 4.   1.We show for each of them that they do not hold.This is done either by a counterexample, by reducing them to another derivation that does not hold, or by showing they contradict other true derivations.As before, we use the notations from Table 5.

Figure 1 :
Figure 1: Evaluation of different finetuning methods applied to different transformer-based models on the negated textual entailment datasets.Accuracies are averaged across 3 runs.

Table 1 :
¬A Reduces to A ▷ B ′ ⇒ ¬B ′ ̸ ▷ A with B ′ = ¬B I live in Italy ̸ ▶ I don't live in Paris ¬B ̸ ▶ A Apply Proposition A.2 then A.3 I don't live in Italy ̸ ▶ I live in Paris A ̸ ▷ B Reduces to A ▷ B ′ ⇒ A ̸ ▷ ¬B ′ with B ′ = ¬B I live in Paris ̸ ▷ I live in Italy Rules for deriving textual entailment instances.The propositions and their proofs are in Appendix A.

Table 3 :
Number of instances in each dataset datasets of SNLI, MNLI, and RTE.For each instance, 3 new pairs were generated by adding the negation "not", as follows:

Table 4 :
Results of our approach applied to different language models on different textual-entailment datasets.Accuracies are averaged across 3 runs.Significant changes have a gray background.

Table 7 :
BART and GPT-2 hyperparameter configurations Table 8 presents all derivations that are not in Table