Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference

We introduce a synthetic dataset called Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We produce 1,304 sentence pairs by modifying 15 examples from the SICK dataset (Marelli et al., 2014). To this end, we modify the original texts using a set of phrases modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We use these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. We found that the performance of NLI models under the zero-shot setting is poor, especially for modified sentences with negation and existential quantifiers. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers.

In this work, we analyze how well transformer networks trained for NLI understand the atomic reasoning blocks defined in NL, and how well they can compose them to detect changes in monotonicity (Richardson et al., 2020;Joshi et al., 2020).To this end, we create a dataset containing 1304 sentences by modifying 15 premise/hypothesis pairs from the SICK dataset (Marelli et al., 2014).The dataset is generated by modifying the premise and hypothesis sentences selected, as follows: • We append a series of modifiers to subject/verb/objects in the hypothesis/premise pairs.These modifiers include universal quantifiers (e.g., every, always), existential quantifiers (e.g., some, at least), negation, and adverbs/adjectives (e.g., happy, sad).Table 2 lists the complete set of modifiers used.
• We store the adjusted entailment label for each modifier pair to understand the shift in meaning from word-level changes within sentential contexts.More formally, we used the seven entailment relations as defined in (MacCartney, 2009).These labels were generated manually for each example by following monotonicity calculus and natural logic.For example, consider the premise: an old man is sitting in a field and the hypothesis: a man is sitting in a field, with the original SICK label: Forward Entailment.After adding the universal quantifier every to the aforementioned SICK example, the modified premise: an old man is sitting in a field and the original hypothesis: every man is sitting in a field are annotated arXiv:2307.05034v3[cs.CL] 7 Sep 2024 with the adjusted label: Reverse Entailment.
Using this dataset, we analyzed the capacity of three different NLI methods to correctly capture the change in entailment given the modified texts.In particular, the contributions of this work are as follows: 1. We propose a mechanism to generate synthetic data for NLI that enforces compositionality in reasoning.Following this mechanism, we produce 1,304 examples from 15 SICK (Marelli et al., 2014) premise, hypothesis sentence pairs by modifying the sentences for subject, verb, and object respectively with a series of modifiers.The resulting dataset is freely available at https://github.com/clulab/releases/tree/sushma/acl2023-nlrse-sicck.
2. We define specific annotation guidelines based on monotonicity calculus and natural logic (MacCartney, 2009) for annotating the modified premise and hypothesis sentences in the dataset above.The resulting labels are included in the dataset.
3. We conducted an analysis to understand how well these structural and compositional changes are captured by neural NLI models, in both zero-shot and fine-tuned scenarios.
Our analysis indicates that NLI models perform poorly over negation and several types of quantifiers.Fine-tuned NLI models do not show significant improvement in learning about compositional changes when compared to their zero-shot equivalent models over our dataset.This suggests that compositionality in reasoning remains a challenge for neural models of language.

Related Work
Natural Logic (NL) is a formal reasoning approach that makes use of syntactic structure and semantic properties of lexical items to understand compositionally (MacCartney, 2009).Logical reasoning is a known challenge for neural NLI models (Ravichander et al., 2019).In particular, NLI models struggle to understand quantifiers, which is highlighted by the fact that these models do not generalize well over quantifierdriven inference tasks (Haruta et al., 2020).The monotonicity calculus over quantifiers with tokenlevel polarity has been explored using the CCG parser over the SICK dataset to generate a synthetic dataset that considers compositional data augmentation (Marelli et al., 2014) and monotonicity calculus (Hu et al., 2019).Other recent research focused on language structures to highlight the importance of compositionality, i.e., the premise and hypothesis differ only in the order of the words, or the presence of antonyms, synonyms, or negation (Dasgupta et al., 2018).Having such data augmentation can help move closer to the compositional encoding of the language (Dasgupta et al., 2018).Our work extends this direction: our dataset captures both phrasal changes (e.g., synonyms, hypernyms), which we inherit from the SICK dataset (Marelli et al., 2014), as well as multiple types of modifiers that are critical for NLI such as universal, existential, negation, and adjectives/adverbs.The FraCas test suite (Cooper et al., 1996) contains 346 examples that explore aspects of natural logic applied to NLI (MacCartney, 2009).The HELP dataset (Yanaka et al., 2019b) modifies phrases in premise/hypothesis sentences based on monotonicity reasoning from combinatorial categorical grammar (Steedman and Baldridge, 2011) and semantic tagging (Abzianidze and Bos, 2017).As mentioned above, our work is complementary to such datasets, as we cover other types of text modifications.The MED dataset (Yanaka et al., 2019a) is another manually-labeled dataset where hypotheses were also modified by the human labelers given the monotonicity information for the premises.Similarly, we manually labeled NLI information, but our work focuses mainly on compositional information in a sentential context.
Enhancing the dataset with data augmentation is another recent method to test the generalizability of NLI models (Jha et al., 2020).Lexical entailment acquired from the distributional behavior of word pairs (Geffet and Dagan, 2005) led to the subsequent work of (Bowman et al., 2015a), who produced a 3-way classification task for NLI dataset that serves as a benchmark for evaluating natural language understanding.Using Natural Logic as a means to learn and reason about the semantic and lexical relations is a common method used to improve the reasoning capabilities of the NLI models (Bowman et al., 2015c).
The NLI_XY dataset (Rozanova et al., 2021) conducts structural investigation over the transformer-based NLI models.In particular, the authors investigate how monotonicity (upwards or downwards) changes when the premises and hypotheses are modified through the insertion of hypernym/hyponym phrases.This work is complementary to ours: while they focus on monotonicity in lexicalization (e.g., changing from a hypernym to a hyponym), we focus on changes in monotonicity due to explicit modifiers applied on top of such lexical modifications.
The MonaLog system (Hu et al., 2019) introduces a simple yet explainable NLI method that relies on a simplified Natural Logic implementation.The proposed method operates by implementing monotonicity calculus over CCG syntactic trees using "a small inventory of monotonicity facts about quantifiers, lexical items and token-level polarity."Despite its simplicity, the authors report excellent performance on the SICK dataset.More closely related to our work, they use MonaLog to generate additional training data for NLI from the generated proofs.

Dataset
We introduce a synthetic dataset to facilitate the analysis of compositionality in logic.The dataset contains 1,304 sentences that were created by modifying 15 examples from the SICK dataset (Marelli et al., 2014) with a variety of modifiers.To this end, we used a set of phrases that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009).These modifiers were applied to syntactic constructs in both premise and hypothesis and the entailment labels are adjusted, as detailed below.

Overview
At a high level, our dataset creation followed the following steps: 1. We start with 15 seed pairs of premise and hypothesis sentences from SICK.Table 1 shows the seed sentence pairs.
2. We syntactically analyze these sentences to understand their subject-verb-object (SVO) structures.Each of the SVO elements is then modified using a subset of the applicable modifiers listed in Table 2.This process is detailed in Section 3.2.

Sentence Modification Strategy
For each premise and hypothesis sentence pair, we modified individual subject, verb, and object phrases with the following approach: 1. To modify subjects, we used the Berkeley Neural Parser to extract the left-most noun phrases (NPs).We then append the applicable modifiers from Table 2.In particular, we used universal quantifiers, existential quantifiers, negations, and adjectives.
2. To modify verbs, we used the Berkeley Neural parser to extract the rightmost verb phrases (VPs) from the parse tree and appended the applicable modifiers.Verbs were modified using universal quantifiers (always, never), negations (not, never), and adverbs (abnormally,elegantly).
3. To detect objects, we used the syntactic dependency parser of (Vacareanu et al., 2020) to identify noun phrases attached to the main verb.Similarly to the subject modifications, these objects were modified using universal quantifiers, existential quantifiers, negations, and adjectives.
After modifying each of the premises and hypotheses sentences, we generate multiple new data points as follows: where m ∈ M : all modifiers; SV O : subject/verb/object phrases for either one of the parts of the sentence; and P i , H i are premise and hypothesis from sentence pairs S i ∈ S where S is the set of 15 examples from SICK.Lastly, f is the function that modifies a given premise and hypothesis that follows one of the modification strategies described above.We generate the following pairs of combinations of the premise, and hypothesis sentences: . We repeat this process to modify each of the relevant sentence phrases, as well as a couple of combinations:  subject, verb, object, subject + object, and verb + object.

Entailment Annotation Strategy
To annotate our dataset,1 we created a set of annotation guidelines that follow Natural Logic and monotonicity calculus (MacCartney, 2009).
In general, to produce entailment relations we used a set theoretic approach to understand how the set of concepts that holds true in the premise overlaps with the set described in the hypothesis.To implement this set theoretic approach consistently, we defined the quantitative interpretation for several more ambiguous modifiers such as all but one, all, not every as follows: 1.For the modifier all, we consider the size of the set of elements X to be greater than 0: |X| > 0. For example, in the case of the phrase all children, we consider the size of the set of children to be greater than 0.

2.
For the all but one modifier, we consider the size of all as N and the size of all but one to be N − 1.Note that the size of all but one could thus theoretically be 0, when N = 1.

3.
For not every we consider the size of the corresponding set X to be 0 or larger: |X| ≥ 0 where X is any set defined over the sentence.not every man would make X as a set of all men but there exists zero or one or more men that would not be included in this set.
4. When we cannot determine the size of the intersection of the two sets of premise and hypothesis, we resolved the annotation to be a Neutral label among all 7 entailment relations.
5. When comparing quantifiers between modified premise, and hypothesis sentence pairs, we denote the sizes of sets mathematically for P ∪ H, P ∩ H, and the Universal set.For example, consider the premise: every turtle is following the fish and the hypothesis: every fish is following the turtle.The set over the premise is P : ∀X ∈ all turtles following one fish, and the set over hypothesis is all fishes following the turtle.Thus, P ∩H = ϕ.In this case, the label is Negation (see Table 3).Table 4 includes more examples with the corresponding entailment labels.
A total of 1,304 modified premise and hypothesis sentence pairs along with original sentence pairs were included in the final SICCK dataset.The data was annotated by 5 annotators which were distributed between two sub-groups of annotators, based on the complexity of the labels.In the first two rounds of annotations, we re-grouped to develop concrete guidelines for annotations, without defining too strict rules by leaving room for more natural "if-this-then-that" deductions.There were disagreements between annotations which were resolved by verifying the sizes of sets mathematically over X ∪ Y , X ∩ Y to follow the entailment relations defined as in (MacCartney, 2009).While in the initial round the inter-annotator agreement was low (k < 0.4), the annotations were revised until each group of annotators converged.

Evaluation
We conducted an evaluation of how NLI methods capture the explicit compositionality in our dataset using two configurations: a zero-shot setting, in which we used NLI systems trained externally, and a fine-tuned setting, in which the same models were fine-tuned using our dataset.

Zero-shot Analysis of NLI Models
For this analysis, we evaluate three pretrained neural entailment models on our dataset.However, all these systems emit just the three "traditional" entailment labels (Forward Entailment, Contradiction, and Neutral) whereas our dataset contains the seven labels from NL.To align these label spaces, we performed the following transformations: 1.In case a system produces a Neutral label, we run the prediction in the opposite direction, i.e., from hypothesis to premise.If the top label in the reversed direction is Forward Entailment (FE), we label the pair as Reverse Entailment.Otherwise, we keep the Neutral label.This heuristic allows these systems to produce four labels instead of three.
2. We convert our seven labels to four labels through the following heuristics: (a) Equivalence was removed since we had only one sentence pair labeled as Equivalence in our dataset; (b) Alternation is merged with Negation; (c) Cover and Independence become Neutral; and (d) the 7 examples that were annotated as Cover|FE were removed.
We conducted zero shot evaluation using three NLI models: the cross-encoder model of Reimers and Gurevych (2019) (nli-deberta-v3-base in our tables), the adversarial NLI model of Nie et al. (2020) (ynie/roberta-large-. . .), and ELMobased Decomposable Attention model (Parikh et al., 2016) (pair-classification-. . .).We draw the following observations from this experiment: • Table 8 indicates that the ELMO-based NLI model performs considerably worse than the other two transformer-based models.This is a testament to how far our field has progressed in just a few years.However, no model approaches 70 F1 points, which indicates that none of these models truly understand the task well.
• The NLI models do better over adjectives and adverbs, but they struggle to understand statements modified with universal and existential quantifiers, and negation.Tables 8-14 indicates that the transformer-based NLI models perform at over 70 F1 points on adjectives/adverbs, at over 65 F1 for universal quantifiers, at approximately 60 F1 for existential quantifiers, and at only 30-35 F1 for negation.This is a surprising finding considering how much attention negation has received in the NLP literature (Pang et al., 2002) (Hossain et al., 2022) (Hossain et al., 2020).
• Lastly, Tables 15-17 indicate that NLI models process objects best, followed by subjects, and, lastly, verbs.This is not surprising considering the increased semantic ambiguity of verbs.

Analysis of Fine-tuned NLI models
To understand if NLI methods are capable of learning this compositional information, we fine-tuned the two NLI models that performed better over the SICCK dataset.To maximize the data available, we implemented a 5-fold cross-validation evaluation over the entire SICCK dataset and experimented with multiple hyperparameters.In particular, we used 4 or 8 epochs, and batch sizes of 8, 16, or 32 data points.The results of these experiments are summarized in Table 9.We draw the following observation from this experiment: • The difference in F1 scores between the finetuned systems and the corresponding zeroshot setting ranges from -0.19 to 0.2.This indicates that these systems do not acquire substantial new knowledge despite the fact that they've been exposed to approximately 1,300 sentences with compositional information.This suggests that understanding compositionality is harder than expected.
• Similar to the zero-shot setting, NLI models did better over adjectives, and adverbs and  relatively better over existential quantifiers in comparison to that of the negation and universal quantifiers.We also observed that models seem to be confused when the annotated label was Neutral but the modifier types were negations.
• NLI models perform somewhat better over subject and object-modified examples than on examples with modified verbs.This indicates that the semantic ambiguity of verbs is likely to impact NLI models.

Error Analysis
We analyze the incorrect predictions of the NLI models over SICCK dataset in this section.We observed that NLI models performed better over adjectives and adverbs, and relatively well over universal quantifiers in comparison to sentences modified with negation and existential quantifiers under both fine-tuned as well as zero-shot settings.
We also observed that models seem to be confused when the adjusted label was Neutral and the modifier types were negations.

Neutral Labels with Negation Modifiers
Negation understanding in natural language has been a challenging problem (Pang et al., 2002;Hossain et al., 2022).(Hossain et al., 2022) discussed that Negation is underrepresented in natural language corpora.Further, (Hossain et al., 2020) show that even though the transformers were fine-tuned with modified premises with negation (i.e., verb modifiers with negation), the transformers struggle with inference over negated sentence pairs.In our SICCK dataset, there are 167 examples with negation modifiers.

Conclusion
This paper introduced a new, synthetic dataset that facilitates analyses of how NLI models capture compositionality.The dataset contains 1,304 sentence pairs that were created by modifying 15 examples from the SICK dataset (Marelli et al., 2014) with a variety of modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009).We used these phrases to modify the subject, verb, and object parts of the premise and hypothesis.Lastly, we annotated these modified texts with the corresponding entailment labels following NL rules.We conducted a preliminary analysis of how well the change in the structural and semantic composition is captured and detected by neural NLI models, in both zero-shot and fine-tuned scenarios.We found that the performance of NLI models is poor in both settings, especially for modified sentences with negation and existential quantifiers, and when verbs are modified.

Limitations
While this work explores the impact of the typical compositional modifiers on entailment relations, we did not consider other fine-grained information that further captures upward or downward monotonicity from the monotonicity calculus of the premise/hypothesis sentence pairs.Further, the dataset that we generated is relatively small, at approximately 1,300 sentences.We also did not evaluate the dataset over T5, BART, GPT-x, and other state-of-the-art LLMs, which may provide more insights.We also did not conduct any evaluation for explanations and interpretation of the evaluated NLI models, which could be future work.Lastly, we did not include a comparison with existing datasets that were created specifically for negation modifiers and universal & existential quantifiers.We see all these issues as exciting avenues for future work.

Table 1 :
(Marelli et al., 2014)sentence pairs from the SICK dataset(Marelli et al., 2014)and corresponding NLI labels that form the seed of our dataset.The bold text highlights the lexically-driven compositional change in the premise and hypothesis sentences.

Table 2 :
List of modifiers used to modify subject, verb, and object elements of sentences.They are applied to each of the premise and hypothesis sentences in Table1.

Table 4 :
Premise, hypothesis examples where one or both of the premise and hypothesis were modified.The text in bold indicates the change from the original text.The SVO column indicates the part of the sentence that was modified: subject, verb, or object (SVO).The Modifier type indicates which type of modifier was used to modify the parts of sentences.The label is the Entailment relation annotated by the annotators over modified data.

Table 5 :
Sentence counts in SICCK based on types of modifiers.

Table 6 :
Sentence counts in SICCK based on which syntactic structures are modified.

Table 7 :
Label counts in SICCK.Note that Nega-tion|Alternation indicates ambiguous labels where the two annotators did not converge.

Table 8 :
Overall scores for the three pretrained NLI modes under zero-shot setting, based on compressed 4-entailment relations: Forward Entailment, Reverse Entailment, Contradiction, and Neutral.

Table 9 :
Overall scores for two fine-tuned NLI models on SICCK dataset based on the compressed 4-entailment relations: Forward Entailment, Reverse Entailment, Contradiction, and Neutral.We repeated these experiments 5 times with different random seeds; we report averages and standard deviation, for 4 and 8 epochs and batch sizes of 8, 16, and 32 data points.

Table 10 :
For all our SICCK dataset's 167 examples with negation modifiers, this table includes counts of all the modified subject, verb, and object parts of sentences respectively for each of the 4-Entailment adjusted labels from SICCK annotations.The last column indicates how many of these data points have the Neutral label.
Table10shows some statistics relevant to this.Of these 167 examples with Negation modifiers, there are 118 Neutral examples.We observed that nli-deberta-v3-base model incorrectly predicted ground truth for approximately 70% of these examples and while the other NLI model (ynie/roberta-largesnli-mnli-fever-anli-R1-R2-R3-nli) incorrectly predicted 23% of the examples.For all the incorrectly predicted labels for negation-modified examples with Neutral labels, the models seemed to be confused for various compositional cases, i.e. subject or verb or object-modified examples almost equivalently.Modifiers such as no, not every, not with Neutral and Contradiction labels seem to contribute to the confusion.SICCK examples also include the format of alternating modifiers between premises, hypothesis, or both i.e.For verb modifiers, we selected abnormally, elegantly, always, never.Our SICCK dataset has a total of 220 verb-modified examples of which, we have 89 universal modifiers, 90 adverbs/adjectives, and 41 negation.Among the 31 verb-modified examples with negation modifiers and with Neutral label, NLI models incorrectly alternate between Contradiction and FE for 99% of the examples.Of the 49 examples with universal modifiers over verbs with Neutral labels, approximately 69.4% were incorrectly predicted.This further emphasizes that negation (especially when occurring in Neutral examples) remains a challenge.