Exploring Transitivity in Neural NLI Models through Veridicality

Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We introduce an analysis method using synthetic and naturalistic NLI datasets involving clause-embedding verbs to evaluate whether models can perform transitivity inferences composed of veridical inferences and arbitrary inference types. We find that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples. The data and code for our analysis are publicly available at https://github.com/verypluming/transitivity.


Introduction
Deep neural networks (DNNs) have shown impressive performance in many natural language processing tasks. In particular, DNN models pretrained with large-scale data such as BERT (Devlin et al., 2019) have achieved high accuracy in various benchmark tasks (Wang et al., 2019a,b), which suggests that they might possess some generalization capacities that are a hallmark of human cognition. However, recent analyses (Talmor and Berant, 2019;Liu et al., 2019;McCoy et al., 2019) have shown that high accuracy on a test set drawn from the same distribution as the training set does not always indicate that the model has obtained the intended ability, so it remains unclear to what ex- tent DNN models can learn the systematic generalization in natural language from training instances.
Central to human-like generalization capacities is the fact that ability to understand a given sentence is related to ability to understand other sentences, called systematicity of human cognition in Fodor and Pylyshyn (1988). Thus, if speakers understand the meaning of the sentence Ann loves Bob, they must also understand the meaning of structurally related sentences such as Bob loves Ann. We explore whether DNN models possess this type of generalization capacity in the domain of natural language inference (NLI), which is the task to judge whether a premise entails a hypothesis (Dagan et al., 2013;Bowman et al., 2015a).
A key property underlying systematicity of drawing inferences is the transitivity of inference relations, illustrated in Figure 1. Schematically, if a model learns a basic inference pattern from A to B and one from B to C, it should be able to compose the two patterns to draw a new inference from A to C. If a model lacks this generalization capacity, it must memorize an exponential number of inference combinations independently of basic patterns.
Among the various inference patterns, we focus on transitivity inferences that combine veridical in-ferences with other types. In veridical inferences, one must distinguish two entailment types. For example, the verb know is called veridical in that "x knows that P " entails that P is true, while the verb hope is called non-veridical since "x hopes that P " does not entail that P is true. Veridical inferences can relatively easily compose transitivity inferences at scale by embedding various inference types into clause-embedding verbs. For instance, as Figure 1 shows, if a model has the ability to perform both Boolean inference and veridical inference, it is desirable to have the ability to combine both types to make a chained inference.
Such transitivity inferences are by no means trivial. For instance, if the premise is changed to Jo knows that Ann or Bob left, it does not follow that Bob left, even though the veridical verb know appears. Models relying on shallow heuristics such as lexical overlap can wrongly predict entailment in this case. To correctly handle such composite inferences, models must capture structural relations between veridical inferences and various kinds of embedded inference.
Previous studies on the generalization capacities of NLI models have addressed how models could learn inferences with various challenging linguistic phenomena (Bowman et al., 2015b;Dasgupta et al., 2018;Geiger et al., 2019Geiger et al., , 2020Yanaka et al., 2019a,b;Richardson et al., 2020). However, these studies have focused on the linguistic phenomena in isolation, and thus do not address how a model could learn the interactions between them. Our aim is to fill this gap by presenting a method for probing generalization capacity of DNN models performing transitivity inferences.
This study provides three main contributions. First, we create and publicly release two types of NLI datasets for testing model ability to perform transitivity inferences: a fully synthetic dataset that combines veridical inferences and Boolean inferences, and a naturalistic dataset that combines veridical inferences with lexical and structural inferences. Second, we use these datasets to systematically expose models to basic inference patterns and test them on a variety of combinations. This will demonstrate that the models lack the ability to capture transitivity of inference. Third, we investigate whether data augmentation with new combination patterns helps models to learn transitivity. Experiments show that the data augmentation improves model performance on similar combina-tions, regardless of the existence of basic inference patterns in the training set. These results suggest there is much room for improving the generalization capacities of DNN models for combining basic inferential abilities.

Related Work
Transitivity The transitivity of entailment relations, which derives A → C from A → B and B → C, is incorporated into logic-based NLI systems using automated theorem proving (Abzianidze, 2015;Mineshima et al., 2015). This is a basic property of formal logic, also known as syllogism in traditional logic or the cut rule in proof theory (Troelstra and Schwichtenberg, 2000;van Dalen, 2013). Transitivity inference in its various forms has also been widely studied as a fundamental property of human reasoning in cognitive psychology (Johnson-Laird and Byrne, 1991;Khemlani and Johnson-Laird, 2012). In the context of NLP, previous works have proposed a method for training models with transitivity constraints in multi-hop reasoning tasks (Asai and Hajishirzi, 2020) and temporal relation extraction tasks (Ning et al., 2017). Clark et al. (2020) investigated a transformer's ability to perform a chain of reasoning where reasoning rules are explicitly given. In this work, we study model ability to learn transitivity of entailment relations from training examples, rather than explicitly providing rules. Systematicity There has been extensive discussion of whether neural networks (aka Connectionist models) can exhibit systematicity of cognitive capacities (Fodor and Pylyshyn, 1988;Marcus, 2003). Recent works have explored whether modern neural networks can learn systematicity in semantic parsing tasks (Lake and Baroni, 2017;Baroni, 2020;Kim and Linzen, 2020) and question answering tasks (Sinha et al., 2019), whereas our focus is the systematicity in NLI.
In works related to the systematicity in NLI, Goodwin et al. (2020), Yanaka et al. (2020), and Geiger et al. (2020) used a manually constructed NLI dataset of monotonicity inferences with and without negation (e.g., The child is not holding plants → The child is not holding flowers) to examine DNN models' generalization capacities. While these approaches concentrate on monotonicity inferences involving quantifiers and negative expressions, our method using veridical inference is general in that it can be applied to any entailment relation that combines basic inference patterns; we generate composite inferences by embedding various types of sentences into clause-embedding verbs. Fodor and Pylyshyn (1988) distinguished systematicity (roughly, the ability to understand sentences that are structurally related to each other) from productivity (the ability to understand an infinite set of sentences), claiming that systematicity poses a serious challenge to neural network models. Yanaka et al. (2020) tested both systematicity and productivity of DNN models with a synthetic dataset of monotonicity inferences for upward (e.g., some, at least three) and downward (e.g., few, at most three) quantifiers, where handling productivity (recursion) makes sentences more involved (e.g., iterated relative clauses and negation). Focusing on systematicity rather than productivity allows testing models with more natural and less complicated data, as compared to sentences appearing in monotonicity inferences.
Veridicality Veridical inferences, including those licensed by factive and implicative verbs, have been intensively studied in the literature of semantics and pragmatics (Karttunen and Peters, 1979;Beaver, 2001). Recent work has revealed graded and context-sensitive aspects of veridicality inferences, creating veridicality judgement datasets (de Marneffe et al., 2012;White and Rawlins, 2018;White et al., 2018). While we use only a subset of veridical predicates discussed in the literature, our method can be extended to more complex inferences, such as factive presupposition.
Ross and Pavlick (2019) presented a naturalistic veridicality dataset and compared the predictions of a BERT-based NLI model and human judgements. These previous studies on veridicality inferences have tended to focus on relations between whole sentences (e.g., Jo remembered that there was a wild deer jumping a fence) and its embedded material (e.g., There was a wild deer jumping a fence). By contrast, we consider the interactions of veridicality inferences and other inference types (see Section 3.2), including cases where the embedded material is further paraphrased via linguistic phenomena (e.g., Jo remembered that there was a wild deer jumping a fence ⇒ An animal was jumping). We also collect human judgements on our dataset and compare them with model predictions (see Section 4.4).
One way to learn challenging inferences is data augmentation, and prior studies (Yanaka et al., 2019b;Richardson et al., 2020;Min et al., 2020) have shown that data augmentation with synthesized datasets improves performance with challenging linguistic phenomena. However, it remains unclear whether data augmentation can help models learn composite inferences mixing several inference types from training instances. We address this question in Section 4.3.

Overview
To investigate whether models can capture transitivity, we consider two basic inference patterns and their combinations. The first basic pattern, I 1 , is veridical inference. We write f (s 1 ) → s 1 to denote a schematic veridical inference, where f is a clause-embedding verb and s 1 is the embedded clause. For instance, in the case of the inference pattern A → B in Figure 1, "Jo knows that x" corresponds to f (x) and "Ann and Bob left" to s 1 .
The second basic pattern, I 2 , provides an inference from the embedded material. We denote a premise-hypothesis pair of this second inference by s 1 → s 2 . Given two inferences f (s 1 ) → s 1 in I 1 and s 1 → s 2 in I 2 , we consider a new inference f (s 1 ) → s 2 , where premise f (s 1 ) is the same as that of I 1 and hypothesis s 2 is the same as that of I 2 . See Table 1 and Table 2 for some examples of inferences f (s 1 ) → s 1 , s 1 → s 2 , and f (s 1 ) → s 2 . In this work, we consider binary labels, entailment and non-entailment, denoted by yes and unk, respectively. As Table 3 shows, the gold label on the f (s 1 ) → s 2 pattern can be determined from those of the basic patterns f (s 1 ) → s 1 and s 1 → s 2 , following the transitivity of entailment relations.
We train models with the first and second patterns, f (s 1 ) → s 1 and s 1 → s 2 , and then test them     Table 3: Rule for determining the f (s 1 ) → s 2 label from the basic patterns f (s 1 ) → s 1 and s 1 → s 2 . on a set of the composite inferences f (s 1 ) → s 2 that combines them. Note that due to how they are constructed, the training and test sets do not overlap. Model capable of applying the transitivity inference from f (s 1 ) → s 1 and s 1 → s 2 to f (s 1 ) → s 2 should consistently predict the correct label of f (s 1 ) → s 2 for any combination of f (s 1 ) → s 1 and s 1 → s 2 .

Data creation
We generate basic inferences f (s 1 ) → s 1 and s 1 → s 2 and combine them to produce transitivity inferences f (s 1 ) → s 2 . To test diverse inference patterns, we consider two types of the second basic inference s 1 → s 2 : synthesized Boolean inferences and naturalistic inferences using an existing NLI dataset, SICK (Marelli et al., 2014), which contains lexical inferences (e.g., boy → kid in ID Type of f Verbs Veridical realize, acknowledge, remember, note, find, notice, learn, see, reveal, discover, understand, know, admit, recognize, observe Non-veridical feel, claim, doubt, hope, predict, imply, suspect, wish, think, believe, hear, expect, estimate, assume, argue  Table 2) and structural inferences (e.g., active-passive alternation in ID 5024 in Table 2). Since the ratio of the gold labels (yes and unk) is set to 1 : 1 in both basic inference sets, the ratio of the gold labels for the transitivity test set is 1 : 3 by the rule in Table 3. We reserve 10% of the basic inference set for the validation set.

in
Clause-embedding verbs We focus on clauseembedding verbs that take tensed subordinate clauses. Specifically, we collect 67 verbs appearing in both MegaVeridicality2 (White et al., 2018) and the verb veridicality dataset (Ross and Pavlick, 2019). As Table 4 shows, we select a final set of 30 clause-embedding verbs. Following a previous study (White et al., 2018), we slot a clause-embedding verb f into a template with the form "Someone f that s 1 " and generate premise f (s 1 ) of veridical inference to avoid confounds introduced by world knowledge and pragmatic inference in the main clause. The clauseembedding verb f is in past or present tense, and we inflect the verb in the complement s 1 to match the tense of f . When measuring the extent to which models can learn transitivity of entailment relations from training instances, it is desirable to determine the gold labels of composite inferences from those of basic inferences. Thus, we take the labels of veridical inference datasets predicted by the veridical and non-veridical distinction in lexical semantics as the gold standard. In addition, veridical inferences are sensitive to context, influenced by world knowledge and pragmatic factors (de Marneffe et al., 2012). Accordingly, we also present additional experiments to take into account such complexity of veridical inferences in Section 4.2.
Boolean inference To provide a fully synthetic transitivity inference dataset, we generate Boolean inferences with conjunction, disjunction, and negation. The data generation process is similar to the one in Yanaka et al. (2020): sentences are generated using a context-free grammar (CFG) associated with semantic composition rules in lambdacalculus. We first generate a set of premise sentences by the CFG rules and translate each sentence s 1 into a first-order-logic (FOL) formula F 1 in accordance with semantic composition rules specified in the CFG rules. Appendix A provides a set of CFG rules and semantic composition rules. We randomly select one of the atomic subformulas appearing in F 1 and take its positive or negative form, which we denote by F 2 . Then we convert F 2 to a sentence s 2 using the same grammar. We set s 2 as a hypothesis.
The gold label for inference pair s 1 → s 2 is determined by checking whether formula F 1 entails formula F 2 using an FOL theorem prover. The gold labels for f (s 1 ) → s 1 and f (s 1 ) → s 2 pairs are automatically determined according to the veridicality of a clause-embedding verb and the rule in Table 3, respectively. To restrict the complexity of generated sentences, we set the maximum number of logical connectives appearing in formula F 1 to 6. Table 1 illustrates examples of fully synthetic transitivity inference datasets. We generate 3,000 Boolean inference examples s 1 → s 2 , 6,000 veridical inference examples f (s 1 ) → s 1 , and 6,000 Naturalistic inference To generate a naturalistic transitivity inference dataset, we collect an example s 1 → s 2 of naturalistic inference from the SICK dataset, which is constructed from existing sentences (image descriptions given by different people) and covers various lexical and structural phenomena. (1) is an example of lexical inference (brush → comb) in SICK, whose label is yes.
(1) s 1 : A person is brushing a cat.
s 2 : A person is combing the fur of a cat.
By selecting a clause-embedding verb f and an embedded sentence s 1 , we generate a new sentence f (s 1 ). As shown in (2), we construct a veridical inference example f (s 1 ) → s 1 by setting f (s 1 ) as a premise and s 1 as a hypothesis.
(2) f (s 1 ): Someone sees that a person is brushing a cat. s 1 : A person is brushing a cat. (yes) Likewise, as in (3), we can obtain a composite inference example f (s 1 ) → s 2 whose label is yes: (3) f (s 1 ): Someone sees that a person is brushing a cat. s 2 : A person is combing the fur of a cat.

Experiments and Analysis
We analyze whether models trained with the basic inference set can consistently perform composite inferences on the test set. We use two DNN models, BERT and LSTM, which are known to perform well with linguistic phenomena such as subject-verb agreement and hierarchical and structural probing tasks (Linzen et al., 2016;Weiss et al., 2018;Kuncoro et al., 2018).

Experimental setup
In all experiments, we train each model for 25 epochs or until convergence and select the bestperforming model based on its accuracy on the validation set. We perform five runs and report the average and standard deviation of their accuracies.

Data
Model

Testing transitivity
We first evaluate whether the models trained with basic inferences f (s 1 ) → s 1 and s 1 → s 2 can consistently make judgements on the composite inferences f (s 1 ) → s 2 . As a previous work (Ross and Pavlick, 2019) reported that a BERT model trained with the benchmark NLI dataset MultiNLI (MNLI; Williams et al., 2018) is sensitive to verb veridicality, we regard the accuracy of models trained with MNLI as a baseline. We also analyze models trained with MNLI mixed with the basic inference set. Table 5 shows accuracies for the fully synthetic transitivity test set that combines veridical and Boolean inferences. Models trained with the basic inference set achieved over 80% accuracy on the test cases, except for cases where f (s 1 ) → s 1 is yes and s 1 → s 2 is unk. Table 6 shows accuracies for the naturalistic transitivity test set. Again, models trained with the basic inference set performed substantially below chance for the cases f (s 1 ) → s 2 , where f (s 1 ) → s 1 is yes and s 1 → s 2 is unk. This suggests that while the models achieve over 80% accuracy on both f (s 1 ) → s 1 and s 1 → s 2 validation sets, they do not apply transitivity inference from the inferences f (s 1 ) → s 1 and s 1 → s 2 , but rather predict the label for the composite inference f (s 1 ) → s 2 by judging whether it is similar to the veridical inference f (s 1 ) → s 1 in the training set.
Accuracy of models trained with MNLI was low because they predicted yes for many examples where correct labels were unk, as in (4).
(4) f (s 1 ): Someone wished that John saw Tom or Greg. s 2 : John saw Tom.
(unk)   (4). When models are trained with MNLI mixed with the basic inference set, they seem to improve performance on the fully synthetic transitivity test set. One reason for this result is that the models might use heuristics to make predictions for some unk examples in the fully synthetic inference set. Error analysis shows that the models tend to predict unk when either a premise or a hypothesis contains a negation like (5).
(5) f (s 1 ): Someone knew that Fred praised Henry or Ann. s 2 : Fred did not praise Ann. (unk) These heuristics might be related to the annotation artifact (Gururangan et al., 2018) in MNLI, because an inference example involving negation words tends to be a contradiction 2 . Moreover, models can memorize the basic inference set regardless of the existence of MNLI in the training set, so performance seems to be better. Note that models trained with MNLI mixed with the basic inference set still failed on the naturalistic transitivity inference f (s 1 ) → s 2 where f (s 1 ) → s 1 is yes and s 1 → s 2 is unk. Since the naturalistic basic inference examples s 1 → s 2 contain various linguistic phenomena, models cannot rely on the heuristics for such examples.
Is poor performance of transitivity inference due to overfitting on verbs? To determine whether models do not overfit on clauseembedding verbs, we analyze the models under   Table 6.
yes yes yes 90.0 (−7.1) 93.6 (−6.4) yes unk unk 2.2 (+2.2) 17.9 (+9.0) unk yes unk 89.9 (−7.2) 94.0 (−6.0) unk unk unk 98.3 (+1.0) 95.6 (+1.8) Table 9: Accuracies of models in the setting (II). (△) is the difference from the accuracy in Table 6. two additional settings using naturalistic transitivity datasets: (I) we use various templates other than "Someone f that s 1 " to generate the main clause in f (s 1 ), and (II) we flip the gold labels of 10% veridical inference f (s 1 ) → s 1 instances, randomly sampled, instead of using gold labels uniquely fixed from verb types. These two complex settings expose models to more natural evaluation settings that consider the context-sensitive property of veridicality. For evaluation setting (I) using various templates involving clause-embedding verbs, we manually select forty main clauses of the verb veridicality dataset (Ross and Pavlick, 2019) and provide additional templates. Table 7 shows examples of additional templates involving clauseembedding verbs used for generating veridical inference datasets. Table 8 and Table 9 show the results for (I) and (II), respectively. These results show the same trends as those in Table 6, indicating that even when we consider the complexity of veridical inference in our analysis, the models fail to consistently perform composite inferences.

Analysis with data augmentation
We further hypothesize that even if the current models fail to consistently perform composite inferences, data augmentation with a small number of composite inference examples might allow models to learn transitivity inference. Thus, we evaluate models trained with basic inferences f (s 1 ) → s 1 and s 1 → s 2 and with a subset of the composite inferences f (s 1 ) → s 2 on a naturalistic inference test set. Considering that models fail on composite inference f (s 1 ) → s 2 where f (a) f (s1) → s1, s1 → s2, and a subset of f (s1) → s2 (b) f (s1) → s1 and a subset of f (s1) → s2  is veridical and s 1 → s 2 is unk, we gradually add veridical verbs (e.g., know) one-by-one to generate an additional training set of composite inference f (s 1 ) → s 2 and analyze performance on a test set. Figure 2 (a) shows that this data augmentation improved performance on test examples f (s 1 ) → s 2 where f is veridical and s 1 → s 2 is unk, while maintaining accuracy on the remaining examples in the test set. BERT achieved 100% accuracy over the entire test set by adding composite inferences generated from four veridical verbs, whereas in the case of LSTM twelve veridical verbs were needed to achieve the same accuracy.
To determine whether models augmented with composite inference examples learn the ability to combine basic inferences to perform transitivity inference, we analyze the performance of models where basic inference examples are not included in the training set. Figure 2(b) shows that models trained with only the basic inference set f (s 1 ) → s 1 and a subset of the composite inference set f (s 1 ) → s 2 also had improved accuracy. This result supports findings that models do not combine the basic inference f (s 1 ) → s 1 and s 1 → s 2 , but rather predict the label for a composite inference f (s 1 ) → s 2 by judging whether it is similar to inference patterns found in the training set.

Comparison with humans
To investigate how humans perform on transitivity inference tasks, we collect human judge-ments for a subset of our naturalistic inference dataset. We asked crowdsourced workers to label 960 transitivity inference examples involving all the clause-embedding verbs in Table 4. Following prior works involving crowdsourced NLI datasets (Zhang et al., 2017;Ross and Pavlick, 2019), we instructed raters to label each premisehypothesis pair with the degree of entailment on a 5-point Likert scale, with 1 meaning a hypothesis is definitely not true given the premise, and 5 meaning a hypothesis is definitely true. We collected three annotations per pair on Amazon Mechanical Turk (see Appendix D for details), and the inter-rater agreement (the Pearson correlation among raters, averaged across both examples and raters) was 0.76. As model predictions are discrete (yes or unk), we discretized human scores into evenly sized bins, setting yes if the score was 4 or higher and set unk if the score was 3 or lower. We assumed the majority of three discretized labels as the final human judgement. Table 10 shows that humans generally follow the distinction between veridical and non-veridical verbs traditionally assumed in the lexical semantics, as well as the transitivity of entailment relation. In particular, while as we saw in Section 4.2 the DNN models performed substantially below chance for transitivity inferences where f (s 1 ) → s 1 is yes and s 1 → s 2 is unk, human performance is near perfect for such inferences.
Interestingly, however, humans tend to predict incorrect labels for transitivity inferences where the verb f is non-veridical (so f (s 1 ) → s 1 is unk) and the embedded inference s 1 → s 2 is yes. This might be because a natural complement as in (6) induces veridicality bias (Ross and Pavlick, 2019), that is, no matter whether a complement verb f is veridical or non-veridical, humans tend to decide the truth value of f (s 1 ) by judging whether its complement s 1 is true. Thus, judgement for f (s 1 ) → s 2 coincides with that of s 1 → s 2 in this case.
(6) f (s 1 ): Someone believed that a man is jumping off a low wall.
s 1 : A man is jumping off a low wall.
s 2 : A man is jumping a wall.

Conclusion
We introduced an analysis method using transitivity inferences for evaluating systematic generalization capacities of NLI models. We found that current NLI models do not perform consistently well on transitivity inference tasks. Furthermore, data augmentation analysis suggested that models can memorize composite inference examples, but do not perform the intended transitivity inferences combining basic inference examples. Overall, our results indicated that despite the impressive performance of DNN models on standard NLI datasets, there remains much room for improving their systematic generalization capacities with respect to combining basic inferential abilities on various linguistic phenomena. Regarding what is necessary for improving the systematic generalization capacity, one interesting possibility is explicitly feeding some form of logic-guided transitivity rules to models, which is left for future work. Our analysis method using transitivity can be an effective tool for further progress in the study of compositional NLI. A Details about the Boolean logic fragment Table 11 shows the context-free grammar used to generate sentences for Boolean logic reasoning with conjunction, disjunction, and negation. Each rewriting rule is paired with the corresponding semantic composition rule in standard Montagovian semantics to generate the logical form of a sentence (Montague, 1973;Heim and Kratzer, 1998). We use ten items each for proper names (PN), intransitive verbs (IV), and transitive verbs (TV). Each sentence is generated with a verb in the past tense.
For sentences with multiple NPs, we assume the surface-scope reading where the subject NP takes scope over the object NP. For instance, the sentence Ann and Bob saw Chris or Daniel, where the subject NP is conjunctive and the object NP is disjunctive, has the logical form (see(ann, chris) ∨ see(ann, daniel)) ∧ (see(bob, chris) ∨ see(bob, daniel)).
There are two types of negation, sentential negation (SNEG) and verbal negation (VNEG), which are distinguished with respect to their scope interpretation. Thus, the sentence Ann and Bob did not swim has the logical form ¬swim(ann) ∧ ¬swim(bob), while the sentence It is not the case that Ann and Bob did not swim has the logical form ¬(swim(ann) ∧ swim(bob)).
To generate a premise-hypothesis pair (s 1 , s 2 ) using this Boolean logic fragment, we first generate a sentence s 1 and derive its logical form F 1 using the grammar in Table 11. We then randomly select one of the atomic formulas appearing in F 1 , say A, and takes its positive (A) or negative (¬A) form, which is in turn converted to the hypothesis sentence s 2 using the same grammar. The gold label (entailment or non-entailment) for the pair (s 1 , s 2 ) is determined by checking whether F 1 logically entails A or ¬A using a first-order-logic theorem prover 3 .

B Training details
In all experiments, we trained models on eight NVIDIA DGX-1 Tesla V100 GPUs. The runtime for training each model was about 1-8 hours, depending on the size of the training set. The order of training instances was shuffled for each model. To confirm that our transitivity inference dataset is not excessively difficult, we conducted additional experiments using the random 9 : 1 train:test split of transitivity inference (f (s 1 ) → s 2 ) datasets. We evaluate models under two settings: (i) models trained with the train split of transitivity inference datasets and (ii) models trained with the train split mixed with MNLI. Table 12 shows the results on the random train-test split of our full-synthetic transitivity dataset, and Table 13 shows the results on the random train-test split of our naturalistic transitivity dataset. These results showed that regardless of the existence of MNLI in the training set, models achieved perfect performance on our transitivity inference test set with the standard random train-test split setting.

D Human judgement details
Using Amazon Mechanical Turk, we collected human judgements for 960 naturalistic veridical inference examples and 960 naturalistic transitivity inference examples. We required raters to have completed at least 5,000 approved tasks to maintain a 99% approval rating. Raters could indicate by a checkbox that one or both sentences did not make sense, but no rater clicked the checkbox. We collected three annotations per pair and paid $0.06 per labeled pair. Since humans predict incorrect labels for some composite inference examples f (s 1 ) → s 2 where the verb f is non-veridical, we checked the accuracy of human judgement on a set of premisehypothesis pairs f (s 1 ) → s 1 and f (s 1 ) → s 2 involving each non-veridical verb, as shown in Table 14. Annotators tended to incorrectly make judgements for both f (s 1 ) → s 1 and f (s 1 ) → s 2 . Regarding accuracy for each non-veridical verb, annotators correctly drew inferences containing wish and hope, while they tended to draw inferences containing claim and hear incorrectly.
In comparison with the previous veridicality dataset MegaVeridicality2 (White et al., 2018), the accuracy tended to be lower than that in MegaVeridicality2 4 . As (7) shows, while a simple complement is used for MegaVeridicality2, a natural complement like (8) Table 11: Grammar for the Boolean logic fragment with semantic composition. Feature tense for VP is either "base" or "past." In semantic composition, sym is the place where the symbol (lemma) for a lexical item appears.  bias (Ross and Pavlick, 2019), resulting in incorrect judgements on veridical inference. Whether a verb is veridical or non-veridical, humans tend to judge the complement as true.
(8) f (s 1 ): Someone believed that a man is jumping off a low wall.
s 1 : A man is jumping off a low wall.
s 2 : A man is jumping a wall.

E Supplementary results with data augmentation
In Section 4.3, we gradually added a subset of the composite inferences f (s 1 ) → s 2 involving a veridical verb (e.g., know) to the training set and evaluated the performance of models on a naturalistic inference test set. We also evaluated the performance of models under two conditions: (a) models trained with the basic inference set s 1 → s 2 and a subset of the composite inference set f (s 1 ) → s 2 and (b) models trained with a subset of the composite inference set f (s 1 ) → s 2 . Figure 3 (a) shows that the models significantly improved accuracy on composite inferences except for the test example f (s 1 ) → s 2 , whose label dif- fered from that of s 1 → s 2 . Moreover, their performance was maintained even without composite inference examples in the training set. This indicates that models predict labels for the composite inference example only by judging whether it is similar to the basic inference example in the training set. Figure 3(b) shows the result when models are trained only with a subset of the composite inference set f (s 1 ) → s 2 . As non-veridical verbs are not included in the training set in this setting, the models predict labels for composite inferences involving non-veridical verbs by judging whether they are similar to composite inferences involving veridical verbs in the training set. The models thus fail on composite inference examples f (s 1 ) → s 2 where f is non-veridical and s 1 → s 2 is yes. The la-bels of such non-veridical inference examples are opposite to those of veridical inference examples in the training set.