Entailment Graph Learning with Textual Entailment and Soft Transitivity

Typed entailment graphs try to learn the entailment relations between predicates from text and model them as edges between predicate nodes. The construction of entailment graphs usually suffers from severe sparsity and unreliability of distributional similarity. We propose a two-stage method, Entailment Graph with Textual Entailment and Transitivity (EGT2). EGT2 learns the local entailment relations by recognizing the textual entailment between template sentences formed by typed CCG-parsed predicates. Based on the generated local graph, EGT2 then uses three novel soft transitivity constraints to consider the logical transitivity in entailment structures. Experiments on benchmark datasets show that EGT2 can well model the transitivity in entailment graph to alleviate the sparsity, and leads to signifcant improvement over current state-of-the-art methods.


Introduction
Entailment, as an important relation in natural language processing (NLP), is critical to semantic understanding and natural language inference (NLI). Entailment relation has been widely applied in different NLP tasks such as Question Answering (Pathak et al., 2021;Khot et al., 2018), Machine Translation (Padó et al., 2009) and Knowledge Graph Completion (Yoshikawa et al., 2019). When coming across a question that "Which medicine cures the infection?", one can recognize the information "Griseofulvin is preferred for the infection," in the corpus and appropriately write down the answer with the knowledge that "is preferred for" entails "cures" when their arguments are medicines and diseases, although the surface form of predicate "cures" does not exactly appear in the corpus. There are many ways to present one question, and it [medicine] is preferred for [disease] [medicine] cures [disease] [medicine] is effective for [disease] [medicine] is related to [disease] [medicine] causes [disease] t 1 =medicine t 2 =disease [medicine] is preferred for [disease] [medicine] cures [disease] [medicine] is effective for [disease] [medicine] is related to [disease] [medicine] causes [disease] t 1 =medicine t 2 =disease Figure 1: A simple example of entailment graph with types medicine and disease. The dashed line represents a missing entailment recovered by considering the transitivity constraint (red) based on the two premise entailment between three boldfaced predicates.
is impossible to handle them without understanding the entailment relations behind the predicates. Previous works on analyzing entailment mainly focus on Recognizing Textual Entailment (RTE) between pairs of sentences, and many recent attempts have achieved quite promising performance in detecting entailment relations using transformer-based language models (He et al., 2020;Raffel et al., 2020;Schmitt and Schütze, 2021b). By modeling typed predicates as nodes and entailment relations as directed edges, the Entailment Graph (EG) is a powerful and well-established form to represent the contextindependent entailment relations between predicates and reflect the global features of entailment inference, such as paraphrasing and transitivity. As EGs are able to help reasoning without additional context or resource, they can be seen as a special type of structural knowledge in natural language. Figure 1 shows an excerpt entailment graph about two types of arguments, Medicine and Disease. Generally speaking, an entailment graphs can be built based on a three-step process: extracting predicate pairs from a corpus, building local graphs with locally computed entailment scores, and modifying the graphs with global methods.
However, existing EG construction methods still face challenges in both local and global stages. The Distributional Inclusion Hypothesis (DIH) about entailment assumes that given a predicate (relation) p, it can be replaced in any context by another predicate (relation) q if and only if p entails q (Geffet and Dagan, 2005). Most local methods in previous works are guided by DIH, thus rely on the distributional co-occurrences from corpora, including named entities, entity pairs and context, as features to compute the local entailment scores. Since different predicate pairs are processed independently, the locally built graphs suffer from severe data sparsity. That is, there are many entailment relations missing (as edges) in the graphs if the predicate pairs do not co-occur in the corpus. Furthermore, predictions from local models may not be coherent with each other, for example, a local model may output three predictions like, a entails b, b entails c and c entails a at the same time, which actually indicate possible errors among the local predictions.
To overcome the challenges faced by local models, different global approaches are used to take the interactions and dependencies between entailment relations into consideration. The first discussed global dependency is the logical transitivity among different predicates, that is, predicate a entails predicate c if there is another predicate b making both "a entails b" and "b entails c" hold simultaneously. Berant et al. (2011) uses the Integer Linear Programming (ILP) to ensure the transitivity constraints on the entailment graphs, which is , unfortunately, not scalable on large graphs with thousands of nodes. Hosseini et al. (2018) models the structural similarity across graphs and paraphrasing relations within graphs to learn the global consistency, but does not gain further improvement due to the lack of high-quality local graphs and proper transitivity modeling.
In order to deal with the problems in the local and global stages, we propose a novel entailment graph learning approach, Entailment Graph with Textual Entailment and Transitivity (EGT2). EGT2 builds high-quality local entailment graphs by inputting predicates as sentences into a transformer-based language model fine-tuned on an RTE task to avoid the unreliability of distributional scores, and models the global transitivity on these scores through carefully designed soft constraint losses, which alleviate the data sparsity and are feasible on large-scale local graphs. Our key insight is that the entailment relation a → c correctly implied by the transitivity constraint is based on two conditions: (1) the appropriate constraint scalable on large graphs containing rich information, and (2) the reliability of local graphs offering the premise a → b and b → c, which is impractical for previous distributional approaches, but may be available for the models well-behaved on RTE tasks. Specifically, the input sentences fed to transformer-based language models are formed without context, which makes our method accessible to those predicates not appearing in the corpus. The transitivity implication is confined to entailment relations with high confidence, which improves the quality of implied edges and cuts down the computational overheads. In a word, this paper makes the following contributions: • we present a novel approach based on textual entailment to scoring predicate pairs on local entailment graphs, which is reliable without distributional features and valid for arbitrary predicate pairs.
• we present three carefully designed global soft constraint loss functions to model the transitivity among entailment relations on large entailment graphs, thus alleviate the data sparsity issue of previous local approaches.
• we evaluate our method on benchmark datasets, and show that our EGT2 significantly outperforms previous entailment graphs construction approaches. The further analysis proves that our local and global approaches are both useful for learning entailment graphs.

Related Work
Based on DIH, previous works extract feature vectors for typed predicates to compute the local distributional similarity. The set of entity argument pair strings, like "Griseofulvin-infection" in the example of Section 1, are used as the features weighted by Pointwise Mutual Information (Berant et al., 2015;Hosseini et al., 2018). Given the feature vectors for a predicate pair, different similarity scores, like cosine similarity, Lin (Lin, 1998), DIRT (Lin and Pantel, 2001), Weeds (Weeds and Weir, 2003) and Balanced Inclusion (Szpektor and Dagan, 2008), are calculated as the local similarities. Hosseini et al. (2019) and Hosseini et al. (2021) use Markov Chain on an entity-predicate bipartite graph weighted by link prediction scores to calculate the transition probability between two predicates as the local score. They rely on the link predication model to generate the features in fact. Guillou et al. (2020) adds temporal information into entailment graphs by extracting entity pairs within a limited temporal window as predicate features. McKenna et al. (2021) extends the graphs to include entailment relations between predicates with different numbers of arguments by splitting the features from argument pairs into independent entity slots, which impairs the representation ability of features when unary predicates are involved.
As mentioned in Section 1, entailment graphs are generally learned by imposing global constraints on the local entailment relations about extracted predicates. The transitivity in entailment graphs is modeled by the Integer Linear Programming (ILP) in Berant et al. (2011), which selects a transitive sub-graph of a local weighted graph to maximize the summation over the weights of its edges. Their work is limited to a few hundreds of predicates due to the computational complexity of ILP. For better scalability, Berant et al. (2012) and Berant et al. (2015) make a strong FRG-assumption that if predicate a entails predicates b and c, b and c entail each other, and an approximation method, called Tree-Node-Fix (TNF). Obviously, the assumption is too strong to be satisfied by real cases.
Since the hard constraints are difficult to work well on large-scale entailment graphs, Hosseini et al. (2018) propose two global soft constraints that maintain the similarity between paraphrasing predicates within typed graphs and between predicates with the same names in graphs with different argument types. Their soft constraints are also used in Hosseini et al. (2019) and Hosseini et al. (2021). The similarity between paraphrasing predicates, which ensures (a → c) (b → c) and (c → a) (c → b) when a ↔ b, implicitly takes the transitivity between paraphrasing predicates and third predicate into consideration. But it ignores the transitivity in more common cases, and leads to a limited improvement on performance.
Meanwhile, the transformer-based Language Model (LM), although proved to be effective in RTE tasks (He et al., 2020;Raffel et al., 2020;Schmitt and Schütze, 2021b), has received less attention in entailment graph learning. Schmitt and Schütze (2021a) uses pretrained LM on the Lexical Inference in Context (LIiC) task, which is closely related to entailment graph learning. Hosseini et al. (2021) uses pretrained BERT to initialize the con-textualized embeddings in their contextualized link prediction and entailment score calculation. Higher scores are assigned to the entailed predicates in the context of their premises, which is one implicit expression form of DIH and different from our direct utilization of LM on textual entailment.

Definition and Notations
The goal of entailment graph learning is to extract predicates, learn the entailment relations and build entailment graphs from raw text corpora. Following previous works (Hosseini et al., 2018(Hosseini et al., , 2019, we use the binary relations from neo-Davisonian semantics as predicates, which is a type of first-order logic with event identifiers. For instance, with the semantic parser (here, GraphParser (Reddy et al., 2014)), the sentence: "Griseofulvin is preferred for the infection." can be transformed into the logical form ∃e.pref er 2 (e, Griseof ulvin) ∩pref er f or (e, inf ection) where e denotes an event. By considering a relation for each pair of extracted arguments, this sentence refers to one predicate, p = (prefer.2,prefer.for.2,medicine,disease) 2 . Likely, the sentence "Griseofulvin cures the infection." contains q = (cure.1,cure.2,medicine,disease). Formally, a predicate with argument types t 1 and t 2 is represented as p = (w p,1 .i p,1 , w p,2 .i p,2 , t 1 , t 2 ). The event-based predicate form is strong enough to describe most of the relations in real cases (Parsons, 1990).
With T as the set of types and P as the set of all typed predicates, V (t 1 , t 2 ) contains typed predicates p with unordered argument types t 1 and t 2 , where p ∈ P and t 1 , > is composed of the nodes of typed predicates V (t 1 , t 2 ) and the weighted edges E(t 1 , t 2 ). The edges can be also represented as sparse score matrix W (t 1 , t 2 ) ∈ [0, 1] |V (t 1 ,t 2 )|×|V (t 1 ,t 2 )| , containing the entailment scores between predicates with type t 1 and t 2 . As the different argument types can naturally determine whether two predicates have the same order of arguments, the order of argument type is not important while t 1 = t 2 , and therefore we can ensure that G(t 1 , t 2 ) = G(t 2 , t 1 ). For those predicates p with τ 1 (p) = τ 2 (p), the two argument types are labeled with orders, which allows the graph to contain the entailment relations with different argument orders, like (be.1,be.capital.of.2,location 1 ,location 2 ) → (contain.1,contain.2,location 2 ,location 1 ).

Local Entailment based on Textual Entailment
Inspired by the outstanding performance of pretrained and fine-tuned LMs on RTE task, which is closely related to the entailment graphs, EGT2 uses fine-tuned transformer-based LM to calculate the local entailment scores of typed predicated pairs. In order to utilize the knowledge about entailment relations in pretrained and fine-tuned LM, EGT2 firstly transfers the predicate pair (p, q) into corresponding sentence pair (S(p), S(q)) by sentence generator S, as the complicated predicates cannot be directly input into the LM. For typed predicate p = (w p,1 .i p,1 , w p,2 .i p,2 , t 1 , t 2 ), the generator deduces the positions of arguments about the predicate based on i p,1 and i p,2 , generates the surface form of p based on w p,1 and w p,2 , and finally concatenates the surface form with capitalized types as its arguments. Some generated examples are shown in Table 1, and the detailed algorithm of S is described in Appendix A.
After generating sentence pair (S(p), S(q)) for predicate pair (p, q), EGT2 inputs (S(p), S(q)) into a transformer-based LM to calculate the probability of the entailment relation p → q as the local entailment score in G(t 1 , t 2 ). In our experiments, the LM is implemented as DeBERTa (He et al., 2020). Generally, an entailment-oriented LM will output three scores for a sentence pair, representing the probability of relationship entail, contra-dict and neutral respectively. Formally, we denote the weighted matrix of local entailment graph with type t 1 and t 2 as W local , and the weight of the edge between p and q in W local is calculated as: where LM (r|p, q) is the output score of corresponding relationship by the LM. As the local entailment is based on the LM fine-tuned to perform textual entailment, the local graph can be built for any predicates in the parsed semantic form, or in any other forms by changing sentence generator S.

Global Entailment with Soft Transitivity Constraint
Existing approaches use global learning to find correct entailment relations which are missing or underestimated in local entailment graphs to overcome the data sparsity. Following Hosseini et al. (2018), the evidence from existing local edges with high confidence is used by EGT2 to predict missing edges in the entailment graphs. The transitivity in entailment relation inference implies a → c while both a → b and b → c hold. For instance, in the example of Figure 1, the entailment "is preferred for" → "is effective for" is discovered because "is preferred for" → "cures" and "cures" → "is effective for" have been learned.
The key challenge to incorporate the transitivity constraint into weighted graphs is discreteness of logical rules. Discreteness makes the rules impossible to be directly used in gradient-based learning methods without NP-hard complexity, as different predicate pairs are jointly involved in the calculation. To unify the discrete logical rules with gradient-based learning, inspired by Li et al. (2019), EGT2 uses the logical constraints in the form of differentiable triangular norms (Gupta and Qi, 1991;Klement et al., 2013), or called t-norms, as the soft constraints so that the gradient-based learning methods can be applied. Different t-norm methods transfer the discrete rules into different continuous loss functions. Traditional product t-norm maps P ). For the entailment relations, the probability of transitivity to be satisfied is: where the probability of the entailment relation a → b is represented by the local entailment scores W a,b . To alleviate the noise from those edges assigned low confidence by local LM, EGT2 only takes the local edges whose scores are higher than 1 − into account (as a → b and b → c), where is a small hyper-parameter because the local probability scores tend to be close to 0 or 1 in practice. Therefore, to maximize the probability of transitivity constraint satisfied over all predicates in the entailment graph G(t 1 , t 2 ), EGT2 tries to minimize the following minus-log-likelihood loss function L 1 in Eq. 2, where I y (x) = 1 if x > y, or 0 otherwise. Another important t-norm, called the Gödel tnorm, maps P (A → B) into 1 if P (B) ≥ P (A) or P (B) otherwise. Therefore, the Gödel probability of transitivity to be satisfied is: and EGT2 similarly tries to minimize the loss function L 2 in Eq. 2. It should be noted that transitivity constraints will be disobeyed not only by the missing edges, but also by the spurious edges in the local graphs. Therefore, we expect the soft constraints to take reducing the weights of premise edges into consideration. L 1 achieves this by the loss item W a,b and W b,c , and we modify L 2 to L 3 in Eq. 2 so that the low confidence of W a,c will help to detect whether W a,b and W b,c are spurious. Our t-norm soft constraints, although do not guarantee the obedience of transitivity, are effective approximations for the transitivity property. Given the local entailment graph G(t 1 , t 2 ) with weighted edges W local , in order to ensure that the global entailment graph W is not too far from W local , EGT2 finally minimizes the following loss function L to trade off the distance from local graphs and the soft transitivity constraint: where L i is the specified implementation of soft transitivity constraint in Eq. 2, and λ is a nonnegative hyper-parameter that controls the influence of two loss terms.  (Zhang and Weld, 2013), which contains 550K news articles, to extract binary relations as generated predicates in EGT2. We make use of the triples released and filtered in Hosseini et al. (2019), which applies GraphParser (Reddy et al., 2014) based on Combinatorial Categorial Grammar (CCG) syntactic derivations to extracting binary relations between predicates and arguments. The argument entities are linked to Freebase (Bollacker et al., 2008) and mapped to the first level of FIGER types (Ling and Weld, 2012) hierarchy. The type of a predicate is determined by its two corresponding argument entities. The triples are filtered by two rules to remove the noisy binary relations and arguments: (1) we only keep those argument-pairs appearing in at least 3 relations; (2) we only keep those relations with at least 3 different argumentpairs. The number of relations in the corpus is reduced from 26M to 3.9M, covering 304K typed predicates in 355 typed entailment graphs. Only those predicate pairs co-occurring with at least one same entity-pair (e.g., Griseofulvin-infection) will be linked to calculate the local scores, and as a result, our local predicate pairs are identical with Hosseini et al. (2019). As we focus on using global models to alleviate the sparsity of local edges, more potential methods to extracting denser local edges will be studied in our future research.

Evaluation Datasets and Metrics
We use Levy/Holt Dataset (Levy and Dagan, 2016;Holt, 2018) and Berant Dataset (Berant et al., 2011) to evaluate the performance of entailment graph models.
In Levy's dataset, each example contains a pair of triples with the same entities but different predicates. Some questions with one predicate were shown to the annotating workers, like "Which medicine cures the infection?". The label for each example are either True or False, indicating whether the first typed predicate entails the second one, by asking the workers whether the first predicates can answer the question with the second one. For example, if "Griseofulvin is preferred for the infection" is a correct answer of the above question, the dataset labels "is preferred for" → "cures". Holt (2018) re-annotates Levy's dataset and forms a new dataset with 18,407 examples (3,916 positive and 14,491 negative), referred as Levy/Holt Dataset. The dataset is split into validation set (30%) and test set (70%) as Hosseini et al. (2018) in our experiments. Berant et al. (2011) annotates all the entailment relations in their corpus, which generates 3,427 positive and 35,585 negative examples, referred as Berant Dataset. Their entity types do not exactly match with the first level of FIGER types hierarchy, and therefore a simple hand-mapping by Hosseini et al. (2018) is used to unify the predicate types.
To be comparable with previous works, we evaluate our methods on the test set of Levy/Holt Dataset and the whole Berant Dataset by calcu-lating the area under the curves (AUC) with changing the classification threshold of global entailment scores. Hosseini et al. (2018) argues that the AUC of Precision-Recall Curve (PRC) for precisions in the range [0.5, 1], as predictions with higher precision than random are more important for the downstream applications. Therefore, we report both the AUC of PRC for precisions in the range [0.5, 1] and the traditional AUC of ROC, which is more widely used in evaluation of other tasks.

Comparison Methods
We compare our model with existing entailment graph construction methods (Berant et al., 2011;Hosseini et al., 2018Hosseini et al., , 2019Hosseini et al., , 2021 and the best local distributional method, Balanced Inclusion (Szpektor and Dagan, 2008), referred as BInc. We also include ablation variants of our EGT2, including local models with or without fine-tuning.

Implementation Details
For local transformer-based LM, EGT2 uses De-BERTa (He et al., 2020) implemented by the Hugging Face transformers library (Wolf et al., 2019) 3 , which has been fine-tuned on MNLI (Williams et al., 2018) dataset. In order to adapt it to the special type-oriented sentence pattern generated by S, we expand the validation set by extracting all of the predicates, generating sentence pairs by generator S for every two predicates, and checking whether they are labeled as paraphrase or entailment in the Paraphrase Database collection (PPDB) (Pavlick et al., 2015). We split 80% of the generated corpus to fine-tune the DeBERTa with Cross-Entropy Loss, and the rest as the validation set of fine-tuning process. The fine-tuning learning rate α f = 10 −5 , and the process is terminated while the F 1 score of entail on validation set does not increase in 10 epochs or training after 100 epochs.
For global soft transitivity constrains, we use SGD (Cun et al., 1998) to optimize the scores W in entailment graphs with loss function L in Eq. 5 for e = 5 epochs. The SGD learning rate α = 0.05, the coefficient λ = 1, and the confidence threshold = 0.02. The hyper-parameters are selected based on Levy/Holt validation dataset. More implementation details are given in Appendix B.
For testing, if one or both predicates of the example do not appear in the corresponding typed entailment graph, we handle the example as un- typed one by resorting to its average score among all typed entailment graphs. This setting is also used for all local and global methods in the experiments for fair comparison.

Main Results
We summarize the model performances on both Levy/Holt and Berant datasets in Table 2. All global methods, including Hosseini et al. (2018), Hosseini et al. (2019) and EGT2, perform better than their corresponding local methods, which demonstrates the effect of global constraints in alleviating the data sparsity. Although using the same extracted entailment relations with Hosseini et al. (2019), our EGT2-Local significantly outperforms previous local methods because of the highquality entailment scores generated by reliable finetuned textual entailment LM. On the whole, EGT2 with transitivity constraint L 3 outperforms all the other models on both Levy/Holt Dataset and Berant Dataset with AUC of PRC, while EGT2-L 1 performs best with AUC of ROC. All of three soft transitivity constraints boost the performance of local model on all evaluation metrics, which shows that making use of transitivity rule between entailment relations improves the local entailment graph. EGT2-L 1 or EGT2-L 3 performs better than EGT2-L 2 , which indicates that involving the premises a → b and b → c into loss function is also important for using transitivity constraints. The Precision-Recall Curves of different meth-ods and the Precision-Recall Point of Berant et al. (2011) on the two evaluation datasets are shown in Figure 2(a) and 2(b) respectively. The local and global models of EGT2 consistently outperform previous state-of-the-art methods on all levels of precision and recall, which indicates the effect of our local model based on textual entailment and global soft constraints based on transitivity. The EGT2-Local achieves slightly higher precision than global models in the range recall < 0.5, but its precision drops quickly if we require higher recall and therefore leads to worse performance than global models. The result indicates that global models with transitivity constraints gain significant improvement on recall with far less expense on precision than EGT2-Local.

How the local model fine-tuning works?
As described in Section 4.4, a new corpus is generated for fine-tuning the local model. We claim that the fine-tuning corpus helps to improve the performance of EGT2-Local by adapting it to the special sentence pattern by S, rather than offering additional data to fit the distribution of target datasets as traditional training datasets do. To prove this, we also test a simple supervised method, labelled as Local-Sup, which fits a 2-layers feedforward neural network on the fine-tuning corpus with cosine similarity, Weed, Lin and BInc scores as features. If the corpus acts as training dataset, the performance of Local-Sup should be obviously better than its unsupervised features. As shown in Table 2, Local-Sup does not perform significantly better on Levy/Holt Dataset, and even worse on Berant Dataset than BInc, which is one of the inputting features of Local-Sup. The result illustrates the difference between the finetuning corpus and the evaluation datasets, and shows that the corpus plays a role as pattern adapting corpus rather than training dataset.

Why are global constraints helpful?
In Section 1, we expect that the improvement of soft transitivity constraints is attributed to the alleviation of data sparsity in corpus. To examine the sparsity before and after the applying of transitivity constraints, we count how many the positive and negative entailment relations in the Levy/Holt test set exactly appear in the local and global entailment graph respectively, and show the counting results in Table 3. All three soft transitivity constraints help to find more entailment relations than  local entailment graph and therefore achieve better performance on the evaluation datasets. Although EGT2-L 2 finds the most entailment relations in the dataset in global stage, it finds more negative examples concurrently and thus performs worse than L 1 and L 3 as shown in Table 2. On the other hand, EGT2-L 1 and EGT2-L 3 obtain more proportions of positive examples by considering premise relations during the gradient calculation. The low confidence of hypothesis relationship W a,c should be helpful to detect spurious premises W a,b and W b,c . Therefore, EGT2-L 3 slightly outperforms EGT2-L 1 as the gradients of W a,b and W b,c in L 3 are related to the hypothesis relationship W a,c . We have also applied the soft transitivity constraints on the local graph with BInc and Hosseini et al. (2019), but observed only slightly improvement of performance, as .155 → .157 and .167 → .170 for EGT2-L 3 on PRC of Levy/Holt Dataset respectively. Comparing it with the significant improvement based on EGT2-Local, we claim that the high-quality local entailment graphs are the basis of effective soft transitivity constraints.
The previous cross-graph soft constraint and paraphrase resolution soft constraint proposed in  Table 4. We can see that previous baselines do not perform well on AUC of PRC, which indicate that it is difficult for them to reach precision > 0.5. Meanwhile, EGT2-Local and EGT2-L 3 outperform all baselines on the directional section of Levy/Holt Dataset. Unsurprisingly, all models' AUC scores on the directional section become lower compared on the original Levy/Holt Dataset, showing the challenges of directional entailment inference. Two EGT2 variants maintain high performance, which proves that our local model can learn to capture directional predicate entailment better than distributional baselines, and the global soft constraint also helps to make directional entailment inference.

Error Analysis
We randomly sample and analyze 100 false positive (FP) examples and 100 false negative (FN) examples from Levy/Holt test set according to predictions by EGT2-L 3 . We manually setup the decision threshold as 0.574 to make the precision level close to 0.76, which is the same as Berant et al. (2011). The major error types are shown in Table 5. Although the global constraint is used, about half of FN errors are due to the data sparsity where the entailment relations are not found in the entailment graph. When compared with the results in Hosseini et al. (2018), EGT2-L 3 reduces the ratio of Sparsity in FN errors from 93% to 46% with stronger alleviation ability of data sparsity. About a quarter of FN are caused by the Under-weighted Relations in the graph, where EGT2 finds the entailment relations but gives them scores lower than the threshold. The rest of FN are related to Dataset Wrong Labels which happens when the predicates are indeed entailed by others but labelled as negative, or the predicate pairs are incomplete.
Most of FP errors are caused by the Spurious Correlation as these relations are too fraudulent for EGT2 to see through their spurious relationships and consequently given high scores. A few FP errors are caused by Lemma-based Processing in LM inevitably, but the ratio still reduces from 12% in Hosseini et al. (2018) to 5%. The result indicates that our fine-tuned LM can handle the predicates even with similar surface forms and contexts better than parsing-based distributional local features.

Conclusions
In this paper, we propose a novel typed entailment graph learning framework, EGT2, which uses language models fine-tuned on textual entailment tasks to calculate local entailment scores and applies soft transitivity constraints to learn global entailment graphs in gradient-based method. The transitivity constraints are achieved by carefully designed loss functions, and effectively boost the quality of local entailment graphs. By using the fine-tuned local LM and global soft constraints, EGT2 does not rely on distributional features, and can be easily applied to large-scale graphs. Experiments on standard benchmark datasets show that EGT2 achieves significantly better performance than existing state-of-the-art entailment graph methods.