Teach the Rules, Provide the Facts: Targeted Relational-knowledge Enhancement for Textual Inference

We present InferBert, a method to enhance transformer-based inference models with relevant relational knowledge. Our approach facilitates learning generic inference patterns requiring relational knowledge (e.g. inferences related to hypernymy) during training, while injecting on-demand the relevant relational facts (e.g. pangolin is an animal) at test time. We apply InferBERT to the NLI task over a diverse set of inference types (hypernymy, location, color, and country of origin), for which we collected challenge datasets. In this setting, InferBert succeeds to learn general inference patterns, from a relatively small number of training instances, while not hurting performance on the original NLI data and substantially outperforming prior knowledge enhancement models on the challenge data. It further applies its inferences successfully at test time to previously unobserved entities. InferBert is computationally more efficient than most prior methods, in terms of number of parameters, memory consumption and training time.


Introduction
Transformer-based pre-trained language models (LMs), such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) have recently achieved human-level performance on standard natural language inference (NLI) benchmarks (Wang et al., 2019). However, the performance on this complex task is achieved in part thanks to large training sets that facilitate learning of dataset-specific biases and correlations, and thanks to the similar distributions between the training and test sets, that rewards such models (Poliak et al., 2018;Gururangan et al., 2018). This contrasts with humans, who can learn a generalized solution from fewer examples (Linzen, 2020). Indeed, NLI models often fail on examples involving various linguistic phenomena such as co-hyponymy (Glockner et al., 2018) and negation (Naik et al., 2018), which they are expected to acquire indirectly from the NLI training set.
Prior work proposed to provide ("inoculate") NLI models with a small number of phenomenonspecific training examples in order to teach the model to address them (Liu et al., 2019a). However, Rozen et al. (2019) showed that when the distributions of the training and test sets differ with respect to syntactic and lexical properties, the performance of such inoculated models drops, concluding that they do not learn a generalized notion of the phenomenon. In this paper we are motivated by the following question: how can we facilitate learning of generalized inference patterns, with respect to a given linguistic phenomenon, from a relatively small number of examples?
Ideally, we would like an NLI model to learn inference patterns detached from their original context, and to be able to apply them in new contexts involving different concrete facts. For example, an NLI model may learn that a word entails its hypernym in upward monotone sentences from training examples such as: Alice ate a banana → Alice ate a fruit. Then, to be able to apply this rule to a test instance with the premise Bob saw a pangolin and the hypothesis Bob saw an animal, it needs to know that animal is a hypernym of pangolin. Training a model on every possible hyponym-hypernym pair is incredibly inefficient and requires re-training a model whenever the vocabulary expands. Instead, we propose to decouple the learning of generic inference patterns from that of the factual knowledge.
To that end, we develop InferBert, a method to enhance language models with relational knowledge from a knowledge base (KB). In contrast to recent knowledge-enhancement approaches such as KnowBert (Peters et al., 2019) and Ernie (Zhang et al., 2019) that incorporate into LMs knowledge about individual entities (e.g. pangolin), we inform the LM of the relation between a pair of entities that are involved in an inference instance, e.g. Hypernym(pangolin) = animal. This approach is agnostic to the identity of the specific entities, allowing models to learn inference patterns separately from the individual facts involved in particular instances.
To evaluate the ability of NLI models to learn inference patterns for specific linguistic phenomena, we follow the evaluation approach taken in previous work (Naik et al., 2018;Liu et al., 2019a;, which demonstrated the learning ability of models over a few chosen inference phenomena. We focus on 4 target semantic relations: hypernymy, location, country of origin, and color, for which we create challenge sets 1 (see Table 1 for examples). We construct the challenge sets such that there is no overlap between the training, validation, and test sets with respect to the target entities (e.g. pangolin), to allow testing whether the model had learned an inference phenomenon in a generic manner, rather than performing lexical memorization. The training sets are deliberately small (660-960 instances), aiming to challenge models with learning from a relatively small number of examples per semantic phenomenon.
Our results confirm that InferBert manages to generalize inference patterns to new facts, substantially improving performance on the challenge sets upon the knowledge-enhanced baselines (up to +17.5 points in accuracy from the next best model), all while maintaining the performance on the original MultiNLI test set (Williams et al., 2018).
Moreover, InferBert not only learns from a small number of training examples (which are insufficient for the baselines), it is also considerably more efficient than prior knowledge-enhanced LMs in terms of training time, resources, and memory. In-ferBert doesn't require LM pre-training, which is a computationally expensive process, and doesn't embed entities, only a small number of relations, substantially reducing the number of parameters with respect to some of the prior work (e.g. only 23% of KnowBert's parameters).
Finally, while InferBert is demonstrated on NLI, it is a general method and may benefit additional tasks such as question answering and co-reference resolution which may rely on relational knowledge between words in given instances. 1 All datasets and resources are available at https://github.com/ohadrozen/inferbert.

Hypernymy P:
He killed another jay this season.

H:
He took life away from a bird this season. Label: Entailment Relation: Hypernym(jay)= bird Location P: It is not located in Baytown. H: It is located in all cities in Texas except for one.

H:
Tommy did not order any dark brown fruits.

Label:
Neutral Relation: none* Country of Origin P: Viesgo deal, from beginning to end, took less than five weeks.

H:
The minimum amount of time it has ever taken a Spanish company to close a deal is six weeks.

Label:
Contradiction Relation: CountryOfOrigin(Viesgo) = Spain  Dagan et al., 2013), the goal is to determine whether a first text unit (premise) entails, contradicts, or is neutral with respect to a second text (hypothesis). The decision involves various syntactic and semantic phenomena, including lexical and world knowledge, coreference resolution, geographical reasoning, etc. (Clark, 2018). While neural models have achieved human performance on the GLUE and SuperGLUE benchmarks (Wang et al., 2018(Wang et al., , 2019, the success of such models is often due to learning non-generalizable datasetspecific patterns (Poliak et al., 2018;Gururangan et al., 2018;McCoy et al., 2019).
Various challenge sets were developed to test the capabilities of state-of-the-art NLI models in addressing specific semantic phenomena. For example, Glockner et al. (2018) showed that substituting a single term in the premise with a similar but mutually-exclusive term (e.g. guitar and piano) confused NLI models that predicted entailment. Naik et al. (2018) further showed that NLI models perform poorly on examples involving antonyms, numerical reasoning, and distractions such as high lexical overlap and spelling errors. NLI models also struggled with examples involving logic and monotonicity Yanaka et al., 2020;Geiger et al., 2020). 2 Finally, the GLUE benchmark dedicated a small set for diagnosing models' strengths and weaknesses on various phenomena (Wang et al., 2018). Liu et al. (2019a) suggested that NLI models may perform poorly on specific phenomena they haven't observed enough during training, and proposed to "inoculate" LM-based models against challenge sets by fine-tuning them on a small number of phenomenon-specific training instances. Rozen et al. (2019) showed that the inoculation does not necessarily teach the model a generalized notion of the phenomenon of interest, and that when the challenge test set differs from the corresponding training sets in terms of, for example, syntactic complexity, the performance of the inoculated models drops.  highlighted the sensitivity of the inoculation training to hyper-parameters, that may result in "catastrophic forgetting", i.e. a substantial drop in performance on the original NLI task.

Knowledge-Enhanced Models
There is plenty of work on incorporating knowledge from KBs into neural models. Knowledgebased Inference Model (KIM; Chen et al., 2018) incorporated semantic relations from WordNet into an RNN-based NLI model, gaining a modest improvement on a challenge set. The incorporation at various components of the original NLI model is not straightforward to adapt to other models.
KnowBert (Peters et al., 2019) incorporated knowledge from Wikipedia and WordNet into a BERT model through entity embeddings, improving performance on relation extraction and entity typing. Ernie (Zhang et al., 2019) and K-Adapter (Wang et al., 2020a) both targeted similar downstream tasks. Ernie embeds entities and relations from a KB, and alters the BERT pre-training to predict entities in addition to words. K-Adapter does not re-train the LM weights, but takes a somewhat more efficient approach of training an additional neural component ("adapter") for each knowledge type as a plug-in for the LM. KEPLER (Wang et al., 2020b) learns entity embeddings from their textual descriptions. These entity-centric methods require pre-training the original LM or its plugins on the KB, while increasing training time and cost and storing the entity embeddings (increasing memory cost). In addition, by design, the knowledge can capture only entities seen during pre-training, thus requiring repeating the pre-training process each time the original input KB gets updated.
Finally, K-BERT (Liu et al., 2019b) is most similar to our model, incorporating knowledge regarding individual entities that occur in the input instance. Like our model, knowledge is augmented, per-instance, at inference time. Unlike our model, knowledge is augmented per entity, rather than per a relation between a pair of entities appearing in the inference instance. Further, K-BERT injects the KB knowledge in a textual form, which augments the input instance, while our model embeds directly structural knowledge. As we show in Section 6.1, this encoding is less effective than our structured incorporation method (Section 4.2), leading to weaker learning ability of different inference phenomena that require external knowledge.

Data
We focus on four types of semantic relations (Section 3.1), each corresponding to a set of facts in the form of semantic relation triplets. An NLI model may learn various inference patterns pertaining to the semantic relation type, such as "a word entails its hypernym in an upward monotone sentence".
To evaluate the models' ability to learn and apply these rules, we create an NLI challenge set for each semantic relation, that we derive from MultiNLI (Section 3.2). As usual, the goal is to determine the label of a premise-hypothesis pair (p, h) among entailment, neutral, and contradiction. For a given semantic relation, each instance in the corresponding challenge set requires applying an inference pattern associated with the semantic relation in order to determine the correct label (possibly along with other required inferences).

Semantic Relations
Hypernymy. An NLI system might learn that a term generally entails its generalization, for example "I ate an apple" entails "I ate a fruit". 3 The relevant facts for this semantic relation are pairs of (x, y) terms that appear in a direct or indirect KB entry: Emporis CountryOfOrigin (property): Germany Extracted Premise: These forms will be posted on Apple website. Premise: These forms will be posted on Emporis website.

Manually created hypotheses:
(1) A company in Germany will make the forms available on its website. (Entailment) (2) The forms cannot be accessed from the website of any German company. (Contradiction) (3) Several German websites will feature the forms. (Neutral) Hypotheses with property replacement: (4) A company in France will make the forms available on its website. (Neutral) (5) The forms cannot be accessed from the website of any French company. (Neutral) (6) Several French websites will feature the forms. (Neutral)  (1)-(3) were created by crowdworkers for the altered premise, based on the Wikidata fact that Emporis' country of origin is Germany. Hypotheses (4)-(6) were created by replacing German with another country of origin (France) and annotated for entailment.

Inference Type
Train Dev Test All

Location.
A model may learn that in some contexts, substituting a city name by the state in which it is located yields a factually correct generalization (e.g. "John visited Chicago" entails "John visited Illinois"). We retrieve entities from Wikidata (Vrandečić and Krötzsch, 2014), focusing on US locations using the state property.
Color. We retrieve entities from Wikidata and their color property. We substitute an entity (e.g. banana) for a generalization involving its color and hypernym (e.g. yellow fruit).
Country of Origin. We retrieve knowledge from Wikidata about companies and their country of origin, using the country property. We substitute an entity (e.g. Apple) for a generalization involving its country of origin (e.g. American organization).

Challenge Sets
Some of the semantic relations we focused on are very rare in the original MultiNLI dataset, e.g. by 4 Excluding instance hypernyms.
heuristically searching for instances that exhibit these phenomena we found that less than 0.05% of the data contained locations. We therefore create challenge sets focusing on each semantic relation. In order to create challenge examples in a similar style and domain, we base our examples on premises in MultiNLI.
For a given semantic relation r, we extract premises in the MultiNLI training set that contain an entity I tail 0 whose type corresponds to the relation argument. For example, for the country of origin semantic relation we extract premises containing company names (e.g. I tail 0 = Apple) in our data. For a given premise p, we modify it by replacing I tail 0 by a random entity I tail 1 of the same type in the KB (e.g. Emporis), and manually check that the sentence still makes sense. We specifically select replacement entities I tail 1 such that there exists a KB assertion R(I tail 1 ) = I head
From each premise p we created 6 hypotheses as follows (See Table 2). Similarly to Williams et al. (2018), we showed p to crowdsourcing workers and asked them to generate a hypothesis for each label (entailment, neutral and contradiction). Our instructions further specified that the hypothesis must include I head 1 (e.g. Germany) but not I tail 1 (e.g. Emporis). Examples (1)-(3) in Table 2 demonstrate the instances created at this step. same property R (e.g. France). We then asked an annotator to label the new hypotheses with respect to p (Table 2, instances (4)-(6)).
The annotation task was performed using Amazon Mechanical Turk. To ensure the quality of the work, we required that workers had a minimum of 96% acceptance rate for prior HITs and pass a qualification test. We paid $1 for each premise. The test set was further validated by two trained annotators. The first annotator re-labeled an example, and, in case of disagreement with the original label (11.9% of the annotations), the second annotator also labeled the example, and the majority vote 5 was used.
Data Split. The statistics of the challenge sets are shown in Table 3. We split the datasets to 68%-9%-23% train, dev and test, respectively. The datasets are split lexically, i.e. such that head and tail entities in one set do not appear in the other sets. That way, a good performance on the test set indicates that the model learned a generalized notion of an inference rule rather than specific facts, and that it is capable of applying the rule when provided with the necessary yet not previously observed facts.

InferBert
We present InferBert, a BERT-based NLI model with a relational knowledge enhancement compo-5 All three annotations were given an equal weight. nent. The key idea in InferBert is incorporating into the model relational knowledge (facts) from external knowledge resources regarding entities mentioned in the input instance. We adopt an inclusive definition of entity, which can refer either to a named entity (such as entries in Wikidata) or a common noun (such as WordNet lemmas).
As we discussed in Section 2.2, most prior work injects external knowledge into models through an entity's knowledge base embedding, which captures in a soft way its relationships with other KB entities. The limitation of such methods is the coupling of an inference pattern with the related factual knowledge. Suppose that a model observed during training that "The boy ate an apple" entails "The boy ate a fruit". The test example with the premise "The woman has a dog" and the hypothesis "The woman has a pet" is represented differently from the training example due to the distance between the entities (e.g. apple and dog) in the KB. Such a model is likely to fail on examples consisting of unseen entities.
We propose to decouple learning the inference pattern from the facts by directly embedding the semantic relations between entities in the text. In the above example, InferBert can access the KB during both the training and inference phases, and add an indicator that fruit=Hypernym(apple). After observing enough training examples with the hypernym indicator, the model can learn a general rule like "a word entails its hypernym in certain common context". During inference, the model can apply this rule to unseen entities in the KB.
We first describe the KnowBert model (Peters et al., 2019, Section 4.1) which is the basis for In-ferBert. Next, we describe how we replace Know-Bert's Knowledge Attention and Recontextualization component (KAR) by our Simplified KAR (S-KAR, Section 4.2).

KnowBert's KAR
KnowBert is a method to incorporate knowledge from KBs into transformer-based language models, which was specifically applied to BERTBASE. For a given input X = (x 1 , ..., x N ) of N word pieces, the BERT contextual embeddings are computed as H i = TransformerBlock(H i−1 ) where H i ∈ R N ×D is the i-th hidden layer (i ∈ {1, ..., L}, and L = 12 layers) and D is BERT's embedding dimension. TransformerBlock operates over a query, key, and value, and is defined as TransformerBlock(H i ) = MLP (MultiHeadAttn(H i , H i , H i )).
The Knowledge Attention and Recontextualization component (KAR) is added between BERT layers i and i − 1, changing the embedding mechanism to: H i = KAR(H i , C), which is computed as follows: Retrieval: The KB entity candidate selector provides a list of C potential entity links for X, along with their mention spans in X.
Disambiguation: Each mention span is represented by applying self-attention pooling over all word pieces in the span (after projection to the entity embedding dimension E), yielding S ∈ R C×E . To select the relevant entities in the context, mention-span self-attention is applied to compute S e = TransformerBlock(S), followed by computing candidate entity scores ψ based on S e .

Knowledge incorporation:
The entity embeddings are averaged toẽ based on their weight ψ, and are used to enhance the mention-span representations, yielding S e = S e +ẽ.
Recontextualization: The BERT word piece representations are recontextualized using a modified transformer layer in which S e is used as both the key and value for MultiHeadAttn. The resulting vectors H i are projected back into the BERT dimension D.

S-KAR
The main component of InferBert is the Simplified Knowledge Attention and Recontextualization component (S-KAR). Rather than enhancing BERT with KB entity embeddings, InferBert embeds the KB relations.
Similarly to KAR, S-KAR replaces BERT's embedding mechanism between two particular layers, computing: H i = S-KAR(H i , C), which is then used to compute the next layer: H i+1 = TransformerBlock(H i ), and the remainder of BERT is run as usual. S-KAR operates as follows: Retrieval: We follow KnowBert (Peters et al., 2019) and adopt a broad definition for a KB as a collection of (tail entity, relation, head entity) triplets, focusing on K relation types of interest: R = {R 1 , ..., R K }. For each relation type R k we learn two embedding vectors, e head k and e tail k , representing the head and the tail entity slots in this relation. 6 We assume that for a given relation set R, the KB is accompanied by a relation extractor, which takes a text X as input and returns a list of triplets: where head m and tail m are the indices of the first token of the head and tail entities in the text (1, ..., N ), and r m is the relation, as illustrated in Figure 1.
Disambiguation: We focus on unambiguous entities, i.e. those with a single KB entry, with respect to relation type, and extract only entities of the relevant type. 7 For example, though Pitcher has multiple entries in Wikidata, only one of them is a location.
Knowledge incorporation: For a given list of triplets C, S-KAR creates the relation embedding matrix R ∈ R N ×E such that the head embedding e head m is in index head m , the related tail embedding vector e tail m is in index tail m , and the remaining entries are set to 0. We incorporate this relation embeddings into the BERT vectors: Recontextualization: the recontextualization step is identical to KAR.

Experimental Setup
BERT model. Our model assumes access to a pre-trained BERT model with or without additional fine-tuning on the target downstream task. Specifically, we used the English uncased BERT BASE model (Devlin et al., 2019) fine-tuned on the MultiNLI dataset (Williams et al., 2018). Based on preliminary experiments, the S-KAR layer was inserted between the first and second layers of BERT.
Relational data. We retrieve relational data from WordNet and Wikidata (See Section 3.1). For a given premise p and hypothesis h we retrieve a relevant KB tuple list of triplets {(head m , tail m , r m )} (Section 4.2) when the head is in the premise, tail is  in the hypothesis, and head = tail. Since we focus on unambiguous entities (in the context of a given relation), we do not need to use an entity linker. We make sure that the target entities in the train, validation, and test sets are distinct, but that they all have entries in the relevant KB.
Training data. We train a single model on the combination of the challenge sets to learn phenomena related to all the semantic relations. To avoid "catastrophic forgetting", i.e. decrease in the performance on the original task (MultiNLI), we mix the challenge training set with a random sample of 10K MultiNLI training set instances and train on the mixed datasets. The training objective assigns more weight to the challenge examples: L BERT = γ · L BERT , where γ > 1 is a hyperparameter fine-tuned on the validation set.
Training procedure. The model consists of a pre-trained BERT model and randomly initialized InferBert parameters (S-KAR weights and relation embeddings). To embed both sets of parameters in the same space, we follow KnowBert and train the model in two phases. In the first phase, we freeze the pre-trained BERT parameters and update only the S-KAR and the relation embeddings for 3 epochs. In the second phase we freeze the recently trained InferBert parameters and unfreeze the BERT parameters, training for another epoch. 8 Baselines. We compare InferBert with two representative knowledge-informed models, KnowBert and K-BERT, as well as a BERT BASE NLI model. All the baselines are trained on MultiNLI and further fine-tuned on the the joint challenge set (mixed with a subset of MultiNLI). For fair comparison, K-BERT used the same entity extraction mechanism, followed the same fine-tuning procedure, and was given access to the exact same data as InferBert. KnowBert, on the other hand, requires re-training a new model on new data. Because of its resource requirements, we used the available pre-trained KnowBert model. It is enhanced with knowledge about 470K entities from Wikipedia and all of WordNet, fully covering the knowledge in our hypernymy and location challenge sets, but only some of the entities in the color and country of origin sets 9 . Thus, the results for KnowBert on these two phenomena are not fully comparable to those of InferBert.
Hyper-parameters. Fine-tuning on MultiNLI followed the original hyper-parameters described in Devlin et al. (2019). When fine-tuning InferBert on the challenge sets, we selected the best hyperparameter values based on the performance on the validation sets. The learning rate for S-KAR was chosen between 0.003-0.007 in steps of 0.001, and was set to 0.006. The rest of the parameters were trained with a learning rate of 9e-6 (selected between 3e-6 and 4e-5). We tested γ values among {2, 4, 6, 8, 10, 12} and selected γ = 4. Fine-tuning was done on a single GeForce GTX 1080 GPU with batch size of 32. A single InferBert forward and backward pass took 0.35 seconds. K-BERT's best validation performance was achieved after 3 epochs with a learning rate of 3e-5 and KnowBert's after 4 epochs with a learning rate of 2e-5.

Experiments
We present the results of InferBert and the baselines on the various challenge sets (Section 6.1). We also test the ability of models to learn relational knowledge about entities seen during training (Section 6.2). Finally, we analyze InferBert's efficiency in terms of memory and runtime compared to the baselines (Section 6.3). Table 4 ("unseen" lines) shows the performance of InferBert and the baselines on the various challenge sets and on the MultiNLI 10 development set. The knowledge-enhanced baseline models slightly outperform BERT on all semantic relations. In-ferBert performs the best, with a large gap from the baselines (up to 20 points), demonstrating its ability to learn and generalize inference patterns and apply them to new relation instances, as well as to new entities.

Performance on the Challenge Sets
K-BERT performs slightly better than KnowBert, yet worse than InferBert. We hypothesize that K-BERT and InferBert enjoy the advantage of having access to relational knowledge at inference time, which facilitates learning general inference patterns and applying them to new facts, on demand. With that said, the K-BERT method of incorporating relational knowledge as free text is less structured and likely leads to less efficient learning of inference patterns (with the limited amount of available training data).
InferBert retains a high performance on the MultiNLI matched development set, with 2.3% reduction from the original BERT BASE model (84.6%). KnowBert achieve similar performance, while K-BERT performs slightly better on it.

Seen vs. Unseen Entities
In contrast to our original test sets, in which the entities has not been seen during training, in this experiment we analyze how the models perform with entities that were all seen in the challenge training set. For that, we duplicated our test sets, while replacing test triplets (head, tail, relation) with others that are included in the training set. The rest of the words remained the same, and we made sure (manually) that the new test examples are analogously sensible and that their entailment labels have not changed. Results are shown in Table 4 in the seen rows. Evidently, InferBert shows impressive robustness when facing unseen entities, unlike other models that seem to depend significantly on seeing the test entities already in training time. In fact, when faced with new entities, the other models performance gets closer to that of original BERT (with no knowledge injected). 10 We used MultiNLI dev-matched.

Efficiency Analysis
While large language models lead to performance boost on standard benchmarks, the NLP community had begun paying more attention to developing more resource-efficient NLP models (Moosavi et al., 2020). In the design of InferBert we took efficiency into consideration. First, InferBert is significantly less memory consuming than KnowBert, which stores up-front the embeddings for all entities in memory. KnowBert trained on Wikipedia and WordNet uses BERTBASE (110M parameters), to which it adds the KAR component (7.3M) and the embeddings of 471K entities (406M parameters), resulting in 523.3M parameters. Conversely, instead of entity embeddings, InferBert supports up to K = 500 relation types × 2 vectors (tail and head) × each with dimension E = 768, resulting in 768K parameters. The SKAR component takes up 8.3M parameters. Overall, InferBert has 119.1 parameters, only 23% of KnowBert's parameters. Second, as opposed to InferBert, KnowBert required a pre-training step in which the 471K instances (corresponding to KB entities) were processed.
InferBert achieved better performance than KnowBert on the challenge set with as little as 1,000 examples per relation (Table 4). We conjecture that the InferBert training is more data efficient as it is not required to learn about specific head or tail entities (e.g. Emporis and Germany) but about relationships, (e.g. Hypernymy) which occur more frequently in the training data.
Finally, we note that, similar to our model, K-BERT is also memory and parameter efficient since it does not store entity embeddings (as KnowBert does). Rather, it only involves fine-tuning the BERT parameters, thanks to representing the enhanced knowledge in textual form as part of the instance input. Our model does incorporate a modest number of additional parameters for structured relation embeddings, which, as shown in our experiment, leads to substantial performance gains over K-BERT's textual representations.

Conclusions
We presented InferBert, a generic and efficient method to incorporate relational knowledge into transformer-based inference models. Our approach targets specific inference phenomena that require external relational knowledge, allowing the model to learn generic inference patterns decoupled from the factual knowledge required for a particular instance, which is injected at inference time. Our experiments show that InferBert successfully applies the learned patterns to unseen facts, where other knowledge enhancement models fail. Unlike most prior work, InferBert does not require pre-training the LM on a KB, and consumes less memory.
Our work joins the effort of others to improve models by teaching them specific inference phenomena (Liu et al., 2019a;. A natural direction for future work would be to apply our methodology to a broader range of inference phenomena and adapt them for additional inference tasks.