How Fast can BERT Learn Simple Natural Language Inference?

This paper empirically studies whether BERT can really learn to conduct natural language inference (NLI) without utilizing hidden dataset bias; and how efficiently it can learn if it could. This is done via creating a simple entailment judgment case which involves only binary predicates in plain English. The results show that the learning process of BERT is very slow. However, the efficiency of learning can be greatly improved (data reduction by a factor of 1,500) if task-related features are added. This suggests that domain knowledge greatly helps when conducting NLI with neural networks.


Introduction
Entailment judgment (Dagan et al., 2006;Marelli et al., 2014a) is a common test for natural language inference (NLI) (Camburu et al., 2018;Conneau et al., 2018) as it possesses the simplest form in related tasks such as question and answering (Bowman and Zhu, 2019). Also, SNLI dataset (Bowman et al., 2015) is frequently adopted for NLI evaluation because it is the first corpus to show the power of neural networks for the task that specifically targets NLI.
To the best of our knowledge, none of previous studies carefully removes the dataset bias, and then investigates how efficiently that BERT could learn NLI if the surface clues are completely removed. In this paper, we empirically study whether BERT is capable of learning NLI without surface clues/bias appearing in the dataset; and if it is, whether it can learn NLI efficiently. We carefully created absolutely unbiased datasets, in which the premises and hypotheses simply describe the relative position of two objects in plain English. That is, given a premise "John is on the left side of Mary" (abbreviated as a predicate "left(John, Mary)" from now on), the hypotheses predicates "left(John, Mary)", "left(Mary, John)" and "left(John, Helen)" should be labeled as entailment, contradictory and neutral respectively. Our experiment results show BERT is very slow in learning this simple NLI.
We then further study if the learning efficiency can be improved with domain knowledge (Gülçehre and Bengio, 2016). Inspired by Chen et al. (2017), we think whether two entities are exactly the same is quite crucial in making the above NLI. So, the task-related features, such as whether the first/second argument of the premise predicate exactly matches that of hypothesis, are fed into BERT. The obtained results show that such task-related features are able to greatly benefit BERT (reducing the data needed by a factor of 1,500), which is important as it is difficult to acquire enough data in many real-world applications.
Our main contributions are: (1) We are the first to quantitatively study how efficiently BERT can learn to conduct NLI without available surface clues/bias. (2)

Teaching BERT Binary Predicates
The sentence "John is on the left side of Mary" describes a positional relation between two people. For conciseness, it will be denoted by a binary predicate "left(John,Mary)" from now on, where "left" is the predicate name, "John" is the first argument and "Mary" is the second argument. We seek to determine how much data are required to teach BERT to truly understand this simple binary predicate (i.e., premise) and correctly judge that the hypothesis "left(Mary,John)" is contradictory and that the hypothesis "left(John,Helen)" is neutral. Besides, we also seek to determine whether BERT is also able to learn the antonymous predicate "right( ⋅,⋅ )", and judge that the hypothesis "right(Mary,John)" is entailed by the above premise.

Entity Names and Datasets
The arguments of the above binary predicates "left(⋅,⋅)" and "right(⋅,⋅)" actually can be the names of any objects. However, in this paper, we simply trained BERT with personal names. To avoid dividing a personal name into sub-words, we collected 1,696 male and female first names which appear in the vocabulary of the pre-trained "BERT-Base, Uncased" model (Devlin et al., 2019). These names were randomly partitioned into three sets: િ ‫܂‬ , િ ‫܄‬ and િ ۳ . The subscripts ‫܂‬ , ‫܄‬ and ۳ indicate these name sets will be used for Training, Validation and Evaluation respectively. Sets િ ‫܂‬ and િ ‫܄‬ , which consist of 1,356 and 170 names respectively, are used to generate the training and validation datasets for fine-tuning BERT; and Set િ ۳ , consisting of 170 names, is used to generate a dataset for evaluating the performance of BERT in understanding the predicates with personal names.
Furthermore, we also seek to ascertain whether the BERT model trained by the predicates with personal names also understands the predicates with names of other object types. Therefore, in addition to personal names, we also collected 30 common fruit and vegetable names to create an additional set ۳ , which will be used to generate another dataset for performance evaluation.
We conduct a number of experiments on recognizing textual entailment (RTE) (Dagan et al., 2006;Marelli et al., 2014a) to study the learning curves of BERT in understanding binary predicates. In each experiment, four datasets-Name-T, Name-V, Name-E and Fruit-E-are created by filling experiment-specific templates with names randomly chosen from િ ‫܂‬ , િ ‫܄‬ , િ ۳ and ۳ respectively. The training set Name-T and the validation set Name-V are used to fine-tune the BERT model. The other two sets (Name-E and Fruit-E) are test sets. They are used to assess the performances of the BERT model. Each dataset is generated via iteratively and sequentially adding one entailment example, one contradictory example, and one neutral example until it reaches the desired size. Thus a dataset of a size 100 will consist of 34 entailment examples, 33 contradictory examples, and 33 neutral examples. The experiment-specific templates are described in the following sections.

One Simple Binary Predicate
We first conducted EXP-SP (SP: Simple Predicate) experiment to show if BERT can be taught to understand the simple binary predicate "left(⋅,⋅)". In this experiment, a template has the form "premise [s] hypothesis", where "[s]" is a separator token between the premise and the hypothesis. The templates are partitioned into entailment, contradictory, and neutral template sets. The entailment template set has only one template "left(‫,ݔ‬ ‫)ݕ‬ [s] left(‫,ݔ‬ ‫,")ݕ‬ where "left(‫,ݔ‬ ‫")ݕ‬ represents the token sequence ‫ݔ"‬ is on the left side of ‫"ݕ‬ and the variables ‫ݔ‬ and ‫ݕ‬ indicate the names to be filled in. Likewise, the contradictory template set has only one template "left( ‫,ݔ‬ ‫ݕ‬ ) [s] left( ‫,ݕ‬ ‫ݔ‬ )". However, the neutral template set consists of four templates: where the variable ‫ݖ‬ indicates a name to be filled in. Names and templates were randomly selected during dataset creation. For example, when generating a neutral example for Name-T, we randomly chose one template from the neutral template set and randomly chose three distinct names (for variables ‫,ݔ‬ ‫,ݕ‬ ‫ݖ‬ ) from name set િ ‫܂‬ . Obviously, the generated datasets do not have any hidden bias, as all examples share the same token sequence except the argument tokens which are randomly chosen. Therefore, no annotation artifacts (Gururangan et al., 2018) in the data can provide hints for predicting final inference answers.
To quantitatively study the learning efficiency, we generated various Name-T sets with different sizes to study how much data would be required to teach BERT to understand the simple binary predicate "leftሺ⋅,⋅ሻ". We set Name-V, Name-E, and Fruit-E to 1,000 examples each. The solid lines in Figure 1 show that the accuracies of BERT on both test sets (i.e., Name-E and Fruit-E) increase when the training set size increases. However, even if we train BERT with 3,000 examples, it still cannot achieve 100% accuracy on the test sets. In fact, all sentences in this experiment have the form ‫ݔ"‬ is on the left side of ‫,"ݕ‬ where ‫ݔ‬ and ‫ݕ‬ are object names. Therefore, only 6 different words could appear in the context of object names. In other words, given 3,000 examples, BERT still cannot fully understand the meaning of the context "is on the left side of".
Furthermore, training BERT to reach over 99% accuracy on Name-E requires 100 training examples. However, even given 30 times the training data, BERT is still unable to achieve 99% accuracy on Fruit-E. That is, BERT is not able to well generalize what it has learned from the examples with person names to the examples with fruit and vegetable names.

One Antonymous Predicate
Antonyms are frequently used in natural language and play an important role in natural language inference. Given the premise "John is on the left side of Mary", we can easily infer that the hypothesis "Mary is on the right side of John" is entailed by the premise, and that the hypothesis "John is on the right side of Mary" contradicts to the premise. This inference is easy for humans; but is it also easy for BERT? Therefore, we conducted another test named EXP-AP (AP: Antonymous Predicate), a more complicated RTE experiment in which we added the antonymous predicate "rightሺ⋅,⋅ሻ".
In this experiment, the entailment template set consists of the following four templates: Likewise, the neutral template set consists of 16 templates. In brief, adding one antonymous predicate "right ሺ⋅,⋅ሻ " enlarges the number of all possible templates for dataset generation from 6 to 24.
The solid lines in Figure 2 are the accuracies on the test sets Name-E and Fruit-E in EXP-AP. For ease of comparison, we also plot the EXP-SP counterpart accuracies with dotted lines. Although EXP-AP uses only four times as many templates as EXP-SP uses, we must provide more than 30 times the training data (from 100 to 3,000) for BERT to reach 99% accuracy on Name-E. That is, by adding a single antonymous predicate, the RTE task of EXP-AP becomes 30 times harder than that of EXP-SP. It seems teaching BERT to "almost understand" two simple binary predicates requires more than 3,000 examples. This result could be also interpreted in another view. In the EXP-AP, the context of object names in each sentence is either "is on the left side of" or "is on the right side of". BERT requires 3,000 examples to learn the meanings of the 7 different words that appear in these two contexts. This represents a quite inefficient learning curve.

Incorporating Human Knowledge
The previous section showed that many training examples are needed to teach BERT for understanding two binary predicates "leftሺ⋅,⋅ሻ" and "rightሺ⋅,⋅ሻ" in a simple RTE task. It thus naturally leads to a conjecture that training BERT to understand more complicated predicates would very likely require infeasible amount of training data. One possible solution is to improve the learning curve of BERT by feeding it useful/obvious features that have been identified in human/domain knowledge.

Simple Features (SF)
Accordingly, we conducted an experiment EXP-SF, in which we appended features to the input tokens of EXP-AP to improve the learning curve. In this experiment, a template has the form "premise [s] hypothesis [s] features". Obviously, humans directly compare the predicate name and the predicate arguments in the premise against their counterparts in the hypothesis. Perhaps this knowledge about which two fields should be compared would be helpful when training BERT on the EXP-AP task. Let ܰ and ܰ ୌ indicate the predicate names in the premise and the hypothesis respectively; also, let ‫ܣ‬ , and ‫ܣ‬ ,ୌ indicate the ݅ -th predicate arguments in the premise and the hypothesis respectively. In EXP-SF, every example in the datasets of EXP-AP will be augmented by the following three indicator features: Here the indicator function ‫ܫ‬ሺ‫ݔ‬ሻ returns the word "true" if ‫ݔ‬ is true and "false" if it is not. EXP-SF templates thus could be directly transformed from the corresponding EXP-AP templates. To illustrate this transformation, we show an entailment 1 Note that it does not matter which words were chosen to represent the values of features. We had randomly selected fortification and mississippi from BERT's vocabulary to template, a contradictory template and a neutral template in EXP-SF as follows: The dashed lines in Figure 3 are the accuracies on the test sets Name-E and Fruit-E in EXP-SF. For ease of comparison, we also plot the counterpart accuracies of the baseline (i.e., EXP-AP) with dotted lines. The fact that the dashed lines lie far on the left side of the dotted lines indicates that much fewer training examples are required to train BERT after adding these three simple features 1 . This represents a greatly improved learning curve. Specifically, given merely 100 EXP-SF examples, BERT achieves over 99% accuracy on Name-E; in contrast, BERT requires 3,000 examples to surpass 99% accuracy on Name-E in EXP-AP. This represents a greatly improved learning curve.

Discriminant Features (DF)
In the previous experiment EXP-SF, features ݂ ଶ and ݂ ଷ do not precisely indicate the situation in which the arguments in the premise match the arguments in the hypothesis after swapping. For example, for both "left ሺ‫,ݔ‬ ‫ݕ‬ሻ [s] left ሺ‫,ݕ‬ ‫ݔ‬ሻ " and "leftሺ‫,ݔ‬ ‫ݕ‬ሻ [s] leftሺ‫,ݖ‬ ‫ݔ‬ሻ ", features ݂ ଶ and ݂ ଷ are all false in EXP-SF. However, the former is a contradictory case and the latter is a neutral case. We thus conducted the last experiment named EXP-DF, in which we replaced ݂ ଶ and ݂ ଷ with a replace the words true and false. The experimental results were similar to those in Figure 3. In other words, the initial embeddings of the feature values are not crucial. new discriminant feature ݂ ଶ ᇱ . Words true, false and fuzzy were used to indicate the three possible values of this new feature as follows: The solid lines in Figure 3 show that the learning curve of BERT is further improved after adopting this discriminant feature. Given only 20 EXP-DF examples, BERT achieves 99.9% accuracy on Fruit-E. However, in EXP-AP, 1,500 times amount of data (i.e., 30,000 examples) is needed for reaching 99.1% accuracy on Fruit-E. Dagan et al. (2006) first initiated the task of recognizing textual entailment about fifteen years ago. This task continued until 2011 (Bentivogli et al., 2011). Afterwards, conducting inference with BERT was studied in Zellers et al., 2019;Tenney et al., 2019;Aken et al., 2019;Coenen et al., 2019;Michel et al., 2019).
Recently, Ribeiro et al. (2020) proposed a new evaluation methodology, named CheckList, to check general linguistic capabilities of a given NLI model. They generated a large number of diverse test cases to identify the critical failures hidden behind state-of-art models. In contrast, our work mainly targeted the learning efficiency of BERT on various unbiased datasets. Besides, we also studied how adding useful domain-specific features would affect the learning curve of BERT.

Conclusion
This paper is the first quantitative study on whether BERT could really learn to conduct NLI without implicitly utilizing hidden dataset bias, and how quickly it does so if it could. We conduct experiments to evaluate the capability of BERT on making inference without hidden bias, and show that BERT learns NLI inefficiently even for a simple case. We further add task-related features to greatly enhance BERT's learning efficiency. As a result, it suggests that domain knowledge may be essential in conducting NLI with neural networks (at least for BERT). To reduce the performance assessment error, each accuracy reported in this paper is the mean of accuracies obtained from multiple simulations. Each simulation uses a unique random seed to finetune the BERT model.  Test Set Accuracy (%) Name-E Fruit-E