Continuous Entailment Patterns for Lexical Inference in Context

Combining a pretrained language model (PLM) with textual patterns has been shown to help in both zero- and few-shot settings. For zero-shot performance, it makes sense to design patterns that closely resemble the text seen during self-supervised pretraining because the model has never seen anything else. Supervised training allows for more flexibility. If we allow for tokens outside the PLM’s vocabulary, patterns can be adapted more flexibly to a PLM’s idiosyncrasies. Contrasting patterns where a “token” can be any continuous vector from those where a discrete choice between vocabulary elements has to be made, we call our method CONtinous pAtterNs (CONAN). We evaluate CONAN on two established benchmarks for lexical inference in context (LIiC) a.k.a. predicate entailment, a challenging natural language understanding task with relatively small training data. In a direct comparison with discrete patterns, CONAN consistently leads to improved performance, setting a new state of the art. Our experiments give valuable insights on the kind of pattern that enhances a PLM’s performance on LIiC and raise important questions regarding our understanding of PLMs using text patterns.


Introduction
Lexical inference in context (LIiC) -also called predicate entailment -is a variant of natural language inference (NLI) or recognizing textual entailment (Dagan et al., 2013) with focus on the lexical semantics of verbs and verbal expressions (Levy and Dagan, 2016;Schmitt and Schütze, 2019). Its goal is to detect entailment between two very similar sentences, i.e., sentences that share subject and object and only differ in the predicate, e.g., PERSON(A) runs ORG(B) → PERSON(A) leads ORG(B). NLI models that were not specifically trained with lexical knowledge have been reported to struggle with this task (Glockner et al., 2018;Schmitt and Schütze, 2019), making LIiC an important evaluation criterion for general language understanding. Other use cases for this kind of lexical entailment knowledge include question answering (Schoenmackers et al., 2010;McKenna et al., 2021), event coreference (Shwartz et al., 2017;Meged et al., 2020), and link prediction in knowledge graphs (Hosseini et al., 2019).
Although LIiC is an inherently directional task, symmetric cosine similarity in a vector space, such as word2vec (an, 2013), has long been the state of the art for this task. Only recently transfer learning with pretrained Transformer (Vaswani et al., 2017) language models (Devlin et al., 2019), has led to large improvements for LIiC. Schmitt and Schütze (2021) combine natural language (NL) patterns with a pretrained language model (PLM) and not only set a new state of the art but also beat baselines without access to such patterns.
Empirical findings suggest that a good pattern can be worth 100s of labeled training instances (Le Scao and Rush, 2021), making pattern approaches interesting for low-resource tasks such as LIiC. But beyond the intuition that patterns serve as some sort of task instruction (Schick and Schütze, 2021a), little is known about the reasons for their success. Recent findings that (i) PLMs can fail to follow even simple instructions (Efrat and Levy, 2020), that (ii) PLMs can behave drastically different with paraphrases of the same pattern (Elazar et al., 2021), and that (iii) performance increases if we train a second model to rewrite an input pattern with the goal of making it more comprehensible for a target PLM (Haviv et al., 2021), strongly suggest that patterns do not make sense to PLMs in the same way as they do to humans.
Our work sheds light on the interaction of patterns and PLMs and proposes a new method of improving pattern-based models fully automatically.
On two popular LIiC benchmarks, our model (i) establishes a new state of the art without the need for handcrafting patterns or automatically identifying them in a corpus and (ii) does so more efficiently thanks to shorter patterns. Our best model only uses 2 tokens per pattern.

The CONAN Model
Continuous patterns. LIiC is a binary classification task, i.e., given a premise p = p 1 p 2 . . . p |p| and a hypothesis h = h 1 h 2 . . . h |h| a model has to decide whether p entails h (y = 1) or not (y = 0). A template-based approach to this task surrounds p and h with tokens t 1 t 2 . . . t m to bias the classifier for entailment detection, e.g., "p, which means that h".While in most approaches that leverage a PLM M these tokens come from the PLM's vocabulary, i.e., ∀i. t i ∈ Σ M , we propose a model based on CONtinuous pAtterNs (CONAN), i.e., surround the embedding representation of p and h with continuous vectors that may be close to but do not have to match the embedding of any vocabulary entry.
For this, we first extend the PLM's vocabulary by a finite set C = c 1 , c 2 , . . . , c |C| of fresh tokens, i.e., Σ = Σ M ∪ C with C ∩ Σ M = ∅. Then, we distinguish two methods of surrounding p and h with these special tokens: α sets them both αround and between p and h (Eq. (1)) while β only sets them βetween the two input sentences (Eq. (2)).
Note that α divides its k tokens into three parts as equally as possible where any remaining tokens go between p and h if k is not a multiple of 3. In particular, this means that the same templates are produced by α and β for k ≤ 2. We chose this behavior as a generalization of the standard approach to fine-tuning a PLM for sequence classification (such as NLI) where there is only one special token and it separates the two input sequences. The template produced by α 1 and β 1 is very similar to this. A major difference is that the embeddings for C tokens are randomly initialized whereas the standard separator token has a pretrained embedding.
Pattern-based classifier. Given γ ∈ { α, β }, we estimate the probability distribution P (ŷ | p, h) with a linear classifier on top of the pooled sequence representation produced by the PLM M: where W ∈ R d×2 , b ∈ R 2 are learnable parameters, σ is the softmax function, and applying M means encoding the whole input sequence in a single d-dimensional vector according to the specifics of the PLM. For BERT (Devlin et al., 2019) and its successor RoBERTa (Liu et al., 2019), this implies a dense pooler layer with tanh activation over the contextualized token embeddings and picking the first of these embeddings (i.e., [CLS] for BERT and s for RoBERTa). 2 For training, we apply dropout with a probability of 0.1 to the output of M(·). Inference with multiple patterns. Previous work (Bouraoui et al., 2020;Schmitt and Schütze, 2021) combined multiple patterns with the intuition that different NL patterns can capture different aspects of the task. This intuition makes also sense for CONAN. We conjecture that an efficient use of the model parameters requires different continuous patterns to learn different representations, which can detect different types of entailment. Following the aforementioned work, we form our final score s by combining the probability estimates from different patterns Γ by comparing the maximum probability for the two classes 0, 1 over all patterns: In conclusion, a CONAN model γ n k is characterized by three factors: (i) The type of pattern γ ∈ { α, β }, (ii) the number of patterns n ∈ N, and (iii) the number of tokens k ∈ N per pattern. Training. While multiple patterns are combined for decision finding during inference, we treat all patterns separately during training -as did previous work (Schmitt and Schütze, 2021). So, given a set of patterns Γ, we minimize the negative loglikelihood of the training data T , i.e., In practice, we apply mini-batching to both T and Γ and thus compute this loss only for a fraction of the available training data and patterns at a time. In this case, we normalize the loss by averaging over the training samples and patterns in the mini-batch.

Experiments
We conduct experiments on two established LIiC benchmarks, SherLIiC (Schmitt and Schütze, 2019) and Levy/Holt (Levy and Dagan, 2016;Holt, 2018), using the data splits as defined in (Schmitt and Schütze, 2021) for comparison. Both benchmarks contain a majority of negative examples (SherLIiC: 67%, Levy/Holt: 81%), making the detection of the entailment (i.e., the minority) class a particular challenge. See Table 1 for dataset and split sizes. Note that Levy/Holt is nearly 5 times bigger than SherLIiC and still has less than 5k train samples.
Following (Schmitt and Schütze, 2021), we use RoBERTa as underlying PLM and also use the same hyperparameters whenever possible for comparison. Also following Schmitt and Schütze (2021), we instantiate the typed placeholders A, B in SherLIiC with Freebase (Bollacker et al., 2008) entities, making sure that A and B are not assigned the same entity. See Appendix A for full training details.
We evaluate model performance with two metrics: (i) The area under the precision-recall curve for precision values ≥ 0.5 (AUC) as threshold-less metric using only the score s defined in the previous section and (ii) the F1 score of actual classification decisions after tuning a decision threshold ϑ on the respective dev portion of the data. Our implementation is based on (Wolf et al., 2019).

Results
Choosing n and k. The number n of patterns and the number k of C tokens per pattern are essential hyperparameters of a CONAN model. Fig. 1 shows the impact on performance (measured in AUC) on SherLIiC dev for different n-kcombinations. We observe that using too many, too long patterns harms performance w.r.t. the base case n = k = 1. Best results are obtained with either a small number of patterns or tokens or both. Comparing the α and β settings, we notice that, Figure 1: AUC on SherLIiC dev for different CONAN models; top = α, bottom = β; white/red/blue = similar to/better than/worse than n = k = 1. Note that α k = β k for k ≤ 2.
rounded to one decimal, they produce identical results for both n = 1 and n = 5 patterns, suggesting that the particular position of the C tokens does not matter much in these settings for SherLIiC. Even with n = 10 patterns, the two methods only begin to differ with k ≥ 5 tokens per pattern. Evaluation on the Levy/Holt data (see Fig. 2 in Appendix B) shows more variation between α and β but, otherwise, confirms the trend that small n and k yield better performance.
Our results offer an explanation for the empirical finding in (Schmitt and Schütze, 2021) that patterns retrieved from a corpus lead to worse performance than handcrafted ones because the latter are generally shorter. CONAN models do not only yield better performance, they also provide an automatic way to test pattern properties, such as length, w.r.t. effect on performance for a given task. Test performance. On both SherLIiC (     (Schmitt and Schütze, 2021). See Table 2 for table format. and Levy/Holt (Table 3), and across model sizes (base and large), CONAN 5 2 (using either α or β because they are identical for k = 2) outperforms all other models including the previous state of the art by Schmitt and Schütze (2021), who fine-tune RoBERTa both without patterns (NLI) and using handcrafted (MANPAT) or automatically retrieved corpus patterns (AUTPAT). We report their two best systems for each benchmark.
We take the performance increase with continuous patterns as a clear indicator that the flexibility offered by separating pattern tokens from the rest of the vocabulary allows RoBERTa to better adapt to the task-specific data even with only few labeled training instances in the challenging LIiC task.  Table 4: Transfer experiments (ϑ = 0). Best models from (Schmitt and Schütze, 2021) according to F1 score.

Analysis and Discussion
Nearest neighbors. To further investigate how RoBERTa makes use of the flexibility of C tokens, we compute their nearest neighbors in the space of original vocabulary tokens based on cosine similarity for our models in Tables 2 and 3. We always find the C tokens to be very dissimilar from any token in the original vocabulary, the highest cosine similarity being 0.15. And even among themselves, C tokens are very dissimilar, nearly orthogonal, with 0.08 being the highest cosine similarity here. RoBERTa seems to indeed take full advantage of the increased flexibility to put the C tokens anywhere in the embedding space. This further backs our hypothesis that the increased flexibility is beneficial for performance. Influence of additional parameters. One might argue that the vocabulary extension and the resulting new randomly initialized token embeddings lead to an unfair advantage for CONAN models because the parameter count increases. While more parameters do generally lead to increased model capacity, the number of new parameters is so small compared to the total number of parameters in RoBERTa that we consider it improbable that the new parameters are alone responsible for the improved performance. Of all models in Tables 2 and 3, CONAN 50 1 introduces the most additional model parameters, i.e., 1 · 50 · 768 = 38400 for RoBERTa-base. Given that even the smaller RoBERTa-base model still has a total of 125M parameters, the relative parameter increase is maximally 0.03%, which, we argue, is negligible. Transfer between datasets. The experiments summarized in Table 4 investigate the hypothesis that CONAN's better adaptation to the fine-tuning data might worsen its generalization abilities to other LIiC benchmarks. For this, we train our best model CONAN 5 2 on SherLIiC to test it on Levy/Holt and vice versa. In this scenario, we assume that the target dataset is not available at all. So there is no way to adapt to a slightly different domain other than learning general LIiC reasoning. We thus set ϑ = 0 in these experiments.
We find that with the very few train samples in SherLIiC the risk of overfitting to SherLIiC is indeed higher. When trained on Levy/Holt with around 4.4k train samples, however, CONAN clearly improves generalization to the SherLIiC domain.

Related Work
PLMs and text patterns. GPT-2 (Radford et al., 2019) made the idea popular that a PLM can perform tasks without access to any training data when prompted with the right NL task instructions. With GPT-3, Brown et al. (2020) adapted this idea to fewshot settings where the task prompt is extended by a few training samples. While this kind of few-shot adaptation with a frozen PLM only works with very big models, Schick and Schütze (2021b) achieve similar performance with smaller models by finetuning the PLM on the available training data and putting them into NL templates. Recently, Schmitt and Schütze (2021) investigated the use of PLMs for LIiC. Compared to a standard sequence classification fine-tuning approach, they were able to improve the PLM RoBERTa's performance by putting an entailment candidate into textual contexts that only make sense for either a valid or invalid example. Patterns like "y because x." (valid) or "It does not mean that y just because x." (invalid) make intuitive sense to humans and outperform standard RoBERTa on LIiC.
A large problem with all these approaches, however, is to find well-functioning patterns, for which numerous solutions have been proposed (Shin et al., 2020;Haviv et al., 2021;Bouraoui et al., 2020;Jiang et al., 2020;Gao et al., 2021;Reynolds and McDonell, 2021). We argue that it is not optimal to constrain pattern search to the space of NL sequences if the primary goal is better task performance, and therefore abandon this constraint. PLMs and continuous patterns. Li and Liang (2021) and Hambardzumyan et al. (2021) contemporaneously introduced the idea of mixing the input token embeddings of a PLM with other continuous vectors that do not correspond to vocabulary elements. In the spirit of GPT-2 (see above), they keep the PLM's parameters frozen and only fine-tune the embeddings of the "virtual tokens" to the target task. While this line of research offers certain appeals of its own, e.g., reusability of the frozen PLM weights, this is not the focus of our work. In pursuit of the best possible performance, we instead compare the use of continuous vs. NL patterns in the process of fine-tuning all PLM parameters and find that even carefully chosen NL patterns can be outperformed by our automatically learned ones.
Contemporaneously to our work, Liu et al. (2021) fine-tune entire PLMs with continuous patterns for SuperGLUE (Wang et al., 2019). Besides reformulating the SuperGLUE tasks as cloze tasks, while we keep formalizing our task as classification, Liu et al. (2021) also add more complexity by computing the continuous token representations with an LSTM (Hochreiter and Schmidhuber, 1997) and adding certain "anchor tokens", such as a question mark, at manually chosen places. CONAN does not use any manual pattern design and embeds continuous tokens with a simple lookup table.
Another contemporaneous work by Lester et al. (2021) tests the influence of model size on the performance of a frozen PLM with trained continuous prompts. Their prompt ensembling is akin to our combining multiple patterns during inference (cf. §2). The key difference is that, instead of making predictions with different patterns and taking the majority vote, we rather compare the scores for different patterns to make our prediction.

Conclusion
We presented CONAN, a method that improves finetuning performance of a PLM with continuous patterns. CONAN does not depend on any manual pattern design and is efficient as the shortest possible patterns with good performance can be found automatically. It provides an automatic way of systematically testing structural properties of patterns, such as length, w.r.t. performance changes. In our experiments on two established LIiC benchmarks, CONAN outperforms previous work using NL patterns and sets a new state of the art.