Language Models for Lexical Inference in Context

Lexical inference in context (LIiC) is the task of recognizing textual entailment between two very similar sentences, i.e., sentences that only differ in one expression. It can therefore be seen as a variant of the natural language inference task that is focused on lexical semantics. We formulate and evaluate the first approaches based on pretrained language models (LMs) for this task: (i) a few-shot NLI classifier, (ii) a relation induction approach based on handcrafted patterns expressing the semantics of lexical inference, and (iii) a variant of (ii) with patterns that were automatically extracted from a corpus. All our approaches outperform the previous state of the art, showing the potential of pretrained LMs for LIiC. In an extensive analysis, we investigate factors of success and failure of our three approaches.


Introduction
Lexical inference (LI) denotes the task of deciding whether or not an entailment relation holds between two lexical items. It is therefore related to the detection of other lexical relations like hyponymy between nouns (Hearst, 1992), e.g., dog ⇒ animal, or troponymy between verbs (Fellbaum and Miller, 1990), e.g., to traipse ⇒ to walk. Lexical inference in context (LIiC) adds the problem of disambiguating the pair of lexical items in a given context before reasoning about the inference question. This type of LI is particularly interesting for entailments between verbs and verbal expressions because their meaning -and therefore their implications -can drastically change with different arguments. Consider, e.g., run ⇒ lead in a PERSON / COMPANY context ("Bezos runs Amazon") vs. run ⇒ execute in a COMPUTER / SOFTWARE context ("My mac runs macOS"). LIiC is thus also closely related to the task of natural language inference (NLI) -also called recognizing textual entailment (Dagan et al., 2013) -and can be seen as a focused variant of it. Besides the important use case of evaluating NLI systems, this kind of predicate entailment has also been shown useful for question answering (Schoenmackers et al., 2010), event coreference (Shwartz et al., 2017;Meged et al., 2020), and link prediction in knowledge graphs (Hosseini et al., 2019).
Despite its NLI nature, previous systems for LIiC have primarily been models of lexical similarity (Levy and Dagan, 2016) or models based on verb argument inclusion (Hosseini et al., 2019). The reason is probably that supervised NLI models need large amounts of training data, which is unavailable for LIiC, and that systems trained on available large-scale NLI benchmarks (e.g., Williams et al., 2018) have been reported to insufficiently cover lexical phenomena (Glockner et al., 2018;Schmitt and Schütze, 2019).
Recently, transfer learning has become ubiquitous in NLP; Transformer (Vaswani et al., 2017) language models (LMs) pretrained on large amounts of textual data (Devlin et al., 2019a;Liu et al., 2019) form the basis of a lot of current stateof-the-art models. Besides zero-and few-shot capabilities (Radford et al., 2019;Brown et al., 2020), pretrained LMs have also been found to acquire factual and relational knowledge during pretraining (Petroni et al., 2019;Bouraoui et al., 2020). The entailment relation certainly stands out among previously explored semantic relations -such as the relation between a country and its capital -because it is very rarely stated explicitly and often involves reasoning about both the meaning of verbs and additional knowledge (Schmitt and Schütze, 2019). It is unclear whether implicit clues during pretraining are enough to learn about LIiC and what the best way is to harness any such implicit knowledge.
Regarding these questions, we make the follow-ing contributions: (1) This work is the first to explore the use of pretrained LMs for the LIiC task.
(2) We formulate three approaches and evaluate them using the publicly available pretrained RoBERTa LM (Liu et al., 2019;Wolf et al., 2019): (i) a few-shot NLI classifier, (ii) a relation induction approach based on handcrafted patterns expressing the semantics of lexical inference, and (iii) a variant of (ii) with patterns that were automatically extracted from a corpus.
(3) We introduce the concept of antipatterns, patterns that express non-entailment, and evaluate their usefulness for LIiC. (4) In our experiments on two established LIiC benchmarks, Levy/Holt's dataset (Levy and Dagan, 2016;Holt, 2018) and SherLIiC (Schmitt and Schütze, 2019), all our approaches consistently outperform previous work, thus setting a new state of the art for LIiC. (5) In contrast to previous work on relation induction (Bouraoui et al., 2020), automatically retrieved patterns do not outperform handcrafted ones for LIiC. A qualitative analysis of patterns and errors identifies possible reasons for this finding.

Related Work
Lexical inference. There has been a lot of work on lexical inference for nouns, notably hypernymy detection, resulting in a variety of benchmarks (Kotlerman et al., 2010;Kiela et al., 2015) and methods (Shwartz et al., 2015;Vulić and Mrkšić, 2018). Although there has been work on predicate entailment before (Lin and Pantel, 2001;Lewis and Steedman, 2013), Levy and Dagan (2016) were the first to create a general benchmark for evaluating entailment between verbs. In their evaluation, neither resource-based approaches (Pavlick et al., 2015;Berant et al., 2011) nor vector space models (Levy and Goldberg, 2014) achieved satisfying results. Holt (2018) later published a re-annotated version, which was readily adopted by later work. Hosseini et al. (2018) put global constraints on top of directed local similarity scores (Weeds and Weir, 2003;Lin, 1998;Szpektor and Dagan, 2008) based on distributional features of the predicates. Hosseini et al. (2019) replaced these scores by transition probabilities in a bipartite graph where edge weights are computed by a link prediction model. When Schmitt and Schütze (2019) created the SherLIiC benchmark, they also mainly focused on resource-and vector-based models for evaluation. Their best model combines general-purpose word2vec representations (Mikolov et al., 2013) with a vector representation of the arguments that co-occur with a predicate.
All these works (i) base the probability of entailment validity on the similarity of the verbs and (ii) compute this similarity via (expected) co-occurrence of verbs and their arguments. Our work differs in that our models solely reason about the sentence surface in an end-to-end NLI task without access to previously observed argument pairs. This is possible because our models have learned about these surface forms during pretraining. Patterns and entailment. Pattern-based approaches have long been known for hypernymy detection (Hearst, 1992). Recent work combined them with vector space models (Mirkin et al., 2006;Roller and Erk, 2016;Roller et al., 2018). While there are effective patterns, such as X is a Y , that are indicative for entailment between nouns, there is little work on comparable patterns for verbs. Schwartz et al. (2015) mine symmetric patterns for lexical similarity and achieve good results for verbs. Entailment, however, is not symmetric. Chklovski and Pantel (2004) handcrafted 35 patterns to distinguish 6 semantic relations for pairs of distributionally similar verbs. Some of their classes like strength (taint :: poison) or antonymy (ban :: allow) can be indicators of entailment and nonentailment but are, in general, much more narrowly defined than the patterns we use in our approach. Another difference to our work is that verb pairs are scored based on co-occurrence counts on the web, while we employ an LM, which does not depend on a valid entailment pair actually appearing together in a document. Patterns and language models. Amrami and Goldberg (2018) were the first to manipulate LM predictions with a simple pattern to enhance the quality of substitute words in a given context for word sense induction. Petroni et al. (2019) found that large pretrained LMs can be queried for factual knowledge, when presented with appropriate pattern-generated cloze-style sentences. This zeroshot factual knowledge has later been shown to be quite fragile (Kassner and Schütze, 2020). So we rather focus on approaches that fine-tune an LM on at least a few samples. Forbes et al. (2019) train a binary classifier on top of a fine-tuned BERT (Devlin et al., 2019a) to predict the truth value of handwritten statements about objects and their properties. While their experiments investigate BERT's physical common sense reasoning, we focus on the different phenomenon of entailment between two actions expressed by verbs in context. Schick and Schütze (2020) used handcrafted patterns and LMs for few-shot text classification. Based on manually defined label-token correspondences, the predicted classification label is determined by the token an LM estimates as most probable at a masked position in the cloze-style pattern. We differentiate entailment and non-entailment via compatibility scores for patterns and antipatterns and not via different predicted tokens.
Addressing relation induction, Bouraoui et al. (2020) propose an automatic way of finding, given a relation, LM patterns that are likely to express it. They train a binary classifier per relation on the sentences generated by these patterns. While some of the relations they consider are related to verbal entailment (e.g., cook activity-goal eat), most of them concern common sense (e.g., library locationactivity reading) or encyclopedic knowledge (e.g., Paris capital-of France). We adapt their method for the automatic retrieval of promising patterns for LIiC, but find that handcrafted patterns that capture the generality of the entailment relation still have an advantage over automatic patterns for LIiC. Another important novelty we introduce is the use of antipatterns. While Bouraoui et al. (2020) have to use negative samples for training their classifiers, they only consider patterns that exemplify the desired relation. In contrast, we also use antipatterns that exemplify what the entailment relation is not. We believe that antipatterns are particularly useful for entailment detection because they can help identify other kinds of semantic relations that often pose a challenge to vector space models (Levy and Dagan, 2016; Schmitt and Schütze, 2019).

NLI classifier
Building an NLI classifier on top of a pretrained LM usually means taking an aggregate sequence representation of the concatenated premise and hypothesis as input features of a neural network classifier (Devlin et al., 2019b). For RoBERTa (Liu et al., 2019), this representation is the final hidden state of a special s token that is prepended to the input sentences, which in turn are separated by a separator token /s . Let Λ be the function that maps such an input x = x 1 /s x 2 to the aggregate representation Λ(x) ∈ R d . Following (Devlin et al., 2019b;Liu et al., 2019), we then feed these features to a 2-layer feed-forward neural network with tanh activation: where drop applies dropout with a probability of 0.1, σ is the softmax function, and W 1 ∈ R d×d , W 2 ∈ R d×2 , b 1 ∈ R d , b 2 ∈ R 2 are learnable parameters. Note that W 1 and b 1 are still part of the LM's pretrained parameters; so we only train W 2 and b 2 from scratch. 2 The actual classification decision uses a threshold ϑ: The traditional choice for the threshold is ϑ = 0.5 because that means D ϑ NLI (x 1 , x 2 ) = 1 iff P NLI (y = 1 | x 1 , x 2 ) > P NLI (y = 0 | x 1 , x 2 ). We nevertheless keep ϑ as a hyperparameter to be tuned on held-out development data.
We train the NLI approach by minimizing the negative log-likelihood L NLI of the training data T :

Pattern-based classifier
This approach puts the input sentences x 1 , x 2 together in a pattern-based textual context and trains a classifier to distinguish between felicitous and infelicitous utterances. 3 In contrast to previous approaches (Forbes et al., 2019;Bouraoui et al., 2020), we also consider antipatterns that exemplify what kind of semantic relatedness we are not interested in, and combine probabilities for patterns and antipatterns in the final classification. Finding suitable patterns. A simple handcrafted pattern to check for the validity of an inference x 1 ⇒ x 2 is "x 2 because x 1 .". An analoguos antipattern is "It is not sure that x 2 just because x 1 .". Based on similar considerations, we manually design 5 patterns and 5 antipatterns (see Table 4). We will refer to the approach using these handcrafted patterns as MANPAT. Bouraoui et al. (2020) argue that text produced by simple, handcrafted patterns is artificial and therefore suboptimal for LMs pretrained on naturally occurring text. To adapt their setup to verbal expressions used in LIiC, we identify suitable patterns (antipatterns) by searching a large text corpus 4 for sentences that contain both elements of valid (invalid) entailment pairs. In a second step, we score each of these patterns (antipatterns) according to the number of valid (invalid) entailment pairs x 1 , x 2 that can be found by querying an LM for the k most probable completions when x 1 or x 2 is inserted in the pattern and its counterpart is masked. For example, consider the entailment pair rule ⇒ control and the pattern "Catchers prem the field; they hypo the plays and tell everyone where to be." extracted from a description of softball. Predicting rule from "Catchers mask the field; they control the plays and tell everyone where to be." and predicting control from "Catchers rule the field; they mask the plays and tell everyone where to be." would result in one point each. Approaches called AUTPAT n use the n patterns with the most points obtained in that manner. See §4 for more details on our experimental setup. Pattern-based predictions. The probability P FEL (z | x) of sentence x to be felicitous (z = 1) or infelicitous (z = 0) is estimated like P NLI in Eq. (1), except that x is not the concatenation of two sentences but a single pattern-generated utterance.
Given a set of patterns Φ and a set of antipatterns Ψ, the score s to judge an input x 1 , x 2 is the difference between the maximum probability m pos that any pattern forms a felicitous statement and the maximum probability m neg that any antipattern forms a felicitous statement: As in NLI, the final decision uses a threshold ϑ: This corresponds to requiring that m pos be higher than m neg by a margin ϑ, i.e., D ϑ PAT (x 1 , x 2 ) = 1 iff m pos > m neg + ϑ.
As Bouraoui et al. (2020) did not use antipatterns, they defined m neg as the maximum probability for any pattern to form an infelicitous statement. 4 We use the Wikipedia dump from Jan 15th 2011. To estimate the usefulness of antipatterns, we evaluate both possibilities, marking systems that use both patterns and antipatterns with ΦΨ and those that only use patterns with Φ.
The use of a threshold is another novel component, i.e., Bouraoui et al. (2020) virtually set ϑ = 0. We discuss the influence of ϑ in §5.
We train all pattern-based approaches by minimizing the negative log-likelihood L PAT that patterns Φ produce felicitous statements for valid entailments (y = 1) and infelicitous statements for invalid entailments (y = 0) from the training data T , and vice versa for antipatterns Ψ:

Experiments
We evaluate on two benchmarks: (  phrases (the arguments) and a verbal expression, in which the two sentences differ. As the verbal expressions can contain auxiliaries or negation, they often consist of multiple tokens. Originally, one argument is replaced with a WordNet (Miller, 1995) type in one of the sentences to make the entailment more general during annotation, but we use a version of the dataset provided by Hosseini et al. (2018) where both sentences have concretely instantiated arguments. For example, consider Table 6 (c). Athena was masked as the WordNet synset deity during benchmark annotation but we use the original sentences as shown in Table 6 for all classifiers without further modification.
For the automatic pattern search in AUTPAT, we look for sentences that mention verbatim the two verbal expressions of any instance from dev 1 . For the ranking, we take the last token of a verbal expression as representative for the whole. This has the advantage that we can query the LM with a single mask token and compare a single token to the k = 100 most probable predictions. We take the last token because it usually is the main verb.
SherLIiC. For classification, we use SherLIiC's automatically generated sentences that were used for annotation during benchmark creation. The arguments in SherLIiC are entity types from Freebase (Bollacker et al., 2008). As such, they can be replaced by any Freebase entity with matching type. For example, consider Table 6 (a); the arguments Germany and Côte d'Ivoire were originally masked as location[A] and location[B] during annotation, but annotators also saw three randomly chosen instantiations for both A (Germany / Syria / USA) and B (Côte d'Ivoire / UK / Italy) for context. From the three examples provided in SherLIiC for each argument, we choose the first one to form sentences with concretely instantiated arguments.
For the automatic pattern search in AUTPAT, we make use of the greater flexibility offered by the  lemmatized representations in SherLIiC. As we are interested in statements that can be made in any way in a text, we search for sentences that mention the two predicates of a SherLIiC dev 1 instance in any inflected form. For the ranking, we again consider the predicate representative for the whole verbal expression. We thus use the predicate lemma and otherwise proceed as described above.

Training details
We train all our classifiers for 5 epochs with Adam (Kingma and Ba, 2015) and a mini-batch size of 10 (resp. 2) for RoBERTa-base (resp. -large). We randomly sample 500 configurations for the remaining hyperparameters (see Appendix A). For a fair comparison, we evaluate all our approaches with the same configurations.

Hyperparameter robustness
Following previous work (Hosseini et al., 2018(Hosseini et al., , 2019, we use the area under the precision-recall curve (AUC) restricted to precision values ≥ 0.5 as criterion for model selection. Fig. 1 (left) shows the distribution of dev 2 performance for 500 randomly sampled runs with RoBERTa-base. Most hyperparameters perform poorly, suggesting that hyperparameter search is crucial. For Levy/Holt, NLI is strong whereas for SherLIiC handcrafted MANPAT Φ patterns have a clearer advantage. For SherLIiC, the combination of automatically generated patterns and an-  tipatterns AUTPAT ΦΨ 5 exhibits the highest median performance and the second-highest upper quartile, making it together with MANPAT Φ the most robust to different hyperparameters, although its top performance is lower compared to the others. For all methods, only very few hyperparameter sets achieve top performances. For both datasets, however, a well-performing configuration is found after fewer than 100 sampled runs (Fig. 1, right). Considering that AUTPAT requires an LM to rank thousands of patterns, these results suggest that, for LIiC, available GPU hours should be spent on automatic hyperparameter rather than pattern search. With its manually written patterns, MANPAT does not need additional GPU hours for pattern search and still, on average, performs better.

Best hyperparameter configurations
For the best found configuration for each method, we not only report AUC, which provides a general picture of a scoring method's precision-recall tradeoff, but also the concrete precision, recall, and F1 for the actual classification after applying a threshold ϑ. For this we tune ϑ on dev 2 for optimal F1. Tables 2 and 3 show the results.
On both datasets, our methods outperform all previous work (sometimes by a large margin), thus establishing a new state of the art. For SherLIiC+RoBERTa-base, the strong but simple NLI system is consistently outperformed by all pattern-based approaches, showing that well-Automatically retrieved patterns (with SherLIiC dev1) prem hypo rank 1 In North America, where the "atypical" forms of community-hypo pneumonia are acquired acquired becoming more common, macrolides (such as azithromycin), and doxycycline have displaced amoxicillin as first-line outpatient treatment for community-prem pneumonia. rank 5 This area now consists of . . . the Yukon Territory (prem 1898) . . . and Nunavut created created in (hypo 1999). rank 12 For example, . . . 訪問 "prem" is composed of 訪 "to visit" and 問 "to hypo".  chosen patterns and antipatterns can be helpful for LIiC. For SherLIiC+RoBERTa-large and also generally on Levy/Holt's dataset, NLI is more competitive, but the combination of handcrafted patterns and antipatterns MANPAT ΦΨ still performs better in these cases. The use of antipatterns does not consistently lead to better performance for all combinations of dataset, LM variant (base vs. large), and pattern set (MANPAT vs. AUTPAT). They do, however, consistently bring gains for some combinations, e.g., MANPAT on Levy/Holt and AUTPAT on SherLIiC. Moreover, antipatterns are essential for achieving top performance, i.e., the new state of the art, on both datasets.
Most of the threshold values ϑ (tuned on dev 2 ) are far from their traditional values, 0.5 for NLI and 0.0 for patterns. NLI classifiers' probability estimates are often too confident, resulting in values close to 0 and 1. To "correct" cases where a very small value is assigned to a valid entailment, optimal thresholds are often close to 0 instead of 0.5. Analogously, most pattern-based approaches opt for a negative ϑ, which means that instead of requiring a margin between m pos and m neg (boosting precision), they make more positive predictions and boost recall. Low recall is a key problem in LIiC (cf. Levy and Dagan (2016)). Tuning a threshold increases the models' flexibility in this aspect. 6 Analysis 6.1 Number of patterns §5 shows that automatic patterns do not beat handcrafted patterns for LIiC. However, automatic patterns have one major advantage: in contrast to man-  ual patterns, their number can be easily increased. We therefore investigate the impact of the hyperparameter n for AUTPAT n . Table 5 shows that too many patterns is as bad as too few. AUTPAT 15 is the sweetspot: on SherLIiC, it outperforms all other RoBERTa-base methods on AUC and closely approaches the otherwise best method MANPAT Φ on F1.

Pattern analysis
Handcrafted patterns mostly outperform automatic ones ( §5). A larger number n of patterns only has a small effect ( §6.1). We therefore take a closer look at automatic and manual patterns. Table 4 shows all handcrafted and a sample of highly ranked automatic patterns.
It is striking how specific the automatically retrieved contexts are; especially for the highest ranks (exemplified by ranks 1 and 5) only a narrow set of verbs seems plausible from a human perspective. It is only at rank 12 that we find a more general context and it arguably even displays some semantic reasoning. There certainly are verbs that are not compatible with the meaning of visit, but this context allows for a wide range of plausible verbs and even mentions composition of meaning.
The handcrafted patterns, in contrast, all capture some general aspect of entailment, which might be the reason they generalize better. Moreover, they also have placeholder slots for the verb arguments, which could be an advantage as these represent a verb's original context. Only accepting corpus sentences in which the verbs occur with the same arguments as in the dataset is too restrictive.
We therefore conduct the following experiment: We manually go through the 100 highest-ranked automatically created patterns and identify 5 contexts that could accommodate arguments without changing the overall sentence structure. We also try to pick patterns that are different enough from each other to avoid redundancy. As a baseline, the method AUTCUR Φ 5 uses these manually curated patterns as is. We then rewrite the patterns such that they include placeholders for verb arguments, e.g., "The original aim of de Garis' work was to prem the field of "brain building" (a term of his invention) and to "hypo a trillion dollar industry within 20 years"." becomes "The original aim of their work was that "PARGL prem PARGR" and that "HARGL hypo HARGR within 20 years"." with PARGL / PARGR (HARGL / HARGR) the placeholder for the left / right argument of the premise (hypothesis). See Table 14 in the appendix for the complete list. AUTARG Φ 5 is based on these rewritten patterns. We try the same 500 hyperparameter configurations as for the other RoBERTa-base approaches and include results for the best configuration (chosen on dev 2 ) in Table 3. We find that manually curating automatically ranked patterns helps performance. AUTCUR Φ 5 outperforms AUTPAT Φ 5 on AUC and F1, reducing the gap to handcrafted patterns (i.e., MANPAT Φ ). This is probably due to the variety we enforced when handpicking the patterns.
Surprisingly, adding arguments decreases performance. Possibly, our modifications make the patterns less fluent or the arguments that are filled into the placeholders during training and evaluation do not fit well into the contexts, which still are rather specific.  common sense knowledge that occupying a territory implies remaining there. This might be learned from patterns more easily as these patterns might resemble contexts -seen during pretraining -that describe how long a military force remained during an occupation. Putting the inference candidate (b) into a pattern-generated context avoids being fooled by the high similarity of the two sentences.

Error analysis
Only the handcrafted patterns can make sense of the important details in this construction.
In contrast, (c) and (d) are difficult for our pattern approaches whereas NLI gets them right. We hypothesize that the problem stems from linking the two sentences into one. An entailment pattern ideally represents a derivation of the hypothesis from the premise. One may wrongly conclude that (c) Athena was the goddess of Athens only because she was worshiped there, by neglecting the possibility that there are others that are equally worshiped. In the same way, (d) is unlikely to be found in an argumentative text. While it is clear that there can be no beating without a fight, one would hardly argue that Pyrrhus fought the romans because they beat him. This particular reasoning calls for additional explanations like Pyrrhus must have fought the romans because I know that they beat him. This analysis serves as inspiration for further improve-ments of entailment patterns.
The last two examples (e) and (f) are difficult for all approaches. It seems to be a particular challenge to identify open situations like a sports match or a negotiation where multiple outcomes are possible and distinguish them from cases where one particular outcome is inevitable.

Conclusion
We proposed and evaluated three approaches to the task of lexical inference in context (LIiC) based on pretrained language models (LMs). In particular, we found that putting an inference candidate into a pattern-generated context mostly increases performance compared to a standard sequence classification approach. Concrete performance, however, also depends on the particular dataset, used LM (variant), and pattern set. We introduced the concept of antipatterns, which express the negative class of a binary classification, and found that they often lead to performance gains for LIiC. We set a new state of the art for LIiC and conducted an extensive analysis of our approaches. Notably, we found that automatically created patterns can perform nearly as well as handcrafted ones if we either use the right number n of patterns or manually identify the right subset of them. Promising directions for future work are the investigation of alternative automatic pattern generation methods or a better modeling of the remaining challenges we described in our error analysis ( §6.3). batch size of 10 or 2 instances for RoBERTa-base and -large, respectively. For AUTPAT n approaches with n > 5, we distribute the available patterns and antipatterns into chunks of size 5 for training to save memory. During evaluation, the predictions are based on all the patterns and antipatterns. We randomly sample 500 configurations for the remaining hyperparameters, i.e., initial learning rate lr, weight decay λ (L2 regularization), and the number of batches c the gradient is accumulated before each optimization step, which virtually increases the batch size by a factor of c. The hyperparameters are sampled from the following intervals: lr ∈ [10 −8 , 5 · 10 −2 ], λ ∈ [10 −5 , 10 −1 ], c ∈ { 1, 2, . . . , 10 }. lr and λ are sampled uniformly in log-space. For a fair comparison, we use the same 500 random configurations for all of our approaches.
As usual for Transformer models, we apply a learning rate schedule: lr decreases linearly such that it reaches 0 at the end of the last epoch. We do not employ warm-up.
The best configurations can be seen in Tables 8  and 10 for Levy/Holt's dataset and in Tables 9  and 11 for SherLIiC.

B Results on development sets
See Tables 12 and 13.

C Varying n in training and evaluation
Another approach to make use of different values of n in AUTPAT n systems is to vary n from training to evaluation. Figure 2 is a visualization of the performance impact of this procedure. The base point for the visualization (in white) is the AUC performance of AUTPAT Φ 5 . We see that training with n = 50 almost always leads to a performance drop (marked in blue) w.r.t. this number. It seems generally to be catastrophic to evaluate a model with patterns that were not seen during training, indicating that there is no generalization from seen patterns to unseen patterns even if they were chosen by the same method and can thus be expected to be -at least to some extent -similar. In general, this evaluation suggests that modifying n after the training always leads to a drop in performance. Table 7 shows results on the question how well a model trained on one dataset performs on the other. For this, we assume that the target dataset is not available at all, i.e., we do not use it at all -neither for finding patterns in AUTPAT nor for tuning the threshold ϑ. We thus use the standard ϑ values, i.e., 0.5 for NLI and 0.0 for the pattern-based methods.  lr 2.72 · 10 −5 2.47 · 10 −5 6.68 · 10 −6 3.82 · 10 −5 2.11 · 10 −5 λ 1.43 · 10 −3 2.98 · 10 −4 1.07 · 10 −5 4.02 · 10 −5 1.65 · 10 −5 c 1 2 1 2 3
(1) The original aim of de Garis' work was to prem the field of "brain building" (a term of his invention) and to "hypo a trillion dollar industry within 20 years".
→ The original aim of their work was that "PARGL prem PARGR" and that "HARGL hypo HARGR within 20 years". (2) Critic Roger Ebert stated that Gellar and co-star Ryan Phillippe "prem a convincing emotional charge" and that Gellar is "effective as a bright girl who knows exactly how to hypo her act as a tramp".
→ Critic Roger Ebert stated that PARGL and co-star Ryan Phillippe "prem PARGR" and that HARGL is "effective as a bright girl who knows exactly how she hypo HARGR as a tramp". (3) Well-known professional competitions in the past have included the World Professional Championships (hypo Landover, Maryland), the Challenge Of Champions, the Canadian Professional Championships and the World Professional Championships (prem in Jaca, Spain).
→ Well-known professional competitions in the past have included HARGL (hypo HARGR), the Challenge Of Champions, the Canadian Professional Championships and PARGL (prem PARGR).
(4) They also had sharpshooter Steve Kerr, whom they hypo via free agency before the 1993-94 season, Myers, and centers Luc Longley (prem via trade in 1994 from the Minnesota Timberwolves) and Bill Wennington.
→ HARGL also had sharpshooter HARGR, whom they hypo via free agency before the 1993-94 season, Myers, and centers PARGR (whom PARGL prem via trade in 1994 from the Minnesota Timberwolves) and Bill Wennington. (5) Because the 6x86 was more efficient on an instructionsper-cycle basis than Intel's Pentium, and because Cyrix sometimes hypo a faster bus speed than either Intel or AMD, Cyrix and competitor AMD co-prem the controversial PR system in an effort to compare its products more favorably with Intel's. . . .
→ Because the 6x86 was more efficient on an instructionsper-cycle basis than Intel's Pentium, and because HARGL sometimes hypo HARGR, PARGL and competitor AMD co-prem PARGR in an effort to compare its products more favorably with Intel's. . . . Table 14: Five manually selected patterns from the 100 highest-ranked automatically extracted patterns from SherLIiC dev 1 (used in AUTCUR Φ 5 ) and their rewritten counterparts (used in AUTARG Φ 5 ). PARGL (HARGL) stands for the left argument of the premise (hypothesis); PARGR (HARGR) for the right one.