Overcoming Poor Word Embeddings with Word Definitions

Modern natural language understanding models depend on pretrained subword embeddings, but applications may need to reason about words that were never or rarely seen during pretraining. We show that examples that depend critically on a rarer word are more challenging for natural language inference models. Then we explore how a model could learn to use definitions, provided in natural text, to overcome this handicap. Our model’s understanding of a definition is usually weaker than a well-modeled word embedding, but it recovers most of the performance gap from using a completely untrained word.


Introduction
The reliance of natural language understanding models on the information in pre-trained word embeddings limits these models from being applied reliably to rare words or technical vocabulary. To overcome this vulnerability, a model must be able to compensate for a poorly modeled word embedding with background knowledge to complete the required task.
For example, a natural language inference (NLI) model based on pre-2020 word embeddings may not be able to deduce from "Jack has COVID" that "Jack is sick." By providing the definition, "COVID is a respiratory disease," we want to assist this classification.
We describe a general procedure for enhancing a classification model such as natural language inference (NLI) or sentiment classification, to perform the same task on sequences including poorly modeled words using definitions of those words. From the training set T of the original model, we construct an augmented training set T for a model that may accept the same token sequence optionally concatenated with a word definition. In the case of NLI, where there are two token sequences, the definition is concatenated to the premise sequence. Because T has the same form as T , a model accepting the augmented information may be trained in the same way as the original model.
Because there are not enough truly untrained words like "COVID" in natural examples, we probe performance by scrambling real words so that their word embedding becomes useless, and supplying definitions. Our method recovers most of the performance lost by scrambling. Moreover, the proposed technique removes biases in more ad hoc solutions like adding definitions to examples without special training.

Related Work
We focus on NLI because it depends more deeply on word meaning than sentiment or topic classification tasks. Chen et al. (2018) pioneered the addition of background information to an NLI model's classification on a per-example basis, augmenting a sequence of token embeddings with features encoding WordNet relations between pairs of words, to achieve a 0.6% improvement on the SNLI (Bowman et al., 2015) task. Besides this explicit reasoning approach, implicit reasoning over background knowledge can be achieved if one updates the base model itself with background information. Lauscher et al. (2020) follows this approach to add information from ConceptNet (Speer et al., 2018) and the Open Mind Common Sense corpus (Singh et al., 2002) through a fine-tuned adapter added to a pretrained language model, achieving better performance on subsets of NLI examples that are known to require world knowledge. Talmor et al. (2020) explore the interplay between explicitly added knowledge and implicitly stored knowledge on artificially constructed NLI problems that require counting or relations from a taxonomy.
In the above works, explicit background infor-mation comes from a taxonomy or knowledge base. Only a few studies have worked with definition text directly, and not in the context of NLI. Tissier et al. (2017) used definitions to create embeddings for better performance on word similarity tasks, compared to word2vec (Mikolov et al., 2013) and fastText (Bojanowski et al., 2017) while maintaining performance on text classification. Their work pushes together embeddings of words that co-occur in each other's definitions. Recently, Kaneko and Bollegala (2021) used definitions to remove biases from pretrained word embeddings while maintaining coreference resolution accuracy. In contrast, our work reasons with natural language definitions without forming a new embedding, allowing attention between a definition and the rest of an example. Alternatively, Schick and Schütze (2020) improved classification using rare words by collecting and attending to all of the contexts in which they occur in BookCorpus (Zhu et al., 2015) combined with Westbury Wikipedia Corpus. 1 Like the methods above that use definitions, this method constructs a substitute or supplementary embedding for a rare word.

Critical words
The enhanced training set T will be built by providing definitions for words in existing examples, while obfuscating the existing embeddings of those words. If a random word of the original text is obfuscated, the classification still may be determined or strongly biased by the remaining words. To ensure the definitions matter, we select carefully.
To explain which words of a text are important for classification, Kim et al. (2020) introduced the idea of input marginalization. Given a sequence of tokens x, let x −i represent the sequence without the ith token x i . They marginalize the probability of predicting a class y c over possible replacement wordsx i in the vocabulary V as and then compare p(y c |x −i ) to p(y c |x) to quantify the importance of x i . The probabilities p(x i |x −i ) are computed by a language model. We simplify by looking only at the classification and not the probability. Like Kim et al. (2020), we truncate the computation of p(y c |x i , x −i ) to words such that p(x i |x −i ) exceeds a threshold, here .05. Ultimately we mark a word x i as a critical word if there exists a replacementx i such that Additionally we require that the word not appear more than once in the example, because the meaning of repeated words usually impacts the classification less than the fact that they all match. Table 1 shows an example.
Premise A young man sits, looking out of a train [side → Neutral, small → Neutral] window.

Hypothesis
The man is in his room. Label Contradiction A technicality remains because our classification models use subwords as tokens, whereas we consider replacements of whole words returned by pattern.en. We remove all subwords of x i when forming x −i , but we consider only replacementsx i that are a single subword long.

Definitions
We use definitions from Simple English Wiktionary when available, or English Wiktionary otherwise. 2 Tissier et al. (2017) downloaded definitions from four commercial online dictionaries, but these are no longer freely available online as of January 2021.
To define a word, first we find its part of speech in the original context and lemmatize the word using the pattern.en library (Smedt and Daelemans, 2012). Then we look for a section labeled "English" in the retrieved Wiktionary article, and for a subsection for the part of speech we identified. We extract the first numbered definition in this subsection. In practice, we find that this method usually gives us short, simple definitions that match the usage in the original text.
When defining a word, we always write its definition as "word means: definition." This common format ensures that the definitions and the word being defined can be recognized easily by the classifier.

Enhancing a model
Consider an example (x, y c ) ∈ T . If the example has a critical word x i ∈ x that appears only once in the example, andx i is the most likely replacement word that changes the classification, we let x denote the sequence where x i is replaced bỹ x i , and let y c = argmax y p(y|x ). If definitions h i and h i for x i andx i are found by the method described above, we add (x, h i , y c ) and (x , h i , y c ) to the enhanced training set T .
In some training protocols, we scramble x i and x i in the examples and definitions added to T , replacing them with random strings of between four and twelve letters. This prevents the model from relying on the original word embeddings. Table 2 shows an NLI example and the corresponding examples generated for the enhanced training set.

Original
A blond man is drinking from a public fountain. / The man is drinking water. / Entailment Scrambled word a blond man is drinking from a public yfcqudqqg. yfcqudqqg means: a natural source of water; a spring. / the man is drinking water. / Entailment Scrambled alternate a blond man is drinking from a public lxuehdeig. lxuehdeig means: lxuehdeig is a transparent solid and is usually clear. windows and eyeglasses are made from it, as well as drinking glasses. / the man is drinking water. / Neutral We fine-tune an XLNet (base, cased) model (Yang et al., 2019), because it achieves near state-of-theart performance on SNLI and outperforms Roberta (Liu et al., 2019) and BERT (Devlin et al., 2019) on later rounds of adversarial annotation for ANLI (Nie et al., 2020). For the language model probabilities p(x i |x −i ), pretrained BERT (base, uncased) is used rather than XLNet because the XLNet probabilities have been observed to be very noisy on short sequences. 3 One test set SN LI f ull crit is constructed in the same way as the augmented training set, but our main test set SN LI true crit is additionally constrained to use only examples of the form (x, h i , y c ) where y c is the original label, because labels for the examples (x , h i , y c ) might be incorrect. All of our derived datasets are available for download. 4 In each experiment, training is run for three epochs distributed across 4 GPU's, with a batch size of 10 on each, a learning rate of 5 × 10 −5 , 120 warmup steps, a single gradient accumulation step, and a maximum sequence length of 384.  Our task cannot be solved well without reading definitions. When words are scrambled but no definitions are provided, an SNLI model without special training achieves 54.1% on SN LI true crit . If trained on T with scrambled words but no definitions, performance drops to 36.9%, reflecting that T is constructed to prevent a model from utilizing the contextual bias.

Results
With definitions and scrambled words, performance is slightly below that of using the original words. Our method using definitions applied to the scrambled words yields 81.2%, compared to 84.6% if words are left unscrambled but no definitions are provided. Most of the accuracy lost by obfuscating the words is recovered, but evidently there is slightly more information accessible in the original word embeddings.
If alternatives to the critical words are not included, the classifier learns biases that do not depend on the definition. We explore restricting the training set to verified examples T true ⊂ T in the same way as the SN LI true crit , still scrambling the critical or replaced words in the training and testing sets. Using this subset, a model that is not given the definitions can be trained to achieve 69.9% performance on SN LI true crit , showing a heavy contextual bias. A model trained on this subset that uses the definitions achieves marginally higher performance (82.3%) than the one trained on all of T . On the other hand, testing on SN LI f ull crit yields only 72.3% compared to 80.3% using the full T , showing that the classifier is less sensitive to the definition.
Noisy labels from replacements do not hurt accuracy much. The only difference between the "original" training protocol and "no scrambling, no defs" is that the original trains on T and does not include examples with replaced words and unverified labels. Training including the replacements reduces accuracy by 0.5% on SN LI true crit , which includes only verified labels. For comparison, training and testing on all of SNLI with the original protocol achieves 90.4%, so a much larger effect on accuracy must be due to harder examples in SN LI true crit . Definitions are not well utilized without special training. The original SNLI model, if provided definitions of scrambled words at test time as part of the premise, achieves only 63.8%, compared to 81.2% for our specially trained model.
If the defined words are not scrambled, the classifier uses the original embedding and ignores the definitions. Training with definitions but no scrambling, 85.2% accuracy is achieved, but this trained model is unable to use the definitions when words are scrambled: it achieves 51.4%.
We have not discovered a way to combine the benefit of the definitions with the knowledge in the original word embedding. To force the model to use both techniques, we prepare a version of the training set which is half scrambled and half unscrambled. This model achieves 83.5% on the unscrambled test set, worse than no definitions. accuracy was 81.2%), but not the whole capability. Definition reasoning is harder than simple substitutions. When definitions are given as oneword substitutions, in the form "scrambled means: original" instead of "scrambled means: definition", the model achieves 84.7% on SN LI true crit compared to 81.2% using the definition text. Of course this is not a possibility for rare words that are not synonyms of a word that has been well trained, but it suggests that the kind of multi-hop reasoning in which words just have to be matched in sequence is easier than understanding a text definition.

A hard subset of SNLI
By construction of the SentencePiece dictionary (Kudo and Richardson, 2018), only the most frequent words in the training data of the XLNet language model are represented as single tokens. Other words are tokenized by multiple subwords. Sometimes the subwords reflect a morphological change to a well-modeled word, such as a change in tense or plurality. The language model probably understands these changes well and the subwords give important hints. The lemma form of a word strips many morphological features, so when the lemma form of a word has multiple subwords, the basic concept may be less frequently encountered in training. We hypothesize that such words are less well understood by the language model.
To In Table 4 we apply various models constructed in the previous subsection to this hard test set. Ideally, a model leveraging definitions could compen-  Table 4: Accuracy on the hard SNLI subset sate for these weaker word embeddings, but the method here does not do so.

Conclusion
This work shows how a model's training may be enhanced to support reasoning with definitions in natural text, to handle cases where word embeddings are not useful. Our method forces the definitions to be considered and avoids the application of biases independent of the definition. Using the approach, entailment examples like "Jack has COVID / Jack is sick" that are misclassified by an XLNet trained on normal SNLI are correctly recognized as entailment when a definition "COVID is a respiratory disease" is added. Methods that can leverage definitions without losing the advantage of partially useful word embeddings are still needed. In an application, it also will be necessary to select the words that would benefit from definitions, and to make a model that can accept multiple definitions.