Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models

Recent progress in pretraining language models on large corpora has resulted in large performance gains on many NLP tasks. These large models acquire linguistic knowledge during pretraining, which helps to improve performance on downstream tasks via fine-tuning. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with `fill in the blank' style cloze questions. Existing probing datasets mainly focus on knowledge about relations between words and entities. We introduce WDLMPro (Word Definition Language Model Probing) to evaluate word understanding directly using dictionary definitions of words. In our experiments, three popular pretrained language models struggle to match words and their definitions. This indicates that they understand many words poorly and that our new probing task is a difficult challenge that could help guide research on LMs in the future.


Introduction
Natural language processing (NLP) has advanced drastically in the last decade with the design of larger and more sophisticated models, availability of larger corpora and increasing computational power. Pretrained word embeddings (Mikolov et al., 2013;Pennington et al., 2014) popularized the use of distributed word representations, which became a fundamental building block for NLP systems. Peters et al. (2018a) introduced LSTMbased deep contextual representations and obtained large performance gains by fine-tuning on tasks after unsupervised pretraining (Radford et al., 2018;Howard and Ruder, 2018). More recently, the attention based transformer architecture was shown to use context more effectively (Vaswani et al., 2017) and several subsequent models achieved state of the art results in many NLP tasks by combining the transformer architecture with unsupervised pretraining and task specific fine-tuning (Devlin et al., 2019;Liu et al., 2019). Radford et al. (2019) showed that language models can be applied to a variety of tasks without task specific fine tuning. This is demonstrated on a much larger scale by Brown et al. (2020).
Deep models improve performance. However, what they actually learn about language and word meaning is still to a large extent unclear due to their uninterpretable nature. For static word embeddings, researchers used word similarity (Hill et al., 2015) and word analogy (Gladkova et al., 2016) tests to shed light on what information is captured in these dense vector spaces. For language models, a great amount of linguistic knowledge is stored in the model parameters (Peters et al., 2018b). Several studies proposed using 'fill in the blank' type cloze statements to test knowledge learned by these models during unsupervised pretraining. Petroni et al. (2019) proposed the LAMA (LAnguage Model Analysis) probe to test the factual and common sense knowledge stored in language models. Similarly, Schick and Schütze (2020) introduced WNLaMPro (WordNet Language Model Probing) to assess the ability of language models to understand words based on their frequency. In WN-LaMPro, cloze style questions are generated based on antonym, hypernym and cohyponym relations among words extracted from WordNet.
The existing probing datasets mainly focus on investigating the knowledge about relations between words or entities. However, a more direct way of testing whether a language model understands the meaning of a word is to use its dictionary definition. If a pretrained language model truly understands the meaning of a word, then it should be able to match it with its dictionary definition. Based on this motivation, we introduce the Word Definition   Language Model Probing (WDLMPro) dataset; 1 it is a challenging benchmark for testing NLP models for their ability to understand words. WDLMPro is essentially a set of thousands of synset groups; each synset group consists of a target word (with its definition) and its taxonomic sisters (with their definitions). Using taxonomic sisters, rather than random word groups, makes the task more challenging for statistical models that are based on the distributional hypothesis since these words have similar distributional characteristics (Lenci, 2008). We evaluate two masked language models, BERT and RoBERTa, and the auto-regressive model GPT-2 on WDLMPro using two different probing tests: (i) match definition to word (D2W) (ii) match word to definition (W2D). We also provide a baseline using static fastText embeddings (Mikolov et al., 2018). We find that all three language models perform clearly better than the baseline. Nevertheless, they have great difficulty matching words and their definitions, implying a poor understanding of word meaning. This is an important result that could help guide research on LMs in the future.

WDLMPro
In this section, we introduce WDLMPro (Word Definition Language Model Probing), a dataset to test how well NLP models can match nouns and verbs with their definitions. We view this as a test of how well the models understand lexical meaning.

Dataset
WordNet (Miller, 1995) is the basis for constructing WDLMPro. A WordNet synset contains a set 1 WDLAMPro and evaluation scripts are available at https://www.cis.lmu.de/definition benchmark/WDLAMPro.zip of synonyms along with a short definition of the synset. Different senses of polysemous words are represented in different synsets providing disambiguation. WordNet connects synsets with each other via semantic relations.
Based on a target synset t and the semantic relation hyponymy <, we construct a synset group G for the target as follows.
that is, G contains all synsets that are "sister hyponyms" to t with respect to a hypernym of t. G(t), along with the definitions of the synsets in G(t), will be used to set up the WDLMPro tasks that require matching of words and definitions. We discard groups G(t) that have a size of less than 5.
In this study, we focus on nouns and verbs, i.e., we create synset groups G for the nouns and verbs in WordNet. Table 1 displays five members from G(t) and their definitions for the target a cappella singing.n.01 (see appx. for the target beckon.v.01.) Table 2 shows statistics of the dataset.

Probing Tests
We define two probing tests that are converses of each other: • Match definition to word (D2W). Given a definition and a set of words, the task is to find the word that the definition defines.
• Match word to definition (W2D). Given a word and a set of definitions, the task is to find the definition that defines the word.
Each synset group G(t) gives rise to one instance of D2W by providing the definition of t, and all words in G(t). The word from G(t) that matches the definition has then to be identified. (Note that t is a member of G(t).) Similarly, each synset group G(t) gives rise to one instance of W2D by providing t and the definitions of all words in G(t).
The correct definition of t has then to be identified among all definition candidates. Note that WordNet definitions by construction do not contain the word Noun <DEF> is the definition of Verb to <DEF> is the definition of Table 3: Patterns used for querying language models for nouns and verbs. <DEF> refers to the definition, is the mask or missing word that the language model has to predict.
to be defined. So there are no instances where the two tasks are trivial.

Application to language models
In principle, any NLP model can be tested on D2W and W2D. In this paper, we are particularly interested in testing language models. To this end, we convert the data to a format that is suitable for language models, i.e., to cloze-style questions as shown in Table 3. The basic quantity that allows us to assess the compatibility of a word t and a definition is the probability of t being generated for " " when the definition is substituted for <DEF>.
More precisely, we compute the probability that the string representation of t is being generated. We will denote the string representation of synset t by t. We obtain the string representation by removing the word type and sense information from the name of the synset and replacing underscores with white space. For example, synset warm up.v.04 is represented by the string "warm up". Table 3 shows that we define different templates for masked and autoregressive language models. For the masked language models, we average the prediction scores across patterns before ranking the candidates.

Baselines
For a masked language model (MLM) M , the probability of a candidate c ∈ G(t) on W2D is calculated as: where t = [t 1 , t 2 , ..., t |t| ] is the tokenization produced by M . Q(c, |t|) is the input query created from one of the patterns (Table 3) with replaced with |t| consecutive mask tokens. For an autoregressive language model (ALM) A, we decompose P (t i |Q(c), t) in the standard way: For D2W, we need to compare, given a definition, the probabilities of different candidate words that are generally of different lengths. To ensure a fair comparison, we follow Xiong et al. (2020). For MLMs, we match the number of mask tokens in an input query to the token count of each candidate. The final score is the average log-probability of the masked tokens: For ALMs, we use the probability of the first token: Considering further tokens does not make sense since they are often easily predictable from the first token.
We apply our probing test to two different pretrained MLMs (BERT and RoBERTa) and one ALM (GPT-2). To investigate the effect of model size on the performance, we experiment with both base and large versions of BERT and RoBERTa along with all four sizes of GPT-2 (small, medium, large, xl). For RoBERTa, we capitalize the first letter of the candidate noun since pretrained RoBERTa models are case sensitive and expect a capital letter at the beginning of a sentence. 2 In addition to the deep contextual language models, we also provide fastText static word embeddings 3 (Mikolov et al., 2018) as a baseline. 4 For fastText embeddings, we tokenize the candidates and their definitions using the NLTK tokenizer and represent them with their average vector. We rank candidates based on their cosine similarity to the target embedding.

Measures
We use two measures: precision at 1 (P@1) and a rank score (RS), both based on a ranked results list, either of words or of definitions. P@1 is the percentage of top-ranked items that is correct. We define RS as follows: where L = |G(t)| is the number of candidates and k is the rank of the correct item, 1 ≤ k ≤ L. Table  2 shows that the size of G(t) is highly variable; in contrast to P@1, RS is less affected by this and the random baseline (cf. Tables 4 and 5) is always 0.5.

Results
Tables 4 and 5 present W2D and D2W results for BERT, RoBERTa and GPT-2 along with fastText and random baselines. Language models perform clearly better than both baselines. Larger models perform generally better than smaller ones and RoBERTa consistently outperforms BERT. This might be an indication for the correlation between performance on WDLAMPro and downstream performance. However, further investigation is necessary to show the correlation more clearly. For W2D, best performance is achieved by GPT-2 xl for nouns (47.3 P@1, 0.81 RS) and by RoBERTa large for verbs (50.8 P@1, 0.84 RS). Performance on D2W is much lower than for W2D for all models. For nouns, RoBERTa large and GPT-2 xl perform similarly (28.8 and 29.8 P@1, 0.70 and 0.73 RS) while RoBERTa large achieves the best results for verbs (38.6 P@1, 0.80 RS). Poor performance on D2W compared to W2D might be due to language models' ability to distinguish different definitions better than individual words since definitions are more informative than individual words. Overall GPT-2 models perform better than masked language models (with the exception of Roberta large for verbs), despite using a single pattern as opposed to the multiple patterns used by masked language models. This might indicate that the ALM objective is better at learning word meaning than the MLM objective.
To investigate the effect of frequency, we stratify words into rare (fewer than 10 occurrences), medium (10 to 99 occurrences) and frequent (100 or more occurrences), based on occurrences in WWC 5 (Westbury Wikipedia Corpus, Shaoul (2010)), 5 Targets that have more than 3 tokens (based on NLTK tokenization) are taken as rare without counting.

Model
Noun  where we use WWC frequency as a substitute for the models' training corpora. We focus on nouns since most verbs in our dataset are relatively frequent. Table 7 shows that, for W2D, all models have a poor understanding of the meaning of rare and medium words. (See appx. for D2W results.) Even for frequent words, P@1 is never above 55. We additionally break down the results based on the depth of the synsets in the WordNet hierarchy. Specifically, we investigate the performance of the GPT-2 xl model on W2D for WordNet nouns, where we take the depth of a synset group as the length of the shortest path from the target synset to the root synset (i.e., entity.n.01). Table 6 shows that performance drops steadily as we go deeper in the hierarchy. Lower levels of the WordNet hierarchy contain many scientific terms and names of (sub)species such as types of cattle (e.g., cattalo, hereford, galloway). These results suggest that even very large LMs lack the knowledge necessary to distinguish these terms.   Analysis. The correct definition of the medium frequency verb 'beckon' is 'signal with the hands or nod'. GPT-2 xl predicts 'signal by winking'. The correct definition of the frequent noun 'roleplaying' is 'acting a particular role (as in psychotherapy)' GPT-2 xl predicts 'acting the part of a character on stage'. So GPT-2 xl understands that beckoning is signaling and that roleplaying is acting, but it has not learned to distinguish between different types of signaling and acting. This points to an important future goal for LMs: they should be developed to gain an understanding of words that goes beyond the current superficial state of the art.
Human performance on WDLAMPro. It is beyond the scope of this paper to evaluate human performance on the entirety of WDLAMPro. However, we provide a comparison with human performance on a small subset to provide an intuition about the difficulty of the task. For each of the two tasks, 20 synset groups that have a maximum of 10 candidates are randomly sampled from WD-LAMPro. Then two native English speakers are asked to rank the candidates. Table 8 displays the average performance of the human participants and the language models on this subset. For both tasks, performance of the best model is comparable to the  average human performance. Human performance is the upper bound for many NLP tasks. We believe that this is not the case for WDLAMPro: arguably, we should aim for models with an excellent understanding of the meanings of words even if it is better than average human understanding. Knowledge based tasks are an analogous case: we should strive for models that know as many facts as possible even if that performance is above average human performance.

Conclusion
We introduced WDLMPro, a probing test that helps analyze how well a model understands word meaning. WDLMPro is complementary to existing probing tests that are about relations between words or entities. We evaluated three popular pretrained language models on the W2D (word to definition) and D2W (definition to word) tasks. Our findings show that, despite their remarkable performance on many downstream tasks, these models struggle to match a word and its true definition, suggesting an insufficient understanding of word meaning. Relatively poor performance of these powerful models on WDLMPro can be seen as evidence for the limitations of purely distributional systems and the need for incorporating external knowledge. WDLMPro provides an important evaluation benchmark, encouraging design and training of models with precise word understanding.