Lacking the embedding of a word? Look it up into a traditional dictionary

Word embeddings are powerful dictionaries, which may easily capture language variations. However, these dictionaries fail to give sense to rare words, which are surprisingly often covered by traditional dictionaries. In this paper, we propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words. For this purpose, we introduce two methods: Definition Neural Network (DefiNNet) and Define BERT (DefBERT). In our experiments, DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods devised for producing embeddings of unknown words. In fact, DefiNNet significantly outperforms FastText, which implements a method for the same task-based on n-grams, and DefBERT significantly outperforms the BERT method for OOV words. Then, definitions in traditional dictionaries are useful to build word embeddings for rare words.


Introduction
Words without meaning are like compasses without needle: pointless. Indeed, meaningless words lead compositionally to meaningless sentences and, consequently, to meaningless texts and conversations. Second language learners may grasp grammatical structures of sentences but, if they are unaware of meaning of single words in these sentences, they may fail to understand the whole sentences. This is the reason why large body of natural language processing research is devoted to devising ways to capture word meaning.
As language is a living body, distributional methods (Turney and Pantel, 2010;Mikolov et al., 2013;Pennington et al., 2014) are seen as the panacea to capture word meaning as opposed to more static models based on dictionaries (Fellbaum, 1998). Distributional methods may easily capture new meaning of existing words and, eventually, can easily assign meaning to emerging words. In fact, the different methods can scan corpora and derive the meaning of these new words by observing them in context (Harris, 1954;Firth, 1950;Wittgenstein, 1953). Words are then represented as vectors -now called word embeddings -which are then used to feed neural networks to produce meaning for sentences (Bengio et al., 2003;İrsoy and Cardie, 2014;Kalchbrenner et al., 2014;Tai et al., 2015) and meaning for whole texts (Joulin et al., 2017;Lai et al., 2015).
Distributional methods have a strong limitation: word meaning can be assigned only for words where sufficient contexts can be gathered. Rare words are not covered and become the classical out-of-vocabulary words, which may hinder the understanding of specific yet important sentences. To overcome this problem, n-grams based distributional models have emerged  where word meaning is obtained by composing "meaning" of n-grams forming a word.These ngrams act as proto-morphemes and, hence, meaning of unknown words can be obtained by composing meaning of proto-morphemes derived for existing words. These proto-morphemes are the building blocks of word meaning.
Traditional dictionaries can offer a solution to find meaning of rare words. They have been put aside since they cannot easily adapt to language evolution and they cannot easily provide distributed representations for neural networks.
In this paper, we propose to use definitions in dictionaries to compositionally produce distributional representations for Out-Of-Vocabulary (OOV) words. Definitions in dictionaries are intended to describe the meaning of a word to a human reader. Then, we propose two models to exploit definitions to derive the meaning of OOV words: (1) Definition Neural Network (DefiNNet), a simple neural network; (2) DefBERT, a model based on pre-trained BERT. We experimented with different tests and datasets derived from Word- Net. Firstly, we determined if DefiNNet and DefBERT can learn a neural network to derive word meaning from definitions. Secondly, we aimed to establish whether DefiNNet and Def-BERT can cover OOV words, which are not covered by word2vec (Mikolov et al., 2013) or by the BERT pre-trained encoder, respectively. In our experiments, DefiNNet and DefBERT significantly outperform stete-of-the-art as well as baseline methods devised for producing embeddings of unknown words. In fact, DefiNNet significantly outperforms FastText , which implements a method for the same task-based on ngrams, and DefBERT significantly outperforms the BERT method for OOV words. Then, definitions in traditional dictionaries are useful to build word embeddings for rare words.

Background and Related Work
Out-of-Vocabulary (OOV) words have been often a problem as these OOV words may hinder the applicability of many NLP systems. For example, if words are not included in a lexicon of a Probabilistic Context-Free Grammar, interpretations for sentences containing these words may have a null probability. Hence, solutions to this problem date back in time.
Recently, in the context of word embeddings, the most common solution is to use word n-grams  or word pieces of variable length (Wu et al., 2016) as proxies to model mor-phemes. Embeddings are learned for 3-grams as well as for word pieces. In  these 3-grams are then combined to obtain the embedding for the entire word. For example, the word cheerlessness, which contains 3 morphemes (cheer, less and ness), is modeled by using embeddings for che, hee, ..., ess in the 3-gram approach and by using embeddings for cheer and lessness in the word pieces approach. These embeddings are possibly capturing information about the related morphemes. In this way, OOV word embeddings are correlated with meaningful bits of observed words. These models are clearly our baselines.
In the study of OOV words for word embeddings, deriving word embeddings from dictionary definitions is, at the best of our knowledge, a novel approach. Dictionary definitions have been used in early attempts to train rudimentary compositional distributional semantic models (Zanzotto et al., 2010), which aimed to build embeddings for sequences of two words.
Universal sentence embedders (USEs) (Conneau et al., 2018) can play an important role in this novel approach. In fact, definitions are particular sentences aiming to describe meaning of words. Therefore, USEs should obtain an embedding representing the meaning of a word by composing embeddings of words in the definition.
Moreover, deriving word embeddings from definitions can be seen as a semantic stress test of universal sentence embedders. Generally, the abil-ity of USEs (Devlin et al., 2019;Yang et al., 2020;Clark et al., 2020) to semantically model sentences is tested with end-to-end downstream tasks, for example, natural language inference (NLI) (Jiang and de Marneffe, 2019a;Raffel et al., 2020;He et al., 2021), question-answering (Zhang, 2019) as well as dialog systems (Wu et al., 2020). USEs such as BERT (Devlin et al., 2019) are encoding semantic features in hidden layers (Jawahar et al., 2019;Miaschi et al., 2020). This explains why these USEs are good at modeling semantics of sentences in downstream tasks. However, USEs' success in downstream tasks may be due to superficial heuristics (as supposed in (McCoy et al., 2019) for the NLI) and not to a deep modeling of semantic features. Therefore, our study can contribute to this debate. In fact, at the best of our knowledge, it is the first study aiming to investigate if USEs can model meaning by producing embedding for words starting from their definitions.

Model
This section introduces our proposals to use definitions in generating embeddings for Out-of-Vocabulary words: Definition Neural Network (DefiNNet) and BERT for Definitions (DefBERT). Section 3.1 describe the basic idea. Section 3.2 describes the definition of the feed-forward neural network DefiNNet. Finally, Section 3.3 describes how we used the Universal Sentence Embedder BERT in producing embeddings for definitions.

Basic Idea
Our model stems from an observation: when someone step into an rare unknown word while reading, definitions in traditional dictionaries are the natural resource used to understand the meaning of this rare, out-of-one's-personal-dictionary word. Then, as people rely on dictionaries in order to understand meanings for unknown words, learners of word embeddings could do the same.
Indeed, definitions in dictionaries are conceived to define compositionally the meaning of target words. Therefore, these are natural candidates for deriving a word embedding of a OOV word by composing the word embeddings of the words in the definition. The hunch is that universal sentence embedders can be used for this purpose.
Moreover, these definitions have a recurrent structure, which can be definitely used to derive simpler model. Definitions for words w are of-ten organized as a particular sentence which contains the super-type of w and a modifier, which specializes the super-type. For example (Fig. 1), cheerlessness is defined as a feeling, which is the super-type, and of dreary and pessimistic sadness, which is the modifier. By using this structure, we propose a simpler model for composing meaning.
In the following sections, we propose two models: (1) DefiNNet, a model that exploit the structure of the definitions to focus on relevant words; and (2) DefBERT, a model that utilizes BERT as universal sentence embedder to embed the definition in a single vector. To extract the two main words from a given definition, DefAnalyzer exploits the recurrent structure of definitions by using their syntactic interpretations. In our study, we use constituency parse trees and correlated rules to extract the super-type w h and its closest modifier w m . Basically, the simple algorithm is the following. Given a definition s, parse the definition s and select the main constituent. If the main constituent contains a semantic head and a modifier, then those are the two target words. In the other case, select the semantic head of the main constituent as the super-type w h and the semantic head of the first sub-constituent as the relevant modifier w m . For example, the parse tree for the definition of cherlessness in Fig In this case the main constituent is the first NP: the selected w h is the word feeling which is semantic head of the first NP; w m is noun sadness which is the semantic head of PP. The semantic heads are computed according to a slightly modified version of the semantic heads defined by Collins, 2003. The second component is DeNN that, given the words embbedings w h and w m from the Word2Vec embedding space for respectively w h and w m from the definition, their POS tag pos h , pos m and the target's POS tag pos c as additional information, outputs the embedding w c for the target word w c . The input of DefiNNet is illustrated in Fig.1. The general equation for DeNN is: The DeNN function can be described starting from three simpler subnets: (1) FF w processes word embeddings w h and w m ; (2) FF p embeds and processes pos h , pos m and pos c ; finally, (3) FF processes the joint information from the previous steps.
The equation describing the subnet FF w that takes as input w h and w m is the following: where W h and W m are dense layer and σ is the LeakyReLU activation function.
The subnet FF p processes POS tags: pos h , pos m , pos c . Each pos i for i ∈ {h, m, c} is firstly fed into an embedding layer which weights are learned from scratch. The resulting embedding (pos i ) is then fed into a dense layer W i . Hence for each for i ∈ {h, m, c} the output of FF p is: (2) The s resulting from Equation 1 and the p h , p m , p c from the Equation 2 are hence concatenated (⊕): As final step h is fed into a feed-forward subnet FF composed of the dense layers W 1 , W 2 and W 3 as follows: Hence the following: For comparative purposes, we defined two additional baseline models: an hyperonym model (Head) and an additive model (Additive) (Mitchell and Lapata, 2008). The Head model derives the embedding for the OOV word c by using the embedding for its hypernym h in WordNet, that is, w c = w h . The Additive model instead adds the embeddings of the two words in the definition used by DefiNNet, that is, w c = w h + w m .

DefBERT: Transforming definitions in word embeddings
DefBERT aims to use BERT's ability to process sentences in order to use directly the definition for w c in order to produce its embedding w c . DefBERT [CLS] and DefBERT Head are the approaches followed in exploiting the definition.
DefBERT [CLS] is the first of these approaches: in this case the definition of w c is given in input to a pretrained Bert-base model and, as showed in Figure 1, b [CLS] , the embedding for the [CLS] token, is taken as sentence embedding in the USE acceptation of BERT.
DefBERT Head is the second approach and in this case is selected b head , which is contextual emedding of w h from the definition. Since BERT's embedding are contextual, b head could benefit of the definition being the input sentence.
For comparative purposes, we also define BERT wordpieces and BERT Head−Example . BERT wordpieces is used to see if our model outperforms the classical behavior of BERT when it encounters OOV words. Hence, BERT wordpieces replicates this classical behavior. In this case, BERT is fed with a sample sentence containing the target OOV word, for example "... melancholy to pastel cheerlessness" for the target OOV "cheerlessness" (see Figure 1). Then, the word is divided in word pieces. To obtain the embedding for the target word, we sum up vectors of these word pieces. BERT Head−Example instead is used to determine if definitions are really useful for modeling meaning of the head word. BERT Head−Example is similar to DefBERT Head but the input is different. BERT Head−Example has a random sentence which contains the head word. Hence, comparing DefBERT Head with BERT Head−Example gives intuition if the head in definition really absorbs its meaning.
Experiments want to investigate three issues: (1) if word embeddings obtained with DefiNNet are reasonably better than baseline compositional functions to obtain embeddings as well as those obtained with an untrained version of BERT; (2) if similarity measures over WordNet are correlated with spaces of word embeddings; (3) finally, if word embeddings for Out-of-Vocabulary words obtained are good word representations in terms of their correlation with similarity measures on Word-Net. Clearly, issue (2) is necessary to investigate issue (3) and we spend time to analyze issue (2) as the correlation between WordNet measures and word embeddings is a highly debated problem (Lastra-Díaz et al., 2019).
The rest of the section is organized as follows. Section 4.1 introduces the general settings of our experiments. Section 4.2 presents results and it is organized in three subsections, which address the above three issues. If needed, these subsections introduce additional settings for the experiments.

Experimental set-up
Our experiments are primarily defined around WordNet (Fellbaum, 1998). WordNet is the source of word definitions, which are needed for DefiNNet and for DefBERT. WordNet is used to collect testing sets of word pairs of similar and dissimilar words. Finally, similarity measures over WordNet are used to rank pairs according to the similarity between words. These latter rankings are used to see if similarities derived with DefiNNet's and DefBERT's word embeddings for OOV words correlate with a standard notion of similarity between two words.
In our study, in-vocabulary (IV) and OOV words (IV w2v , OOV w2v , IV BERT and OOV BERT ) are defined according to a pre-trained word embedding matrix W w2v and W BERT . W w2V is the Word2Vec's embedding space (Mikolov et al., 2013) pre-trained on part of Google News dataset (about 100 billion words) and W BERT is the BERT's word embedding space (Devlin et al., 2019) trained on lower-cased English text from BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) as described in Devlin et al.. Then,IV w2v and IV BERT words are words in WordNet that are in the target embedding matrix and OOV w2v and OOV BERT are words in WordNet that are not in the target embedding ma-trix. These OOV words are interesting since, in principle, their meaning is known in WordNet but their embedding is not available. Then, DefiNNet as well as DefBERT can be definitely utilized. In selecting IV BERT and OOV BERT , there is an additional limitation: in order to apply DefBERT, usage examples are needed. Then, IV BERT and OOV BERT are words that have an usage example in WordNet.
We prepared two different sets of datasets for directly and indirectly investigating DefiNNet and DefBERT.
In the direct investigation, DefiNNet and Def-BERT are tested to verify their ability to produce vectors for IV words. Methods are compared on the distribution (mean and standard deviation) of cosine similarity between the embedding of words and the embedding produced by using their definitions. We then have selected: 1) T rain v2w with 33404 words and T est v2w with 8336 words as subsets of IV w2v ; 2) T est BERT with 3218 words as subset of IV BERT . T rain v2w is also used to train DefiNNet.
In the indirect investigation, DefiNNet and Def-BERT are tested to assess their ability to produce embeddings for OOV that may replicate some similarity measure between words in pairs. We selected three similarity measures defined over WordNet: path (Rada et al., 1989), wup (Wu and Palmer, 1994) and res (Resnik, 1995). Then, we collected two sets of pairs of words P airs w2v and P airs BERT . Word pairs (w 1 , w 2 ) in P airs w2v are selected as follows: (1) w 1 is in OOV w2v ; (2) w 2 is in IV w2v is either a random sister word of w 1 in 50% of the cases or a random word in the other 50% of cases. Word pairs (w 1 , w 2 ) in P airs BERT are obtained similarly. P airs w2v contains around 4,500 word pairs and P airs BERT contains 3500 word pairs. To correctly apply Spearman's correlation between our systems and the expected rank on the list of pairs induced by a similarity measure, we divided P airs w2v in 600 lists of 7 pairs and P airs BERT in 450 lists. P airs w2v∩BERT contains 450 pairs divided into 60 lists. Pairs in the list are selected to have 7 clearly different values of the selected similarity (path, wup and res) between the two words. The final Spearman's correlation is a distribution of correlation over these lists.
The last datasets here defined are used to investigate the second issue addressed at the beginning of this section: it is necessary to determine if mea-sures over WordNet are correlated with spaces of word embedding. The investigated words embeddings are Word2Vec, FastText, BERT. Similarly to IV w2v and IV BERT , IV f asttext is a set of word in WordNet that are in the W f asttext target embedding matrix of FastText. P airs IV w2v , P airs IV BERT and P airs IV f asttext are the built dataset and each of them is composed of pairs (w 1 , w 2 ) of words from the given IV , where w 2 is either a random sister word of w 1 in 50% of the cases or a random word in the other 50% of cases. This definition follows the same approach used in defying P airs w2v and P airs BERT . P airs IV w2v , P airs IV BERT and P airs IV f asttext contain respectively about 14,000, 560 and 14,000 pairs. These are then divided into smaller lists of 7 pairs where Spearman's coefficient is computed.
To comparatively investigate our DefiNNet and DefBERT, we used FastText  as realized in Grave et al. (2018) along with: (1) Additive and Head defined in Section 3.2; (2) BERT wordpieces and BERT Head−Example defined in Section 3.3. FastText defines embeddings unknown words c by combining embeddings of 3grams, for example, the embedding for the OOV word cheerlessness is represented as the vector f c = che + hee + ... + ess.
As final experimental setting, definitions are parsed using Stanford's CoreNLP probabilistic context-free grammar parser . NLTK (Loper and Bird, 2002) is used to access WordNet and compute similarity measures over it.

Results and discussion
For clarity, this section is organized around the three issues we aim to investigate: the ability of proposed methods to build embeddings of words starting from dictionary definitions (Sec. 4.2.1); the debated relation between similarity over word embeddings and similarity in WordNet (Sec. 4.2.2); and, finally, the ability of the proposed methods to produce embeddings for OOV words (Sec. 4.2.3).

Word Embeddings from Dictionary Definitions
The first issue to investigate is whether our methods produce word embeddings from dictionary definitions that are similar with respect to word embeddings directly discovered. We then studied the cosine similarity between the two kinds of embeddings, for example, between the embedding of cheerlessness and the embed-ding of the definition a feeling of .... sadness. For the diffent methods, the comparison is on their own space, that is, sim( w c , w def ) for DefiNNet and sim( b c , b [CLS] ) or sim( b c , b head ) for DefBERT [CLS] and DefBERT Head , respectively (see Fig. 1). Experiments are conducted on In-Vocabulary words for both spaces by using the T est w2v , T est BERT and T est w2v∩BERT datasets.  Definitions seem to be better sources of word embeddings instead of baseline methods and other solutions. In fact, both DefiNNet and DefBERT Head outperform different methods in their respective tests for both nouns and verbs (see Table 1). For nouns, DefiNNet has an average cosine similarity of 0.46(±0.14), which is well above that of Additive (0.28(±16)) and Head (0.27(±20)). In the same syntactic category, DefBERT Head outperforms BERT Head−Example , 0.46(±0.13) vs. 0.41(±0.12). For verbs, DefiNNet has an average cosine similarity of 0.48(±0.13), which is well above the Additive and the Head. In the same category, DefBERT Head slightly outperforms BERT Head−Example . Finally, in the common test, that is, T est w2v∩BERT , definition based models outperform simpler models. DefBERT Head has a better similarity for nouns and DefiNNet has a better similarity for verbs.
For BERT, the embedding emerging related to the token [CLS] does not seem to represent the good token where to take semantics of the sentence in terms of a real composition of the meaning of component words. DefBERT [CLS] performs poorly with respect to DefBERT Head and also with respect to BERT Head−Example in both syntactic categories for T est BERT (see Table 1). This is confirmed in the restricted set T est w2v∩BERT . Therefore, even if the embedding in token [CLS] is often used as universal sentence embedding for classification purposes (Devlin et al., 2019;Adhikari et al., 2019;Jiang and de Marneffe, 2019b), it may not to contain packed meaning whereas it may contain other kinds of information regarding the sentence.

Word Embedding Spaces and WordNet
WordNet and it's correlated similarly metrics can be an interesting opportunity to extract testsets for assessing whether our methods can be used to derive embeddings of OOV words. However, it is a strongly debated question whether similarities in WordNet are correlated with similarities over word embeddings (Lastra-Díaz et al., 2019).
The aim of this section is twofold. Firstly, it aims to investigate if this relation can be established on the word embedding spaces we are using. Secondly, it aims to validate and select plausible similarity measures over WordNet, which can then be used to investigate the behavior of embeddings for OOV words. For both experimental sessions, we used the datasets P airs IV w2v , P airs IV BERT and P airs IV f asttext definde in Section 4.1.  For the first aim, we investigated whether similarities derived on a particular word embedding spaces can be used to divide positive and negative pairs in the respective sets of pairs. Then, given a word embedding space, we ranked pairs according to computed similarities and we computed the Area under the ROC built on sensitivity and specificity.
Results show that there is a correlation between "being siblings" and the three word embedding spaces, w2v, BERT and f asttext ( Table 2). All the AROCs are well above the threshold of 0.5 and close or above the value of 0.7, which indicates a good correlation.
For the second aim, we investigated WordNet Similarity metrics in order to find interesting metrics to experiment with our definition-oriented methodologies. In fact, the binary task of being or not being siblings in WordNet may not capture   Figure 1) are sibling words and are definitely similar. On the contrary, house and architecture are sibling words but are less similar with respect to the previous pair of words. In WordNet, this difference in similarity is captured by using many different metrics. We investigated three different WordNet similarity measures: path (Rada et al., 1989), wup (Wu and Palmer, 1994) and res (Resnik, 1995). The measure path uses the length of the path connecting two synsets over the WordNet taxonomy. The measure wup is still based on the length of path between the synsets related to the two words and takes into account the number of edges from synsets to their Least Common Subsumer (LCS) and the number of links from the LCS up to the root of the taxonomy. Finally, the measure res belongs to another family of measures as it is based on the Information Content. In res, the similarity between synsets of the related words is a function of the Information Content of their LCS. In this case, a more informative LCS (a rare as well as a specific concept) indicates that the hyponym concepts are more similar.
The best correlated WordNet measure is res. In fact, it is highly correlated for two spaces out of three, Word2Vec and FastText, and it is on par with wup in the BERT space (see 3). The average Spearman's correlation between the word embedding spaces of Word2Vec and res is 0.50(±0.31), which is well above path and wup. The same happens for the space FastText where the correlation is 0.52(±0.29).
As a final consideration, for our purposes, word embedding spaces are correlated and the best mea-  sure that captures this correlation is res.

Testing over Out-Of-Vocabulary Words
The final analysis is on real OOV words for Word2Vec and for BERT. These last experiments are carried out by considering the positive relation between WordNet similarity measures and the word embedding spaces.
Using definitions for deriving word embeddings for OOV words seems to be the good solution compared to alternative available approaches.
In its space, DefiNNet achieves very important results for the correlation with the two WordNet similarity measures wup and res (see Table 4). In both cases, it outperforms FastText, which is a standard approach for deriving word embeddings fo OOV words (0.51 ± 0.31 vs. 0.34 ± 0.37 for res and 0.56 ± 0.30 vs. 0.42 ± 0.36 for wup). Moreover, DefiNNet outperforms Head, a baseline method based on WordNet, and Additive, the simplest model to use WordNet definitions.
The same happens for DefBERT Head in its space (see Table 4). DefBERT Head significantly outperforms BERT wordpieces , showing that DefBERT Head is a better model to treat OOV with respect to that already included in BERT. Results on DefBERT Head confirms that the output related to the token representing the head carries better information than the output related to the token [CLS]. Moreover, the definition has is a positive effect on shaping the word embedding of the head word towards the defined word. In fact, DefBERT Head and BERT Head−Example are applied on the same head word and DefBERT Head transforms better the meaning than BERT Head−Example , which is applied to a random sentence containing the head word. Indeed, also for BERT, definitions are important in determining embeddings of OOV words.
The final comparison is between DefiNNet and DefBERT Head and it is done on the small dataset P airs w2n∩BERT . DefiNNet outperforms DefBERT Head for all the three WordNet measures (see Table 4). These results show that the simpler is the better in using definitions for OOV words.

Conclusions and Future Work
Building word embedding for rare out-ofvocabulary words is essential in natural language processing systems based on neural networks. In this paper, we proposed to use definitions in dictionaries to solve this problem. Our results show that this can be a viable solution to retrieve word embedding for OOV rare words, which work better than existing methods and baseline systems.
Moreover, the use of dictionary definitions in word embedding may open also another possible line of research: a different semantic probe for universal sentence embedders (USEs). Indeed, definitions offer a definitely interesting equivalence between sentences and words. Hence, unlike existing semantic probes, this approach can unveil if USEs are really changing compositionally the meaning of sentences or are just aggregating pieces of sentences in a single representation.