Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories

Word Sense Disambiguation (WSD) aims to automatically identify the exact meaning of one word according to its context. Existing supervised models struggle to make correct predictions on rare word senses due to limited training data and can only select the best definition sentence from one predefined word sense inventory (e.g., WordNet). To address the data sparsity problem and generalize the model to be independent of one predefined inventory, we propose a gloss alignment algorithm that can align definition sentences (glosses) with the same meaning from different sense inventories to collect rich lexical knowledge. We then train a model to identify semantic equivalence between a target word in context and one of its glosses using these aligned inventories, which exhibits strong transfer capability to many WSD tasks. Experiments on benchmark datasets show that the proposed method improves predictions on both frequent and rare word senses, outperforming prior work by 1.2% on the All-Words WSD Task and 4.3% on the Low-Shot WSD Task. Evaluation on WiC Task also indicates that our method can better capture word meanings in context.


Introduction
Human language is inherently ambiguous since words can have various meanings in different contexts. Word Sense Disambiguation (WSD) aims to automatically identify the correct sense (meaning) of the target word within a context sentence, which is essential to many downstream tasks such as machine translation and information extraction. Recently, many approaches have achieved state-ofthe-art performance on WSD by fine-tuning language models pretrained with massive text data on task-specific datasets (Blevins and Zettlemoyer, 2020;Yap et al., 2020).
However, fine-tuning a WSD model using taskspecific resources could limit its applicability and may cause two major problems. First, the performance of models decreases significantly when predicting on rare and zero-shot word senses (Kumar et al., 2019;Choubey and Huang, 2020;Blevins et al., 2021) because there are no sufficient supporting examples in training data. Second, the trained models are often inventory-dependent which can only select the best definition from one predefined word sense inventory (mainly WordNet) that human annotations are based upon.
In this paper, we overcome these limitations by leveraging abundant lexical knowledge from various word sense inventories. As we know, dictionaries that are compiled by experts contain rich sense knowledge of words. Moreover, a dictionary usually provides several example sentences for each word sense to illustrate its usage, which can be viewed as context sentences of that word sense. Since a word's sense (meaning) can be determined by its context, the word itself in a given context and the definition sentence corresponding to the correct sense are merely two surrogates of the same meaning (semantically equivalent). Furthermore, we observe that different dictionaries normally summarize meanings of a word to a close number of word senses, where definition sentences (glosses) from different dictionaries are different expressions of the same bunch of meanings. For example, Figure 1 lists glosses retrieved from three dictionaries for verb word search. We can see that glosses with the same color have the same meaning and can be aligned across different dictionaries.
Based on this observation, we propose a gloss alignment algorithm to leverage abundant lexical knowledge from various word sense inventories. We convert the problem of aligning two groups of glosses according to meanings to an optimization • to carefully look for someone or something in (something) • to carefully look through the clothing of (someone) for something that may be hidden • to use a computer to find information in (a database, network, Web site, etc.) • to look carefully at (something) in order to get information about it Webster • to try to find someone or something by looking very carefully • to use a computer to find information • if someone in authority searches you or the things you are carrying, they look for things you might be hiding • to examine something carefully in order to find something out, decide something etc.

Longman
• If you search for something or someone, you look carefully for them. • If a police officer or someone else in authority searches you, they look carefully to see whether you have something hidden on you. • If you search for information on a computer, you give the computer an instruction to find that information. Figure 1: Definition sentences of word search retrieved from three dictionaries: Longman Dictionary of Contemporary English, Merriam-Webster's Advanced Learner's Dictionary, and Collins COBUILD Advanced Dictionary.

Collins
problem -Maximum Weighted Graph Matchingto find the best matching that maximizes the overall textual similarity. In this way, we can gather general semantic equivalence knowledge from various dictionaries as a whole for all word senses, especially for rare senses that are less frequently seen in human-annotated data.
To make use of the derived semantic equivalence knowledge, we adopt a transfer learning approach that first pretrains a general semantic equivalence recognizer by contrasting the word representations in example sentences with the sentence representations of positive glosses or negative glosses. The general model can be directly applied to downstream WSD tasks or further fine-tuned on the taskspecific dataset to get an expert model. We test our two-stage transfer learning scheme on two WSD benchmark tasks, i.e., the standard task (Raganato et al., 2017b) that focuses on all-words WSD and FEWS (Blevins et al., 2021) task that emphasizes low-shot (including few-shot and zero-shot) WSD. Experimental results show that the general model (without fine-tuning) surpasses the supervised baseline by 13.1% on zero-shot word senses. After further fine-tuning with build-in training data, the expert model outperforms the previous state-ofthe-art model by 1.2% on all-words WSD tasks and 4.3% on low-shot WSD tasks. Adding semantic equivalence knowledge to the Word-in-Context (WiC) task (Pilehvar and Camacho-Collados, 2019) also improves the accuracy of RoBERTa Large (Liu et al., 2019) by 6%, which even outperforms the 9X larger T5 model (Raffel et al., 2020).
Overall, the major contributions of our work are two-fold. 1) We propose a gloss alignment algorithm that can integrate lexical knowledge from different word sense inventories to train a general semantic equivalence recognizer. 2) Without using task-specific training data, the general model not only performs well on overall word senses but demonstrates strong applicability to low-shot senses. The general model can turn into an expert model to achieve new state-of-the-art performance after further fine-tuning.

Related Work
Supervised WSD Approaches. Most existing WSD models are learned in a supervised manner and depend on human-annotated data. For example, Raganato et al. (2017a) regarded WSD as a sequence labeling task and trained a BiLSTM model with self-attention using multiple auxiliary losses. Luo et al. (2018a) introduced a hierarchical coattention mechanism to generate gloss and context representations that can attend to each other. More recently, several BERT-based models have achieved new state-of-the-art performance on WSD by fine-tuning a pretrained language model. Gloss-BERT (Huang et al., 2019) appends each gloss to a given context sentence to create pseudo sentences and predicts them as either positive or negative depending on whether the sense corresponds to the correct sense or not. Bi-Encoder Model (BEM) (Blevins and Zettlemoyer, 2020) represents the target words and senses in the same embedding space using a context encoder and a gloss encoder but optimizes on each word individually. Yap et al. (2020) formulated WSD as a relevance ranking task and fine-tuned BERT to select the most probable sense definition from candidate senses. The neural architecture of our semantic equivalence recognizer realizes the benefits of GlossBERT and BEM. Knowledge-Based WSD Approaches. Closely related to our work, many knowledge-based approaches rely on Lexical Knowledge Bases (LKB), such as Wikipedia and WordNet, to enhance representations of word senses. BabelNet (Navigli and Ponzetto, 2010) Figure 2: Overview of our approach. The left part illustrates the gloss alignment algorithm where each blue circle is a gloss containing one definition sentence and several example sentences. The right part is our model architecture to predict the semantic equivalence of a word in context and a gloss by comparing their representations obtained from a shared transformer encoder. Task-specific WSD datasets can be further used to fine-tune our model.
of Machine Translation. Lesk (Basile et al., 2014) relies on a word-level similarity function to measure the semantic overlap between the context of a word and each sense definition. SENSEMBERT (Scarlini et al., 2020a) produces high-quality latent semantic representations of word meanings by incorporating knowledge contained in BabelNet into language models. Other approaches try to learn better gloss embeddings by considering the Word-Net graph structure (e.g., hypernyms, hyponyms, synonyms, etc.) (Luo et al., 2018b;Loureiro and Jorge, 2019;Kumar et al., 2019;Bevilacqua and Navigli, 2020). For example, Kumar et al. (2019) proposed EWISE to improve model's performance on rare or unseen senses by learning knowledge graph embeddings from WordNet. Building upon EWISE, Bevilacqua and Navigli (2020) developed a hybrid approach that incorporates more lexical knowledge (e.g., hypernymy, meronymy, similarity in WordNet) into the model through synset graph embeddings.
3 Overview of Our Approach Figure 2 shows the overview of our approach. We first collect all word glosses and corresponding example sentences from six word sense inventories. We next apply the gloss alignment algorithm to find the best matching between two groups of glosses retrieved from two different inventories for every common keyword. By contrasting example sentences with the correct glosses and incorrect glosses within each inventory or across different inventories, we automatically gather rich supervision for pretraining a universal binary classifier that can determine whether the keyword in the context sentence (example sentence) and a gloss are semantically equivalent or not. The pretrained general model can be directly used in downstream WSD tasks or further fine-tuned to get an expert model.

Data Collection
We collected word sense inventory data by querying WordNet 3.0 (Miller, 1995) and the electronic edition of five professional dictionaries for advanced English learners: Oxford Advanced Learner's Dictionary (Turnbull, 2010), Merriam-Webster's Advanced Learner's Dictionary (Perrault, 2008), Collins COBUILD Advanced Dictionary (Sinclair, 2008), Cambridge Advanced Learner's Dictionary (Walter, 2008), and Longman Dictionary of Contemporary English (Summers, 2003). Advanced learners' dictionaries have a good characteristic that they usually provide abundant example sentences to illustrate the usage of different word senses in context, making it possible to generate strong supervision for training a classifier. Table 1 shows statistics of six word sense inventories used. In total, we collected 557.8K glosses and 469.4K example sentences.

Gloss Alignment as a Maximum-weight Matching Problem
Each word sense inventory is a lexical knowledge bank that provides example sentences for illustrating word senses, including senses less frequently seen in the real world. Moreover, we observe that different inventories usually provide parallel explanations of meanings for a given word ( Figure  1). Thus, if we can align explanations (glosses) from different inventories according to meanings, we can significantly expand lexical knowledge acquired, especially for rare word senses. Essentially, finding the best alignment between two groups of glosses can be converted to Maximum-weight Bipartite Matching Problem (Cormen et al., 2009;Duan and Pettie, 2014) that aims to find a matching in a weighted bipartite graph that maximizes the sum of weights of the edges.

Problem Formulation
Given a keyword, suppose we retrieved two word sense sets S 1 and S 2 from two inventories, where each set consists of a list of definition sentences (glosses). Given a reward function r: S 1 ×S 2 → R, we want to find a matching 2 f : S 1 → S 2 such that the total rewards a∈S 1 ,f (a)∈S 2 r(a, f (a)) is maximized. By finding the matching f , we will know the best alignment between two word sense sets S 1 and S 2 . In this paper, we use the sentence-level textual similarity as the reward function to find the best word sense alignment. To measure the textual similarity between two definition sentences, we apply a pretrained model SBERT (Reimers and Gurevych, 2019) that has achieved state-of-the-art performance on many Semantic Textual Similarity (STS) tasks and Paraphrase Detection tasks. Specifically, we apply SBERT to S 1 and S 2 to get sentence embeddings and then calculate cosine similarity as the reward function.

Solving Bipartite Graph Matching by Linear Programming
The Maximum-weight Graph Matching problem can be solved by Linear Programming (Matousek and Gärtner, 2007;Cormen et al., 2009). For simplicity, let weight w ij denotes the textural similarity score between the i th definition sentence in S 1 and the j th definition sentence in S 2 . We next introduce another variable x ij for each edge (i, j). x ij = 1 if the edge between i and j is contained in the matching and x ij = 0 otherwise. The total weight of the matching is (i,j)∈S 1 ×S 2 w ij x ij . To reflect every vertex is in exactly one edge in the match-ing, we add constraints j∈S 2 x ij = 1 for i ∈ S 1 , and i∈S 1 x ij = 1 for j ∈ S 2 , to guarantee that the variable x represents a perfect matching. Our goal is to find a maximum-weight perfect matching such that above constraints are satisfied. To sum up, aligning glosses between two word sense inventories is equivalent to solving the following linear integer programming problem: In our implementation, we consider all possible inventory combinations (select two from six) and apply the gloss alignment solver 3 to all common words shared by two inventories. For each word, the gloss alignment solver is only applied to glosses under the same POS category. Overall, we obtain 704K gloss alignment links.

Positive and Negative Training Instances
For a given word, the gloss alignment algorithm provides us the linking from word sense set S 1 in one inventory to S 2 in another inventory. Two glosses (e.g., g ∈ S 1 and g ∈ S 2 ) have the same meaning if they are aligned by the algorithm or have a different meaning if they are not aligned. So we can pair the definition sentence of g (g ) to each example sentence in g (g) to generate glosscontext pairs for training the semantic equivalence recognizer. Pairs are labeled as positive if g and g are aligned or negative otherwise 4 . In experiments, we only consider aligned gloss pairs with textual similarities above a threshold (see Section 6.1) to further improve the quality of supervision. In total, we generate 421K positive and 538K negative gloss-context pairs across different inventories.
Pairs are also generated by contrasting glosses within each inventory individually. In detail, for every word in an inventory, we pair the gloss sentence with its example sentences to get positive gloss-context pairs or pair the gloss sentence with example sentences from another gloss within the inventory to get negative gloss-context pairs 5 . We generate 1.3M positive and 418K negative glosscontext pairs in this way. Similarly, we also generate context-context pairs by contrasting example sentences in two glosses to reflect the task setting of WiC (Section 6.3).

A Unified Neural Model for
Recognizing Semantic Equivalence

Model Architecture
This section introduces our model architecture (the right part of Figure 2) for recognizing semantic equivalence. Inspired by Blevins and Zettlemoyer (2020), our model first uses an encoder to get the semantic representation of the target word (within its context sentence) or the gloss sentence. Next, by comparing two representations, our model predicts whether they are semantically equivalent or not. Semantic Encoder. We apply a pretrained BERT model to get the contextual word representation of the target word (with its context) or the sentence representation of the gloss sentence. Specifically, given an input sentence S padded by the start symbol [CLS] and the end symbol [SEP], we first obtain N contextualized embeddings {o i } N i=1 for all tokens {t i } N i=1 using BERT. We next select the contextualized embedding at the target word position 6 when S is a context sentence, or select the first output embedding o 0 (corresponding to the special token [CLS]) as the sentence representation when S is a gloss sentence. Learning Objective. After deriving embeddings using BERT, both representations u and v, together with element-wise difference |u − v| and elementwise multiplication u · v are concatenated and multiplied with the trained weight W t ∈ R 4n×2 with a softmax prediction layer for binary classification (semantically equivalent or not): where n is the dimension of the embeddings. Our experiments consider two model sizes: SemEq-Base that is initialized with the pretrained BERT Base (Devlin et al., 2019)   2019) model with 24 transformer block layers, 1024 hidden size, 16 self-attention heads 7 . We train our model using binary cross-entropy loss and AdamW (Loshchilov and Hutter, 2018) optimizer with initial learning rate {1e-5, 5e-6, 2e-6}, 0.2 dropout, batch size 64 and 10 training epochs.

Accuracy of the Gloss Alignment Algorithm
To evaluate the accuracy of the gloss alignment algorithm, we randomly sample 1,000 gloss pairs from 704K alignments and ask two human annotators to judge whether two gloss sentences refer to the same meaning or not. Two annotators labeled 200 gloss pairs in common and agreed on 94% (188) of them, achieving the kappa inter-agreement score of 0.74. One gloss pair is regarded as correct when both annotators label it as correct, and the remaining 800 gloss pairs were evenly allocated to two annotators to label. Table 2 shows the accuracy of the gloss alignment algorithm on each POS type based on human annotations. The accuracy on Noun, Verb, Adjective and Adverb words is 0.90, 0.81, 0.88 and 0.85, respectively, with an overall accuracy of 0.87. In experiments, we apply a threshold of 0.6 to alignment results and only consider aligned gloss pairs with textual similarities above it, which can further improve gloss alignment accuracy to 0.98 based on human annotations. In this way, we can significantly improve the quality of training data that are generated from the automatically aligned dictionaries.

Experiments on WSD
We evaluate our model on two WSD datasets, i.e., WSD tasks standardized by Raganato et al. (2017b) that focuses on all-words WSD evaluation and FEWS dataset proposed by Blevins et al. (2021) that emphasizes low-shot WSD evaluation. Since both datasets are annotated using word senses in WordNet 3.0 (Miller, 1995), we pair the context sentence with the annotated gloss in WordNet 3.0  Table 3: F1-score (%) on All-Words WSD benchmark datasets. We distinguish models based on 1) using the Training Set (TS) SemCor or not, 2) using single ( to generate positive gloss-context instances or other glosses of the word to get negative gloss-context instances for training. In validation or test, we apply the trained classifier to examine all possible glosses of the target word in WordNet 3.0 and select the gloss with the highest probability score as the prediction. To incorporate rich lexical knowledge harvested from word sense inventories into model training, we consider two strategies: Data Augmentation. We directly augment the build-in training set from each WSD dataset with gloss-context pairs generated from our aligned word sense inventories and then train the semantic equivalence recognizer (SemEq) to do WSD.
Transfer Learning. We first train our semantic equivalence recognizer ONLY using gloss-context pairs generated from our aligned word sense inventories. The trained classifier is a general model (SemEq-General) capable of deciding whether a gloss sentence and the target word in a context sentence are semantically equivalent independent from any specific word sense inventories. Next, to evaluate on a specific WSD dataset, we further fine-tune the general model on the build-in training set to get an expert model (SemEq-Expert). The expert model can adapt to the new domain to achieve better performance.

All-Words WSD Tasks
We evaluate our model on the all-words WSD framework established by Raganato et al. (2017b). The testing dataset contains 5 benchmark datasets from previous Senseval and SemEval competitions, including Senseval-2 (SE2) (Edmonds and Cotton, 2001), Senseval-3 (SE3) (Mihalcea et al., 2004), SemEval-07 (SE07) (Pradhan et al., 2007), SemEval-13 (SE13) (Navigli et al., 2013), and SemEval-15 (SE15) (Moro and Navigli, 2015). Following Raganato et al. (2017b) and other previous work, we use SemCor (Miller et al., 1993) that contains 226,036 annotated instances as the build-in training set and choose SemEval-07 as the development set for hyper-parameter tuning. Since all datasets are mapped to word senses in WordNet 3.0 (Miller, 1995), we retrieve all definition sentences of the target word from WordNet 3.0 to form gloss-context pairs for both training and testing. Table 3 shows experimental results on all-words WSD datasets (Raganato et al., 2017b). We also report models' performance on each POS category. The first section includes results of the most frequent sense baseline and previous WSD models.
The second section presents results of our model that adopt data augmentation strategy to incorporate multi-source inventory knowledge. SemEq-  Base (line 11) is our model's performance when fine-tuning BERT Base sentence encoder only on the build-in SemCor training set. Compared to line 11, when augmenting SemCor with our multi-source inventory knowledge, the same model (line 12) improves the F1 on the aggregated ALL set by 1.2%. The third section of Table 3 reports the results of applying transfer learning strategy to exploiting our multi-source inventory knowledge. By only training on our multi-source inventory knowledge (without using SemCor), our model SemEq-Base-General (line 13) already achieves comparable performance with LMMS BERT (line 6, which is based on BERT Large ). After further fine-tuning on the training set -Semcor, SemEq-Base-Expert (line 14) improves the performance on ALL to 79.9%, which is slightly better than using the data augmentation strategy. Moreover, increasing BERT model parameters (line 16) further boosts the WSD performance on ALL to 80.7% 8 .
Overall, our SemEq-Large-Expert model (line 16) consistently outperforms AdaptBERT (Yap et al., 2020) (line 9), the previous best model without using WordNet synset graph information, on SE07, SE2, SE3 and SE13, attaining 1.2% higher F1 on ALL. The SemEq-Large-Expert model also better disambiguates all types of words including nouns, verbs, adjectives, and adverbs than AdaptBERT. It clearly demonstrates the benefits of leveraging multiple word sense inventories via automatic alignment and transfer learning. Our final model is 0.6% higher even compared with EWISER (Bevilacqua and Navigli, 2020) that uses the extra WordNet graph knowledge. We can see that by pretraining on lexical knowledge derived from aligned inventories, our model generalizes more easily and better captures semantic equivalence between the target word and a gloss sentence for identifying the correct word meaning.
In order to understand our model's behavior of transferring semantic equivalence knowledge from our word sense inventories to a specific WSD task, we partition word senses in the test set into groups according to their numbers of training instances found in the training set SemCor. As shown in Figure 3, by pretraining on our semantic equivalence knowledge and then fine-tuning on SemCor, SemEq-Base-Expert beats SemEq-Base (SemCor) that is only trained on SemCor across all annotation-rich and annotation-lacking word senses. Interestingly, without fine-tuning on SemCor, the general model (SemEq-Base-General) works surprisingly well on low-shot senses, which is 13.1%, 8.1% and 5.6% higher than SemEq-Base (SemCor) on 0 shot, 1-2 shot, 3-5 shot senses, respectively. After fine-tuning on SemCor, the expert models fit to the distribution of senses in the real world and achieve better overall performance.

Few-Shot and Zero-Shot WSD Tasks
By pretraining on massive semantic equivalence knowledge generated from aligned word sense inventories, we expect our model performs better on annotation-lacking senses. We next evaluate our model on the FEWS dataset (Blevins et al., 2021), a new WSD dataset that focuses on low-shot WSD evaluation. FEWS is a comprehensive evaluation dataset constructed from Wiktionary and covers 35K polysemous words and 71K senses. Overall, the build-in training set of FEWS consists 87K sentence instances. The test (development) set consists of two evaluation subsets, i.e., a few-shot evalua-   (Raffel et al., 2020) 69.3 770M T5-3B (Raffel et al., 2020) 72.1 3000M BERTARES (Scarlini et al., 2020b) 72.2 342M SemEq-Large (+WSI) 75.9 355M tion set and a zero-shot evaluation set; each subset contains 5K instances. Word senses that are used in zero-shot evaluation sets are verified to not occur in the training set, and word senses in few-shot evaluation sets will only occur 2 to 4 times in the training set. Table 4 presents the results on FEWS. BEM SemCor (line 4) is a similar transfer learning model but fine-tuned on SemCor before training on FEWS while BEM (line 3) only trains on FEWS. The second section of Table 4 shows that augmenting the FEWS train set with our multi-source inventory knowledge (line 6) greatly improves zero-shot learning performance by 1.6% on the dev set and 2.4% on the test set (compared with line 5). Surprisingly, when we adopt the transfer learning strategy, the final SemEq-Large-Expert (line 10) model's performance on test sets increases to 82.3% on fewshot senses and 72.2% on zero-shot senses, which significantly outperforms all baseline models.

Experiments on Context-Sensitive Word Meanings
Word-in-Context (WiC) Task (Pilehvar and Camacho-Collados, 2019) from SuperGLUE benchmark (Wang et al., 2019) provides a high-quality dataset for the evaluation of contextsensitive word meanings. WiC removes predefined word senses and reduces meaning identification to a binary classification problem in which, given two sentences containing the same lemma word, a model is asked to predict whether the two target words have the same meaning. Considering WiC uses WordNet as one lexical resource in its data construction, we completely remove WordNet from our inventory knowledge to avoid data leaking. Specifically, we simply add context-context pairs 9 generated from the other five inventories to the training set of WiC to train a semantic equivalence recognizer. Table 5 shows results on the WiC task comparing to other models 10 . The results indicate that incorporating semantic equivalence knowledge from aligned inventories improves RoBERTa Large 's performance by 6%, which also surpasses a large language model T5-3B (9X parameters) by 3.8%. It demonstrates the superiority of incorporating our high-quality multi-source lexical knowledge than blindly increasing the size of plain pretraining texts in language models.

Conclusion
Based on the observation that glosses of a word from different inventories usually are different expressions of a few meanings, we have proposed a gloss alignment algorithm that can unify different lexical resources as a whole to generate abundant semantic equivalence knowledge. The general model pretrained on derived equivalence knowledge can serve as a universal recognizer for word meanings in context or adapt to a specific WSD task by fine-tuning to achieve new state-of-the-art performance. Our results also point to an interesting future research direction: how to develop a robust fine-tuning approach that is able to retain the excellent performance of the general model on lowresource senses while still improving performance on high-resource senses.

Ethical Considerations
Copyrights of data used in this paper belong to their respective owners. The authors are permitted to use data under the permission of the non-commercial research purpose and following the principle of fair use. The authors will not reproduce, republish, distribute, transmit, or link data used on any other website without the express permission of respective owners. The authors bear the responsibility to comply with the rules of copyright holders.