Improved Word Sense Disambiguation with Enhanced Sense Representations

Current state-of-the-art supervised word sense disambiguation (WSD) systems (such as GlossBERT and bi-encoder model) yield sur-prisingly good results by purely leveraging pre-trained language models and short dictionary deﬁnitions (or glosses) of the different word senses. While concise and intuitive, the sense gloss is just one of many ways to provide information about word senses. In this paper, we focus on enhancing the sense representations via incorporating synonyms, example phrases or sentences showing usage of word senses, and sense gloss of hypernyms. We show that incorporating such additional information boosts the performance on WSD. With the pro-posed enhancements, our system achieves an F1 score of 82.0% on the standard benchmark test dataset of the English all-words WSD task, surpassing previous published scores on this benchmark dataset.


Introduction
Word sense disambiguation (WSD) refers to the task of automatically identifying the meaning of ambiguous words using computational methods. Given a word in context and a fixed inventory of senses, the system determines the correct word sense. For example, the noun "bank" means different things in "financial bank" and "bank of a river". Ambiguity is one of the central problems faced by natural language processing (NLP) tasks and WSD aims to resolve semantic ambiguity. It is commonly used to help downstream NLP tasks, such as machine translation (Chan et al., 2007;Neale et al., 2016) and information retrieval (Zhong and Ng, 2012).
Supervised WSD approaches typically frame the task as a multi-class classification problem with a fixed sense inventory for each word type. Traditionally, many well-performing methods use manually engineered features to train an independent classifier, or word expert, for every word type (Zhong and Ng, 2010;Melamud et al., 2016). Target senses are thus treated as discrete labels. Neural-based supervised methods were also explored, with a unified classifier that shares parameters across all polysemous words (Kågebäck and Salomonsson, 2016). However, they were not able to outperform the word expert supervised systems. More recently, the advent of large language models such as BERT (Devlin et al., 2019) has boosted the performance of these neural-based methods. Pre-trained on massive amounts of texts, the language models have a good sense of language context, inherently encoding word sense information. Using these models to generate contextualized word representations, a rapid slew of recent publications has continually redefined the state of the art.
In combination with language models, lexical resources have also been shown to be able to significantly improve WSD scores. Specifically, sense definitions (or glosses) have been used in recent work (Luo et al., 2018;Huang et al., 2019;Blevins and Zettlemoyer, 2020;Barba et al., 2021). In both GlossBERT (Huang et al., 2019) and the bi-encoder model (BEM) (Blevins and Zettlemoyer, 2020), good performance was achieved purely by utilizing the context sentence containing the ambiguous word and sense gloss information. In other words, the queried word sense is solely represented by a sense gloss that is typically less than twenty words. Given the brevity of information in a sense gloss, it is somewhat surprising that these architectures are able to achieve state-of-the-art performance.
In this paper, we show that enhancing the sense representations allows the pre-trained language models to better differentiate between the word senses by improving word sense clustering for each word type. We present a binary sentence pair classification model that is built upon RoBERTa (Liu et al., 2019), with focus on sense representation embellishment. We approach the task as a sentence pair classification problem, performing binary clas-sification on context-sense sentence pairs and training it in an end-to-end fashion.
To enhance word sense representation, we introduce a bag of "related" words that is associated with that particular word sense. These "related" words are intuitively chosen to provide more information about the word sense. Concretely, it is derived from synonyms, example phrases or sentences showing usage of word senses, and sense gloss of hypernyms. Incorporating these additional sources to enhance the sense representation improves the performance on the standard allwords English WSD evaluation benchmark. We achieve an F1 score of 82.0% on this benchmark test dataset, surpassing previous published scores on this test dataset.
In summary, the overall contributions of this paper include: • We present an approach towards sentence-pair classification for WSD with improved performance over current implementations.
• We show that enhancing sense representations (ESR) is indeed able to boost performance on the all-words English WSD task.
• We examine and visualize the impact of additional lexical information on the sense representations with an ablation study, to investigate why our model performs better.
Our source code and trained models are available at https://github.com/nusnlp/esr.

Related Work
In this paper, we address the English all-words WSD task, where a system disambiguates every ambiguous word in the dataset (Palmer et al., 2001). In general, supervised methods have been shown to perform better on the task, utilizing expensive human annotated data to achieve superior results. Combined with recent pre-trained language models, supervised neural architectures have gained popularity in recent years. For example, Hadiwinoto et al. (2019) investigates different ways of using pre-trained BERT to perform WSD, with the GLU model outperforming previous work.
While supervised methods traditionally do not leverage lexical resources such as WordNet (Miller, 1995), lexical information has proven to be useful in other methods. For example, the well-known Lesk algorithm (Lesk, 1986) shows that sense gloss is useful, with the algorithm picking the sense whose dictionary gloss shares the most words with the neighborhood of the ambiguous word. With pre-trained language models as feature extractors, sense gloss information can be incorporated into supervised WSD systems, generating significant performance boost. Two such examples are Gloss-BERT and BEM.
Similar to our work, GlossBERT (Huang et al., 2019) formulates the task as a sentence-pair classification problem -using context-gloss pairs to fine-tune the pre-trained BERT (Devlin et al., 2019) on the labeled SemCor data (Miller et al., 1994). This becomes a binary classification problem where the system predicts whether the ambiguous word matches the queried sense gloss in a single crossencoder model. However, they use the default BERT architecture for sentence pair classification, applying affine transformation on the [CLS] token. This summarized word sense query makes it more challenging for the model to identify the ambiguous word. In comparison, our system provides additional information about the ambiguous word (on top of [CLS] token), with immediate improved performance.
The BEM model (Blevins and Zettlemoyer, 2020) further improves on this approach by using a bi-encoder approach that independently embeds the ambiguous word with its surrounding context and the sense gloss of each queried sense. Since they are jointly optimized in the same representation space, disambiguation is performed by finding the nearest sense embedding.
Unlike GlossBERT and BEM, ESCHER (Barba et al., 2021) also utilizes sense gloss, but formulates the task as a span extraction problem. The input is a sentence pair where the first sentence contains the context of the ambiguous word and the second sentence contains the concatenation of glosses from all candidate senses. The system is trained to find the text span corresponding to the correct sense.
Another challenge faced by supervised systems is the limited training data size. The work from Yap et al. (2020) utilizes usage examples from WordNet to generate more training data. In contrast, our system uses example sentences to improve sense representations instead.
Other approaches make use of relational information in the lexical knowledge graphs. For example, LMMS (Loureiro and Jorge, 2019) uses annotated data to generate sense embeddings us-ing BERT. These embeddings are then propagated through the WordNet graph to infer senses that do not appear in SemCor. Similarly, ARES (Scarlini et al., 2020) also achieves full sense coverage but through extraction of relevant contexts. SparseLMMS (Berend, 2020) further makes the embeddings sparse through a dictionary matrix. Connections are made between each dimension of the sparse embeddings and human interpretable semantic content. EWISE (Kumar et al., 2019), on the other hand, learns sense embeddings by pre-training a gloss encoder with sense definitions and knowledge graph information. The learned sense gloss embeddings are then scored via dot product with a contextual vector to perform prediction. EWISER (Bevilacqua and Navigli, 2020) extends EWISE by injecting additional relational knowledge from the lexical knowledge graph via a simple sparse dot product operation with an adjacency matrix formulated with the knowledge graph. Since the pre-trained sense embeddings are used to classify the ambiguous word, the model is able to predict synsets that are not present in the training set, improving zero-shot performance. Our system surpasses previous published systems despite using minimal knowledge graph information (only the sense gloss of hypernyms).

Methodology
In this section, we describe the model architecture of our system, and present our method for achieving enhanced sense representations (ESR).

Model Architecture
The WSD task determines the best synsetŝ ∈ S w for an ambiguous word w, where S w is the set of candidate synsets for word w.
The inputs of our system are sentence pairs. The first sentence is the context containing the ambiguous word w, and the second sentence is the sense representation of one candidate synset s ∈ S w . The two sentences are then concatenated to form a sentence pair containing words w 1 , w 2 , ..., w m , which will be tokenized into tokens t 1 , t 2 , ..., t n . In the case of RoBERTa tokenizer, each tokenized sentence in the pair is surrounded by <s> and <\s>, so t 1 = <s>. The tokens are then passed to RoBERTa T, which will produce final layer hidden states: where H is the size of one hidden state. If a word w i is tokenized into multiple tokens t j , ..., t k , then the average of the corresponding final layer hidden states is used: Note that RoBERTa adds an extra layer with tanh activation on top of the final layer hidden state of the first token <s> to produce an output for classification tasks: This output h s and the hidden state of the ambiguous word w are then concatenated and passed to a binary classification layer, whose output is passed to softmax to model the probability of a candidate synset to be positive: Here we use p s = p 2 to model the probability of a candidate synset s to be positive.
During training, each sentence pair is assigned a label y, with y equals to 1 if the sentence pair contains a positive synset and 0 otherwise. Binary cross-entropy is used as our loss function: During prediction, the synset with the highest probability among all the candidate synsets in S w is used as the predicted synset of the ambiguous word w:ŝ = argmax s∈Sw p s where S w is determined by the lemma and POS tag of the ambiguous word w.

Baseline System
We use RoBERTa as our transformer model. To better represent the context, we not only use the sentence S containing the ambiguous word, but also include one neighboring sentence before S and one neighboring sentence after S. For sense representation, we join the ambiguous word and the sense definition of the synset with a colon.

Context Sentence:
Has your attitude toward employee benefits encouraged an excess of free "government" work in your plant?

Enhanced Sense Representations
Built on top of the baseline system, ESR not only uses the sense definition of the synset, but also incorporates words related to the synset to enrich the sense representation. The related words are constructed by first concatenating the words from the following three sources in order: (i) all the lemmas belonging to the synset (synonyms); (ii) WordNet example phrases or sentences of the synset; (iii) hypernym gloss of the synset. Table 1 shows an example for the word plant, with the words from synonyms, example sentences, and hypernym glosses listed accordingly. We then remove stop words (which are not so informative), and keep one occurrence of a word if it appears multiple times. By appending related words to the sense representation of a synset, we obtain enhanced sense representation. Table 1 gives examples of enhanced sense representations for the positive and negative synset of the word plant in the context sentence 1 .

Experiments
In this section, we provide the details of our experiments and a comparison with other systems.

Datasets
We follow the unified evaluation framework for WSD (Raganato et al., 2017  In addition, we evaluate few-shot and zeroshot performance of ESR on the FEWS (Blevins et al., 2021) dataset. FEWS is generated from Wiktionary quotations and illustrations. It covers 71,391 senses from Wiktionary and contains a total of 121,459 ambiguous instances, which are divided into 101,459, 10,000, and 10,000 instances for training, development, and testing respectively. Each of the development set and test set contains 5,000 few-shot instances and 5,000 zero-shot instances. By creating positive and negative examples for each instance, we generate 478,604 training examples. Since the sense definitions and usage examples are put together in FEWS, we use e.g. as the delimiter to separate them for use with ESR.

Hyperparameters
We have two settings during training, one with roberta-base and the other with roberta-large. Both settings fine-tune the pre-trained language model from Hugging Face (Wolf et al., 2020) through 3 epochs with a total batch size of 32. The optimizer used is AdamW (Loshchilov and Hutter, 2019), with learning rate set to 8.5e-6, epsilon set to 1e-6, and weight decay set to 0. The warm up steps are 10% of the total training steps (batches). The number is 14,476 for fine-tuning on SemCor, 34,207 for fine-tuning on both SemCor and WNGC, and 4,487 for finetuning on FEWS. The input size (number of tokens n) is limited to 432 for roberta-base and 348 for roberta-large. During fine-tuning, the model is evaluated every 500 batches. After 1.5 epochs, the checkpoint with the highest SE07 F1 score is saved. If multiple checkpoints have the same SE07 F1 score, the earliest one is chosen to avoid over-fitting.

Results
In this subsection, we present the scores of ESR on the benchmark WSD evaluation framework and on FEWS. Table 2 shows the F1 scores of different WSD systems on the English all-words WSD evaluation framework (Raganato et al., 2017). For each of  our systems, we run the experiment 3 times with different random seeds and report the average score over 3 runs in the table. By incorporating ESR, there is a significant improvement of 1.0% over the baseline system, from 78.8% to 79.8%. The improvement is statistically significant with p-value < 0.01, which shows that ESR is effective.

WSD Evaluation Framework
When training on SemCor only with roberta-base, ESR outperforms most prior published systems except ESCHER. However, ESCHER fine-tunes on a large model. The WSD system from Yap et al. (2020) performs close to ESR. However, the bert-large-uncased used in their system contains 336M parameters, almost 2.7 times the number of parameters compared to robeta-base, which has only 125M parameters. Note that the F1 scores for verbs are all below 70% and more than 10% lower than other POS tags in all previous WSD systems, dragging down the overall performance of the systems. The reason is that the synsets for verbs in WordNet are so fine-grained that it is often difficult for even humans to tell the difference. The performance of ESR on verbs beats all previous WSD systems, including those utilizing WNGC and a large model, which shows that ESR is effective in distinguishing fine-grained senses.
When training on SemCor only with roberta-large, ESR surpasses all previous WSD systems with an F1 score of 81.1% on ALL. By adding WNGC to the training data, ESR with roberta-large further improves to 73.0% on verbs, and achieves an F1 score of 82.0% on ALL. The 0.9% improvement brought by WNGC is statistically significant with p-value < 0.01.
With roberta-base, the time taken for training on SemCor is 9 hours on 1 RTX 3090 GPU, and 18 hours for training on both SemCor and WNGC.
With roberta-large, the time taken for training on SemCor is 8 hours on 2 A100 GPUs, and 17 hours for training on both SemCor and WNGC. Testing time is 0.25 hours for both. Table 3 shows the F1 scores of different WSD systems on FEWS development set and test set. All the systems are trained on the FEWS dataset only. We use BEM BERT from Blevins et al. (2021) as baseline. Compared with the BEM baseline, ESR with roberta-base improves on the full test set by 2.0%, and improves on the zero-shot test set by 5.1%. When using roberta-large, ESR further improves the F1 score on the full test set to 79.6%. On the full development set, ESR even outperforms human, although its zero-shot performance is still worse than human.

FEWS
The time taken for training on FEWS is 2 hours on 1 A100 GPU with roberta-base, and 3.5 hours on 2 A100 GPUs with roberta-large. Testing time is 0.35 hours for both.

Analysis
In this section, we will analyze the effectiveness of different components constituting the related words in ESR: synonyms in the synset, example phrases or sentences from WordNet, and sense definition of hypernym for the synset. We will then evaluate the less frequent sense and zero-shot performance of ESR. Finally, we will visualize how ESR separates different synsets of a word with an example, and show that ESR achieves better clustering.

Ablation Studies
In order to evaluate the effectiveness of different components constituting the related words in ESR, we remove each of them and see how the overall performance is affected.    We can also view the results from another angle.
By adding examples to the baseline system, there is a 0.3% increase from 78.8% to 79.1%, while adding hypernyms to the baseline system only increases F1 score by 0.1%, from 78.8% to 78.9%. If we add both examples and hypernyms to the baseline system, there is a 0.5% increase in F1 score, from 78.8% to 79.3%, the same increase as further adding synonyms. This again shows that adding synonyms is the most significant in ESR, and adding hypernyms is less significant than adding examples.
One explanation for the above observations is that the synonyms of a synset are semantically close to the synset and make the synset more distinguishable, compared to its examples and hypernym. Besides, the hypernym is shared by all its hyponyms, making it less unique to a specific synset.

Few-shot and Zero-shot Performance
We have shown the effectiveness of ESR over the baseline system, and synonyms play the most significant role. We further investigate ESR's effectiveness on the most frequent sense (MFS) and less frequent senses (LFS) of a word, where MFS is defined as the first and also the most common sense of a word in WordNet, and LFS is defined as the the other less frequent senses of a word. We also investigate the zero-shot performance of ESR, when it is tested on unseen senses and unseen words in the training data.
As shown in Table 5, both ESR and the baseline system perform better on MFS than on LFS. This is because SemCor is imbalanced and 73.7% of the training instances are MFS. The fewer training instances for LFS and the fine-grained nature of WordNet make it hard to distinguish the different synsets and achieve a high performance on LFS. However, ESR uses related words to make the synset more distinguishable, and improves by 1.0% over the baseline by using only examples and hypernym. If synonyms are used, a further 1.2% improvement is achieved.
Unseen senses are senses that do not appear in the SemCor training data, but appear in the test datasets. By adding examples and hypernyms, a 0.4% improvement can be made. After adding synonyms, a further 0.2% improvement can be made. To see why the performance on an unseen sense can be improved, consider the word evoke, where its sense call to mind in the SE2 test set does not appear in SemCor. However, in the SemCor training data, the sense call forth (emotions, feelings, and responses) of evoke is present. During training, related words of the unseen sense call to mind are used as part of a negative sentence pair with a context sentence that contains the ambiguous word evoke. As such, even though the sense call to mind does not appear in the training data, the ESR system is (indirectly) aware of this unseen sense call to mind, via its related words in a negative sentence pair. In this way, ESR is able to leverage the negative sentence pairs so that it can better disambiguate the call to mind sense during testing, even though it is an unseen sense that does not appear in the training data at all.
Unseen words are those that appear in the test datasets, but do not appear at all in the SemCor training data. However, by adding examples and hypernyms, ESR can improve the F1 score on unseen words by 0.6% over the baseline. Although unseen words do not appear as ambiguous words in SemCor, some of them actually show up in the sense representations of seen words. For example, although the word envoy in the SE13 test set never appears as an ambiguous word in SemCor, it shows up in the sense gloss provide or send (envoys or embassadors) with official credentials ... of another seen word accredit. Hence, some of the unseen words are involved in the training process indirectly through the sense representations of seen words. This explains why ESR can improve the performance on unseen words.

ESR Improves Clustering
We have shown that ESR improves the performance of WSD by adding related words to make the sense representations more distinguishable through the above analysis. To further illustrate this fact, we evaluate the performance of the baseline system and ESR qualitatively through clustering.
For clustering, we use the concatenated hidden states of the first token and the ambiguous word in the context, which are the inputs of the binary classification layer as described in subsection 3.1. For each ambiguous word in SemCor, only the positive sentence pairs corresponding to its different senses are chosen. For visualization, the high dimensional concatenated hidden states are reduced to 2 dimensions with t-SNE. Figure 1 shows the ambiguous word plant with its two senses in different systems. Each point represents a positive sentence pair in SemCor containing the sense representation of the ambiguous word plant. Although the two senses are distinctive, the baseline system cannot separate them well and the points of both senses are mixed together.
By adding examples and hypernyms, the system is able to separate the two different senses. In Tabel 6, the average distance between a point and the cluster centroid for the "building" sense is decreased from 4.04 to 3.12 as the points form better clusters. However, the separation is not perfect due to some outliers from the "botany" sense mixing with the cluster for the "building" sense, causing a decrease in distance from 6.96 to 4.19 between the two centroids compared to the baseline. From visualization, it is clear that ESR separates the points best among all the three systems. The points for each sense form circular clusters with decreased average distance between a point and the cluster centroid, and there are no outliers. The distance between the two clusters is 20.83, much larger than the other two systems and more than enough for separation. This is consistent with the ablation test conclusion that synonyms play a more significant role than examples and hypernyms.

Conclusion
In this paper, we present ESR which incorporates related words of a synset from its synonyms, usage examples, and sense definition of hypernym to further boost the performance on WSD over previous state-of-the-art systems. ESR provides more |c 1 − c 2 | |p 1 − c 1 | |p 2 − c 2 |  Table 6: Distance between the two cluster centroids, and the average distance between a point and the corresponding centroid in each cluster for the two senses of plant in different systems.
distinctive representations for senses, making the senses better separated from each other, and improves the performance of a baseline WSD system significantly. ESR not only brings improvements on less frequent senses, unseen senses, and unseen words, but also improves the overall performance and surpasses prior published scores with an F1 score of 82.0%. While our work shows that ESR improves WSD performance, there is still room for improvement as we only explore limited methods to enhance sense representations. For future work, we believe there are potentially better ways to enrich sense representations and make them more distinguishable, further improving the performance of WSD systems.