Word Sense Disambiguation: Towards Interactive Context Exploitation from Both Word and Sense Perspectives

Lately proposed Word Sense Disambiguation (WSD) systems have approached the estimated upper bound of the task on standard evaluation benchmarks. However, these systems typically implement the disambiguation of words in a document almost independently, underutilizing sense and word dependency in context. In this paper, we convert the nearly isolated decisions into interrelated ones by exposing senses in context when learning sense embeddings in a similarity-based Sense Aware Context Exploitation (SACE) architecture. Meanwhile, we enhance the context embedding learning with selected sentences from the same document, rather than utilizing only the sentence where each ambiguous word appears. Experiments on both English and multilingual WSD datasets have shown the effectiveness of our approach, surpassing previous state-of-the-art by large margins (3.7% and 1.2% respectively), especially on few-shot (14.3%) and zero-shot (35.9%) scenarios.


Introduction
Word Sense Disambiguation (WSD) is the task of determining a word's sense given its context. Recently, contextualized representation learning (Devlin et al., 2019;Liu et al., 2019) have accelerated the advancement of WSD, raising the performance on a standard evaluation framework (Raganato et al., 2017a) from slightly higher than 70% (Raganato et al., 2017b;Luo et al., 2018;Kumar et al., 2019) to about 80% (Vial et al., 2019;Blevins and Zettlemoyer, 2020; Bevilacqua and * corresponding author . This is an estimated upper bound of the task, which is from the inter-annotator agreement: the percentage of words that are annotated with the same meaning by two or more annotators (Navigli, 2009). There is a clear trend that supervised systems tend to incorporate sense knowledge into their architecture, ranging from sense definition, usage examples to sense relation.
However, the disambiguation of words in a document is almost independent of each other, especially from the perspective of senses in context. The connection of each word's disambiguation is limited to the utilization of a sentence (Loureiro and Jorge, 2019;Huang et al., 2019;Hadiwinoto et al., 2019;Scarlini et al., 2020a) or a small window of text  because of computation cost or model restriction. More severely, the interaction of senses in context is barely explored. Similar to word cooccurrence, the appearance of one sense can sometimes dominate the choice of another sense in the same sentence (Agirre et al., 2014;Maru et al., 2019).
In this paper, we introduce SACE, a similaritybased WSD approach. Precisely, we transform the previously almost isolated disambiguation of words in a document into interrelated ones to maximize the contribution of context from both word and sense perspectives. We summarize our contributions as follows: 1. We devise an interactive sense embedding learning technique that takes into account senses in context via a selective attention layer in a neural architecture. It connects senses via their appearance in a piece of text rather than using manually constructed sense relations, being less costly. 2. We introduce a method to better exploit the

Related Work
There are mainly two alternatives for solving WSD, namely knowledge-based and supervised approaches. While the former mainly relies on a sense inventory for disambiguation, the latter is dependent on sense-annotated corpora to train a sense classifier, either for each word or the whole vocabulary. However, many recently proposed systems combine the above two strategies, injecting sense knowledge into their supervised models while somehow inadequately modeling the provided context in a document from both word and sense perspectives.

Supervised Method
Early supervised approaches model the relational pattern between an ambiguous word's local features and its gold sense from sense-annotated data. IMS (Zhong and Ng, 2010) was one of the most prevalent systems that trained a sense classifier for each lemma in training data. In comparison, Raganato et al. (2017b) unified the disambiguation of words into a single sequence labeling architecture, relieving the efficiency issue. Many following systems improved this architecture by incorporating sense knowledge.
For unseen lemmas, these systems require most frequent sense (MFS) fallback (select the most frequent candidate sense in the training data). To tackle this problem, LMMS (Loureiro and Jorge 2019) implements the disambiguation in a similarity-based manner. It learns a sense embedding for each labeled sense in SemCor (Miller et al., 1994) and maps them to full coverage of WordNet (Miller, 1995) senses using sense relations. BERT (Devlin et al., 2019) is used as a feature-extraction module for both gloss and context encoding. Further, BEM (Blevins and Zettlemoyer, 2020) utilizes two encoders for the above approach in a fine-tuning manner. Although the model is more effective even without exploiting sense knowledge other than glosses, it takes around 2.5 days for training.
The employment of sense relations in previous supervised systems is mostly limited to explicitly defined sense relations including hypernymy and hyponymy relation, severely neglecting how senses in context contribute to the selection of a word's sense.

Context Exploitation
For supervised WSD approaches, it is typical to use a small fraction of the whole context to carry out disambiguation, such as a sentence, or a sliding window of text. In contrast, knowledge-based WSD approaches tend to more sufficiently exploit a word's context, ranging from a sentence (Lesk, 1986;, a few sentences (Agirre et al., 2018 to even the whole document (Chaplot and Salakhutdinov, 2018). Some studies draw in out-of-dataset context (Ponzetto and Navigli, 2010;Scarlini et al., 2020a) for disambiguation, including Wikipedia documents. Therefore, it is worth exploring whether the disambiguation of words within the same document can benefit from each other in a supervised system.
The utilization of senses in context is far less investigated compared with words in context. UKB (Agirre et al., 2014, a knowledge-based system) is one of the related systems that model sense relations in context. It first connects senses in context via WordNet sense relations and operates personalized PageRank on the constructed sense graph to decide sense importance. For each word, the most important potential sense is considered as the correct sense. SyntagNet (Maru et al., 2019) improves the idea by introducing manually disambiguated sense pairs in context during sense graph construction. Although the system was able to challenge supervised systems at the time, it relied on human labor to obtain sense pairs in context. There was no attempt on integrating the utilization of senses in context into a supervised architecture.

Preliminary
WSD is to select the correct sense ̃ of a word given its context.
is the ℎ word in the The candidate senses � � = { 1 , 2 , … , , … , } are from a sense inventory such as WordNet. Here, , , and denote the index of sentence, word, and sense respectively. In a similarity-based WSD approach, the disambiguation of a word is determined by the similarity between its context representation and each candidate sense representation . In many cases, both representations are vectors and the similarity is measured by their dot product after normalization. Then, the sense with the highest similarity is selected as the correct sense.
Typically, a word's context representation is learned using the sentence where the word appears (Loureiro and Jorge, 2019;Scarlini et al., 2020a;Scarlini et al., 2020b). The representation of a candidate sense is obtained using its gloss/definition defined in WordNet (Blevins and Zettlemoyer, 2020). A common approach of encoding these two sequences in recent research is to utilize pre-trained models such as BERT, RoBERTa (Liu et al., 2019), and so on, taking the sum of the outputs of the last four layers as encoded features (Loureiro and Jorge, 2019;Scarlini et al., 2020a), as in (1) and (2). Before feeding and to the models, a special token [CLS]/[SEP] is added to the beginning/end of the sequence, modifying them into ̅ and ̅ , respectively.
For each 's context representation, a normal choice is to utilize the model's output at the position of the word ( ), using ̅ as input, shown in equation (1). If the word is tokenized into several pieces, their mean is taken. In contrast, for each sense representation, when it is fine-tuning a pretrained model, the sense embedding is the output at the position of [CLS] (Blevins and Zettlemoyer, 2020), with the modified gloss as input, as in (2).
To utilize the supervision from a training corpus, a cross-entropy loss is implemented against the similarity distribution of candidate senses (the SoftMax product without index in (3)) and the one-hot ground-truth distribution, shown in equation (4). � � ∈ ℝ � � ��×ℎ is a matrix of concatenated sense embeddings arranged in rows. ℎ is the dimension of the pre-trained model's hidden states (768 or 1024 of BERT).
is equal to 1 when (the ℎ sense of ) is the correct sense, otherwise 0, representing each element in the ground-truth one-hot vector. For prediction, the model selects the sense with the largest dot product for each word.
In the above approach (from BEM, Blevins and Zettlemoyer, 2020), the embedding learning process of different senses is independent of each other, relying merely on sense gloss. Besides, the The sound of a bell being struck.

Scores
Predicted sense from the last epoch interaction between different words' disambiguation is limited to the utilization of a sentence, leading to inadequate exploitation of the words in context. Therefore, we transform the above almost isolated decisions into interrelated ones by learning the sense and context embeddings interactively.

SACE:
Sense Aware Context Exploitation in Supervised WSD

Sense-level Context (SlC)
The interactive sense embedding learning mainly involves a selective attention layer upon the original sense embeddings from the pre-trained model. The goal of this interaction is to assist the learning of one sense's embedding to be aware of the others in the same context. It is supported by the fact that many sense pairs are more commonly used than the others.
In practice, each of the ambiguous words in the document has several candidate senses, which poses questions about which senses should be attended in the selective attention layer. To address this problem, we make use of the iterative characteristic of the model training. In other words, the system's predicted senses of each word within a particular context from the former iteration are attended. For the first iteration, the first sense of each word in context is attended. In such a strategy, the senses of monosemous words (has a single sense) can be exploited at all iterations.
For convenient demonstration, we use the embedding of predicted senses ̂ of the context words in to enhance that of each sense of word . We note that, can be a larger context. In equation (5), is the number of words in . In (6), ∈ ℝ ℎ×ℎ is a learnable weight matrix.
The attention score in (6) only takes into consideration the representation at [CLS] position (sentence level representation) for each gloss, neglecting the relatedness between each gloss word of two senses. To tackle this, we devise a combined attention score by considering both [CLS] and gloss word relevance, in equation (7). is a predefined gloss length of all senses for normalization.
∈ ℝ ℎ×1 is obtained with equation (2) by changing the output position to . If the length (e.g., ) of a sense gloss is smaller than , is a zero vector where is larger than .

Word-level Context (WlC)
In many previous supervised systems, the disambiguation of one word in a sentence is isolated from the words in the other sentences of the same document. We convert the isolated disambiguation into interactive ones by utilizing several highly related sentences within the same document for context embedding learning.
For each sentence , we select its related sentences under two criteria, with one being the distance to , and the other being the semantic relatedness to . The first criterion can be regarded as local features and the second one is aimed at injecting global features while maintaining a low noise level.
From the perspective of local features, directly surrounding sentences within a window are used as related sentences. For global features, we score context sentences and utilize the top related sentences for context embedding learning. Precisely, in a document , we regard each sentence as a document and calculate the TF-IDF score of each word in the vocabulary of for all sentences. The intuition behind modeling sentences with TF-IDF is that we find the average length of SemCor sentences is 22, which is reasonably long. This represents the original document as a matrix ∈ ℝ ×| | , where each row and column indicate sentence and word dimension respectively. For instance, ( , ) is the TF-IDF score of in . The score of concerning is shown as follows: After scoring all context sentences for each sentence , we concatenate related sentences with and utilize them as an input to BERT for context embedding learning. As an example, { −1 , +1 } are related sentences from local features, and if { −12 , +7 } are top-scored sentences from global features, we use = { � −12 , ̅ −1 , ̅ , ̅ +1 , ̅ +7 } as an input to equation (1) and retrieve the enhanced context embedding ̅ of each word in . In such a way, different is retrieved for each sentence in the document. We note that, when the total sequence length is longer than 512, we remove the furthest sentences away from ̅ . For instance, ̅ −12 , ̅ +7 and so on in the above example will be removed in order.
Finally, and [ ] in equation (4) are replaced with ̅ and ̅ respectively to calculate the loss, with which to update the weights of the pre-trained model and the selective attention layer.

Try-again Mechanism (TaM)
In a previous similarity-based WSD approach,  proposed a Try-again Mechanism (TaM) that takes into account not only the similarity of 's context embedding to the sense embedding of , but also to the sense embedding of ∈ during evaluation. Here, and are connected by either WordNet relations or the super-sense relation (i.e., senses that belong to the same super-sense category in WordNet). This mechanism in (9) manages to boost the performance of its knowledge-based system by a relatively large margin.
In this subsection, we reconstruct TaM so that it becomes effective in our model. This process helps the disambiguation of words to be even more interactive since it considers an increased number of senses by utilizing sense relation knowledge.
In our implementation, we replace the above relations with only those derived from Coarse Sense Inventory (CSI, Lacerra et al., 2020). Similar to the utilization of super-sense categories, we connect senses that belong to the same label in CSI as related senses. Also, we change the direct sum of the above two similarities into a weighted sum using a hyperparameter .
In addition, our approach only learns a sense embedding for the candidate senses whose lemma is annotated in training data. Therefore, in TaM, we save sense embeddings from training for each † http://lcl.uniroma1.it/wsdeval/home epoch and use them to implement TaM during evaluation. It is worth mentioning that for senses that do not have a sense embedding in , we neglect their calculation in equation (10).

Datasets
To validate the effectiveness of our approach, we use SemCor and an evaluation framework † to train and evaluate our model, SACE base , respectively. The evaluation framework contains 5 English allwords WSD benchmarks. We report the experimental results on each dataset including SensEval-2 (SE2, Palmer et al., 2001), SensEval-3 (SE3, Snyder and Palmer, 2004), SemEval-2007Task-17 (SE07, Pradhan et al., 2007, SemEval-2013(SE13, Navigli et al., 2013 and SemEval-2015 (SE15, Moro and Navigli, 2015). Also, the results from Part-Of-Speech (POS) perspectives on their combined dataset (ALL) are reported. Following previous works, we train large models, SACElarge on SemCor and SACE large+ on SemCor, WordNet Gloss Tagged (WNGT), and WordNet examples (WNE) for fair comparisons. Here, WNE is regarded as an extra sense gloss and is concatenated after the original sense gloss for sense embedding learning, which is similar to the implementation in SREF .
For few-shot WSD, we partition ALL according to the gold label of each annotation into ALL WN_1st and ALL WN_others . Besides, according to whether senses and lemmas of ALL instances appear in SemCor, we extract two subsets, ALL ZSS and ALL ZSL , to evaluate the zero-shot learning ability of our model.
For cross-lingual datasets, we use the WordNet version of the latest evaluation framework ‡ which contains test datasets for Spanish, Italian, French, and German. These datasets are preprocessed data from SemEval-2013 (Navigli et al., 2013) and SemEval-2015 (Moro andNavigli, 2015). The former only disambiguates nouns while the latter covers words in four POS (noun-N, verb-V, adjective-A, adverb-R).
We note that the performance in each table is reported with F1 in percentage.

Model Design ‡ https://github.com/SapienzaNLP/mwsd-datasets
Our base and large model utilize RoBERTa base and RoBERTa large respectively, which perform relatively better than BERT models. For crosslingual evaluation, we fine-tune XLM-RoBERTabase (SACE mul , Conneau et al., 2020) with the same training data as SACE large+ , following the setting in EWISER. In each system, two encoders are adopted, with one being a context encoder and the other being a sense gloss encoder. This is identical to the setting in BEM. We note that a major difference is that the pre-trained model adopted in the above papers is BERT.
The hyperparameters of our model are selected using SE07. They include the number of surrounding sentences (2) on both sides of , the number of top related sentences (2) of and (0.1) in TaM. The learning rate for SACE base , SACE large , SACE large+ , and SACE mul is 1e-5, 1e-6, 1e-6, and 5e-6 respectively.
To accelerate the model training, we organize the sentences in a document into batches according to the total number of candidate senses (400 for SACE base and SACE mul , 150 for SACE large and SACE large+ ), i.e., if the total number of candidate senses exceeds 400 or 150 when adding a sentence, then the sentence belongs to the next batch. For each batch, the gloss and context encoders are only called once. The context and gloss length is normalized to the maximal sequence length within each batch to reduce unnecessary padding and computation. Also, apex is employed for mixedprecision computing. More details are shown in Appendix A.

Baselines
We compare the proposed model with previous  , BEM (Blevins and Zettlemoyer, 2020), ARES (Scarlini et al., 2020b) and SREF . BEM is our direct baseline, which utilizes two encoders to learn context and sense embedding separately and achieves state-of-the-art with only SemCor.
For cross-lingual evaluation, we compare our results with those reported in SyntagNet, EWISER, ARES, MuLaN (Barba et al., 2020). These systems are all recently proposed systems with state-of-the-art performance.

Ablation Analysis
In this subsection, we demonstrate how each component of our model benefits WSD performance. In table 1, the system's performance on ALL has illustrated that enhancing the interaction between different words' disambiguation in the same document (WlC) can raise the system's performance by the largest margin, 1.5 F1. This promotion is slightly larger than that (1.2 F1) provided by the interactive sense embedding learning (SlC). The gloss word attention in SlC is also proved effective, which helps increase the system's performance by 0.5 F1, similar to the contribution of TaM, 0.6 F1. Most importantly, when all components are removed, the performance on ALL decreases to 78.4 F1. We note that the baseline here is different from BEM since we remove unnecessary padding and utilize RoBERTa. This has dramatically accelerated the training process from 3.5 hours to 0.5 hour per epoch while achieved similar performance. We also note that the experimental results reported in this paper are obtained using the same random seed as BEM. With different random seeds, the performance gap on ALL between SACEbase and its baseline (-w/o all) ranges from 1.7 F1 to 2.7 F1. Our systems also obtain state-of-the-art performance on each dataset, with the margin ranging from 0.2 to 2.9 F1 for SACE base and 1.8 to 3.0 F1 for SACE large , in the first category. As for SACE large+ , the margin above the previous best system for each dataset is even larger, varying from 1.7 to 5.5 F1. It is noteworthy that SACE base outperforms SACE large by 0.9 F1 on SE15 and they obtain similar performance on SE13. These two datasets are less ambiguous since each lemma has fewer candidate senses on average. This illustrates the competitive disambiguation capability of SACE base on easier instances. We also note that the development set in two categories is different, with the first being SE07 and the second being SE15. This is because we follow most systems' setting in the first category and follow EWISER's setting in the second category for better comparison.

All-words WSD
For the performance on different POS, our systems set new lines for all of them in ALL. The largest advancement comes from the higher disambiguation ability of verbs, making our system the first to reach the line of 70 F1. The systems also obtain unprecedented performance on noun disambiguation, surpassing the previous best system by 1.5, 2.4, and 2.4 for SACEbase, SACE large , and SACE large+ respectively. SACE large+ is the only system that exceeds 85 F1 on noun disambiguation. Table 3 reports different systems' performance on ALL WN_1st and ALL WN_others , which has 4278 and 2525 annotations respectively. Compared with previous wellperforming systems including LMMS and SREF, our systems achieve much better performance on both datasets, with the major contribution coming from WordNet 1 st sense disambiguation. On the contrary, SACE and BEM obtain similar performance on ALL WN_1st while SACE can disambiguate rare senses with higher accuracy. This shows a better few-shot learning ability of SACE in comparison to BEM because the ALL WN_others dataset only contains the words whose correct sense appears infrequently in SemCor.

Rare Sense Disambiguation
Here, sense disambiguation is defined as whether a system can select the sense as the correct sense, which is viewed from a sense perspective. In comparison, word or lemma disambiguation is to determine the correct sense of a word or lemma, which is viewed from a word perspective.

Unseen Sense Disambiguation
In the second column of table 4, different system's performance on ALLZSS (691 polysemous instances) is provided. This dataset only contains polysemous words whose gold label is not in SemCor, which evaluate the zero-shot sense disambiguation ability of different systems. It is shown that lately proposed systems have an overwhelming advantage of zeroshot sense disambiguation over ordinary baselines including WordNet S1 and BERT-base, with the margin ranging from about 12 F1 to about 42 F1. Specifically, although BEM outperforms its  baselines by around 25 F1, our base and large system still beat BEM by almost 12 and 18 F1 respectively.
In the third column, we follow previous works and show how different systems perform on ALL ZSS* (1139 instances including monosemous ones). The aforementioned gaps become narrower since each system can correctly disambiguate monosemous instances.

Unseen Lemma Disambiguation
In the last two columns of table 4, the systems' performance on zero-shot lemmas is presented. The difference between these two datasets is whether monosemous lemmas are included. We believe it is more reasonable to focus on ALLZSL (222 polysemous instances) since monosemous lemmas do not require disambiguation and thus the statistics on ALL ZSL* cannot fully reveal the systems' zero-shot disambiguation ability of words.
Similarly, it shows that lately proposed systems tend to outperform the baselines by large margins, varying from 19 to almost 36 F1. Among them, BEM performs the worst on this dataset, 2.2 F1 lower than a similar system, GlossBERT. In contrast, after incorporating both word and sense level context, our system obtains an unprecedented performance on this dataset, being the first system to reach the line of 90 F1 and beating BEM by almost 16 F1. Also, different from SREF and ARES, our systems do not rely on WordNet or SyntagNet sense relation knowledge.

Cross-lingual All-words WSD
We utilize two multilingual datasets (including French-FR, German-DE, Italian-IT, and Spanish-ES subsets) to evaluate the multilingual transferability of our method. Table 5 presents the performance of some lately proposed systems and ours. For our system, the baseline is trained with the same training data as SACElarge+ using XLM-RoBERTa-base, while removing all the proposed components including SlC, WlC, and TaM. For the systems under comparison, all but UKB +Syn utilizes English training data. Also, EWISER and MuLaN further employ SemCor and WNGT as their training data, being the same as SACE mul .
It shows that SACE mul has obtained a new stateof-the-art on both the combined dataset and most individual datasets, surpassing its direct baseline by 2.4 F1. In detail, the largest margin, about 5.5 F1 on its Spanish and Italian subset, above the previous best system is acquired on SE15, which covers instances in all POS. This has revealed the overwhelming advantage of SACE mul on disambiguating instances of other POS. In contrast, SACE mul performs 6.5 F1 lower than MuLaN on the Spanish subset of SE13, which only covers noun instances. In a word, SACE mul is more compatible with real cross-lingual scenarios since it has a strong disambiguation ability of words in different POS.

Analysis
Error Analysis By comparing the disambiguation results of SACE base and its baseline (all factors removed), it is revealed that both systems have correctly disambiguated 5346 instances in ALL while 525 and 339 instances are only correctly disambiguated by SACE base and its baseline respectively. In other words, SACE base has falsely     6 shows an example (country) that SACE base falsely predicted. It is shown that the WlC does not manage to retrieve valuable information for disambiguating the word while injecting some irrelevant context. Table 6 gives an example of top related sentences (#47 and #19) of a particular sentence (#10) under disambiguation. Here, church is falsely predicted when WlC is disabled. It shows that WlC has detected similar sentences in the same document and incorporated valuable context for context embedding learning. Table 7 provides some examples regarding synsets that are connected by the selective attention layer, indicating its ability of detecting some syntagmatic sense relations and senses of close meaning. The connection is established by using the largest attention score � ,̂� in a batch after filtering self-connection.

Conclusion
In this paper, we propose an interactive context exploitation method from both word and sense perspectives in a supervised similarity-based WSD architecture. Experiments on English and crosslingual all-words WSD datasets verify the effectiveness of our approach, surpassing previous state-of-the-art by large margins. It also shows that the proposed method has an overwhelming advantage of learning few-shot and zero-shot WSD ability. For future work, we intend to utilize reinforcement learning to enhance current interactive WSD by customizing the context exploitation for different instances. The source code is available at: https://github.com/lwmlyy/SACE.

Ethics Impact Statement
This paper does not involve the presentation of a new dataset, an NLP application and the utilization of demographic or identity characteristics in formation. For compute time/power, the proposed system requires less GPU amount (1 versus 2 GPUs) and time (10 versus about 70 hours) for training compared with its direct baseline (Blevins and Zettlemoyer, 2020  / They belong to a group of ringers who drive every Sunday from church to church in a sometimes-exhausting effort to keep the bells sounding in the many belfries of East Anglia. 47 0.969 "The sound of bells is a net to draw people into the church," he says.
19 0.807 Proper English bells are started off in "rounds, " from the highest-pitched bell to the lowest --a simple descending scale using, in larger churches, as many as 12 bells.

/
Immigration policy under Nicolas Sarkozy was criticized from various aspects a congestion of police, legal and administrative services subjected to a policy of numbers and the compatibility of that policy with the self-proclaimed status of the country as the country of French human rights. 0 0.384 Is immigration a burden or an opportunity for the economy? 13 0.476 Restraining immigration leads to anaemic growth and harms employment.

Hyperparameter Search
The bounds for each hyperparameter are listed in table 1, with configurations for best performing models underlined. We use the F1-measure on SE07 to select the values. All the details are shown in the source code. For those that have two underlined numbers, they are the best setting for base and large models.

B Experimental Results
In figure 1, we show how SACE base and SACE large perform on SE07 at each epoch during training. It is shown that both systems reach their optimal performance on SE07 at early epoch, 3rd or 4th epoch. This indicates if we utilize the method of early stopping during training, its time efficiency can further be enlarged.