Enhancing the Context Representation in Similarity-based Word Sense Disambiguation

In previous similarity-based WSD systems, studies have allocated much effort on learning comprehensive sense embeddings using contextual representations and knowledge sources. However, the context embedding of an ambiguous word is learned using only the sentence where the word appears, neglecting its global context. In this paper, we investigate the contribution of both word-level and sense-level global context of an ambiguous word for disambiguation. Experiments have shown that the Context-Oriented Embedding (COE) can enhance a similarity-based system’s performance on WSD by relatively large margins, achieving state-of-the-art on all-words WSD benchmarks in knowledge-based category.


Introduction
Word sense disambiguation (WSD) is aimed at selecting the correct sense for a word given its context. Potential senses of a word are from a sense inventory such as WordNet (Miller, 1995). WSD can be classified into lexical sample WSD and all-words WSD. The former focuses on disambiguating some particular words in many sentences, while the latter conducts WSD on every ambiguous word in the provided text.
The nature of all-words WSD allows the task to be more compatible to downstream applications. Nevertheless, the task becomes more difficult (Pradhan et al., 2007) while it also provides more context information (rather than a single sentence).

* corresponding author
Utilizing such global context can assist the systems to tackle WSD from an overall perspective.
Recent development of contextual representation models, has accelerated the progress of WSD. Many systems are proposed to tackle WSD by employing BERT either by extracting features (Vial et al., 2019;Loureiro and Jorge, 2019) or fine-tuning (Peters et al., 2019;Levine et al., 2020). However, these systems are mostly implemented with a single sentence context, especially for the systems (Huang et al., 2019;Blevins and Zettlemoyer, 2020) that finetune BERT (Devlin et al., 2019). As for the others (Scarlini et al., 2020a;Wang, 2020, Scarlini et al., 2020b), efforts are allocated to construct sense embeddings using WordNet or SemCor (Miller et al., 1994), while context embeddings for ambiguous words are learned merely from a single sentence. This has led to an issue that the information volume of context embeddings and sense embeddings is not balanced.
In this paper, we introduce COE, a contextoriented embedding technique to learn comprehensive context representations for ambiguous words. This is aimed at enhancing the context embeddings by considering both the global and local sentences in the provided document. In summary, our approach has the following contributions: • We propose a novel technique to capture both local and global context information for context representation learning. The obtained context embeddings are further enhanced with the embeddings of senses appeared in the context.
• We show that the proposed technique can elevate previous systems' performance on allwords WSD to new state-of-the-art in the knowledge-based category.

Similarity-based WSD
Given a document that contains several sentences, a system is required to determine the correct sense , , is one of the potential senses in , retrieved from WordNet by the lemma and part-of-speech (POS) of word , . In previous similarity-based WSD models (Loureiro and Jorge, 2019;Scarlini et al., 2020a;Scarlini et al., 2020b), sense embeddings of all WordNet senses are first learned using their definitions and other available resources. Then, in order to disambiguate , , the sense embedding , , of its potential sense , , is retrieved from the learned sense embedding pool. Then, the dot product of each potential sense embedding , , and the context embedding , is used to select the optimal sense ̂, , shown in formula (1). , is learned using only the sentence where , appears.
Typically, , is the sum of BERT's last four layers at the position of , , taking as its input. When , is tokenized into several pieces, the sum of all its pieces' embeddings is taken as , . However, this naïve context representation learning process has limited the system's ability to capture global context information. In order to relieve this issue, we devise several methods to learn more comprehensive context embeddings by combining and the other sentences in the same document. Note that, this work does not involve any attempt on sense embedding learning.

Local Context Embedding
Following the approaches in prior works (Agirre et al. 2018, we utilize the directly surrounding sentences { 1 , … , −1 , +1 , … , } of the ambiguous sentence for a more effective local context embedding. Here, we use a development set to select the optimal number of surrounding sentences on both sides of and use the expanded sentence set as BERT's input to get the local context embedding , .
Global Context Embedding Except for the sentences that are in the same small window as the ambiguous sentence , distant sentences are also beneficial for understanding the words in in many cases. Here, we transform the problem into a sentence selection problem, i.e., to determine which sentences can better incorporate global context information for the disambiguation of the words in .
We hence formally define the problem as follows: for each sentence ⋲ { 1 , 2 , … , } under disambiguation, we aim at ranking the other sentences in the same document according to their contributions from different perspectives. Then, we use and its top ranked sentences to learn the global context embedding , . We design three methods to rank the sentences: word overlap (WO), TF-IDF score (TF-IDF WO), glossexpanded word overlap (GeWO).
• Word overlap: the overlap count between and , i.e., the sum of the number of times that 's words appear in . • TF-IDF weighted word overlap: we regard each sentence ⋲ { 1 , 2 , … , } as a document and calculate the TF-IDF score of each word in the sentences; the TF-IDF score is then used to weight the overlap count between and for each word. The score of with respect to is calculated as follows: • Gloss-expanded word overlap: we first expand each sentence ⋲ { 1 , 2 , … , } with all the synsets' definition words of each monosemous word , and then calculate the overlap between expended and . After we obtain the score of sentence ⋲ { 1 , … , −1 , +1 , … , } with respect to , we rank them based on the scores and combine and its top related sentences to learn a global context embedding. We note that, the sentence order is maintained when using them to learn the context embedding. For instance, if −4 and +9 are the top 2 related sentences of , we take { −4 , , +9 } as BERT's input for learning the global context embedding of each word in . We also employ a development set to acquire the optimal number of related sentences for the global context embedding learning.
Sense-aware Context Embedding In most cases, the words in a given document are not always polysemous. This is verified by the statistics that 16.4% of words are monosemous in SemCor. These monosemous words can provide some general background information about the whole document. Here, we utilize the sense embeddings of the monosemous words to compose a sense-aware context embedding , .
In detail, all the sense embeddings of the monosemous words in the same document as , are added together to obtain , only when , is a noun or verb. This is because the disambiguation of adjectives and adverbs tend to rely more on the local context information, indicating that it is a modifier (adjective or adverb) of which word (noun or verb) in the same sentence. We note that, for the knowledge-based approach, we also use the sense embedding of WordNet 1 st sense for polysemous words in the document.
We combine the above local and global context embeddings after normalization to get the final enhanced context embedding � , , detailed in formula (3).

Try-again Mechanism (TaM)
Wang and Wang (2020) proposed a try-again mechanism that exploits WordNet synset relations and super-sense connections to conduct a second WSD. Precisely, when disambiguating , , the method takes into account two similarity scores.
One is from Formula (1). The other is calculated from a broader perspective, e.g., the maximal similarity between , and one potential sense' ( , , ) related synsets ( , , ). These related synsets are connected to , , by WordNet synset relations and the super-sense connection.
Here, synsets that are in the same super-sense category are regarded as connected by the supersense connection. For example, toy.n.03 (toy) {a device regarded as providing amusement} and bell.n.01 (bell) {a hollow device made of metal that makes a ringing sound when struck} are both in the super-sense category of noun.artifact. Formula (4) illustrates the final WSD calculation. The method manages to boost the knowledge-based system's performance by a relatively large margin, while slightly damages the performance of the supervised system.
We improve the original mechanism by utilizing a higher quality of synset category named coarse sense inventory (CSI, Lacerra et al., 2020). CSI defines 45 labels in its inventory and covers 83,000 WordNet synsets. We replace the super-sense connection with CSI in the modified try-again mechanism. The revised mechanism leads our model to a better performance.

Datasets and Systems
We use the evaluation framework in (Raganato et al., 2017b) to evaluate our method's effectiveness.
In the following section, we report the performance of systems in the knowledge-based category for all-words WSD task, in comparison with ours.   . Throughout the whole paper, we utilize the knowledge-based version of SREF  sense embeddings to validate the effectiveness of our method.
Except for the knowledge-based version, we also implement the proposed method in some supervised similarity-based systems, achieving better performance than their original versions. However, the margin is not significant. Details are shown in Appendix. Table 1 demonstrates the ablation study on the combined dataset (ALL). An overall conclusion can be drawn that each of the proposed factors manages to raise the system's performance. F1 measure is reported in percentage in all the tables.

Ablation Analysis
As one can see, although the sense-aware context embedding is simple and easy to implement, the strategy alone enhances the system's performance by 2 F1. This astonishing contribution owes to a fine quality sense embedding and the employment of WordNet 1st senses, an essential prior knowledge in WordNet. As for the other two factors regarding context sentence usage, the contribution of each factor is not as significant.
Viewing from another perspective, when both the local and global context embeddings are removed, the performance drop exceeds that of the system that ignores the sense-aware embeddings. This has illustrated a fact that both word-level and sense-level context embeddings are crucial for WSD. It is interesting to note that merely adding the sense-aware context embedding can ruin the contribution of TaM, which makes the last two systems (use only , as the context embedding) perform identically on ALL.
In Table 2, the performance of COE kb on ALL has shown that the simplest strategy (WO) has led to the best performance, although the margin is not significant. Table 3 shows how different systems perform on several partitions of ALL. Our system in both categories produces a new state-of-the-art.

Overall Results
The knowledge-based version of our system, COE kb , outperforms the previous state-of-the-art system (SREF) on ALL by a relatively large margin, 2.8 F1. From the perspective of POS performance, COE kb is the first system that reaches 80 F1 on noun disambiguation, surpassing the previous SOTA by 3.1 F1.
In fact, the performance of COE kb has exceeded that of many supervised systems including GLU. GLU utilizes BERT as a feature extraction tool in a supervised manner. The fact that it merely relies on SemCor hampers the system's generalization ability since SemCor only covers a small proportion of WordNet senses. It is shown that those systems (EWISE and GLU) that fail to   Table 3: All-words WSD performance on different partitions of ALL, including dataset and POS (noun-N, verb-V, adjective-A and adverb-R) partitions. * indicates the performance that are obtained (partially) as a development set. Bold and underlined figures represent the current and previous state-of-the-art performance.
incorporate WordNet knowledge (especially definitions) perform poorly on SE13 and cannot outperform many lately proposed knowledgebased systems such as SyntagNet, SREF and COE. The performance of the systems in supervised category is shown in Appendix.

Case Study
In this subsection, we compare the experimental result of SREF kb and COE kb in a detailed manner so as to find out on what aspects COE kb performs well and poorly respectively. Table 4 shows the number of instances in ALL that are correctly disambiguated by SREF kb or COE kb only (nonoverlap). It also details the ambiguity (average number of potential senses per instance) and POS proportions of the above instances.
A key factor is revealed that COE kb does not outperform SREF kb incrementally, which means COE kb has falsely predicted many, 310, instances that are correctly predicted by SREF kb . In this case, although COE kb can disambiguate more ambiguous instances, it has somehow compromised the ability of disambiguating easier instances. This has triggered a question regarding how to customize the context exploitation for different instances. Nevertheless, the POS proportions of the instances that are only correctly predicted by each model is almost identical.

Error analysis
In Table 5, a falsely predicted example, among others, from SE15 is given to demonstrate what kind of instance COEkb are typically weak at disambiguating. It is shown that the similarity of the top ranked senses to the context of contact is very close to each other. This is logical since the definition of these senses are semantically similar, which are hard to distinguish even for human beings.
The above dilemma has raised concerns about whether the systems have reached the upper bound of their capability, 80%. This is an estimated inter-annotator agreement in Navigli (2009), which means the percentage of words tagged with the same sense by two or more human annotators. Further, if a system's performance outperforms this upper bound, is it because of overfitting? To tackle the above issue, a plausible choice might be to construct a coarse-grained sense inventory, similar to Navigli et al. (2007). This might also lead to an easier application of WSD to downstream tasks.

Conclusion
In this paper, we have presented COE, a contextoriented embedding technique for similaritybased WSD systems. It takes better advantage of both word-level and sense-level information from the document where an ambiguous word appears. Experiments have shown that the proposed method can enhance a system's performance on all-words WSD by relatively large margins. The ablation study has shown the contribution of each proposed factor. The source code will be made available at GitHub for further development.

Ethics Impact Statement
This paper does not involve the presentation of a new dataset, an NLP application and the utilization of demographic or identity characteristics in formation.
lemma contact (semeval2015.d003.s022.t005-noun) sentence (in lemma) what be the precaution for the person who give the medicine or come_into contact with the animal? senses contact.n.01 2.202 close interaction contact.n.02 2.182 the state or condition of touching or of being in immediate proximity contact.n.03 2.174 the physical coming together of two or more things contact.n.04 2.168 the act of touching physically contact.n.05 2.113 (electronics) a junction where things (as two electrical conductors) touch or are in physical contact Andrea Moro and Roberto Navigli. 2015. SemEval-To implement the supervised version of our system, we utilize the supervised sense embedding from SREF sup . For COE sup , the context embedding is a concatenation of two embeddings, with one from COE kb ( � , ) and the other ( , ) from the output of BERT using only the original sentence as input. This is to guarantee an information symmetry of the embeddings since LMMS supervised sense embeddings in SREF sup are learned from SemCor with one sentence as input at each time. The calculation before TaM is shown in formula (5). To be consistent with the sense embeddings, we use BERT LARGE_CASED to learn the context embeddings.  (Scarlini et al., 2020a), SREF , ARES (Scarlini et al., 2020b), BEM (Blevins and Zettlemoyer, 2020) and EWISER (Bevilacqua and Navigli, 2020). In this category, we only report the performance obtained by using SemCor as the training set for a fair comparison.

Overall Performance
In Table 6, COE sup , outperforms its direct competitor, SREF, by 1.8 F1, although the margin between the newly proposed systems that finetunes BERT (BEM) is smaller. BEM is a system that fine-tunes two separate BERT for encoding context and gloss respectively. The whole training process takes 2 to 3 days with 2 GPUs, which is comparatively expensive in terms of time and device. In comparison, COE kb and COE sup take less than half an hour to learn all the necessary sense embeddings.

Rare Lemma or Sense disambiguation
In this subsection, we implement two experiments on rare sense or lemma disambiguation. Following the setting in SREF and other previous works, we partition ALL into two subsets according to the gold label of each lemma, with one containing those lemmas whose sense is ranked 1 st in WordNet (ALL WN_1st ) and the others (ALL WN_other ). The 1 st sense of each lemma in WordNet can be regarded as the most frequent sense (MFS). This was manually sorted with the statistics from some sense-annotated corpora. Table 7 compares different systems' performance on the two subsets of ALL. It shows that COE kb has obtained an advantageous position at disambiguating rare senses, with a 2.5 F1 higher than its direct competitor, SREF kb , while maintained a better performance on lemmas of MFS. COE sup has also outperformed it direct opponent, SREF sup , on rare sense disambiguation with a larger margin, 3 F1. In comparison to BEM, our system can scale much better to unseen or rare senses while still have a competitive capability of disambiguating MFS.
Following Scarlini et al. (2020b), we also conduct an experiment on those lemmas or senses that are in ALL but not in the training data, SemCor. For zero-shot lemmas/words, 1139 instances are extracted from ALL (ALL LFW ). In terms of senses that do not appear in SemCor, we extract 222 polysemous instances from ALL (ALL LFS ). Table 8 shows that COE kb has attained the best performance on both subsets, outperforming SREF kb 1.6 and 4.9 F1 on ALL LFS and ALL LFW , respectively. The margin between COE kb and other newly proposed systems is even larger, revealing the tremendous potential of our system regarding zero-shot learning in WSD. It is also worth mentioning that COE sup performs 8.4 F1 lower than the knowledge-based version on ALL LFS . This has raised a question regarding how to balance the exploitation of the sense embeddings learned from SemCor and WordNet knowledge. In addition, an essential conclusion can be drawn that knowledge-based systems (SREF kb and COE kb ) have an overwhelming advantage on zero-shot sense disambiguation. Table 9 shows the performance of our systems using different sense embeddings, compared with the original system. Precisely, the proposed method is proven valid and robust when utilizing three different sense embedding sets. The largest margin is obtained in the knowledge-based     Table 6: All-words WSD performance for both supervised (Sup.) and knowledge-based (Know.) categories on different partitions of ALL, including dataset and POS (noun-N, verb-V, adjective-A and adverb-R) partitions. * indicates the performance that are obtained (partially) as a development set. Bold and underlined figures represent the current and previous state-of-the-art performance, respectively.