Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration

Recent studies treat Word Sense Disambiguation (WSD) as a single-label classification problem in which one is asked to choose only the best-fitting sense for a target word, given its context. However, gold data labelled by expert annotators suggest that maximizing the probability of a single sense may not be the most suitable training objective for WSD, especially if the sense inventory of choice is fine-grained. In this paper, we approach WSD as a multi-label classification problem in which multiple senses can be assigned to each target word. Not only does our simple method bear a closer resemblance to how human annotators disambiguate text, but it can also be seamlessly extended to exploit structured knowledge from semantic networks to achieve state-of-the-art results in English all-words WSD.


Introduction
Word Sense Disambiguation (WSD) is traditionally framed as the task of associating a word in context with its correct meaning from a finite set of possible choices (Navigli, 2009). Following this definition, recently proposed neural models were trained to maximize the probability of the most appropriate meaning while minimizing the probability of the other possible choices (Huang et al., 2019;Vial et al., 2019;Blevins and Zettlemoyer, 2020;. Although this training objective proved to be extremely effective and even led to  reaching the estimated upper bound of interannotator agreement for WSD performance on the unified evaluation framework of Raganato et al. (2017b), adhering to it underplays a fundamental aspect of how human annotators disambiguate text. Indeed, past studies have observed that it is not uncommon for a word to have multiple appropriate meanings in a given context, meanings that can be used interchangeably under some circumstances because their boundaries are not clear cut (Tuggy, 1993;Kilgarriff, 1997;Hanks, 2000;Erk and McCarthy, 2009). This is especially evident if the underlying sense inventory is fine-grained, as the complexity, and therefore performance, of WSD is tightly coupled to sense granularity (Lacerra et al., 2020). The difficulty an annotator faces in choosing the most appropriate meaning from a fine-grained sense inventory becomes clear from an analysis of gold standard datasets: a non-negligible 5% of the target words are annotated with two or more sense labels in several gold standard datasets, including Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013, and SemEval-2015 (Moro and Navigli, 2015). Therefore, we follow Erk and McCarthy (2009), Jurgens (2012), and Erk et al. (2013, and argue that forcing a system to treat WSD as a single-label classification problem and learn that only one sense is correct for a word in a given context does not reflect how human beings disambiguate text. In contrast to recent work, we approach WSD as a soft multi-label classification problem in which multiple senses can be assigned to each target word. We show that not only does this simple method bring significant improvements at low or no additional cost in terms of training and inference times and number of trainable parameters, but it can also be seamlessly extended to integrate senses from relational knowledge in structured form, e.g., similarity, hypernymy and hyponymy relations from semantic networks such as WordNet (Miller, 1995) and BabelNet (Navigli and Ponzetto, 2012). While structured knowledge has been naturally utilized by graph-based algorithms for WSD (Agirre and Soroa, 2009;Moro et al., 2014;Scozzafava et al., 2020), the incorporation of such information into neural approaches has recently been garnering significant attention. However, currently available models can only take advantage of this knowledge with purposely-built layers ) that require additional complexities and/or trainable parameters. To the best of our knowledge, the work presented in this paper is the first to integrate structured knowledge into a neural architecture at negligible cost in terms of training time and number of parameters, while at the same time attaining state-of-the-art results in English all-words WSD.

Method
Single-label vs multi-label. WSD is the task of selecting the best-fitting sense s among the possible senses S w of a target word w in a given context c = w 1 , w 2 , . . . , w n , where S w is a subset of a predefined sense inventory S. Abstracting away from the intricacies of any particular supervised model for WSD, the output of a WSD system provides a probability y i for each sense s i ∈ S w . Recently proposed machine learning models - Kumar et al., 2019;Barba et al., 2020;Blevins and Zettlemoyer, 2020;, inter alia -are trained to maximize the probability of the single most appropriate senseŝ by minimizing the cross-entropy loss L CE : We observe that this loss function is only suitable for single-label classification problems. In the case of WSD, this is equivalent to assuming that there is just a single appropriate senseŝ ∈ S w for the target word w in the given context c, that is,ŝ is clearly dissimilar from any other sense in S w . Indeed, minimizing the cross-entropy loss in order to maximize the probability of two or more senses generates conflicting training signals; at the same time, choosing to ignore one of the correct senses results in a loss of valuable information.
Since there is a not insignificant number of instances where multiple similar senses of the target word w fit the given context c (see Section 1), we frame WSD as a multi-label classification problem in which a machine learning model is trained to predict whether a sense s ∈ S w is appropriate for a word w in a given context c, independently of the other senses in S w . This is simply equivalent to minimizing the binary cross-entropy loss L BCE on the probabilities of the candidate senses S w : whereŜ w ⊆ S w is the set of appropriate senses for the target word w in the given context c. We note that this simple yet fundamental change in paradigm does not come with an increased computational complexity as |S w | is usually small. Moreover, it is independent of the underlying model used to calculate the output probabilities and, therefore, it does not increase the number of trainable parameters.
Knowledge integration. If our model benefits from learning to assign multiple similar senses to a target word in a given context, then it makes sense that the very same model may also benefit from learning what related senses can be assigned to that word. For example, in the sentence "the quick brown fox jumps over the lazy dog", our model may formulate a better representation of fox if it is also trained to learn that any fox is a canine (hypernymy relation) or that the fox species includes arctic foxes, red foxes, and kit foxes (hyponymy relations). In this way, not only would the model learn that canines, foxes and arctic foxes are closely related, but it would also learn that canines and arctic foxes may have the ability to jump, and this could act as a data augmentation strategy especially for those senses that do not appear in the training set.
There is a growing interest in injecting relational information from knowledge bases into neural networks but, so far, recent attempts have required purposely-designed strategies or layers. Among others, Kumar et al. (2019) aid their model with a gloss encoder that uses the WordNet graph structure; Vial et al. (2019) adopt a preprocessing strategy aimed at clustering related senses to decrease the number of output classes; Bevilacqua and Navigli (2020) introduce a logit aggregation layer that takes into account the neighboring meanings in the WordNet graph.
In contrast, our multi-labeling approach to WSD can be seamlessly extended to integrate relational knowledge from semantic networks such as Word-Net without any increase in architectural complexity, training time, and number of trainable param-eters. We simply relax the definition of the set of possible senses S w for a word w to include all the senses related to a sense in S w . More formally, let G = (S, R) be a semantic network where S is a sense inventory and R is the set of semantic connections between any two senses. Then we define S + w to also include every sense s j that is connected to any sense s i ∈ S w by an edge (s The loss function is updated accordingly to maximize not only the probability of the correct senses, but also the probability of their related senses: We note that the increase of the number of possible choices (|S + w | ≥ |S w |) and correct meanings (|Ŝ + w | ≥ |Ŝ w |) does not hinder the learning process since each probability is computed independently of the others. Finally, we stress that our approach to structured knowledge integration is completely model-agnostic, as it is independent of the architecture of the underlying supervised model.
Model description. In order to assess the benefits of our multi-labeling approach and avoid improvements that may not be related to the overall objective of this paper, we conduct our experiments with a simple WSD model. Similarly to , this model is simply composed of BERT (large-cased, frozen), a non-linear layer, and a linear classifier. Thus, given a word w in context we build a contextualized representation e w ∈ R d BERT of the word w as the average of the corresponding hidden states of the last four layers of BERT, apply a non-linear transformation to obtain h w ∈ R d h with d h = 512, and finally a linear projection to o w ∈ R |S| to compute the sense scores. More formally: where b −i w is the hidden state of the i-th layer of BERT from the topmost one, BatchNorm(·) is the batch normalization operation, and Swish(x) = x · sigmoid(x) is the Swish activation function (Ramachandran et al., 2017).

Experiments and Results
Experimental setup. We train our models in different configurations to assess the individual contribution of several factors. First, we compare our baseline model trained with a single-label objective (Equation 1) to the same model trained with a multi-label objective (Equation 2). Then, we gradually include structured knowledge in the form of WordNet relations using Equation 3, starting from similarity relations (similar-to, also-see, verbgroup, and derivationally-related-form), and incrementally including generalization and specification relations (hypernymy, hyponymy, instancehypernymy, instance-hyponymy). In order to keep a level playing field with single-label systems, we choose only the meaning with highest probability for our multi-label models.
Datasets. We evaluate the models on the Unified Evaluation Framework for English all-words WSD proposed by Raganato et al. (2017b). This evaluation includes five gold standard datasets, namely, Senseval-2, Senseval-3, SemEval-2007, SemEval-2013, and SemEval-2015. Following standard practice we use the smallest gold standard as our development set, SemEval-2007, and the remaining ones as test sets. We distinguish between two settings: closed and open. In the former setting, we include systems that only use SemCor (Miller et al., 1994) as the training corpus, while in the latter we also include those systems that use WordNet glosses and examples and/or Wikipedia.
Hyperparameters. We use the pretrained version of BERT-large-cased (Devlin et al., 2019) available on HuggingFace's Transformers library (Wolf et al., 2020) to build our contextualized embeddings (Section 2). BERT is left frozen, that is, its parameters are not updated during training. Each model is trained for 25 epochs using Adam (Kingma and Ba, 2015) with a learning rate of 10 −4 . We avoid hyperparameter tuning and opt for values that are close to the ones reported in the literature so as to have a fairer comparison.
Comparison systems. In order to have a comprehensive comparison with the current state of the art in WSD, we include the work of:     • alongside the aforementioned work of Vial et al. (2019), Blevins and Zettlemoyer (2020), and .
The systems are divided into two groups in Table 1: in the upper part we compare our approach against those systems that do not take advantage of information coming from WordNet glosses and/or examples, while in the lower part we also include those systems that make use of such knowledge.
Results. The first two rows of Table 2 show the results of switching from a single-label to a multilabel approach for WSD: this single change already brings a significant improvement in performance (+1.0% in F 1 score, significant with p < 0.1, χ 2 test). Not only that, increasing the number and variety of WordNet relations further increases the performance of the model, with hyponyms being particularly beneficial (+0.8% in F 1 score). Unfortunately, including instance hypernyms and instance hyponyms does not bring further improvements; this may be due to the relatively low number of instances that can take advantage of such relations in SemCor. Nonetheless, the results obtained set a new state of the art among single and ensemble systems trained only on SemCor without the use of additional training data or resources external to Word-Net such as Wikipedia, surpassing the previous state-of-the-art non-ensemble system of Vial et al. (2019) by 2.0% in F 1 score (significant with p < 0.05, χ 2 test), as shown in Table 1. When further trained on the WordNet glosses and examples, our model attains state-of-the-art results (+1.2% and +0.1% in F 1 score compared to the systems of Blevins and Zettlemoyer (2020) and Bevilacqua and Navigli (2020), respectively) despite being simpler than most of the techniques it is compared against.

Conclusion
WSD is a key task in Natural Language Understanding with several open challenges and with the granularity of sense inventories being undoubtedly the most pressing issue (Navigli, 2018). We departed from recent work on WSD and investigated the effect of tackling the task as a multi-label classification problem. Not only is our approach simple and model-agnostic, but it can also be seamlessly extended to integrate relational knowledge in structured form from semantic networks such as WordNet, and at no extra cost in terms of architectural complexity, training times, and number of parameters.
Our experiments show that our method, thanks to its more comprehensive notion of loss over equally valid and structurally-related senses, achieves stateof-the-art results in English all-words WSD, especially when there is a lower amount of annotated text available. These results open the path to further research in this direction, from explor-ing more complex models and richer knowledge bases to exploiting multiple labels in innovative disambiguation settings which can overcome the fine granularity of sense inventories. Not only that, our knowledge integration approach could potentially be applied to address the knowledge acquisition bottleneck in multilingual WSD (Pasini, 2020;Pasini et al., 2021). Finally, with the rise of ever more complex general and specialized pretrained models, we believe that our simple model-agnostic approach can be another step towards knowledgebased (self-)supervision.