Integrating Personalized PageRank into Neural Word Sense Disambiguation

Neural Word Sense Disambiguation (WSD) has recently been shown to benefit from the incorporation of pre-existing knowledge, such as that coming from the WordNet graph. However, state-of-the-art approaches have been successful in exploiting only the local structure of the graph, with only close neighbors of a given synset influencing the prediction. In this work, we improve a classification model by recomputing logits as a function of both the vanilla independently produced logits and the global WordNet graph. We achieve this by incorporating an online neural approximated PageRank, which enables us to refine edge weights as well. This method exploits the global graph structure while keeping space requirements linear in the number of edges. We obtain strong improvements, matching the current state of the art. Code is available at https://github.com/SapienzaNLP/neural-pagerank-wsd


Introduction
Word Sense Disambiguation (WSD) is a task with a long history in Natural Language Processing (Bevilacqua et al., 2021). It addresses the pervasive phenomenon of (lexical) ambiguity, whereby the same polysemous word or expression conveys different meanings in different contexts. Natural Language Processing (NLP) tends to maintain the working assumption that word meaning can be discretized in a finite number of classes, thus casting polysemy resolution as a multi-class classification problem, where the classes, i.e., the senses, are specific to a word. Senses are registered in a dictionarylike resource called the sense inventory. In English WSD the sense inventory is virtually always Word-Net (Miller et al., 1990), which, for example, lists separately the fish and musical instrument senses of the word "bass".
The WSD task is currently dominated by supervised methods (Yap et al., 2020;Blevins and Zettlemoyer, 2020). These methods, thanks, inter alia, to the game-changing effect of pretrained language models, have widened the margin over so-called knowledge-based approaches (Agirre and Soroa, 2009;Moro et al., 2014;Scozzafava et al., 2020), which usually disambiguate using only global graph information -a source of information that, however, is not easy to integrate explictly into supervised WSD. Although there have been a few successful approaches integrating graphs into a standard neural classification architecture (Conia and Navigli, 2021;Bevilacqua and Navigli, 2020), such methods only exploit the local relational structure, leaving the global structure unused. Thus, while the model is able to directly utilize the explicit knowledge that a cairn is a terrier, it is not able to also utilize the fact that, following the hyponymy relation, a cairn is also a dog.
In this paper we propose a method for integrating global graph information into neural supervised WSD through a Personalized PageRank (Page et al., 1999, PPR) approximation, blurring the distinction between knowledge-based and supervised methods. We achieve this by generalizing the logit aggregation scheme used on top of a feedforward classifier by Bevilacqua and Navigli (2020). Our proposed method is simple and extensible, and could be adapted to work with other classifiers as well. We match the performance of the current state of the art in WSD (Blevins and Zettlemoyer, 2020) while using only 35% (60M) of the trainable parameter count. When training on additional data, our technique outperforms the previous state of the art in that setting (Conia and Navigli, 2021).

Method
In this section we explain how our method, building on top of previous approaches to WSD.

Neural WSD
The most popular formulation for WSD employs token-level classifiers, which encode each token t i into a vector e i using a sequence classifier, e.g. an LSTM (Raganato et al., 2017b) or, more recently, pre-trained Transformer-based contextualized embeddings (Hadiwinoto et al., 2019). The vector is then fed to one or more feedforward (FFN) layers, producing a softmax-normalized probability distribution over all the output classes, i.e., all the synsets (groups of senses sharing the same meaning) in WordNet: where I is an inventory function that returns the set of possible synsets for a token, and I(·) denotes the set of all WordNet synsets. At test time, the predicted synsetŝ i is the one with the highest probability, searching only among the set of synsets that are consistent with the current token (I(t i )): This formulation is very straightforward, but also wasteful, as scores are computed for "impossible" synsets, i.e. those ∈ I(t i ).

EWISER
Recently, there has been a trend towards the inclusion of knowledge from the WordNet sense inventory in WSD. While many successful approaches have exploited glosses (Huang et al., 2019;Blevins and Zettlemoyer, 2020), the use of relational information, i.e. the graph structure of WordNet, has mostly been exploited by so-called knowledgebased algorithms, i.e., those that make use of no corpus supervision at all. However, one recent approach (Bevilacqua and Navigli, 2020, EWISER) has made the use of graph relations part of its core method: it employs trainable edge weights to aggregate scores of related synsets together, thus taking advantage of the scores over the whole vocabulary: where E in (s) is the set of all the synsets s ∈ V (i.e., the set of nodes), such that there is an edge from s to s, and w(s , s) is the weight of the edge. Bevilacqua and Navigli (2020) batch the computation by encoding all edges in a (sparse) adjacency matrix A ∈ R |I(·)|×|I(·)| , where A s ,s = 0 unless s ∈ E in (s), in which case A s ,s = 1 |E in (s)| : In training, only non-zero A weights are updated. EWISER provides an elegant way to incorporate graph knowledge within a token tagger architecture. However, it also has some evident limitations. First, the model is only able to incorporate local neighbourhood, as only paths of length 1 are considered. While one could incorporate paths of arbitrary length by augmenting E in (s) with all nodes connected to s by at most k steps, this solution is in practice not very scalable, as it would rapidly densify A, increasing the number of parameters way beyond what is reasonable: if we were to have a fully connected graph for WordNet synsets, it would have more than 6.9 billion parameters.

Integrating PageRank
We propose improving the logit aggregation mechanism of EWISER (Eq. 4) by applying the A matrix K times. Each new iteration makes it such that progressively more distant neighbours affect the classification score. However, given a sufficiently large value of K, this would have the effect that the contribution of the original scores becomes increasingly smoothed out.
To solve this issue we exploit the connection between the logit aggregation step and the PPR algorithm (Page et al., 1999), which uses both edge information and a personalization vector (z, a prior on node importance before taking into account edges) to produce a distribution over nodes. One of the ways to "solve" the PageRank uses the so-called power iteration method, whereby z is repeatedly multiplied by A: where α is the so-called teleport probability, used to interpolate between z and the current iteration scores -saving some probability mass from the former. In our case for each instance to classify the personalization vector is set to be equal to logit scores, i.e. the raw scores before applying softmax: Differently from vanilla PPR, we do not check for convergence, but treat K as a hyperparameter; also, we do not normalize the personalization vector into a probability distribution, as preliminary experiments showed that normalization does not affect the final classification scores significantly. Our approach is related to that of Klicpera et al. (2019), who use a neural approximation based on topicsensitive PageRank (Haveliwala, 2002) instead of PPR, and apply it to the task of node classification in citation graphs. Differently from them, however, we exploit the graph structure of the output, not that of the input.

Experiments
Setting We evaluate our proposed addition to the WSD task by employing it on top of the simple feedforward classifier baseline (taking as input frozen BERT large embeddings) used by Conia and Navigli (2021). We train the model with vanilla categorical cross-entropy. Following Bevilacqua and Navigli (2020) we use SensEmBERT (Scarlini et al., 2020) and LMMS (Loureiro and Jorge, 2019) embeddings to initialize the output embeddings (i.e., the last transformation matrix of the FFN) for, respectively, nominal and all other synsets. The evaluation measure is the F 1 score. We compare against EWISER, but also report results for other recent high-performing methods from the literature, both sequence-tagging (Huang et al., 2019;Yap et al., 2020) and token-tagging (Blevins and Zettlemoyer, 2020; Conia and Navigli, 2021). 1 Data As usual in the WSD literature, we train our model on SemCor (Miller et al., 1994), i.e. the largest available manually semantically annotated corpus; following Bevilacqua and Navigli (2020) we also add synthetic instances built by prepending lemmas to the corresponding WordNet synset definition -thus injecting gloss information that is used by other state-of-the-art models (Huang  moyer, 2020). In order to test the performance upper bounds we also experiment separately with adding to the training set the so-called WordNet Tagged Glosses corpus 2 (WNTG), which contains a large number of additional gold and silver annotations. We evaluate on the framework for English allwords WSD made available by Raganato et al. (2017a), 3 which includes Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013 (Navigli et al., 2013), and SemEval-2015 (Moro and Navigli, 2015). We use SemEval-2007 as our development set, and report results on the concatenation of all other datasets in the framework (ALL − ); following common practice, we also report results on SemEval-2007 + ALL − (ALL), even though the former is the development set.
Graph We train models with various values of the power iteration parameter K: (i) K = 0, which corresponds to the baseline with no logit aggregation scheme; (ii) K = 1, which is similar to EWISER, but using α-weighted interpolation between original logits and aggregated ones; (iii) K = 10, which corresponds to a range greater than or equal to the length of around 99% of paths between any two connected nodes in the graph. We report the path length statistics in Table 1.
We build A by including different sets of edges  from WordNet, i.e., hyponymy, hypernyms, similarity, derivationally related, and verb group. The weight A s ,s from synset s to s is initialized as is the set of all s s.t. the edge s , s is in the graph. Additionally, we experiment with including edges from SyntagNet (Maru et al., 2019), a resource that includes edges representing semantic collocations. 4 Collocational information is orthogonal to that contained in WordNet, providing paths between regions of the WordNet graph that would otherwise be distant or disconnected. Finally, to check whether it is feasible to include global information by precomputing the PPR instead of approximating it in the network forward pass, we experiment with a baseline approach where we directly initialize A T s with a PPR distribution, built using WordNet as the starting graph and a one-hot vector z (with z s = 1) as personalization.
To keep A manageable, we cut the distribution in A T s the top 10 ranks and renormalize to sum to 1. We then train a K = 1 model on this graph.
Hyperparameters Models are trained for 30 epochs, using early stopping with patience set to 5, feeding data in batches of 128 sentences. The optimizer used is Adam (Kingma and Ba, 2015) with a learning rate of 10 −4 which decays linearly to 10 −7 . We set the teleport probability α to 0.15, a common default value. We do not tune K, nor 4 http://syntagnet.org/ other hyperparameters, which we take from the configuration of Bevilacqua and Navigli (2020). The development set is only used for early stopping and to select the best epoch of the run.

Results
We report results of our experiments in Table 2. When training on SemCor, the performances of our model increase steadily from K = 0 to K = 1 (+1.4 on ALL − ), and from K = 1 to K = 10 (+0.9 on ALL − ), while SyntagNet edges do not seem to boost the results over the use of the simple WordNet graph. The precomputed PPR graph baseline obtains much lower results than K = 10. The reason is probably that the baseline needs to keep a fixed number of incoming edges, missing potentially useful information on a structurally longer range, while our K = 10 model has no such requirement. On the overall evaluation (ALL − ) the results of the K = 10 model also match (and even slightly outperform) those of the current state of the art, i.e., BEM (Blevins and Zettlemoyer, 2020), all while using around 35% of the trainable parameters. In fact, apart from the parameters in A, our model is a simple feedforward network on top of BERT, while BEM uses two jointly fine-tuned encoders, one to encode the context and the other to encode the definitions.
When adding WNTG to the training corpus, the same trend of increasing performance along with increasing K appears. Moreover, our best model

Model Seen Unseen
Ours (K = 1) 93.9 65.1 Ours (K = 10) 94.1 67.0 Table 3: Results of model trained on SemCor (for K = 1 and K = 10), on the subset of ALL − such that the MFS is among gold synsets (MFS) and on its complement in ALL − (Unseen). F 1 is reported.
outperforms the previous state of the art (Conia and Navigli, 2021) by 0.5 points on ALL − . In this new setting using SyntagNet edges is beneficial, probably because now that many more senses have meaningful occurrences in the larger training set, there is less noise in the tail of the Z (0) (unnormalized) distribution, and more synsets can affect classification positively.
Effect of K To understand the influence of the number of iterations on performance, we plot in Figure 1 the performance on ALL − and ALL when increasing the K hyperparameter from 0 to 20. As can be seen, the trend is quite clear: the model improves going from K = 1 to K = 10, but then increasing K further results in (slightly) diminished performances. The reason for this could be that K = 10 strikes the best tradeoff between influence range increase and oversmoothing. In fact, the average shortest path between any two nodes in the WordNet graph is around 6.3.

MFS/Unseen
So, where does the improvement when using our technique come from? To answer this question we have isolate two subsets of ALL − : a most frequent synset (MFS) set, including only instances in which one of the gold synsets associated with the instance is the MFS, and a much harder unseen one, in which the gold synsets never occur in the training set. We report the results in Table 3.
As can be seen the benefits of using K = 10 in our model are much stronger for the long tail of never occurring senses than for high-frequency senses. In fact, our K = 10 model trained on SemCor improves the performance of the K = 1 baseline on unseen synsets by 1.9 points (67.0 against 65.1), while the improvement on the MFS is much more modest (94.1 vs 93.9 F1).

Conclusion
In this paper we have shown that a deeper integration between supervised and knowledge-based methods in WSD can be attained. Indeed, by using the standard logits as personalization vector for an approximated neural PageRank, a supervised method can exploit not just local graph information, but global information as well. Thanks to our technique, we are able to match the best competitor system when training only on SemCor (Blevins and Zettlemoyer, 2020), while using a simpler model. When we concatenate the WNTG training corpus our results outperform the best competitor (Conia and Navigli, 2021) by 0.5 on the overall evaluation. We leave it as future work to ascertain whether i) edge label information can be incorporated too, and ii) stronger baseline models using glosses (Yap et al., 2020;Blevins and Zettlemoyer, 2020) can benefit from the use of our method.