Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available https://github.com/ksipos/polysemy-assessment .


Introduction
Polysemy, the number of senses that a word has, is a very subjective notion, subject to individual biases. Word sense annotation has always been one of the tasks with the lowest values of interannotator agreement (Artstein and Poesio, 2008). Yet, creating high-quality, consistent word sense inventories is a critical pre-requisite to successful word sense disambiguation.
Towards creating word sense inventories, it can be helpful to have some reliable information about polysemy. That is, knowing which words have many senses and which words have only a few senses. Such information can help in creating new inventories but also in validating and interpreting existing ones. It can also help select which words to include in a study (e.g., only highly polysemous words).
We propose a novel, fully unsupervised, and data-driven approach to quantify polysemy, based on basic geometry in the contextual embedding space.
Contextual word embeddings have emerged in the last few years as part of the NLP transfer learning revolution. Now, entire deep models are pre-trained on huge amounts of unannotated data and fine-tuned on much smaller annotated datasets. Some of the most famous examples include ULM-FiT (Howard and Ruder, 2018) and ELMo (Peters et al., 2018), both based on recurrent neural networks; and GPT (Radford et al., 2018) and BERT (Devlin et al., 2018), based on transformers (Vaswani et al., 2017). These models all are very deep language models. During pre-training on large-scale corpora, they learn to generate powerful internal representations, including fine-grained contextual word embeddings. For instance, in a well pre-trained model, the word python will have two very different embeddings depending on whether it occurs in a programming context (as in, e.g., "I love to write code in python") or in an ecological context ("while hiking in the rainforest, I saw a python").
Our approach capitalizes on the contextual embeddings previously described. It does not involve any tool and does not rely on any human input or judgment. Also, thanks to its unsupervised nature, it can be applied to any language (even those with limited resources), provided that contextual embeddings are available.
The remainder of this paper is organized as follows. We detail our approach in section 2. Then, we present our experimental setup (sec. 3), evaluation metrics (sec. 4), and report and interpret our results (sec. 6). In section 7, we briefly touch on two other interesting applications of our method. One that allows the user to sample sentences containing different senses of a given word and one that goes towards word sense induction. Finally, related work is presented in section 8.
2 Proposed approach 2.1 Basic assumption First, by passing diverse sentences containing a given word to a pre-trained language model, we construct a representative set of vectors for that word (one vector for each occurrence of the word). The basic and intuitive assumption we make is that the volume covered by the cloud of points in the contextual embedding space is representative of the polysemy of the associated word.

Main idea: multiresolution grids
As a proxy for the volume covered, we adopt a simple geometrical approach. As shown in Fig.  1, we construct a hierarchical discretization of the space, where, at each level, the same number of bins are drawn along each dimension. Each level corresponds to a different resolution. Our polysemy score is based on the proportion of bins covered by the vectors of a given word at each level. Grid vs. clustering. Using a binning strategy makes more sense than a clustering-based approach. Indeed, clusters do not partition the space equally and regularly. This is especially problematic since word representations are not uniformly distributed in the embedding space (Ethayarajh, 2019). Therefore, the vectors lying in the same dense area of the space will always belong to one single large cluster, while outliers lying in the same, but sparser, area of the space, will be assigned to many different small clusters. Therefore, counting the number of clusters a given word belongs to is not a reliable indicator of how much of the space this word covers.

Scoring scheme
We quantify the polysemy degree of a word w as: where coverage l w designates the proportion of bins covered by word w at level l, between 0 and 1. At each level, 2 l bins are drawn along each dimension (see the vertical and horizontal lines in Fig. 1). The hierarchy starts at l = 1 since there is only one bin covering all the space at l = 0 (so all words have equal coverage at this level). The total number of bins in the entire space, at a given level l, is equal to (2 l ) D .
Consider again the example of Fig. 1. In this example, each word is associated with a set of 10 contextualized embeddings in a space of dimension D = 2, and the hierarchy has L = 3 levels. First, we can clearly see that word 1 (blue circles) covers a large area of the space while all the vectors of word 2 (orange squares) are grouped in the same region. Intuitively, this can be interpreted as "word 1 occurs in more different contexts than word 2", which per our assumption, is equivalent to saying that "word 1 is more polysemous than word 2".
Let us now see how this is reflected by our scoring scheme. First, the penalization terms (denominators) for levels 1 to 3 are 1 2 2 , 1 2 1 , 1 2 0 = [ 1 4 , 1 2 , 1 . Note that the higher the level, the exponentially more bins, and so the less penalized (or the more rewarded) coverage is, because getting good coverage becomes more and more difficult. Now, per Eq. 1, the score of word 1 is computed as the dot product of its coverage vector 3 4 , 7 16 , 10 64 (coverage at each level) with the penalization vector, which gives a score of 0.5625. Likewise, the score of word 2 is computed as [ 1 4 , 1 2 , 1 · 1 4 , 4 16 , 7 64 = 0.297. We can thus see that our scores reflect what can be observed in Fig. 1: word 1 covers a larger area of the space than word 2.
Note that the score of a given word is only meaningful compared to the scores of other words, i.e., in rankings, as will be seen in the next section.

Experiments
In this section, we describe the protocol we followed to test the extent to which our rankings match human rankings.

Word selection
The first step was to select words to include in our analysis. For this purpose, we downloaded and extracted all the text from the latest available English Wikipedia dump 2 . We then performed tokenization, stopword, punctuation, and number removal and counted the occurrence of each token of at least 3 characters in size. Out of these tokens, we kept the 2000 most frequent.

Generating vector sets
For each word in the shortlist, we randomly selected 3000 sentences such that the corresponding word appeared exactly once within each sentence. The words that did not appear in at least 3000 sentences were removed from the analysis, reducing the shortlist's size from 2000 to 1822. Then, for each word, the associated sentences were passed through a pre-trained ELMo model 3 (Peters et al., 2018) in test mode, and the top layer representations corresponding to the word were harvested. The advantage of using ELMo's top layer embeddings is that they are the most contextual, as shown by Ethayarajh (2019). We ended up with a set of exactly 3000 1024-dimensional contextual embeddings for each word.

Dimensionality reduction
Remember that the total number of bins in the entire space is equal to (2 l ) D at a given level l, which would have given us an infinite number of bins even at the first level, since the ELMo representations have dimensionality D = 1024. To reduce the dimensionality of the contextual embedding space, we applied PCA, trying 19 different output dimensionalities, from 2 to 20 with steps of 1. Due to the quantity and high initial dimensionality of the vectors, we used the distributed 4 version of PCA provided by the PySpark's ML Library (Meng et al., 2016).

Score computation
We computed our scores for each PCA output dimensionality, trying with 18 different hierarchies whose numbers of levels L ranged from 2 to 19. So in total, we obtained 19 × 18 = 342 rankings.

Ground truth rankings and baselines
We evaluated the rankings generated by our approach against several ground truth rankings that we derived from human-constructed resources.
Since the number of senses of a word is a subjective, debatable notion, and thus may vary from source to source, we included 6 ground truth rankings in our analysis, in order to minimize sourcespecific bias as much as possible. For sanity checking purposes, we also added two basic baseline rankings (frequency and random). We provide more details about all rankings in what follows.

WordNet
We used WordNet (Miller, 1998) version 3.0 and counted the number of synonym sets or "synsets" of each word.

WordNet-Reduced
There are very subtle differences among the Word-Net senses ("synsets"), making distinguishing between them difficult and even irrelevant in some applications (Palmer et al., 2004(Palmer et al., , 2007Brown et al., 2010;Rumshisky, 2011;Jurgens, 2013). For instance, call has 41 senses in the original WordNet (28 as verb and 13 as noun). Even for other words with fewer senses, like eating (7 senses in total), the difference between senses can be very tiny. For instance, "take in solid food" and "eat a meal; take a meal" are really close in meaning. This very fine granularity of WordNet may somewhat artificially increase the polysemy of some words.
To reduce the granularity of the WordNet synsets, we used their sense keys 5 . They follow the format lemma%ss type:lex filenum: lex id:head word:head id, where ss type represents the synset type (part-of-speech tag such as noun, verb, adjective) and lex filenum represents the name of the lexicographer file containing the synset for the sense (noun.animal, noun.event, verb.emotion, etc.). We truncated the sense keys after lex filenum.

WordNet-Domains
WordNet Domains (Bentivogli et al., 2004;Magnini and Cavaglia, 2000) is a lexical resource created in a semi-automatic way to augment Word-Net with domain labels. Instead of synsets, each word is associated with a number of semantic domains. The domains are areas of human knowledge (politics, economy, sports, etc.) exhibiting specific terminology and lexical coherence. As for the two previous WordNet ground truth rankings, we simply counted the number of domains associated with each word.
We counted the senses in the sense inventory of each word. The senses in OntoNotes are groupings of the WordNet synsets, constructed by human annotators. As a result, the sense granularity of OntoNotes is coarser than that of WordNet (Brown et al., 2010).

Oxford
We counted the number of senses returned by the Oxford dictionary 6 , which was, at the time of this 5 See 'Sense Key Encoding' here: https://wordnet. princeton.edu/documentation/senseidx5wn 6 www.lexico.com study, the resource underlying the Google dictionary functionality.

Wikipedia
We capitalized on the Wikipedia disambiguation pages 7 . Such pages contain a list of the different categories under which one or more articles about the query word can be found. For example, the disambiguation page of the word bank includes categories such as geography, finance, computing (data bank), and science (blood bank). We counted the number of categories on the disambiguation page of each word to generate the ranking.

Frequency and random baselines
In the frequency baseline, we ranked words in decreasing order of their frequency in the entire Wikipedia dump (see subsection 3.1). The naive assumption made here is that words occurring the most have the most senses.
With the random baseline, on the other hand, we produced rankings by shuffling words. Further, we assigned them random scores by sampling from the Log Normal distribution 8 , to imitate the long-tail behavior of the other score distributions, as can be seen in Fig. 2. All distributions can be seen in Fig.  6. Note that to account for randomness, all results for the random baseline are averages over 30 runs. Not every of the 1822 words included in our analysis had an entry in each of the resources described above. The lengths of each ground truth ranking are shown in Table 1.
To ensure a fair comparison, the scores in the rankings of all methods were normalized to be in the [0, 100] range before proceeding.
Also, each method played in turn the role of candidate and ground truth. This allowed us to compute not only the similarity between our rankings and the ground truth rankings, but also the similarity among the ground truth rankings themselves, which was interesting for exploration purposes.
For each pair of evaluated and ground truth method, only the parts of the rankings corresponding to the words in common (intersection) were compared. Thus, the rankings in each (candidate,ground truth) pair had equal length.

Implementation details
To compute our scores, we built on the code of the pyramid match kernel from the GraKeL Python library (Siglidis et al., 2018). We used the base R (R Core Team, 2018) cor() function 9 to compute the τ and ρ statistics. For RBO, we relied on a publicly available Python implementation 10 . For all other metrics, we wrote our own implementations. Full details about design choices, tokenizers, stopword lists etc., can also be found in our publicly available code repository: https: //github.com/ksipos/polysemy-assessment.

Results and observations
Our rankings correlate well with human rankings. Results are shown in Fig. 3, as pairwise similarity matrices, for all six metrics. For readability, all scores are shown as percentages. For a given metric, our configuration that best matches, on average, all other methods (except random and frequency) is always shown as the first column. Since all metrics except NDCG are symmetric, we only show the lower triangles of the other matrices. For NDCG, candidate methods are shown as columns and ground truths as rows.
For each of the six evaluation metrics, it can be seen that the ranking generated by our unsupervised, data-driven method is well correlated with all human-derived ground truth rankings. This means that our method is robust to how one defines and measures correlation or similarity.
In some cases, we even very closely reproduce the human rankings. For instance, our best configurations for cosine and NDCG get almost perfect scores of 86.5 and 99.72 when compared against Wikipedia. In terms of Kendall's tau, Spearman's rho, p@k, and RBO, we are also very close to OntoNotes (scores of 49.43, 35.23, 39.53, and 33.47, resp.).
Finally, the correlation between our rankings and the human rankings can also be observed to be, everywhere, much stronger than that between the baseline rankings (random and frequency) and the human rankings.
Statistical significance. We computed statistical significance for the Spearman's rho and Kendall's tau metrics. As shown in Fig. 3, the null hypothesis that there is no correlation between our rankings and the human-derived ground truth rankings was systematically rejected everywhere, with very high significance (p ≤ 0.0001).
However, against the random baseline, the same null hypothesis (no correlation) was accepted everywhere. Against frequency, the null was rejected, but very weakly (only at the p ≤ 0.01 level), and with very low correlation coefficients (6.53 for Spearman and 4.44 for Kendall).
Finally, the correlation between the random and frequency rankings and the ground truth rankings is never statistically significant, except for the pair frequency/OntoNotes, but again, at a weak level (p ≤ 0.01).
Hyperparameters have a significant impact on performance, but optimal values are consistent across metrics. First, as can be observed from Fig. 4 and Fig. 5, there is a large variability in performance when D (number of PCA dimensions) and L (number of levels in the hierarchy) vary.
However, for all six evaluation metrics, the best configurations are very similar: D2L10, D2L8,

Sentences
Bin coordinates it stars christopher lee as count dracula along with dennis waterman (3, 5, 1) the count of the new group is the sum of the separate counts of the two original groups (4, 1, 3) the first fight did not count towards the official record (4, 5, 1) five year old horatia came to live at merton in may 1805 (2, 5, 2) it features various amounts of live and backstage footage while touring (4, 2, 4) first tax bills were used to pay taxes and to register bank deposits and bank credits (4, 2, 4) the ball nest is built on a bank tree stump or cavity (5, 2, 3) Table 2: Sentences containing different senses of the same word can be sampled by selecting from different bins.

Keywords
Bin coordinates also, gas, used, system, protein, blood, new, steel, food, made (20, 11, 16, 9) first, new, one, second, later, world, national, olympic, team, games (19, 13, 15, 13) album music rock one labour number chart songs metal single (21,14,14,16)  D2L8, D4L5, D3L9, and D4L10 11 . Given the rather large grid we explored ([2, 20] × [2, 19] for D and L, resp.), with 342 combinations in total, we can say that all these optimal values belong to the same small neighborhood. This interpretation is confirmed by inspecting Fig. 4, where it can clearly be seen that the optimal area of the hyperparameter space is robust to metric selection and consistently corresponds to small values of D (around 3), and values of L at least above 3 or 4, ideally around 8. For larger values of L, performance plateaus (keeping D fixed). In other words, it is necessary to have some levels in the hierarchy, but having very deep hierarchies is not required for our method to work well. A benefit of having such small optimal values of D and L is their affordability, from a computational standpoint. All rankings derived from WordNet-based resources are highly correlated. It is interesting to note that the rankings generated from OntoNotes, WordNet, WordNet reduced, and WordNet domains, all are highly similar. And this, despite the very different sense granularities they have. This means that despite the apparent differences in these resources, they all tend to produce similar polysemy rankings. The Oxford rankings tend to be part of this high-similarity cluster as well, to a lesser extent. Frequent words are not the most polysemous. Finally, one last interesting observation is that while the frequency ranking is much better than the random ones, it still is far away from the human rankings. In other words, the frequency of appearance of a word (excluding stopwords, of course) is not as good an indicator of its polysemy as one 11 for RBO, D4L10 and D4L8 had the same score.
could expect. Some words that follow this observation are "number", "population", and "war". A note on ties. To assess the impact of ties on the reported results, we repeated all of our experiments multiple times with different tie-breaking strategies (e.g., random, alphabetical...). Results do not change: we find the same best parameter combinations, and the differences in the similarity matrices are minimal.

Other applications
Sampling diverse examples. An interesting byproduct of our discretization strategy is that it can be used to select sentences containing different senses of the same word, as illustrated in Table 2. Provided a mapping, for a given word, between the sentences that were passed to the pre-trained language model and the vectors, we can sample vectors from different bins and retrieve the associated sentences. If the bins are distant enough, the sentences will contain different senses of the word. For instance, in Table 2, we can see that we are able to sample sentences containing three senses of the word count: (1) noble title, (2) determining the total number of, and (3) taking into account. This has many useful applications in practice, e.g., in information retrieval, NLG and conversational systems, dataset creation, etc. Automatic word sense induction. A simple way of capitalizing on our binning strategy to create word sense inventories would consist in (1) selecting distant bins for a given word, and (2) labeling the selected bins with senses. Both steps can be performed automatically. While this will be investigated in future work, we still give, as a proof of concept, an example in Table 3. In this example, (d) keywords are extracted from distant bins containing the word metal, and different senses are retrieved.

Related work
Task. Several previous efforts have interested themselves in creating sense inventories without human experts. As an example, in Rumshisky (2011); Rumshisky et al. (2012) 12 , Amazon Mechanical Turk (AMT) workers are given a set of sentences containing the target word and one sentence that is randomly selected from this set as a target sentence. Workers are then asked to judge, for each sentence, whether the target word is used in the same way as in the target sentence. This creates an undirected graph of sentences where clustering can be applied to find senses. To label clusters with senses, one has to inspect the sentences in each cluster manually.
More recently, Jurgens (2013) 13 compared three annotation methodologies for gathering word sense labels on AMT. The methods compared are Likert scales, two-stage select and rate, and the difference between counts of when senses were rated best/worst. Regardless of the strategy, interannotator agreement remains low (around 0.3). Methodology. In the original ELMo paper, Peters et al. (2018) have shown that using contextual word representations (through nearest-neighbor matching) improves word sense disambiguation. Hadiwinoto et al. (2019); Coenen et al. (2019) showed that this technique works well with BERT too. Pasini et al. (2020) uses a combination of BERT embeddings and a knowledge-based WSD model to generate word sense distributions, while Giulianelli et al. (2020) uses clustering over the embeddings to detect semantic shifts.
Our approach is also related in spirit to pyramid matching (Nikolentzos et al., 2017;Grauman and Darrell, 2007;Lazebnik et al., 2006). This kernel-based method originated in computer vision. It computes the similarity between objects by placing a sequence of increasingly coarser grids over the feature space and taking a weighted sum of the number of matches occurring at each level. Matches found at finer resolutions are weighted more than matches found at coarser resolutions.

Conclusion
We proposed a novel unsupervised, fully datadriven geometrical approach to estimate word polysemy. Our approach builds multiresolution grids in the contextual embedding space. Through rigorous experiments, we showed that our rankings are well correlated (with strong statistical significance) to 6 different human rankings, for 6 different metrics. Such fully data-driven rankings of words according to polysemy can help in creating new sense inventories, but also in validating and interpreting existing ones. Increasing the quality and consistency of sense inventories is a key first step of the word sense disambiguation pipeline. We also showed that our discretization could be used, at no extra cost, to sample contexts containing different senses of a given word, which has useful applications in practice. Finally, the unsupervised nature of our method makes it applicable to any language.
While our scores are a good proxy for polysemy, they are not equal to word sense counts. Moreover, we do not label each sense. Future work should address these challenges by, e.g., automatically selecting bins of interest and generating labels for them (see section 7).
Future work should also perform some sort of extrinsic evaluation. For instance, the Wordin-Context task (Pilehvar and Camacho-Collados, 2018) could be used, where two occurrences would be classified as having the same meaning if their two vectors fall in the same bin.
Another direction is investigating how different contextual embeddings (e.g., BERT, BART) impact our rankings, including in languages other than English (Eddine et al., 2020;Cañete et al., 2020), and low-resource languages.