PolyLM: Learning about Polysemy through Language Modeling

To avoid the “meaning conflation deficiency” of word embeddings, a number of models have aimed to embed individual word senses. These methods at one time performed well on tasks such as word sense induction (WSI), but they have since been overtaken by task-specific techniques which exploit contextualized embeddings. However, sense embeddings and contextualization need not be mutually exclusive. We introduce PolyLM, a method which formulates the task of learning sense embeddings as a language modeling problem, allowing contextualization techniques to be applied. PolyLM is based on two underlying assumptions about word senses: firstly, that the probability of a word occurring in a given context is equal to the sum of the probabilities of its individual senses occurring; and secondly, that for a given occurrence of a word, one of its senses tends to be much more plausible in the context than the others. We evaluate PolyLM on WSI, showing that it performs considerably better than previous sense embedding techniques, and matches the current state-of-the-art specialized WSI method despite having six times fewer parameters. Code and pre-trained models are available at https://github.com/AlanAnsell/PolyLM.


Introduction
Much work in NLP has been dedicated to vector representations of words, but it has been recognized since as early as (Schütze, 1998) that such representations fail to capture the polysemous nature of many words, conflating their multiple senses into a single point in semantic space. There have been several attempts at embedding individual word senses to avoid this issue, termed the "meaning conflation deficiency" by Camacho-Collados and Pilehvar (2018) in their survey on the area.
We propose PolyLM, an unsupervised sense embedding model which is effective and easy to apply to downstream tasks. PolyLM can be thought of as both a (masked) language model and a sense model, as it calculates a probability distribution both over words and word senses at masked positions. The formulation is derived from two observations about word senses: firstly, that the probability of a word occurring in a given context is equal to the sum of the probabilities of its individual senses occurring; and secondly, that for a given occurrence of a word, one of its senses tends to be much more plausible in the context than the others.
There are several reasons for the interest in sense representations. The first is the downsides associated with the meaning conflation deficiency. Word embedding models can have difficulty distinguishing which sense of an ambiguous word applies in a given context (Yaghoobzadeh and Schütze, 2016). Additionally, homonymy and polysemy cause distortion in word embeddings: for instance, we would find the unrelated words left and wrong unreasonably close in the vector space due to their similarity to two different senses of the word right, an effect noted by Neelakantan et al. (2014) and illustrated in Figure 1. Intuitively we would expect that sense embedding models could gain superior semantic understanding by avoiding these problems.
In addition to well-established applications for sense representations such as word sense disambiguation (WSD) and induction (WSI), another interesting use case is the automatic construction of lexical resources (Neale, 2018). While there are existing human-curated word sense inventories for English such as such as WordNet (Miller, 1995), these are expensive to create and are unavailable for most languages. Panchenko (2016) showed that sense embeddings learned using the model of Bartunov et al. (2016) could be linked with word senses contained in BabelNet (Navigli and Ponzetto, 2012) with a reasonable degree of precision, although the mapping struggled with recall. PolyLM represents a significant advance over Bartunov et al.'s in terms of WSI performance, so  (Flyamer, 2017). Sense embeddings were learned by training PolyLM SMALL with the standard 8 senses per word; word embeddings were learned by training PolyLM SMALL , but with a single sense per word. Note that both models were trained on unlemmatized data, unlike those used in the WSI experiments. The occurrence of closely related polysemous words nearby in the word embedding space (i.e. left and right) causes unrelated words to be closer together (e.g. left and wrong) and related words to be further apart (e.g. right and east) than they otherwise would be. The use of sense embeddings avoids such distortion. PolyLM is capable of detecting comparatively rare word senses, such as the political senses of left and right, and the use of smith and mason to refer to tradespeople.
it seems reasonable to imagine that this approach to lexical resource construction might now be more feasible.
The emergence of contextualized models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) has had a tremendous impact on the area of semantic representation. Rather than representing words using a single embedding, or even a set of sense embeddings, these models allow words to be represented using an infinite set of possible embeddings depending on the context. This approach has been very effective across NLP and many state-ofthe-art systems incorporate contextualized models, including systems for WSD and WSI. The success of contextualized models raises the question of whether there is still value in learning discrete sense representations.
However, contextualized models still rely on word embeddings, and are therefore subject to the meaning conflation deficiency. Furthermore it could be argued that it is inefficient to have the same representation size for all words regardless of how diverse their range of senses is. Another drawback is that before they can be applied to word sense-related tasks, an adaptation step such as clus-tering to induce discrete senses or fine-tuning is generally required, which is often expensive in terms of both research and compute time.
The contributions of this paper can be summarized as follows: • We propose PolyLM, an end-to-end, unsupervised neural sense embedding model derived from two simple assumptions about word senses. We demonstrate that PolyLM learns senses which correspond well to human notions by showing that it performs well at WSI.
• PolyLM is flexible in that it can use any "contextualizer" (a useful term coined by Liu et al. (2019)), so it will remain relevant as contextualization techniques improve.
• We reduce the effect of the meaning conflation deficiency by disambiguating word senses at the input with a neural "disambiguation layer." We show that good performance on WSI can be achieved using the output of this layer alone, suggesting that it could be a useful component in many neural networks for language understanding.

Related Work
One of the first works in unsupervised learning of sense representations was by Schütze (1998), who proposed a two-step process, where vector representations are first derived for each context containing an ambiguous word, and these are then clustered into a pre-defined number of groups. Huang et al. (2012) added a third step, where after senselabeling each word according to its context cluster, sense representations are learned through neural language modeling.
A number of later approaches employed a joint training approach, where sense labeling and sense representation learning happen in parallel. Neelakantan et al. (2014), Li and Jurafsky (2015) and Bartunov et al. (2016) each proposed multi-sense variants of the Skip-Gram model (Mikolov et al., 2013). Various approaches were tried for determining the number of senses per word: for instance, Li and Jurafsky and Bartunov et al. used Chinese Restaurant Processes and Dirichlet Processes respectively to automatically learn an appropriate number of senses for each word.
Many joint training approaches have the disadvantage that they create ambiguity in the context representation by representing context words with word embeddings in order to avoid considering the exponential number of possible sense labelings for the context. Qiu et al. (2016) and Lee and Chen (2017) propose purely sense-based approaches which can sense-label the input efficiently. Arora et al. (2018) took a novel approach to the problem of learning word senses, demonstrating that the embedding learned by traditional techniques for an ambiguous word tends to be very close to a linear combination of the hypothetical vectors corresponding to its individual senses. They proposed a method for recovering the underlying sense vectors and coefficients, and evaluated their system on WSI.
Since the emergence of contextualized models, there have been a number of other systems which have exploited their powerful semantic representations for specific tasks such as word sense disambiguation (Huang et al., 2019;Vial et al., 2019) and induction Goldberg, 2018, 2019), however none of these methods creates explicit sense embeddings.

Overview
Consider a typical neural language model. Each word w in a vocabulary V is assigned a single embedding, resulting in an embedding matrix M ∈ R |V |×d , where d is the embedding dimensionality. The probability of w occurring in a context c is estimated as (1) where y(c) ∈ R d is a vector representation of c and a ∈ R |V | is a trainable bias vector. In BERT (Devlin et al., 2019) for instance, y(c) corresponds to the final output of multiple Transformer encoder layers (Vaswani et al., 2017). Now suppose that for each w ∈ V , there is a corresponding set S w of sememes, or senses which w can have.
For instance, intuitively we might have S rock = {rock:stone, rock:musical genre, rock:shake}. We assume that the S w are disjoint, i.e. S w ∩ S w = ∅ whenever w = w , and we define the full sense inventory S = w∈V S w .
Context induces specific senses for the words it contains. Thus a passage of text can be thought of as a sequence of sememes as well as a sequence of words. The first observation underlying PolyLM is that the probability of a word w occurring in a context c is equal to the sum of the probabilities of w's component sememes occurring in the context, i.e. (2) We wish to learn representations for individual senses, and so we assign an embedding to each sememe in our sense inventory, resulting in a matrix E with dimension |S| × d and bias vector b of dimension |S|. Note that this assumes that we know the number of senses of each word a priori, an assumption whose consequences we discuss later. Following Eq. 1, we define the vector p(c) ∈ R |S| of sememe probabilities in a context c as Considering Eq. 2, we have allowing us to formulate the problem of learning sense representations with a language modeling objective. PolyLM is constructed from three components: the input layer, which represents the input tokens as aggregates of their sense embeddings, the disambiguation layer, which attempts to determine the contextually appropriate sense embeddings for the input, and the prediction layer, which implements the language modeling objective.
We adopt the masked language modeling (MLM) task used for training BERT. When training, we select a subset T ⊂ {1, 2, ..., n} of the tokens in the input sequence as targets for prediction, and produce a masked version c = w 1 , w 2 , ..., w n of the original sequence c = w 1 , w 2 , ..., w n as follows: 15% of tokens are chosen at random as targets, of which 80% are replaced with a special [MASK] token, 10% are replaced with a random token, and 10% are left unchanged.

Input Layer
We define a contextualizer to be a function which maps a sequence of input representations x 1 , x 2 , ..., x n ∈ R d to a corresponding sequence of output representations y 1 , y 2 , ..., y n ∈ R d . Recurrent Neural Networks and Transformer architectures are both commonly used as contextualizers for language modeling. Typically the input representations are drawn from an embedding matrix I ∈ R |V |×d . It has become common (e.g. BERT) to set I equal to O, the embedding matrix used at the language modeling output, as recommended by Press and Wolf (2017), and thus have a single embedding matrix E.
The issue of input representation poses a problem for our model. Our output embeddings E ∈ R |S|×d correspond to sememes. We cannot straightforwardly tie our input and output embeddings as Press and Wolf suggest, because we receive words rather than sememes as input. We solve this problem by setting the input representation of a word to be a convex combination of the representations of its sememes, i.e.
where e s is the row of E corresponding to sememe s, and λ w is a learnable weight vector with the properties that s∈Sw λ ws = 1 and λ w ≥ 0 (in practice, λ w is the softmax of an underlying, unconstrained variable vector).

Disambiguation Layer
The disambiguation layer attempts to infer the contextually appropriate sememe embeddings for the input based on the conflated representations from the input layer. Representations x(w 1 ), x(w 2 ), ..., x(w n ) of c , calculated according to Eq. 5, are fed into a contextualizer instance C D , which outputs representations y D 1 (c ), y D 2 (c ), ..., y D n (c ). We use these representations to calculate a probability distribution over each sense of the tokens in the input: where E (w i ) is a submatrix of E containing only the rows corresponding to senses of token w i , and similarly b (w i ) is a subvector of a learnable bias vector b ∈ R |S| . In other terms, where s ∈ S w i . q D is (c ) corresponds to the probability that the ith token in sequence c has sense s.
The disambiguated representation of a token could simply be its highest-probability sememe embedding in the context, but to allow gradients to flow through the disambiguation layer, we take the sum of the sememe embeddings weighted by their probabilities:

Prediction Layer
The prediction layer maps a sequence of disambiguated input representations onto a corresponding set of output representations, and from each output representation estimates the probability of every sememe in the sense inventory occurring at the corresponding position of the sequence. Disambiguated representations x P 1 (c ), x P 2 (c ), ..., x P n (c ) are fed into another contextualizer instance C P , which returns output representations y P 1 (c ), y P 2 (c ), ..., y P n (c ). These are used to calculate a probability distribution over the entire sense inventory, as prescribed by Eq. 3: . . apple 1 : 0.00006 apple 2 : 0.05164 apple 3 : 0.00012 . .
Sense probabilities summed to give word probabilities for masked words, used to calculate language modeling loss J LM P(apple) = 0.00006 + 0.05164 + 0.00012 = 0.0518 J LM q P 3 (c , c) apple 1 : 0.0011 apple 2 : 0.9966 apple 3 : 0.0022 x(w 5 ) Input embeddings for c i like apple pie . Unmasked sequence c Figure 2: Architecture diagram for PolyLM when training, illustrated on the sentence "I like apple pie.", where the word "apple" is chosen as a target and masked (note that "apple" is ambiguous when tokens are lower-cased, as it may refer to a fruit or a technology company). At inference time, the bottom components (up to and including q D (c)) do not need to be evaluated, and the sequence may not be masked at the input.
We define an additional set of probabilities q P analogous to q D defined in Eq. 6: q P i takes both c and the unmasked sequence c as arguments because we are interested in the sense probabilities of the words w i that actually occurred. q P i will be used later for defining the loss function and is useful for downstream tasks.

Loss Function
We seek to minimize a loss function J with three components, each of which is explained below:

Language Modeling Loss
The language modeling loss J LM is defined as the mean negative log likelihood of the target tokens occurring: where p i is as defined in Eq. 9.

Distinctness Loss
Recall that we assume in advance a number of senses for each word. In practice we guess a relatively high number to avoid missing senses. When we overestimate the number of senses, we find that two different sense embeddings for a word converge to essentially the same meaning. The aim of the distinctness loss is to ensure that each sense has a distinct meaning, and to "kill off" superfluous senses by causing them to have very low probability in all contexts. The second key observation of PolyLM is that if the sememes corresponding to a word w are distinct, then in contexts where w occurs, we would expect one of these sememes to have a high estimated probability of occurring, and the rest to have a low probability. The distinctness loss, given by, with hyperparameter r > 1, encourages this separation to occur. A full justification is given in Appendix A.

Match Loss
Without extra supervision, the disambiguation layer tends to very quickly allocate almost all of the probability mass for a word to a single one of its senses. This appears to be due to a "rich get richer" effect in Eq. 8, where the sense embedding with the highest weight has larger gradients associated with it. A more reliable source of sense probabilities is the output of the prediction layer, as this is more closely associated with the ground truth. Therefore we encourage the disambiguation sense probabilities q D to be similar to the prediction sense probabilities q P by adding a sense probability "match loss," which is proportional to the cosine similarity between q D and q P .
Because q D i (c ) is meaningless when token i is replaced with [MASK], when calculating the match loss we evaluate the disambiguation layer on the unmasked sequence (shown with bottom-up arrows in Figure 2), obtaining q D i (c). The match loss is defined as where q D i and q P i are shorthand for q D i (c) and q P i (c , c) respectively, and λ M is a hyperparameter. As we wish the disambiguation layer to learn from the prediction layer rather than the other way around, we do not allow gradients from the match loss to propagate through q P i .

Preprocessing
To avoid the issue of how to represent a word's sense when it is broken into sub-word level tokens, our vocabulary consists of whole-word tokens. However the WSI tasks on which we evaluate our model operate on the lemma level, so we lemmatize our training corpus as described in Appendix B. The vocabulary consists of the ∼86K tokens appearing more than 500 times in our training corpus, which like BERT's consists of English Wikipedia + BookCorpus (Zhu et al., 2015). All tokens are lower-cased.

Contextualizers
One of the advantages of PolyLM is that it can be used with any type of contextualizer -note however that we must train our contextualizers together with the rest of the model rather than using pretrained contextualizer instances, because their word embedding matrix would not match our sense embedding matrix. In this paper we present results where the disambiguation and prediction contextualizers C D and C P use BERT's implementation of the Transformer encoder architecture.

Parameters
To keep the total number of embeddings reasonable, we allow only the ∼10,000 tokens which occur more than 20,000 times in the training corpus, or appear as focuses in the evaluation datasets, to have multiple senses. Specifically, we assign these tokens a fixed number of k = 8 embeddings, and other tokens a single embedding. Since according to Zipf's law (Zipf, 1950), it is the most frequent words which tend to have the most senses, we expect not to miss too many senses by assuming that infrequent words are monosemous. We leave the investigation of more sophisticated methods for pre-allocating or dynamically updating the number of senses for each token for future work. We train two PolyLM models of different sizes, PolyLM SMALL and PolyLM BASE . Due to the prohibitive computational cost of training a model of BERT LARGE 's size, we use significantly smaller dimensions, as shown in Table 1.
Models were trained over 6,000,000 batches consisting of 32 sequences of length 128 using the Adam optimizer (Kingma and Ba, 2014). The learning rate was increased linearly from 0 to 3e-5 over the first 10,000 batches, and then reduced linearly back to zero over the remaining batches. The hyperparameters λ M and r specific to PolyLM's loss function were first increased linearly and then left constant, λ M from 0 to 0.1 over the first 1,000,000 batches, and r from 1.0 to 1.5 over the first 2,000,000 batches.
It is important for r to be gradually increased in this manner because if r is large initially, then the effect of the distinctness loss reduces the diversity of the senses learned. On the other hand, increasing r too slowly seems to be detrimental to the senses' distinctness.

Experiments
Word sense induction (WSI) is the task of inferring the senses of a word in an unsupervised manner. This is precisely the aim of our method, and so is an ideal test task. We evaluate PolyLM on two WSI datasets, SemEval-2010 Task 14 (Manandhar et al., 2010) and SemEval-2013 Task 13 (Jurgens and Klapaftis, 2013). Both datasets consist of passages containing one of a set of polysemous focus words. The occurrences of the focus words in the test set have been sense-labeled by human annotators according to a reference sense inventory.
In the SemEval-2010 dataset, each instance is labeled with a single sense, whereas in the SemEval-2013 dataset an instance may be labeled with several relevant senses, each with a corresponding weight denoting its degree of applicability in the context.
Performance on SemEval-2010 is measured using paired F-Score (F-S) and V-Measure (V-M), and on SemEval-2013 using Fuzzy B-Cubed (FBC) and Fuzzy Normalized Mutual Information (FNMI). Overall performance on each task (AVG) is typically defined as the geometric mean of its two sub-metrics.
Currently, the best performing system on both datasets is that of Amrami and Goldberg (2019). Their system uses the idea of substitute vectors, first devised by Başkaya et al. (2013). For each instance, a set of most likely words that could have occurred instead of the focus word is obtained from the output of a language model. These sets are then clustered, and each cluster is taken to correspond to a different sense of the focus word. Amrami and Goldberg use BERT LARGE as their language model.
PolyLM can be used for WSI without any further training. For the SemEval-2010 dataset, each instance c is labeled with the sense of the focus word w i which has the highest predicted probability, i.e. argmax s∈Sw i q P is (c , c), where c is formed from c by replacing w i with [MASK]. For SemEval-2013, we consider a sense applicable if it has a predicted probability q P is (c , c) > p thresh , and the weight assigned to each applicable sense is its probability q P is (c , c). We arbitrarily set p thresh to 0.2.    Results are shown in Table 2. Both PolyLM models comprehensively outperform previous sense embedding methods. PolyLM BASE and Amrami and Goldberg's system slightly outperform each other on one dataset each, suggesting similar overall proficiency at WSI. However it is worth noting that the BERT LARGE language model used by Amrami and Goldberg has more than six times as many parameters as PolyLM BASE and is much more computationally expensive to train and run.
PolyLM scales well for the sizes tested, with PolyLM BASE outperforming PolyLM SMALL by 3.2 and 4.1 points in AVG score on the two datasets with a 2.25x increase in the number of parameters. Even if further increases in model dimensions yielded much smaller improvements in performance, it seems likely that a PolyLM model of BERT LARGE 's 340 million parameter size would achieve results significantly better than those of Amrami and Goldberg (2019).

Ablation Study
We test three alternative configurations against PolyLM SMALL : one where the distinctness loss term is removed from the objective ("no distinctness loss"), one where the disambiguation layer is removed ("no disambiguation layer"), and one where the disambiguation sense probabilities q D are used in place of q P when performing WSI ("disambiguation layer only"). Note that the first two configurations require new models to be trained, whereas the last simply uses PolyLM SMALL in a different way. Results are shown in Table 3.
The use of the distinctness loss has a big impact on model performance, while the disambiguation layer is somewhat less important but still useful. The model still performs surprisingly well when the disambiguation rather than the prediction sense probabilities are used; these are the output of only four Transformer layers and hence are much cheaper to compute. This suggests that it might be practical to add the disambiguation layer at the input of various neural NLP models to improve their understanding of polysemy.

Conclusions
PolyLM is a novel model of polysemy based on two assumptions about word senses: firstly, that the probability of a word occurring in a context is equal to the sum of its individual senses occurring, as expressed by the language modeling loss; and secondly, that generally only one sense of a word ought to have a high probability of occurring in a given context, as expressed by the distinctness loss. PolyLM does indeed learn word senses which correspond well to human notions, as demonstrated by its performance on word sense induction, which matches that of the previous state-of-the-art system despite having 6 times fewer parameters. It can be easily applied to many word-sense related tasks, as it generates a probability distribution over the senses of each word in the input text. It is not specific to any one contextualizer and so can be improved as contextualizers improve.

A Justification of the Distinctness Loss
Consider the derivative of the language modeling loss for one particular target position i ∈ T with respect to the pre-softmax scores e k y P i + b k of the target word w i 's sense embeddings k ∈ S w i . For brevity, we define y k = e k y P i + b k . Since q P ik > p ik , ∂ ∂y k J LM (c, c , {i}) will always be negative, meaning that every sense embedding for the target word will always move towards the contextualized representation y P i . This is undesirable, because it means that even senses which are irrelevant in a context will receive a positive update.
Now consider the derivatives of the distinctness loss: When r > 1, e ry k s∈Sw i e rys is a "sharpened" version of q P ik (c , c): it is larger than q P ik when q P ik is large, and smaller when q P ik is small. Thus the addition of the distinctness loss results in even stronger reinforcement for senses which are highly applicable in the context, and even weaker (possibly negative) reinforcement for senses which are inapplicable. This encourages only one sense of a word to have high probability in a given context, as desired.

B Lemmatization
The training corpus and all text used for evaluation are lemmatized as follows: first, we perform part-of-speech (POS) tagging using Stanford CoreNLP's POS tagger (Manning et al., 2014). Any token with a tag associated with inflectional morphology in English (NNS, JJR, JJS, RBR, RBS, VBD, VBG, VBP, VBZ or VNB) is split into two separate tokens, its lemmatized form and a special token. There is a unique special token for each of the above tags except the pairs JJR and RBR (comparative adjectives and adverbs) and JJS and RBS (superlative adjective and adverbs), which share [COMP] and [SUP] tokens respectively.