Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space

There is rising interest in vector-space word embeddings and their use in NLP, especially given recent methods for their fast estimation at very large scale. Nearly all this work, however, assumes a single vector per word type ignoring polysemy and thus jeopardizing their usefulness for downstream tasks. We present an extension to the Skip-gram model that efficiently learns multiple embeddings per word type. It differs from recent related work by jointly performing word sense discrimination and embedding learning, by non-parametrically estimating the number of senses per word type, and by its efficiency and scalability. We present new state-of-the-art results in the word similarity in context task and demonstrate its scalability by training with one machine on a corpus of nearly 1 billion tokens in less than 6 hours.


Introduction
Representing words by dense, real-valued vector embeddings, also commonly called "distributed representations," helps address the curse of dimensionality and improve generalization because they can place near each other words having similar semantic and syntactic roles.This has been shown dramatically in state-of-the-art results on language modeling (Bengio et al, 2003;Mnih and Hinton, 2007) as well as improvements in other natural language processing tasks (Collobert and Weston, 2008;Turian et al, 2010).Substantial benefit arises when embeddings can be trained on large volumes of data.Hence the recent considerable interest in the CBOW and Skip-gram models of Mikolov et al (2013a); Mikolov et al (2013b)relatively simple log-linear models that can be trained to produce high-quality word embeddings on the entirety of English Wikipedia text in less than half a day on one machine.
There is rising enthusiasm for applying these models to improve accuracy in natural language processing, much like Brown clusters (Brown et al, 1992) have become common input features for many tasks, such as named entity extraction (Miller et al, 2004;Ratinov and Roth, 2009) and parsing (Koo et al, 2008;Täckström et al, 2012).In comparison to Brown clusters, the vector embeddings have the advantages of substantially better scalability in their training, and intriguing potential for their continuous and multi-dimensional interrelations.In fact, Passos et al (2014) present new state-of-the-art results in CoNLL 2003 named entity extraction by directly inputting continuous vector embeddings obtained by a version of Skipgram that injects supervision with lexicons.Similarly Bansal et al (2014) show results in dependency parsing using Skip-gram embeddings.They have also recently been applied to machine translation (Zou et al, 2013;Mikolov et al, 2013c).
A notable deficiency in this prior work is that each word type (e.g. the word string plant) has only one vector representation-polysemy and hononymy are ignored.This results in the word plant having an embedding that is approximately the average of its different contextual semantics relating to biology, placement, manufacturing and power generation.In moderately highdimensional spaces a vector can be relatively "close" to multiple regions at a time, but this does not negate the unfortunate influence of the triangle inequality 2 here: words that are not synonyms but are synonymous with different senses of the same word will be pulled together.For example, pollen and refinery will be inappropriately pulled to a dis- tance not more than the sum of the distances plantpollen and plant-refinery.Fitting the constraints of legitimate continuous gradations of semantics are challenge enough without the additional encumbrance of these illegitimate triangle inequalities.
Discovering embeddings for multiple senses per word type is the focus of work by Reisinger and Mooney (2010a) and Huang et al (2012).They both pre-cluster the contexts of a word type's tokens into discriminated senses, use the clusters to re-label the corpus' tokens according to sense, and then learn embeddings for these re-labeled words.The second paper improves upon the first by employing an earlier pass of non-discriminated embedding learning to obtain vectors used to represent the contexts.Note that by pre-clustering, these methods lose the opportunity to jointly learn the sense-discriminated vectors and the clustering.Other weaknesses include their fixed number of sense per word type, and the computational expense of the two-step process-the Huang et al (2012) method took one week of computation to learn multiple embeddings for a 6,000 subset of the 100,000 vocabulary on a corpus containing close to billion tokens. 3his paper presents a new method for learning vector-space embeddings for multiple senses per word type, designed to provide several advantages over previous approaches.(1) Sensediscriminated vectors are learned jointly with the assignment of token contexts to senses; thus we can use the emerging sense representation to more accurately perform the clustering.(2) A nonparametric variant of our method automatically discovers a varying number of senses per word type.(3) Efficient online joint training makes it fast and scalable.We refer to our method as Multiple-sense Skip-gram, or MSSG, and its nonparametric counterpart as NP-MSSG.
Our method builds on the Skip-gram model (Mikolov et al, 2013a), but maintains multiple vectors per word type.During online training with a particular token, we use the average of its context words' vectors to select the token's sense that is closest, and perform a gradient update on that sense.In the non-parametric version of our method, we build on facility location (Meyerson, 2001): a new cluster is created with probability proportional to the distance from the context to the nearest sense.
We present experimental results demonstrating the benefits of our approach.We show qualitative improvements over single-sense Skip-gram and Huang et al (2012), comparing against word neighbors from our parametric and non-parametric methods.We present quantitative results in three tasks.On both the SCWS and WordSim353 data sets our methods surpass the previous state-ofthe-art.The Google Analogy task is not especially well-suited for word-sense evaluation since its lack of context makes selecting the sense difficult; however our method dramatically outperforms Huang et al (2012) on this task.Finally we also demonstrate scalabilty, learning multiple senses, training on nearly a billion tokens in less than 6 hours-a 27x improvement on Huang et al.

Related Work
Much prior work has focused on learning vector representations of words; here we will describe only those most relevant to understanding this paper.Our work is based on neural language models, proposed by Bengio et al (2003), which extend the traditional idea of n-gram language models by replacing the conditional probability table with a neural network, representing each word token by a small vector instead of an indicator variable, and estimating the parameters of the neural network and these vectors jointly.Since the Bengio et al (2003) model is quite expensive to train, much research has focused on optimizing it.Collobert and Weston (2008) replaces the max-likelihood character of the model with a max-margin approach, where the network is encouraged to score the correct n-grams higher than randomly chosen incorrect n-grams.Mnih and Hinton (2007) replaces the global normalization of the Bengio model with a tree-structured probability distribution, and also considers multiple positions for each word in the tree.
More relevantly, Mikolov et al (2013a) and Mikolov et al (2013b) propose extremely computationally efficient log-linear neural language models by removing the hidden layers of the neural networks and training from larger context windows with very aggressive subsampling.The goal of the models in Mikolov et al (2013a) and Mikolov et al (2013b) is not so much obtaining a low-perplexity language model as learning word representations which will be useful in downstream tasks.Neural networks or log-linear models also do not appear to be necessary to learn high-quality word embeddings, as Dhillon and Ungar (2011) estimate word vector representations using Canonical Correlation Analysis (CCA).
There is considerably less prior work on learning multiple vector representations for the same word type.Reisinger and Mooney (2010a) introduce a method for constructing multiple sparse, high-dimensional vector representations of words.Huang et al (2012) extends this approach incorporating global document context to learn multiple dense, low-dimensional embeddings by using recursive neural networks.Both the methods perform word sense discrimination as a preprocessing step by clustering contexts for each word type, making training more expensive.While methods such as those described in Dhillon and Ungar (2011) and Reddy et al (2011) use token-specific representations of words as part of the learning algorithm, the final outputs are still one-to-one mappings between word types and word embeddings.

Background: Skip-gram model
The Skip-gram model learns word embeddings such that they are useful in predicting the surrounding words in a sentence.In the Skip-gram model, v(w) ∈ R d is the vector representation of the word w ∈ W , where W is the words vocabulary and d is the embedding dimensionality.
Given a pair of words (w t , c), the probability that the word c is observed in the context of word w t is given by, The probability of not observing word c in the context of w t is given by, Given a training set containing the sequence of word types w 1 , w 2 , . . ., w T , the word embeddings are learned by maximizing the following objective function: where w t is the t th word in the training set, c t is the set of observed context words of word w t and c t is the set of randomly sampled, noisy context words for the word w t .D + consists of the set of all observed word-context pairs (w t , c t ) (t = 1, 2 . . ., T ).D − consists of pairs (w t , c t ) (t = 1, 2 . . ., T ) where c t is the set of randomly sampled, noisy context words for the word w t .
For each training word w t , the set of context words c t = {w t−Rt , . . ., w t−1 , w t+1 , . . ., w t+Rt } includes R t words to the left and right of the given word as shown in Figure 1.R t is the window size considered for the word w t uniformly randomly sampled from the set {1, 2, . . ., N }, where N is the maximum context window size.
The set of noisy context words c t for the word w t is constructed by randomly sampling S noisy context words for each word in the context c t .The noisy context words are randomly sampled from the following distribution, where p unigram (w) is the unigram distribution of the words and Z is the normalization constant.and let each sense of word have its own embedding, and induce the senses by clustering the embeddings of the context words around each token.

Multi-Sense Skip-gram (MSSG) model
The vector representation of the context is the average of its context words' vectors.For every word type, we maintain clusters of its contexts and the sense of a word token is predicted as the cluster that is closest to its context representation.After predicting the sense of a word token, we perform a gradient update on the embedding of that sense.The crucial difference from previous approaches is that word sense discrimination and learning embeddings are performed jointly by predicting the sense of the word using the current parameter estimates.
In the MSSG model, each word w ∈ W is associated with a global vector v g (w) and each sense of the word has an embedding (sense vector) v s (w, k) (k = 1, 2, . . ., K) and a context cluster with center µ(w, k) (k = 1, 2, . . ., K).The K sense vectors and the global vectors are of dimension d and K is a hyperparameter.
Consider the word w t and let c t = {w t−Rt , . . ., w t−1 , w t+1 , . . ., w t+Rt } be the set of observed context words.The vector representation of the context is defined as the average of the global vector representation of the words in the be the vector representation of the context c t .We use the global vectors of the context words instead of its sense vectors to avoid the computational complexity associated with predicting the sense of the context words.We predict s t , the sense Figure 2: Architecture of Multi-Sense Skip-gram (MSSG) model with window size R t = 2 and K = 3. Context c t of word w t consists of w t−1 , w t−2 , w t+1 , w t+2 .The sense is predicted by finding the cluster center of the context that is closest to the average of the context vectors. of word w t when observed with context c t as the context cluster membership of the vector v context (c t ) as shown in Figure 2.More formally, The hard cluster assignment is similar to the kmeans algorithm.The cluster center is the average of the vector representations of all the contexts which belong to that cluster.For sim we use cosine similarity in our experiments.
Here, the probability that the word c is observed in the context of word w t given the sense of the word w t is, The probability of not observing word c in the context of w t given the sense of the word w t is, Given a training set containing the sequence of word types w 1 , w 2 , ..., w T , the word embeddings are learned by maximizing the following objective Update context cluster center µ(w t , s t ) since context c t is added to context cluster s t of word w t .9: Gradient update on v s (w t , s t ), global vectors of words in c t and c t .11: end for 12: Output: v s (w, k), v g (w) and context cluster centers µ(w, k), ∀w ∈ W, k ∈ {1, . . ., K} function: where w t is the t th word in the sequence, c t is the set of observed context words and c t is the set of noisy context words for the word w t .D + and D − are constructed in the same way as in the Skipgram model.After predicting the sense of word w t , we update the embedding of the predicted sense for the word w t (v s (w t , s t )), the global vector of the words in the context and the global vector of the randomly sampled, noisy context words.The context cluster center of cluster s t for the word w t (µ(w t , s t )) is updated since context c t is added to the cluster s t .

Non-Parametric MSSG model (NP-MSSG)
The MSSG model learns a fixed number of senses per word type.In this section, we describe a non-parametric version of MSSG, the NP-MSSG model, which learns varying number of senses per word type.Our approach is closely related to the online non-parametric clustering procedure described in Meyerson (2001).We create a new cluster (sense) for a word type with probability proportional to the distance of its context to the nearest cluster (sense).
Each word w ∈ W is associated with sense vectors, context clusters and a global vector v g (w) as in the MSSG model.The number of senses for a word is unknown and is learned during training.Initially, the words do not have sense vectors and context clusters.We create the first sense vector and context cluster for each word on its first occurrence in the training data.After creating the first context cluster for a word, a new context cluster and a sense vector are created online during training when the word is observed with a context were the similarity between the vector representation of the context with every existing cluster center of the word is less than λ, where λ is a hyperparameter of the model.
Consider the word w t and let c t = {w t−Rt , . . ., w t−1 , w t+1 , . . ., w t+Rt } be the set of observed context words.The vector representation of the context is defined as the average of the global vector representation of the words in the context.Let v context (c t ) = 1 2 * Rt c∈ct v g (c) be the vector representation of the context c t .Let k(w t ) be the number of context clusters or the number of senses currently associated with word w t .s t , the sense of word w t when k(w t ) > 0 is given by ) where µ(w t , k) is the cluster center of the k th cluster of word w t and k max = arg max k=1,2,...,k(wt) sim(µ(w t , k), v context (c t )).
The cluster center is the average of the vector representations of all the contexts which belong to that cluster.If s t = k(w t ) + 1, a new context cluster and a new sense vector are created for the word w t .
The NP-MSSG model and the MSSG model described previously differ only in the way word sense discrimination is performed.The objective function and the probabilistic model associated with observing a (word, context) pair given the sense of the word remain the same.

Experiments
To evaluate our algorithms we train embeddings using the same corpus and vocabulary as used in Huang et al (2012), which is the April 2010 snapshot of the Wikipedia corpus (Shaoul and Westbury, 2010).It contains approximately 2 million articles and 990 million tokens.In all our experiments we remove all the words with less than 20 occurrences and use a maximum context window (N ) of length 5 (5 words before and after the word occurrence).We fix the number of senses (K) to be 3 for the MSSG model unless otherwise specified.Our hyperparameter values were selected by a small amount of manual exploration on a validation set.In NP-MSSG we set λ to -0.5.The Skip-gram model, MSSG and NP-MSSG models sample one noisy context word (S) for each of the observed context words.We train our models using AdaGrad stochastic gradient decent (Duchi et al, 2011) with initial learning rate set to 0.025.Similarly to Huang et al (2012), we don't use a regularization penalty.Below we describe qualitative results, displaying the embeddings and the nearest neighbors of each word sense, and quantitative experiments in two benchmark word similarity tasks.
Table 1 shows time to train our models, compared with other models from previous work.All these times are from single-machine implementations running on similar-sized corpora.We see that our model shows significant improvement in the training time over the model in Huang et al (2012), being within well within an order-ofmagnitude of the training time for Skip-gram models.Note that the different senses closely correspond to intuitions regarding the senses of the given word types.

Nearest Neighbors
Table 2 shows qualitatively the results of discovering multiple senses by presenting the nearest neighbors associated with various embeddings.The nearest neighbors of a word are computed by comparing the cosine similarity between the embedding for each sense of the word and the context embeddings of all other words in the vocabulary.Note that each of the discovered senses are indeed semantically coherent, and that a reasonable number of senses are created by the non-parametric method.

Word Similarity
We evaluate our embeddings on two related datasets: the WordSim-353 (Finkelstein et al, 2001) dataset and the Contextual Word Similarities (SCWS) dataset Huang et al (2012).WordSim-353 is a standard dataset for evaluating word vector representations.It consists of a list of pairs of word types, the similarity of which is rated in an integral scale from 1 to 10. Pairs include both monosemic and polysemic words.These scores to each word pairs are given without any contextual information, which makes them tricky to interpret.
Since it is not trivial to deal with multiple embeddings per word, we consider the following similarity measures between words w and w given their respective contexts c and c , where P (w, c, k) is the probability that w takes the k th sense given the context c, and d(v s (w, i), v s (w , j)) is the similarity measure between the given embeddings v s (w, i) and v s (w , j).
The avgSim metric, computes the average similarity over all embeddings for each word, ignoring information from the context.
To address this, the avgSimC metric, weighs the similarity between each pair of senses by how well does each sense fit the context at hand.The globalSim metric uses each word's global context vector, ignoring the many senses: Finally, localSim metric selects a single sense for each word based independently on its context and computes the similarity by where k = arg max i P (w, c, i) and k = arg max j P (w , c , j) and P (w, c, i) is the probability that w takes the i th sense given context c.The probability of being in a cluster is calculated as the inverse of the cosine distance to the cluster center (Huang et al, 2012).
We report the Spearman correlation between a model's similarity scores and the human judgements in the datasets.
Table 5 shows the results on WordSim-353 task.C&W refers to the language model by Collobert and Weston (2008) and HLBL model is the method described in Mnih and Hinton (2007).On WordSim-353 task, we see that our model performs significantly better than the previous neural network model for learning multi-representations per word (Huang et al, 2012).Among the methods that learn low-dimensional and dense representations, our model performs slightly better than Skip-gram.Figure 3 shows the distribution of number of senses learned per word type in the NP-MSSG model.We learn the multiple embeddings for the same set of approximately 6000 words that were used in Huang et al (2012) for all our experiments to ensure fair comparision.These approximately 6000 words were choosen by Huang et al. mainly from the top 30,00 frequent words in the vocabulary.This selection was likely made to avoid the noise of learning multiple senses for infrequent words.However, our method is robust to noise, which can be seen by the good performance of our model that learns multiple embeddings for the top 30,000 most frequent words.We found that even by learning multiple embeddings for the top 30,000 most frequent words in the vocubulary, MSSG model still achieves state-of-art result on SCWS task with an avgSimC score of 69.2 as shown in Table 6.

Conclusion
We present an extension to the Skip-gram model that efficiently learns multiple embeddings per word type.The model jointly performs word sense discrimination and embedding learning, and non-parametrically estimates the number of senses per word type.Our method achieves new stateof-the-art results in the word similarity in context task and learns multiple senses, training on close to billion tokens in less than 6 hours.The global vectors, sense vectors and cluster centers of our model and code for learning them are available at https://people.cs.umass.edu/˜arvind/emnlp2014wordvectors.In future work we plan to use the multiple embeddings per word type in downstream NLP tasks.

Figure 3 :
Figure 3: The plot shows the distribution of number of senses learned per word type in NP-MSSG model

Figure 4 :
Figure 4: Shows the effect of varying embedding dimensionality of the MSSG Model on the SCWS task.

Figure 5 :
Figure 5: show the effect of varying number of senses of the MSSG Model on the SCWS task.
Figure 1: Architecture of the Skip-gram model with window size R t = 2. Context c t of word w t consists of w t−1 , w t−2 , w t+1 , w t+2 .

Table 2 :
Nearest neighbors of each sense of each word, by cosine similarity, for different algorithms.

Table 3 :
Huang et al (2012)lant for different models.We see that the discovered senses in both our models are more semantically coherent thanHuang et al (2012)and NP-MSSG is able to learn reasonable number of senses.

Table 4 :
Table4shows the results for the SCWS task.In this task, when the words are Experimental results in the SCWS task.The numbers are Spearmans correlation ρ × 100 between each model's similarity judgments and the human judgments, in context.First three models learn only a single embedding per model and hence, avgSim, avgSimC and localSim are not reported for these models, as they'd be identical to globalSim.Both our parametric and non-parametric models outperform the baseline models, and our best model achieves a score of 69.3 in this task.NP-MSSG achieves the best results when globalSim, avgSim and localSim similarity measures are used.The best results according to each metric are in bold face.

Table 6 :
Huang et al (2012)3a) WordSim-353 and SCWS Task.Multiple Embeddings are learned for top 30,000 most frequent words in the vocabulary.The embedding dimension size is 300 for all the models for this task.The number of senses for MSSG model is 3. introduced byMikolov et al (2013a)where both MSSG and NP-MSSG models achieve 64% accuracy compared to 12% accuracy byHuang et al (2012).Skip-gram which is the state-of-art model for this task achieves 67% accuracy.