Obtaining Better Static Word Embeddings Using Contextual Embedding Models

The advent of contextual word embeddings — representations of words which incorporate semantic and syntactic information from their context—has led to tremendous improvements on a wide variety of NLP tasks. However, recent contextual models have prohibitively high computational cost in many use-cases and are often hard to interpret. In this work, we demonstrate that our proposed distillation method, which is a simple extension of CBOW-based training, allows to significantly improve computational efficiency of NLP applications, while outperforming the quality of existing static embeddings trained from scratch as well as those distilled from previously proposed methods. As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings via standard lexical evaluation tasks.


Introduction
Word embeddings-representations of words which reflect semantic and syntactic information carried by them are ubiquitous in Natural Language Processing. Static word representation models such as GLOVE (Pennington et al., 2014), CBOW, SKIPGRAM (Mikolov et al., 2013) and SENT2VEC (Pagliardini et al., 2018) obtain stand-alone representations which do not depend on their surrounding words or sentences (context). Contextual embedding models (Devlin et al., 2019;Peters et al., 2018;Liu et al., 2019;Radford et al., 2019;Schwenk and Douze, 2017) on the other hand, embed the contextual information as well into the word representations making them more expressive than static word representations in most use-cases.
While recent progress on contextual embeddings has been tremendously impactful, static embeddings still remain fundamentally important in many scenarios as well: • Even when ignoring the training phase, the computational cost of using static word embeddings is typically tens of millions times lower than using standard contextual embedding models 1 , which is particularly important for latency-critical applications and on lowresource devices, and in view of environmental costs of NLP models (Strubell et al., 2019).
• Many NLP tasks inherently rely on static word embeddings (Shoemark et al., 2019), for example for interpretability, or e.g. in research in bias detection and removal (Kaneko and Bollegala, 2019; Gonen and Goldberg, 2019;Manzini et al., 2019) and analyzing word vector spaces (Vulic et al., 2020) or other metrics which are non-contextual by choice.
We also refer the reader to this article 2 illustrating several down-sides of using BERT-like models over static embedding models for non-specialist users. Indeed, we can see continued prevalence of static word embeddings in industry and research areas including but not limited to medicine (Zhang et al., 2019;Karadeniz andÖzgür, 2019;Magna et al., 2020) and social sciences (Rheault and Cochrane, 2020;Gordon et al., 2020;Farrell et al., 2020;Lucy et al., 2020). From a cognitive science point of view, Human language has been hypothesized to have both con-1 BERT base (Devlin et al., 2019) produces 768 dimensional word embeddings using 109M parameters, requiring 29B FLOPs per inference call (Clark et al., 2020). 2 Do humanists need BERT? (https:// tedunderwood.com/2019/07/15/) textual as well as context-independent properties (Barsalou, 1982;Rubio-Fernández, 2008) underlining the need for continued research in studying the expressiveness context-independent embeddings on the level of words.
Most existing word embedding models, whether static or contextual, follow Firth (1957)'s famous hypothesis -"You shall know a word by the company it keeps" , i.e., the meaning of a word arises from its context. During training existing static word embedding models, representations of contexts are generally approximated using averaging or sum of the constituent word embeddings, which disregards the relative word ordering as well as the interplay of information beyond simple pairs of words, thus losing most contextual information. Ad-hoc remedies attempt to capture longer contextual information per word using higher order n-grams like bigrams or trigrams, and have been shown to improve the performance of static word embedding models (Gupta et al., 2019;Zhao et al., 2017). However, these methods are not scalable to cover longer contexts.
In this work, we obtain improved static word embeddings by leveraging recent contextual embedding advances, namely by distilling existing contextual embeddings into static ones. Our proposed distillation procedure is inspired by existing CBOW-based static word embedding algorithms, but during training plugs in any existing contextual representation to serve as the context element of each word.
Our resulting embeddings outperform the current static embedding methods, as well as the current state-of-the-art static embedding distillation method on both unsupervised lexical similarity tasks as well as on downstream supervised tasks, by a significant margin. The resulting static embeddings remain compatible with the underlying contextual model used, and thus allow us to gauge the extent of lexical information carried by static vs contextual word embeddings. We release our code and trained embeddings publicly on GitHub 3 .

Related Work
A few methods for distilling static embeddings have already been proposed. Ethayarajh (2019) propose using contextual embeddings of the same word in a large number of different contexts. They take the first principal component of the matrix 3 https://github.com/epfml/X2Static formed by using these embeddings as rows and use it as a static embedding. However, this method is not scalable in terms of memory (the embedding matrix scaling with the number of contexts) and computational cost (PCA). Bommasani et al. (2020) propose two different approaches to obtain static embeddings from contextual models.
1. Decontextualized Static Embeddings -The word w alone without any context, after tokenization into constituents w 1 , . . . , w n is fed to the contextual embedding model denoted by M and the resulting static embedding is given by g(M (w 1 ), . . . , M (w n )) where g is a pooling operation. It is observed that these embeddings perform dismally on the standard static word embedding evaluation tasks.
2. Aggregated Static Embeddings -Since contextual embedding models are not trained on a single word (without any context) as input, an alternative approach is to obtain the contextual embedding of the word w in different contexts and then pool(max, min or average) the embeddings obtained from these different contexts. They observe that average pooling leads to the best performance. We refer to this method (with average pooling) as ASE throughout the rest of the paper. As we see in our experiments, the performance of ASE embeddings saturates quickly with increasing size of the raw text corpus and is therefore not scalable.
Other related work includes distillation of contextual word embeddings to obtain sentence embeddings (Reimers and Gurevych, 2019). We also refer the reader to Mickus et al. (2020) for a discussion on the semantic properties of contextual models (primarily BERT) as well as Rogers et al. (2020), a survey on different works exploring the inner workings of BERT including its semantic properties.

Proposed Method
To distill existing contextual word representation models into static word embeddings, we augment a CBOW-inspired static word-embedding method as our anchor method to accommodate additional contextual information of the (contextual) teacher model. SENT2VEC (Pagliardini et al., 2018) is a modification of the CBOW static word-embedding method which instead of a fixed-size context window uses the entire sentence to predict the masked word. It also has the ability to learn n-gram representations along with unigram representations, allowing to better disentangle local contextual information from the static unigram embeddings. SENT2VEC, originally meant to obtain sentence embeddings and later repurposed to obtain word representations (Gupta et al., 2019) was shown to outperform competing methods including GLOVE (Pennington et al., 2014), CBOW, SKIPGRAM (Mikolov et al., 2013) and FASTTEXT (Bojanowski et al., 2016) on word similarity evaluations. For a raw text corpus C (collection of sentences), the training objective is given by Here, w t is the masked target word, U and V are the target word embedding and the source n-gram matrices respectively, N is the set of negative target samples and, : x → log (1 + e −x ) is the logistic loss function. For SENT2VEC, the context encoder E ctx used in optimizing (1) is simply given by the (static, non-contextual) sum of all vectors in the sentence without the target word, where R(S) denotes the optional expansion of the sentence S from words to short n-grams, i.e., the context sentence embedding is obtained by averaging the embeddings of word n-grams in the sentence S.
We will now generalize the objective (1) by allowing the use of arbitrary modern contextual representations E ctx instead of the static context representation as in (2). This key element will allow us to translate quality gains from improved contextual representations also to better static word embedding in the resulting matrix U . We propose two different approaches of doing so, which differ in the granularity of context used for obtaining the contextual embeddings.

Approach 1 -Sentences as context
Using contextual representations of all words in the sentence S (or the sentence S \ {w t } without the target word) allows for a more refined representation of the context, and to take in account the word order as well as the interplay of information among the words of the context.
More formally, let M (S, w) denote the output of a contextual embedding-encoder, e.g. BERT, corresponding to the word w when a piece of text S containing w is fed to it as input. We let E ctx (S, w) to be the average of all contextual embeddings of words w returned by the encoder, This allows for a more refined representation of the context as the previous representation did not take in account neither the word order nor the interplay of information among the words of the context. Certainly, using S mw t (S with w t masked) and w would make for an even better word-context pair but that would amount to one contextual embeddingencoder inference per word instead of one inference per sentence as is the case in (3) leading to a drastic drop in computational efficiency.

Approach 2 -Paragraphs as context
Since contextual models are trained on large pieces of texts (generally ≥ 512 tokens), we instead use paragraphs instead of sentences to obtain the contextual representations. However, in order to predict target words, we use the contextual embeddings within the sentence only. Consequently, for this approach, we have where P S is the paragraph containing sentence S. In the transfer phase, this approach is more computationally efficient than the previous approach, as we have to invoke the contextual embedding model M only once for each paragraph as opposed to once for every constituent sentence. Moreover, it encapsulates the related semantic information in paragraphs in the contextual word embeddings.
We call our models X2STATIC sent in the sentence case (3), and X2STATIC para in the paragraph case (4) respectively where X denotes the parent model.

Corpus Preprocessing and Training
We use the same English Wikipedia Dump as Pagliardini et al. (2018)  generate distilled X2STATIC representations. as our corpus for training static word embedding baselines as well as for distilling static word embeddings from pre-trained contextual embedding models. We remove all paragraphs with less than 3 sentences or 140 characters, lowercase the characters and tokenize the corpus using the Stanford NLP library  resulting in a corpus of approximately 54 Million sentences and 1.28 Billion words. We then use the Transformers library 4 (Wolf et al., 2020) to generate representations from existing transformer models. Our X2STATIC representations are distilled from the last representation layers of these models. We use the same hyperparameter set for training all X2STATIC models, i.e., no hyperparameter tuning is done at all. We use 12-layer as well as 24layer pre-trained models using BERT (Devlin et al., 2019), ROBERTA (Liu et al., 2019) and GPT2 (Radford et al., 2019) architectures as the teacher model to obtain X2STATIC word embeddings. All the X2STATIC models use the same set of training parameters except the parent model. Training hyperparameters are provided in Table 1. The distillation/training process employs the lazy version of the Adam optimizer (Kingma and Ba, 2015a), suitable for sparse tensors. We use a subsampling parameter similar to FASTTEXT (Bojanowski et al., 2016) in order to subsample frequent target words during training. Each X2STATIC model was trained using a single V100 32 GB GPU. Obtaining X2STATIC embeddings from 12-layer contextual embedding models took 15-18 hours while it took 4 https://huggingface.co/transformers/ 35-38 hours to obtain them from their 24-layer counterparts.
To ensure a fair comparison, we also evaluate SENT2VEC, CBOW and SKIPGRAM models that were trained on the same corpus. We do an extensive hyperparameter tuning for these models and choose the one which shows best average performance on the 5 word similarity datasets used in Subsection 4.2. These hyperparameter sets can be accessed in Table 2 where the chosen hyperparameters are shown in bold. We set the number of dimensions to be 768 to ensure parity between them and the X2STATIC models compared. We used the SENT2VEC library 5 for training SENT2VEC and the FASTTEXT library 6 for training CBOW and SKIPGRAM models. We also evaluate some pre-trained 300 dimensional GLOVE (Pennington et al., 2014) and FASTTEXT (Bojanowski et al., 2016) models in Table 3. The GLOVE model was trained on Common-Crawl corpus of 840 Billion tokens (approximately 650 times larger than our corpus) while the FASTTEXT vectors were trained on a corpus of 16 Billion tokens (approximately 12 times larger than our corpus)). We also extract ASE embeddings from each layer using the same Wikipedia corpus.
We perform two different sets of evaluations. The first set corresponds to unsupervised word similarity evaluations to gauge the quality of the obtained word embeddings. However, we recognize that there are concerns regarding word-similarity evaluation tasks (Faruqui et al., 2016) as they are shown to exhibit significant difference in performance when subjected to hyperparameter tuning (Levy et al., 2015). To address these limitations in the evaluation, we also evaluate the X2STATIC embeddings on a standard set of downstream supervised evaluation tasks used in Pagliardini et al. (2018).

Unsupervised word similarity evaluation
To assess the quality of the lexical information contained in the obtained word representations, we use the 4 word-similarity datasets used by (Bommasani et al., 2020), namely WordSim353 (353 word-pairs) (Agirre et al., 2009) dataset; SimLex-999 (999 word-pairs) (Hill et al., 2014) dataset; RG-65 (65 pairs) (Joubarne and Inkpen, 2011); and SimVerb-3500 (3500 pairs) (Gerz et al., 2016) dataset as well as the Rare Words RW-2034 (2034 pairs) (Luong et al., 2013) dataset. To calculate the similarity between two words, we use the cosine similarity between their word embeddings. These similarity scores are compared to the human ratings using Spearman's ρ (Spearman, 1904) correlation scores. We use the tool 7 provided by Bommasani et al. (2020) to report these results on ASE embeddings. It takes around 3 days to obtain ASE representations of the 2005 words in these word-similarity datasets for 12-layer models and around 5 days to obtain them for their 24-layer counterparts on the same machine used for learning X2STATIC representations. All other embeddings are evaluated using the MUSE repository evaluation tool 8 (Lample et al., 2018).
We perform two sets of experiments concerning the unsupervised evaluation tasks. The first set is the comparison of our X2STATIC models with competing models. For ASE, we report two sets of results, one which per task reports the best result amongst all the layers and other, which reports the results obtained on the best performing layer on average.
We report our observations in Table 3. We provide additional results for larger models in Appendix B. We observe that X2STATIC embeddings outperform competing models on most of the tasks. Moreover, the extent of improvement on SimLex-999 and SimVerb-3500 tasks compared to the pre-7 https://github.com/rishibommasani/ Contextual2Static 8 https://github.com/facebookresearch/ MUSE vious models strongly highlights the advantage of using improved context representations for training static word representations.
Second, we study the performance of the best ASE embedding layer with respect to the size of corpus used. Bommasani et al. (2020) report their results on a corpus size of only up to N = 100, 000 sentences. In order to measure the full potential of the ASE method, we obtain different sets of ASE embeddings as well as X2STATIC para embeddings from small chunks of the corpus to the full wikipedia corpus itself and compare their performance on SimLex-999 and RW-2034 datasets. We choose SimLex-999 as it captures true similarity instead of relatedness or association (Hill et al., 2014) and RW-2034 to gauge the robustness of the embedding model on rare words. We report our observations in Figure 1. We observe that the performance of the ASE embeddings tends to saturate with the increase in the corpus size while X2STATIC para embeddings are either significantly outperforming the ASE embeddings or still show a significantly greater positive growth rate in performance w.r.t. the corpus size. Thus, the experimental evidence suggests that on larger texts, X2STATIC embeddings will have an even better performance and hence, X2STATIC is a better alternative than ASE embeddings from any of the layers of the contextual embedding model, and obtains improved static word embeddings from contextual embedding models.

Downstream supervised evaluation
We evaluate the obtained word embeddings on various sentence-level supervised classification tasks. Six different downstream supervised evaluation tasks namely classification of movie review sentiment(MR) (Pang and Lee, 2005), product reviews(CR) (Hu and Liu, 2004), subjectivity classification(SUBJ) (Pang and Lee, 2004), opinion polarity (MPQA) (Wiebe et al., 2005), question type classification (TREC) (Voorhees, 2002) and finegrained sentiment analysis (SST-5)  are employed to gauge the performance of the obtained word embeddings.
We use a standard CNN based architecture on the top of our embeddings to train our classifier. We use 100 convolutional filters with a kernel size of 3 followed by a ReLU activation function. A global max-pooling layer follows the convolution layer. Before feeding the max-pooled output to a  classifier, it is passed through a dropout layer with dropout probability of 0.5 to prevent overfitting. We use Adam (Kingma and Ba, 2015b) to train our classifier. To put the performance of these static models into a broader perspective, we also fine-tune linear classifiers on the top of their parent models as well as sentence-transformers (Reimers and Gurevych, 2019) obtained from ROBERTA-12 and BERT-12. For the sentence-transformer models, we use the sentence-transformer models obtained by fine-tuning their parent models on the Natural Language Inference(NLI) task using the combination of Stanford NLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) datasets. The models are refered to as SBERT-BASE-NLI and SROBERTA-BASE-NLI in the rest of the paper.
The hyperparameter search space for the finetuning process involves the number of epochs (8-16) and the learning rates[1e-4,3e-4,1e-3]. Wherever train, validation, and test split is not given, we use 60% of the data as the training data, 20% of the data as validation data and the rest as the test data. After obtaining the best hyperparameters, we train on the train and validation data together with these hyperparameters and predict the results on the test set. For the linear classifiers on the top of parent models, we set the number of epochs and learning rate search space for parent model + linear classifier combination to be [3,4,5,6] and [2e-5,5e-5] respectively. The learning rates in the learning rate search space are lower than those for static embeddings as the contextual embeddings are also fine-tuned and follow the recommendation of Devlin et al. (2019). For the sentence-transformer models, we only train the linear classifier and set the number of epochs and learning rate search space to be [3,4,5,6] and [1e-4,3e-4,1e-3] respectively. We use cross-entropy Performance of the models on RW-2034 ASE BERT 12 ASE RoBERTa 12 ASE GPT2 12 Bert2Static para Roberta2Static para GPT22Static para Figure 1: Effect of corpus size on the word-embedding quality for ASE best task independent layer and X2STATIC para : In the legend, parent model is indicated in subscript.
loss for training all the models. We use Macro-F1 score and Accuracy to gauge the quality of our predictions. We compare X2STATIC models with all other static models trained from scratch on the same corpus as well as the GLOVE and FASTTEXT models used in the previous section. We also use existing GLOVE embeddings trained on tweets(27 billion tokens -20 times larger than our corpus) (Pennington et al., 2014) to make the comparison even more extensive. We report our observations in Table 4. For ASE embeddings, we take the layer with best average macro-F1 performance.
We observe that when measuring the overall performance, with the exception of ROBERTA2STATIC sent which has similar av-erage F-1 score to ASE owing to its dismal performance on the CR task, all X2STATIC embeddings outperform their competitors by a significant margin. Even though the GLOVE and FASTTEXT embeddings were trained on corpora of one to two magnitudes larger and have a larger vocabulary, their performance lags behind that of the X2STATIC embeddings. To ensure statistical soundness, we measure mean and standard deviation of the performance on 6 runs of X2STATIC para model training followed by downstream evaluation along with 6 runs of ASE embedding downstream evaluation with different random seeds in Table 5 Table 4: Comparison of the performance of different static embeddings on downstream tasks. All X2STATIC method performances which improve or are at par over all other static embedding methods and the best ASE layer on their parent model are shown in bold. Best static embedding performance for each task is underlined. For each ASE method, the number in brackets indicates the layer with best average performance. We use macro-F1 scores and accuracy as the metrics to gauge the performance of models on these downstream tasks. Note: Contextual embeddings for BERT-12, ROBERTA-12 and GPT2-12 in the SOTA section are also fine-tuned while SBERT-BASE-NLI and SROBERTA-BASE-NLI are not.
by a significant margin.
For both word similarity evaluations and downstream supervised tasks, we observe that X2STATIC para embeddings perform slightly better than X2STATIC sent embeddings. However, since no hyperparameter tuning was performed on the distillation of X2STATIC embeddings, it is hard to discern which X2STATIC variant shows better performance. Moreover, owing to the same fact concerning hyperparameter tuning, we expect to see even larger improvements with proper hyperparameter tuning as well as training on larger data.

Conclusion and Future Work
This work proposes to augment earlier WORD2VEC-based methods by leveraging recent more expressive deep contextual embedding models to extract static word embeddings. The resulting distilled static embeddings, on an average, outperform their competitors on both unsupervised as well downstream supervised evaluations and thus can be used to replace compute-heavy contextual embedding models (or existing static embedding models) at inference time in many compute-resource-limited applications. The resulting embeddings can also be used as a task-agnostic tool to measure the lexical information conveyed by contextual embedding models and allow a fair comparison with their static analogues.
Further work can explore extending this distillation framework into cross-lingual domains (Schwenk and Douze, 2017;Lample and Conneau, 2019) as well as using better pooling methods instead of simple averaging for obtaining the context representation, or joint fine-tuning to obtain even stronger static word embeddings. Another promising avenue is the use of a similar approach to learn sense embeddings from contextual embedding models. We would also like to investigate the performance of these embeddings when distilled on a larger corpus along with more extensive hyperparameter tuning. Last but not the least, we would like to release X2STATIC models for different languages for further public use.

B Experiments on larger models
In addition to the smaller 12-layer contextual embedding models, we also obtain X2STATIC word vectors from larger 24-layer contextual embedding models, once again outperforming their ASE counterparts by a significant margin. The evaluation results can be accessed in the ASE -best layer per task BERT-24 1024 0.7745(9) 0.7267(6) 0.5404(15) 0.4364(10) 0.4735(6) 0.5782(7) ASE -best task independent layer BERT-24 1024 0.7677(7) 0.7052(7) 0.5209(7) 0.4307(7) 0.4665 (7) (9) 0.5385(9) 0.5680(9) ASE -best task independent layer ROBERTA-24 1024 0.6738(6) 0.6270(9) 0.5437(9) 0.4571(9) 0.5385 (9) Table 6: Comparison of the performance of different embedding methods on word similarity tasks. Models are compared using Spearman correlation for word similarity tasks. All X2STATIC method performances which improve over all ASE methods on their parent model as well as all static models are shown in bold. Best performance in each task is underlined. For all ASE methods, the number in parentheses for each dataset indicates which layer was used for obtaining the static embeddings.