Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

Although contextualized embeddings generated from large-scale pre-trained models perform well in many tasks, traditional static embeddings (e.g., Skip-gram, Word2Vec) still play an important role in low-resource and lightweight settings due to their low computational cost, ease of deployment, and stability. In this paper, we aim to improve word embeddings by 1) incorporating more contextual information from existing pre-trained models into the Skip-gram framework, which we call Context-to-Vec; 2) proposing a post-processing retrofitting method for static embeddings independent of training by employing priori synonym knowledge and weighted vector distribution. Through extrinsic and intrinsic tasks, our methods are well proven to outperform the baselines by a large margin.


Introduction
Contextualized embeddings such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) have become the default architectures for most downstream NLP tasks. However, they are computationally expensive, resource-demanding, hence environmentally unfriendly. Compared with contextualized embeddings, static embeddings like Skipgram (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are lighter and less computationally expensive. Furthermore, they can even perform without significant performance loss for contextindependent tasks like lexical-semantic tasks (e.g., word analogy), or some tasks with plentiful labeled data and simple language (Arora et al., 2020).
Recent work has attempted to enhance static word embedding while maintaining the benefits of both contextualized embedding and static embedding. Among these efforts, one category is the direct conversion of contextualized embeddings Figure 1: The overall training pipeline of our proposed word embeddings training and post-processing methods. In the Context-to-Vec phase, static word embeddings are trained using contextualized embeddings based on a monolingual corpus. While in the postprocessing phase, external knowledge is introduced to fine-tune the word vectors based on the graph topology.
to static embeddings (Bommasani et al., 2020). The other category of enhancement is to make use of contextualized embeddings for static embeddings (Melamud et al., 2016). The latter category is a newer paradigm, which we call Context-to-Vec. This paradigm not only alleviates the word sense ambiguities from static embedding, but also fuses more syntactic and semantic information in the context within a fixed window.
For the Context-to-Vec paradigm, an association between contextualized word vectors and static word vectors is essentially required. In this case, the contextualized signal serves as a source of information enhancement for the static embeddings (Vashishth et al., 2018). However, the existing efforts only consider the contextualized embeddings of center words as the source, which is actually incomplete since the contextualized features for the context words of the center words are ignored.
In addition, benefiting from the invariance and stability of already trained static embeddings, postprocessing for retrofitting word vectors is also an effective paradigm for improving static embeddings. For example, one solution is an unsupervised approach that performs a singular value decomposition to reassign feature weights (Artetxe et al., 2018), but this does not utilize more external knowledge and lacks interpretation. Poor initial spatial distribution of word embeddings obtained from training may lead to worse results. Another common solution is to use a synonym lexicon (Faruqui et al., 2014), which exploits external prior knowledge with more interpretability but does not take into account the extent of spatial distance in the context.
In this work, we unify the two paradigms above within a model to enhance static embeddings. On the one hand, we follow the Context-to-Vec paradigm in using contextualized representations of center words and their context words as references for static embeddings. On the other hand, we propose a graph-based semi-supervised postprocessing method by using a synonym lexicon as prior knowledge, which can leverage proximal word clustering signals and incorporate distribution probabilities. The overall training pipeline is shown in Fig.1. The pipeline is divided into two separate phases, where the first phase follows the Context-to-Vec paradigm by distilling contextualized information into static embeddings, while the second phase fine-tunes the word embeddings based on graph topology. To validate our proposed methods, we evaluate several intrinsic and extrinsic tasks on public benchmarks. The experimental results demonstrate that our models significantly outperform traditional word embeddings and other distilled word vectors in word similarity, word analogy, and word concept categorization tasks. Besides, our models moderately outperform baselines in all downstream clustering tasks.
To our knowledge, we are the first to train static word vectors by using more contextual knowledge in both training and post-processing phases.

Related Work
Word Embeddings. For traditional static word embeddings, Skip-gram and CBOW are two models based on distributed word-context pairs (Mikolov et al., 2013). The former uses center words to predict contextual words, while the latter uses contextual words to predict central words. GloVe is a log-bilinear regression model which leverages global co-occurrence statistics of corpus (Pennington et al., 2014); FASTTEXT takes into account subword information by incorporating character ngrams into the Skip-gram model (Bojanowski et al., 2017). While contextualized word embeddings (Peters et al., 2018;Devlin et al., 2018) have been widely used in modern NLP. These embeddings are actually generated using language models such as LSTM and Transformer (Vaswani et al., 2017) instead of a lookup table. This paradigm can generally integrate useful sentential information into word representations.
Context-to-Vec. The fusion of contextualized and static embeddings is a newly emerged paradigm in recent years. For instance, Vashishth et al. (2018) propose SynGCN using GCN to calculate context word embeddings based on syntax structures; Bommasani et al. (2020) introduce a static version of BERT embeddings to represent static embeddings; Wang et al. (2021) enhance the Skip-gram model by distilling contextual information from BERT. Our work also follows this paradigm but introduce more context constraints.
Post-processing Embeddings. Post-processing has been used for improving trained word embeddings. Typically, Faruqui et al. (2014) use synonym lexicons to constrain the semantic range; Artetxe et al. (2018) propose a method based on eigenvalue singular decomposition. Similar to these techniques, our post-processing method is easy for deployment and can be applied to any static embeddings. The difference is that we not only take advantage of the additional knowledge, but also consider the distance weights of the word vectors, overcoming the limitations of existing methods with better interpretability.

Embedding Representations
As shown in Fig.2, our proposed framework consists of four basic components. Formally, given a sentence s = {w 1 , w 2 , ..., w n }(w i ∈ D), our objective is to model the relationship between the center word w i and its context words {w i−ws , ..., w i−1 , w i+1 , ..., w i+ws }.
Contextualized Embedding Module. To incor-  porate contextualized information, an embedding u i of the center word w i needs to be generated from a pre-trained language model ( Fig.2(a)). Taking the BERT model as an example, the center word w i is first transformed into a latent vector h i , then h i is fed to a bidirectional Transformer for selfattention interaction. Finally, the output representation o i ∈ R d is linearly mapped to u i ∈ R d emb through a linear layer as:

Context Words
where W o ∈ R d emb ×d denotes model parameters, Linear( * ) denotes a linear mapping layer, and SA( * ) denotes self-attention. In practice, the size of o i is d = 768, and the size of u i is d emb = 300.
The h i here is a sum of the Token Embedding E w i and the Positional Embedding P E w i as: Static Embedding Module. The Skip-gram model ( Fig.2(b)) is used as the static embedding module. Our method does not directly fit the Skipgram model by replacing an embedding table, although the original Skip-gram uses an embedding table of center words as the final embedding. Instead, to make the context words predictable and to enable negative sampling from the vocabulary, contextualized representations are used for the center words, while an embedding table of the context words is used for the output static embedding.

Heuristic Semantic Equivalence
As mentioned above, a key issue for the Contextto-Vec paradigm is to bridge the gap between contextualized and static word vectors. To this end, a main intuition is to find key equivalent semantic connections between contextualized vectors and static vectors. We take the following heuristics: Heuristic 1: For a given sentence, the contextualized embedding representation of a center word can be semantically equivalent to the static embedding of the center word in the same context.
According to Heuristic 1, in order to model the center word w i and its context words w i+j (note here that the illegal data that indexes less than 0 or greater than the maximum length are ignored), a primary training target is to maximize the probability of the context words w i+j (|j| ∈ [1, w s ]) in the Skip-gram model: where u i is the contextualized representation of the center word, and u k is the static embedding from a center word w k that is generated by a static embedding table with size d = 300.
For Heuristic 1, the contextualized word embedding of any center word is essentially used as reference for corresponding static word embedding. Such a source for information enhancement implicitly contains the context of the contextualized embedding, but explicitly ignores the contextual information which is easily accessible. Hence, the proposed: Heuristic 2: Inspired by the idea of Skip-gramlike modeling, the contextualized embedding representation for the context words of a center word can be also semantically equivalent to represent the static embedding of the center word.
To model this semantic relationship, we introduce a Tied Contextualized Attention module ( Fig.2(d)) for explicitly attending contextual signals, which complements Heuristic 1 by incorporating more linguistic knowledge into the static embedding. In particular, assume that the center word w i in the contextualized embedding module corresponds to the contextual vocabulary notated as {w i−w s , ..., w i−1 , w i+1 ..., w i+w s }, then the output contextual attention vector can be computed as: where V center denotes the embedding representation of the center word, which is a residual connection here. And V c−words denotes the embedding representations of corresponding context words. φ is an optional nonlinear function, U ( * ) is a merge operation, and τ is an average pooling operation. W 1 ∈ R d×d emb and W 2 ∈ R d×d emb are trainable parameters, in which W 2 denotes the weight assignment of each context vector.
Since each o k has similar linguistic properties, the weight W 2 can be shared, and we name this module Tied Contextualized Attention mechanism. Therefore, the weighted average of the linear transformation of all context vectors can be reduced to the weighted linear output of the average of all vectors as shown in Eq.4. This weight-sharing mechanism can help speed up calculations.
In practice, to reduce the complexity, the weight parameter λ 1 and λ 2 are the same; the u i in Eq.1 can be directly used as V context ; the value of w s is the same as that of w s , e.g., 5.

Training Objectives
The modular design requires our model to satisfy multiple loss constraints simultaneously, allowing static embeddings to introduce as much contextual information as possible. Given a training corpus with N sentences s c = {w 1 , w 2 , ..., w nc }(c ∈ [1, N ]), our loss functions can be described as follows.
Semantic Loss. As illustrated in Heuristic 1, one of our key objectives is to learn the semantic similarity between the contextualized embedding and the static embedding of the center word. To speed up computation, the inner product of the normalized vectors can be used as the loss L 1 : where σ is the sigmoid function.
Contextualized Loss. As described in Heuristic 2, the contextualized embeddings for the context words of the center word are explicitly introduced to further enhance the static embedding, thus the Contextualized Loss L 2 is expressed as: Contrastive Negative Loss. Negative noisy samples (Fig.2(c)) can improve the robustness and effectively avoid the computational bottleneck. This trick is common in NLP. Our Contrastive Negative Loss L 3 is calculated as: where w neg m denotes a negative sample, k is the number of negative samples and P (w)is a noise distribution set.
Joint Loss. The final training objective is a joint loss L for multi-tasks as: where each hyperparameter η i denotes a weight.  Figure 3: A word graph diagram with edges between related words. The dashed edges indicate the corresponding edge relationships between observed word vectors (white nodes) and inferred word vectors (colored nodes). And the solid edges indicate the relationship between the word (green node) to be refined and its corresponding synonyms (orange nodes).

Graph-based Post-retrofitting
In the post-processing stage, we propose a new semi-supervised retrofitting method for static word embeddings based on graph topology (Xia et al., 2022;Wu et al., 2021Wu et al., , 2020. This method overcomes the limitations of previously existing work by 1) using a synonym lexicon as priori external knowledge. Since both contextualized embeddings and static embeddings are trained in a selfsupervised manner, the word features originate only from within the sequence and no external knowledge is considered; 2) converting the Euclidean distances among words into a probability distribution (McInnes et al., 2018), which is based on the special attributes that the trained static word vectors are mapped in a latent Euclidean space and remain fixed. Word Graph Representation. Suppose that V = {w 1 , ..., w n } is a vocabulary (i.e., a collection of word types). We represent the semantic relations among words in V as an undirected graph (V, E), with each word type as a vertex and edges (w i , w j ) ∈ E as the semantic relations of interest. These relations may vary for different semantic lexicons. Matrix Q represents the set of trained word vectors for q i ∈ R Dim , in which q i corresponds to the word vector of each word w i in V .
Our objective is to learn a set of refined word vectors, denoted as matrix Q = (q 1 , ..., q n ), with the columns made close to both their counterparts in Q and the adjacent vertices according to the probability distribution. A word graph with such edge connectivity is shown in Fig.3, which can be interpreted as a Markov random field (Li, 1994).
Retrofitting Objective. To refine all word vectors close to the observed value q i and its neighbors q j ((i, j) ∈ E), the objective is to minimize: where α i , β i , and γ ij control the relative strengths of associations, respectively. Since Ψ is convex in Q, we can use an efficient iterative update algorithm. The vectors in Q are initialized to be equal to the vectors in Q . Assuming that w i has m adjacent edges corresponding to m synonyms, then we take the first-order derivative of Ψ with respect to a q i vector and equate it to zero, yielding the following online update: By default, α i and β i take the same value 0.5, and γ ij can be expressed as: in which σ is a scale parameter, ν is a positive real parameter, and C ν is the normalization factor of ν as (the following Γ( * ) denotes the gamma function): and d ij calculates the sum of Euclidean distances of the feature vectors across all dimensions Dim as: Through the above process, the distance distribution is first converted into a probability distribution, and then the original word graph is represented as a weighted graph. This retrofitting method is modular and can be applied to any static embeddings.

Experiments
We use Wikipedia to train static embeddings. The cleaned corpus has about 57 million sentences and 1.1 billion words. The total number of vocabularies is 150k. Sentences between 10 and 40 in length were selected during training.  Table 1: Results on word similarity and analogy tasks. Ours(preliminary): without post-processing; Ours (+postprocess): with post-processing. The best results are bolded, and the second-best underlined.
Extrinsic Tasks. The CONLL-2000 shared task (Sang and Buchholz, 2000) is used for chunking tasks and F1-score is used as the evaluation metric; OntoNotes 4.0 (Weischedel et al., 2011) is used for NER tasks and F1-score is used as the evaluation metric; And the WSJ portion of Penn Treebank (Marcus et al., 1993) is used for POS tagging tasks, and token-level accuracy is used as the evaluation metric. These tasks are reimplemented with the open tool NCRF++ (Yang and Zhang, 2018).

Baselines
As shown in Table 1, baselines are classified into three categories. For the first category (Static), static embeddings come from a lookup table. Note here that Skip-gram(context) denotes the results from the context word embeddings. For the second category (Contextualized), static embeddings come from contextualized word embedding models (i.e., BERT, ELMo, GPT2, and XLNet) for lexical semantics tasks. The models with _token use the mean pooled subword token embeddings as static embeddings; The models with _word take every single word as a sentence and output its word representation as a static embedding; The models with _avg take the average of output over training corpus. For the last category (Contextto-Vec), contextualized information is integrated into Skip-gram embeddings. Among these models, ContextLSTM (Melamud et al., 2016) learns the context embeddings by using single-layer bi-LSTM; SynGCN (Vashishth et al., 2018) uses GCN to calculate context word embeddings based on syntax structures; BERT+Skip-gram (Wang et al., 2021) enhances the Skip-gram model by adding context syntactic information from BERT, which is our primary baseline.

Quantitative Comparison
Word Similarity and Analogy. Table 1 shows the experimental results of intrinsic tasks. Overall, the models that integrate contextualized information into static embeddings (Context-to-Vec) perform better than other types (Contextualized / Static).
Our results outperform baselines across the board.
To be fair, the backbone of our model here is BERT as that in the main baseline (BERT+Skip-gram) (Wang et al., 2021). Within the Context-to-Vec category, our models perform best on all word similarity datasets. Our base model without post-processing obtains an average absolute improvement of about +23.8%(+13.2) and related improvement of +4.4%(+2.9) compared with the main baseline. The performance is further enhanced using postprocessing with a +25.6%(+14.2) absolute increase, and a +5.8%(+3.8) relative increase compared with the main baseline, and a +1.4%(+1.0) relative increase compared with our base model (w/o postprocessing). It is worth mentioning that the main baseline does not perform better than BERT avg in Contextualized group on the RG65 dataset, but our model does make up for their regrets, which indicates that our model is better at understanding contextual correlates of synonymy.
For the word analogy task, our performances are basically equal to the baselines. Overall, we gain the best score (+0.5) on the Google dataset but without a significant improvement. Although we do not gain the best score across all baselines on the SemEval dataset, our model performs better than the main baseline. For different datasets, especially in word similarity tasks, the improvement of our preliminary model on WS353, SimiLex, RG65 (+4.1, +5.5, and +5.7, respectively) is significantly better than other datasets. For example, the improvement of the main baseline on the WS353R (relatedness) subset and the WS353 set is far greater than that on the WS353S (similarity) subset. While our model bridges their gaps in the WS353 set and also ensures that the performance of WS353S and WS353R is further improved slightly.
Word Concept Categorization. Word concept categorization is another important intrinsic evaluation metric. We use 4 commonly used datasets as shown in Table 2. Overall, our model without postprocessing outperforms the baselines by a large margin, giving the best performance and obtaining an average performance gain of +5.2%(+5.1) compared to the main baseline. In particular, the largest increases are observed on the ESSLI(N) (+7.5), ESSLI(V) (+3.8). And with post-processing, our model can obtain better improvements (+3.3 vs. +5.1). The experimental results show the advantage of integrating contextualized and word cooccurrence information, which can excel in grouping nominal concepts into natural categories.
Extrinsic Tasks. Extrinsic tasks reflect the effectiveness of embedded information through downstream tasks. We conduct extrinsic evaluation from chunking, NER, and POS tagging tasks   as shown in Table 3. We select comparison representatives from the Static group, the Contextualized group, and the Context-to-Vec group, respectively. Although the improvement is not significant compared with the intrinsic evaluations, it can be seen that our performances are better than the baselines, which can prove the superiority of our model. The primary baseline BERT+Skip-gram obtains the second-best average score, but does not excel in the chunking task. In contrast, our model not only outperforms all baselines moderately on average, but also performs best in every individual task.

Ablation and Analysis
Post-processing Schemes. From Table 1, we can initially find that the post-processing method has a positive impact. To further quantitatively analyze, we compare more related methods as shown in Table 4. In this ablation experiment, the comparison baseline is our trained original word vectors (w/o retrofitting), and the other comparison methods include the singularity decomposition-based method (Artetxe et al., 2018), and the synonymbased constraint method (Faruqui et al., 2014). From the results, we can see that other postprocessing schemes can improve the word vectors to some extent, but do not perform better in all datasets. However, our proposed post-processing scheme performs the best across the board here, which shows that converting the distance distribution into a probability distribution is more effective. Nearest Neighbors. To further understand the results, we show the nearest neighbors of the words "light" and "while" based on the cosine similarity, as shown in Table 5. For the noun "light", other methods generate more noisy and irrelevant words, especially static embeddings. In contrast, the Context-to-Vec approaches (Ours & BERT+Skipgram) can capture the key meaning and generate cleaner results, which are semantically directly related to "light" literally. For the word "while", the static approaches tend to co-occur with the word "while", while Context-to-Vec approaches return conjunctions with more similar meaning to "while", such as "whilst", "whereas" and "although", which demonstrates the advantage of using contextualization to resolve lexical ambiguity.
Word Pairs Visualization. Fig.4 shows the 3D visualization of the gender-related word pairs based on t-SNE (Van der Maaten and Hinton, 2008). These word pairs differ only by gender, e.g., "nephew vs. niece" and "policeman vs. policewoman". From the topology of the visualized vectors, the spatial connectivity of the word pairs in Skip-gram and GloVe is rather inconsistent, which means that static word vectors are less capable of capturing gender analogies. In contrast, for vectors based on contextualized embeddings, such as BERT avg , SynGCN, BERT+Skip-gram, and our model, the outputs are more consistent. In particular, our outputs are highly consistent in these instances, which illustrates the ability of our model to capture relational analogies better than baselines and the importance of contextualized information based on semantic knowledge.

Conclusion
We considered improving word embeddings by integrating more contextual information from existing pre-trained models into the Skip-gram framework. In addition, based on inherent properties of static embeddings, we proposed a graph-based postretrofitting method by employing priori synonym knowledge and a weighted distribution probability.
The experimental results show the superiority of our proposed methods, which gives the best results on a range of intrinsic and extrinsic tasks compared to baselines. In future work, we will consider prior knowledge directly during training to avoid a multistage process.