Denoising Word Embeddings by Averaging in a Shared Space

We introduce a new approach for smoothing and improving the quality of word embeddings. We consider a method of fusing word embeddings that were trained on the same corpus but with different initializations. We project all the models to a shared vector space using an efficient implementation of the Generalized Procrustes Analysis (GPA) procedure, previously used in multilingual word translation. Our word representation demonstrates consistent improvements over the raw models as well as their simplistic average, on a range of tasks. As the new representations are more stable and reliable, there is a noticeable improvement in rare word evaluations.


Introduction
Continuous (non-contextualized) word embeddings have been introduced several years ago as a standard building block for NLP tasks. These models provide efficient ways to learn word representations in a fully self-supervised manner from text corpora, solely based on word co-occurrence statistics. A wide variety of methods now exist for generating word embeddings, with prominent methods including word2vec (Mikolov et al., 2013a), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2017). Recently, contextualized embeddings (Peters et al., 2018;Devlin et al., 2019), replaced the use of non-contextualized embeddings in many settings. Yet, the latter remain the standard choice for typical lexical-semantic tasks, e.g., semantic similarity (Hill et al., 2015), word analogy (Jurgens et al., 2012), relation classification (Barkan et al., 2020a), and paraphrase identification (Meged et al., 2020). These tasks consider the generic meanings of lexical items, given out of context, hence the use of non-contextualized embeddings is appropriate. Notably, FastText was shown to yield state-of-theart results in most of these tasks (Bojanowski et al., 2017).
While word embedding methods proved to be powerful, they suffer from a certain level of noise, introduced by quite a few randomized steps in the embedding generation process, including embedding initialization, negative sampling, subsampling and mini-batch ordering. Consequently, different runs would yield different embedding geometries, of varying quality. This random noise might harm most severely the representation of rare words, for which the actual data signal is rather weak (Barkan et al., 2020b).
In this paper, we propose denoising word embedding models through generating multiple model versions, each created with different random seeds. Then, the resulting representations for each word should be fused effectively, in order to obtain a model with a reduced level of noise. Note, however, that simple averaging of the original word vectors is problematic, since each training session of the algorithm produces embeddings in a different space. In fact, the objective scores of both word2vec, Glove and FastText are invariant to multiplying all the word embeddings by an orthogonal matrix, hence, the algorithm output involves an arbitrary rotation of the embedding space.
For addressing this issue, we were inspired by recent approaches originally proposed for aligning multi-lingual embeddings (Chen and Cardie, 2018;Kementchedjhieva et al., 2018;Alaux et al., 2019;Jawanpuria et al., 2019;Taitelbaum et al., 2019). To obtain such alignments, these methods simultaneously project the original language-specific embeddings into a shared space, while enforcing (or at least encouraging) transitive orthogonal transformations. In our (monolingual) setting, we propose a related technique to project the different embedding versions into a shared space, while optimizing the projection towards obtaining an improved fused representation. We show that this results in im-proved performance on a range of lexical-semantic tasks, with notable improvements for rare words, as well as on several sentence-level downstream tasks.

Word Averaging in a Shared Space
Assume we are given an ensemble of k pre-trained word embedding sets, of the same word vocabulary of size n and the same dimensionality d. In our setting, these sets are obtained by training the same embedding model using different random parameter initializations. Our goal is to fuse the k embedding sets into a single "average" embedding that is hopefully more robust and would yield better performance on various tasks. Since each embedding set has its own space, we project the k embedding spaces into a shared space, in which we induce averaged embeddings based on a mean squared error minimization objective.
Let x i,t ∈ R d be the dense representation of the t-th word in the i-th embedding set. We model the mapping from the i-th set to the shared space by an orthogonal matrix denoted by T i . Denote the sought shared space representation of the t-th word by y t ∈ R d . Our goal is to find a set of transformations T = {T 1 , ..., T k } and target word embeddings y = {y 1 , ..., y n } in the shared space that minimize the following mean-squared error: For this objective, it is easy to show that for a set of transformations T 1 , ..., T k , the optimal shared space representation is: Hence, solving the optimization problem pertains to finding the k optimal transformations. In the case where k = 2, the optimal T can be obtained in a closed form using the Procrustes Analysis (PA) procedure (Schönemann, 1966), which has been employed in recent bilingual word translation methods (Xing et al., 2015;Artetxe et al., 2016;Hamilton et al., 2016;Artetxe et al., 2017a,b;Conneau et al., 2017;Artetxe et al., 2018a,b;. In our setting, to obtain an improved embedding, we wish to average more than two embedding sets. However, if k > 2 there is no closed form solution to (1) and thus, we need to find a solution using an iterative optimization process. To that end, we follow several works that suggested employing the General Procrustes Analysis (GPA) procedure, which is an extension of PA to multi-set alignment (Gower, 1975;Kementchedjhieva et al., 2018). Generally, the GPA consists of an alternate minimization procedure where we iterate between finding the orthogonal transformations and computing the shared space. The optimal transformation from each embedding space to the shared space is found by minimizing the following score, The minimum of S(T i ) can then be found by the closed form PA procedure. The updated trans- At each step in the iterative GPA algorithm, the score (1) is monotonically decreased until it converges to a local minimum point.
Algorithm 1 Shared Space Embedding Averaging 1: Input: Ensemble of k word embedding sets. 2: Task: Find the optimal average embedding. 3: Preprocessing: 4: Compute the cross-correlation matrices: end for 12: end while 13: Compute the average embedding: For large vocabularies, GPA is not efficient, because, in each iteration, when computing the SVD we need to sum over all the vocabulary words. To circumvent this computational cost, we adopt the optimization procedure from Taitelbaum et al. (2019), which we apply within each iteration. Instead of summing over the whole vocabulary, the following extension is proposed. Let C ij = t x j,t x i,t be the cross-correlation matrix  for a pair (i, j) of two original embedding spaces, which can be computed once, for all pairs of spaces, in a pre-processing step. Given the matrices C ij the computational complexity of the iterative averaging algorithm is independent of the vocabulary size, allowing us to compute efficiently the SVD. The resulting algorithm termed Shared Space Embedding Averaging (SSEA) is presented in Algorithm 1. 1

Experimental Setup and Results
This section presents our evaluation protocol, datasets, data preparation, hyperparameter configuration and results.

Implementation Details and Data
We trained word2vec (Mikolov et al., 2013a), Fast-Text (Bojanowski et al., 2017) and GloVe (Pennington et al., 2014) embeddings. For word2vec we used the skip-gram model with negative sampling, which was shown advantageous on the evaluated tasks (Levy et al., 2015). We trained each of the models on the November 2019 dump of Wikipedia articles 2 for k = 30 times, with different random seeds, and used the default reported hyperparameters; we set the embedding dimension to d = 200, and considered each word within the maximal window c max = 5, subsampling 3 threshold of ρ = 10 −5 and used 5 negative examples for every positive example. In order to keep a large amount of rare words in the corpus, no preprocessing was applied on the data, yielding a vocabulary size of 1.5 · 10 6 . We then applied the SSEA algorithm to the embedding sets to obtain the average embedding. The original embedding sets and averaged embeddings were centered around the 0 vector and normalized to unit vectors.

Improved Embedding Stability
We next analyze how our method improves embedding quality and consistency, notably for rare 1 The algorithm demonstration code is available at github.com/aviclu/SSEA. In practice, we utilized an efficient PyTorch implementation based on Taitelbaum   words. To that end, for any two embedding sets, u and v, we can find the optimal mapping Q between them using the PA algorithm and compute its mean square error (MSE), 1 n t=1 Qu t − v t 2 . We define the stability of an embedding algorithm by the average MSE (over 10 random pairs of samples) between two instances of it. This score measures the similarity between the geometries of random instances generated by a particular embedding method , and thus reflects the consistency and stability of that method. The scores of the different models are depicted in Table 1. As observed, after applying SSEA the Average MSE drops by an order of magnitude, indicating much better stability of the obtained embeddings.
We can perform a similar analysis for each word separately. A consistent embedding of the t-th word in both sets u and v should result in a small mapping discrepancy Qu t −v t 2 . Figure 1 depicts MSE for the models and their computed SSEA, as a function of the word's frequency in the corpus. The denoised version of the models is marked with a 'D-' prefix. For clarity of presentation, we did not include the results for GloVe (which are similar to word2vec). As expected, embedding stability always increases (MSE decreases) with word frequency. SSEA is notably more stable across the frequency range, with the error minimized early on and reduced most drastically for low frequencies.

Comparison of methods
We next compare our denoised model, denoted with a 'D-' prefix, with the original embedding models. As an additional baseline, we considered also the naïve averaged embedding model, denoted with a 'A-' prefix, where for every word we computed the simplistic mean embedding across all origi-  nal spaces. Note that we did not compare other proposed embeddings or meta-embedding learning methods, but rather restricted our analysis to empirically verifying our embedding aggregation method and validating the assumptions behind the empirical analysis we performed.
Results The results of the lexical-semantic tasks are depicted in Table 2, averaged over 30 runs for each method. Our method obtained better performance than the other methods, substantially for FastText embeddings. As shown, the naïve averaging performed poorly, which highlights the fact that simply averaging different embedding spaces does not improve word representation quality. The most notable performance gain was in the rare-words task, in line with the analysis in Fig. 1, suggesting that on rare words the raw embedding vectors fit the data less accurately.

Evaluations On Downstream Tasks
For completeness, we next show the relative advantage of our denoising method also when applied to several sentence-level downstream benchmarks. While contextualized embeddings domi-nate a wide range of sentence-and document-level NLP tasks (Peters et al., 2018;Devlin et al., 2019;Caciularu et al., 2021), we assessed the relative advantage of our denoising method when utilizing (non-contextualized) word embeddings in sentencean document-level settings. We applied the exact procedure proposed in Li et al. (2017) and Rogers et al. (2018), as an effective benchmark for the quality of static embedding models. We first used sequence labeling tasks. The morphological and syntactic performance was evaluated using part-of-speech tagging, POS, and chunking, CHK. Both named entity recognition, NER, and multiway classification of semantic relation classes, RE, tasks were used for evaluating semantic information at the word level. For the above POS, NER and CHK sequence labeling tasks, we used the CoNLL 2003 dataset (Sang and Meulder, 2003) and for the RE task, we used the SemEval 2010 task 8 dataset (Hendrickx et al., 2010). The neural network models employed for these downstream tasks are fully described in (Rogers et al., 2018). Next, we evaluated the following semantic level tasks: document-level polarity classification, PC, using the Stanford IMDB movie review dataset (Maas et al., 2011), sentence level sentiment polarity classification, SEN, using the MR dataset of short movie reviews (Pang and Lee, 2005), and classification of subjectivity and objectivity task, SUB, that uses the Rotten Tomatoes user review snippets against official movie plot summaries (Pang and Lee, 2004). Similarly to the performance results in Table 2, the current results show that the suggested denoised embeddings obtained better overall performance than the other methods, substantially for FastText embeddings.

Related Work
A similar situation of aligning different word embeddings into a shared space occurs in multi-lingual  word translation tasks which are based on distinct monolingual word embeddings. Word translation is performed by transforming each language word embeddings into a shared space by an orthogonal matrix, for creating a "universal language", which is useful for the word translation process. Our setting may be considered by viewing each embedding set as a different language, where our goal is to find the shared space where embedding averaging is meaningful.
The main challenge in multilingual word translation is to obtain a reliable multi-way word correspondence in either a supervised or unsupervised manner. One problem is that standard dictionaries contain multiple senses for words, which is problematic for bilingual translation, and further amplified in a multilingual setting. In our case of embedding averaging, the mapping problem vanishes since we are addressing a single language and the word correspondences hold trivially among different embeddings of the same word. Thus, in our setting, there are no problems of wrong word correspondences, neither the issue of having different word translations due to multiple word senses. Studies have shown that for the multi-lingual translation problem, enforcing the transformation to be strictly orthogonal is too restrictive and performance can be improved by using the orthogonalization as a regularization (Chen and Cardie, 2018) that yields matrices that are close to be orthogonal. In our much simpler setting of a single language, with a trivial identity word correspondence, enforcing the orthogonalization constraint is reasonable.
Another related problem is meta-embedding (Yin and Schütze, 2016), which aims to fuse information from different embedding models. Various methods have been proposed for embedding fusion, such as concatenation, simple averaging, weighted averaging (Coates and Bollegala, 2018;Kiela et al., 2018) and autoencoding (Bollegala and Bao, 2018). Some of these methods (concatenation and autoencoding) are not scalable when the goal is to fuse many sets, while others (simple averaging) yield inferior results, as described in the above works. Note that our method is not intended to be a competitor of meta-embedding, but rather a complementary method.
An additional related work is the recent method from (Muromägi et al., 2017). Similarly to our work, they proposed a method based on the Procrustes Analysis procedure for aligning and averaging sets of word embedding models. However, the mapping algorithm they used is much more computationally demanding, as it requires to go over all the dictionary words in every iteration. Instead, we propose an efficient optimization algorithm, which requires just one such computation during each iteration, and is theoretically guaranteed to converge to a local minimum point. While their work focuses on improving over the Estonian language, we suggest evaluating this approach on English data and on a range of different downstream tasks. We show that our method significantly improves upon rare words, which is beneficial for small sized / domain-specific corpora.

Conclusions
We presented a novel technique for creating better word representations by training an embedding model several times, from which we derive an averaged representation. The resulting word representations proved to be more stable and reliable than the raw embeddings. Our method exhibits performance gains in lexical-semantic tasks, notably over rare words, confirming our analytical assumptions. This suggests that our method may be particularly useful for training embedding models in low-resource settings. Appealing future research may extend our approach to improving sentence-level representations, by fusing several contextualized embedding models.