Benchmarking Meta-embeddings: What Works and What Does Not

In the last few years, several methods have been proposed to build meta-embeddings. The general aim was to obtain new representations integrating complementary knowledge from different source pre-trained embeddings thereby improving their overall quality. How-ever, previous meta-embeddings have been evaluated using a variety of methods and datasets, which makes it difﬁcult to draw meaningful conclusions regarding the merits of each approach. In this paper we propose a uniﬁed common framework, including both intrinsic and extrinsic tasks, for a fair and objective meta-embeddings evaluation. Fur-thermore, we present a new method to generate meta-embeddings, outperforming previous work on a large number of intrinsic evaluation benchmarks. Our evaluation framework also allows us to conclude that previous extrinsic evaluations of meta-embeddings have been overestimated.

Following the hypothesis that different knowledge sources may contain complementary semantic information (Goikoetxea et al., 2016), metaembeddings (Yin and Schütze, 2016) aim to obtain an ensemble of distinct word embeddings each trained using different methods and resources to produce a word representation with an improved overall quality.
The main challenge when generating metaembeddings is preserving the information encoded in the source embeddings and many different methods have been proposed to deal with the task. Concatenation (Goikoetxea et al., 2016) and averaging (Coates and Bollegala, 2018) are two very strong baselines, but much complex methods based on linear transformations and supervised neural models have also been proposed (Bollegala et al., 2018;Bollegala and Bao, 2018;Yin and Schütze, 2016).
When it comes to evaluating meta-embeddings, there is no consensus on either evaluation tasks or methodology. Meta-embeddings are evaluated in a wide range of tasks (Schnabel et al., 2015;Bakarov, 2018), ranging from intrinsic (i.e. word similarity, word analogy) to extrinsic tasks such as short text classification (Bollegala and Bao, 2018;Bollegala et al., 2018), common-sense stories , Named Entity Recognition (O'Neill and Bollegala, 2020) or Semantic Textual Similarity (García-Ferrero et al., 2020). Furthermore, different evaluation methodologies have been applied. For example, Yin and Schütze (2016) discard the words in the datasets which are not represented in the meta-embedding model, while Speer and Lowry-Duda (2017) use various strategies to minimize the number of out-of-vocabulary (OOV) words. To make things more complicated, previous meta-embeddings approaches require some ad-hoc pre-processing to tune multiple filtering criteria and parameters according to the source embeddings used (Bollegala et al., 2018;Bollegala and Bao, 2018;Yin and Schütze, 2016), which has a signifi-cant effect on the final evaluation results. Summarizing, this lack of consistency in evaluation tasks, methodologies and ad-hoc hyper-parameter tuning makes it very hard to objectively compare the proposed methods. Thus, to the best of our knowledge, and despite the existence of multiple works addressing this task, a unified and comprehensive evaluation of meta-embeddings has not been yet carried out. In fact, the lack of such unified and comprehensive evaluation framework has arguably caused erroneous assumptions and an overestimation in the performance of meta-embeddings for extrinsic tasks.
An additional issue is that most previous work has focused on combining word embeddings generated from similar sources and algorithms. For instance, combining Word2vec CBOW (Mikolov et al., 2013a) with GloVe (Pennington et al., 2014) embeddings. We empirically show that, since these embeddings encode very similar knowledge, combining them does not produce a significant gain. Instead, the best meta-embeddings are obtained by combining embeddings trained with different algorithms and resources. For example, by leveraging vectors induced from text corpora together with other embeddings obtained from knowledge bases.
In this paper we present a new method to generate meta-embeddings that outperform previous approaches on a large number of intrinsic benchmarks. Other contributions include: 1. We empirically demonstrate that our method generates better meta-embeddings thanks to decreasing the information loss during the embedding combination. Our approach does not rely on hyper-parameter tuning.
2. We generate meta-embeddings using a wide range of source embeddings trained with very different algorithms and resources. Our experiments show that the best meta-embeddings are obtained when combining embeddings that encode complementary knowledge.

3.
A unified and comprehensive benchmarking framework to facilitate a fair and objective evaluation of embeddings in both intrinsic and extrinsic settings.
4. We report the largest meta-embedding extrinsic evaluation performed so far showing that meta-embedding performance in these tasks has been overestimated by previous work.
The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 focuses on the evaluation frameworks used by previous works and presents our own proposal. In Section 4 we describe our approach for creating meta-embeddings, with Section 5 describing the source word embeddings explored and reporting our experimental results in Section 6. Finally, Section 7 presents some concluding remarks and our future work. Our code and meta-embeddings are publicly available 1 .

Related work
Previous research has shown that word embeddings created using different methods and resources present significant variations in quality. For instance, Hill et al. (2014) show that word embeddings trained from monolingual or bilingual corpora capture different nearest neighbours.
The term meta-embedding was coined by Yin and Schütze (2016). They showed how to combine five different pre-trained word embeddings using a small neural network for improving the accuracy of cross-domain part-of-speech (POS) tagging. Following this, Bollegala et al. (2018) propose an unsupervised locally linear method for learning meta-embeddings from a given set of pre-trained source embeddings while Bollegala and Bao (2018) apply three types of autoencoders for the purpose of learning meta-embeddings.
(1) (Yin and Schütze, 2016) Sim. (5), An. (1) POS (1) (Li et al., 2020) MT (   such as SVD (Yin and Schütze, 2016), PCA (Ghannay et al., 2016) or DRA (Raunak, 2017). In this line of work, Numberbatch  claims to be the best meta-embedding model so far, by combining knowledge from a variety of embeddings obtained from different corpora and knowledge bases such as ConceptNet. Methods such as MUSE (Lample et al., 2018) and VecMap (Artetxe et al., 2018) project embeddings of two different languages to a shared common space by means of a bilingual dictionary (Mikolov et al., 2013b). This requires minimal bilingual supervision while still leveraging large amounts of monolingual corpora with very competitive results (Artetxe et al., 2016(Artetxe et al., , 2018.  He et al. (2020) to generate meta-embeddings. This usually involves mapping all the source embeddings to a common vector space followed by averaging. We extend this idea by proposing a multiple step algorithm that: (i) normalizes the source embeddings; (ii) maps them to the same vector space; (iii) handles the OOV words; and (iv), generates the final meta-embedding. An ablation study confirms that these steps increase the performance of the generated meta-embeddings in both intrinsic and extrinsic tasks.
Another recent research line tries to dynamically generate meta-embeddings for specific tasks (He et al., 2020;Kiela et al., 2018;O'Neill and Bollegala, 2020). These methods extend already existing algorithms to generate meta-embeddings by learning task specific weights. Instead, the focus of our research is to generate the best general purpose meta-embedding that can be applied to any task.

Evaluation Framework
As it has been earlier mentioned, several methods to generate meta-embeddings have been previously proposed and evaluated on many different benchmarks, as shown by Table 1. Moreover, add-hoc decisions (not always explicitly mentioned) to evaluate the embeddings caused large variations in the results. Let us consider, for example, the problem of out-of-vocabulary (OOV) words.
Two popular techniques are used to address OOV words. Table 2 shows the accuracy of FastText embeddings 2 in the Google Analogy dataset using the two approaches. The first one uses the average of all the embeddings as a representation for unknown words (With OOV). The second approach simply removes from the dataset the examples containing unknown words (Without OOV). Additionally, the dataset is usually pre-processed. A common approach lowercase all the words and removes non English characters (Clean dataset) to reduce the number of unknown words. The words in the embedding can also be lowercased (Lowercase embeddings). Another popular practice to evaluate analogy consist of trimming the vocabulary of the embedding to the k most popular words. As an example, trimming the vocabulary to the 100,000 most popular English words also speeds up the computations (Trim vocabulary). These changes in the pre-processing of the very same embeddings cause the results to vary from 39.5% accuracy to 84.1%. Obviously, without a common evaluation framework the comparison between the different embeddings and meta-embeddings cannot be objectively done.
This lack of evaluation consistency led us to propose a unified evaluation framework that encompasses a wide range of tasks and datasets to evaluate meta-embeddings. In order to make the evaluation as simple and unified as possible we chose two already existing out of the box frameworks: Word embeddings benchmarks 3 (Jastrzebski et al.  (Mikolov et al., 2013a), MSR Analogy (Mikolov et al., 2013c), SemEval2012 (Jurgens et al., 2012)) and, (iii) Word categorization (AP (Almuhareb and Poesio, 2005), BLESS (Baroni and Lenci, 2011), Battig (Battig and Montague, 1969), ESSLI (McRae et al., 2005)). We use the provided script for evaluating embeddings on all the tasks without lowercasing them.
It should be taken into account that, for Word analogy, smaller vocabularies usually obtain better results. This particularly hurts the performance of those meta-embeddings that were generated using many source embeddings resulting in a metaembedding with a vocabulary of more than 4 million words. Thus, in order to ensure a fair evaluation regardless of the number of words in the vocabulary, we trim the vocabulary of all the embeddings and meta-embeddings to the 200,000 most popular English words according to the Google's Trillion Word Corpus 4 .

Our Method
Our meta-embedding generation approach consists of two main steps: (i) pre-processing of the source embeddings and (ii) generation of the metaembedding by averaging. Our method can combine any number of word embeddings as long as there is some common vocabulary shared between them. The resulting meta-embedding vocabulary will be the union of the vocabularies of the source word embeddings used.

Word embeddings pre-processing
Word embeddings generated with different sources or techniques can result in very different vectors spaces and vocabularies. Before aligning the vector spaces an harmonization pre-processing step is needed. Thus, we translate, scale, rotate and match the vocabularies of the source embeddings.
1) Mean Centering and scaling: Following (Artetxe et al., 2018), we first normalize the length of the source embeddings. We mean center each dimension, and we normalize them again by length. This translates all the source embeddings to the origin and scales them to have the same length.
2) Aligning the vector spaces: We align the vector spaces of the source embeddings using VecMap (Artetxe et al., 2016). VecMap learns word embedding mappings using an orthogonal transformation. Orthogonality allows monolingual invariance during the mapping, preserving vector dot products between word vectors. Monolingual invariance ensures that no information is lost during the mapping step, which is desirable for our aim of generating meta-embeddings. In our experiments we align the source embedding by projecting them to the vector space of one particular source embeddings involved in the construction of the meta-embeddings.
3) OOV generation: Different word embeddings have different vocabularies. When combining two word embeddings we can distinguish two sets of words. Those for which we have a representation in both embeddings and those for which one of the embeddings has no representation. We call the latter "OOV words". We unify the vocabulary of the source embeddings by creating new approximate representations for the OOV words.
The process is as follows. Given two source embeddings E1 and E2 where for a word W only E1 has a representation, we generate a new approximation for the OOV word in E2 by revising the most similar words from the common vocabulary of E1 and E2. First, using the cosine similarity as distance metric, we select the k (ranging from 2 to 50) nearest neighbours of the word W in E1 that also appear in the common vocabulary with E2. 8 For each k, we calculate k candidate representations of the OOV word in E2 and E1 as a weighted average of the selected k nearest neighbours in their corresponding spaces. We use the cosine similarity from the nearest neighbors in E1 to W as weights. Finally, the selected representation of the OOV word in E2 is the one corresponding to the closest candidate to W in E1.

Meta-embedding generation
We combine the harmonized source embeddings by averaging them. In our experiments we demonstrate that, thanks to the pre-processing steps described above, averaging source embeddings effectively combines multiple source embeddings resulting in representations as good as the ones generated by concatenation without increasing their dimensionality. 8 For computation efficiency we limit the maximum k to 50. In our experiments the optimal k is usually smaller than 20.

Word embeddings
This section describes the source word embeddings used to generate our meta-embeddings. We choose these pre-trained embeddings for two main reasons. They have been trained using very diverse algorithms and resources, and they obtain good performance on our evaluation framework when tested individually. That is, they may encode high quality complementary knowledge.
Using WordNet (Miller, 1992), RWSGwn (UKB) (Goikoetxea et al., 2015) combines random walks over WordNet with the skip-gram model. We have used the vectors trained using WordNet3.0 plus gloss relations. JOINTChyb (J) (Goikoetxea et al., 2018) combines Random Walks over multilingual WordNets and bilingual corpora as input for a modified skip-gram model that forces equivalent terms in different languages to come closer during training. We used the English-Spanish bilingual embeddings publicly available.
Using the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013), Attract Repel (AR) (Mrkšić et al., 2017) improves word embeddings by injecting synonymy and antonym constraints extracted from monolingual and cross-lingual lexical resources. We used the English vocabulary from the four-lingual (English, German, Italian, Russian) vector space. Paragram (P) (Wieting et al., 2015) are pre-trained word vectors learned using word paraphrase pairs from PPDB using a modification of the skip-gram objective function. The hyper parameters were tuned using the wordsim-353 dataset. The word embeddings of the default model are initialized with Glove word vectors.

Experiments
We evaluate all the word embeddings in a wide range of intrinsic and extrinsic evaluation tasks which composed the evaluation framwework described in Section 3.

Intrinsic evaluation results
First we evaluate the source embeddings that we will later use for meta-embedding generation. Table 3 shows the averaged results of the Categorization (C), Word Similarity (WS) and Analogy (A) datasets. We report the average cluster purity score of the Categorization datasets, the average Spearman correlation in the WS datasets, and the average score 9 in the Word Analogy datasets. The results shows that FastText achieve the best performance on the Analogy datasets and Numberbatch on Categorization and Word Similarity. As expected, on average Numberbatch obtains the best results on the intrinsic evaluations tasks. We start generating meta-embeddings with our proposed method combining pairs of source embeddings. Table 4 shows the average score in the 9 We calculate the Spearman Correlation for the Se-mEval2012 dataset and accuracy for GoogleAnalogy and MSR intrinsic evaluation benchmark of different pairs of source embeddings. For each source class type (Text Corpora, WordNet, PPDB and ConceptNet), we combine the best embeddings of each class with the best embeddings of the other classes. Within the same class we combine the first and second best embeddings.
The results show that, instead of using embeddings based on the same information type, combining embeddings of different classes obtains most of the time better results. That is, two embeddings generated using similar sources do not contain complementary knowledge, and its combination does not result in better performance. In our experiments, the best results are achieved when combining source embeddings generated using very different resources, such as text and knowledge bases. These combinations produce a meta-embedding that encodes the complementary knowledge of the source embeddings resulting in an improved performance. Also note that the meta-embedding combining text (FT) and PPDB (P), and also text (FT) with ConceptNet (N) outperforms the results of Numberbach (N) alone.
We generate our best meta-embeddings combining the best source embeddings created using large text corpora (FT), WordNet (J), PPDB (P) and Con-ceptNet (Numberbatch) (hereinafter FJNP). This combination maximizes the complementary knowledge encoded in the meta-embedding. We compare our method with 3 baselines using the same source embeddings: (i) Concatenation: (CONC+) Concatenation is a very strong baseline in metaembedding generation. It allows combining multiple embeddings without any information loss. However, this comes at a high cost, as the metaembedding dimensionaly is increased dramatically. We standardize the source embeddings using the approach described in Section 4.1. (ii) AutoEncoders (Bollegala and Bao, 2018): Autoencoders are an unsupervised learning method that first compress the input in a space of latent variables and then reconstructs the input based on the information encoded in these latent variables. It aims to learn meta-embeddings by reconstructing multiple source embeddings. This method comes in three flavours, DAEME, CAEME and AAEME. We used the last one because it obtains the best results. We applied the default parameters and enabled the option to generate OOV word representations. (iii) Locally Linear Meta-Embedding  Table 5: Comparison of our meta-embedding method, baselines and prior work in the intrinsic evaluation.
Learning (LLE) (Bollegala et al., 2018): This approach which consists of two steps. In the reconstruction step the embeddings of each word are represented by the linear weighted combination of the embeddings of its nearest neighbours. In the projection step the meta-embedding of each word is computed such that the nearest neighbours in the source embedding spaces are embedded closely to each other in the meta-embedding space. We tested this method with the same parameters used in the original paper. Note that the code provided by the authors generates meta-embeddings using the intersection of the vocabulary of the source embeddings. This results in a small vocabulary that severely hurts its performance in some tasks. Table 5 reports the results for our method and the baselines. The overall performance of our method is slightly better than concatenation (improved with our standardization method), mostly due to the good results in Categorization. In any case, the most important point here is to notice that our method, unlike concatenation (CONC+), does not increase the final dimensionality of the metaembeddings. Furthermore, our technique clearly outperforms the meta-embeddings generated by Autoencodding and LLE and all the embeddings listed in Table 3 including Numberbatch, which is a meta-embedding. To the best of our knowledge, these are the best results published using these intrinsic benchmarks.

Extrinsic evaluation results
We compare our meta-embeddings with the same source embeddings and baselines used in the intrinsic evaluation (subsection 6.1). We test the same combination of embeddings that provides the best results in the intrinsic evaluation (FJNP). For brevity we report the GLUE Score calculated as proposed by the authors (Wang et al., 2019b). We are aware that, for the GLUE benchmark, (static) word embeddings are outperformed by contextual representations such as those obtained by BERT (Devlin et al., 2019). Thus, word embeddings may be better suited for other tasks such as unsupervised machine translation (Artetxe et al., 2019), inferring high-quality embeddings for rare words (Schick and Schütze, 2020), unsupervised word alignment (Jalili Sabet et al., 2020) or knowledge base queries (Dufter et al., 2021). However, we can use the GLUE benchmark as part of an objective and unified framework to evaluate word embeddings. In this sense, future research can also use exactly the same setting and methodology to evaluate new word embeddings and meta-embeddings. Table 6 presents the results of the extrinsic evaluation. Interestingly, FastText achieves the best results, outperforming every single meta-embedding in every task. In fact, Numberbatch and AAEME fail on the extrinsic evaluation achieving very low results compared with the source word embeddings.
Previous research in meta-embedding generation has limited the extrinsic evaluation to very few tasks that are formulated closely to the intrinsic evaluation such as short text classification (Bollegala and Bao, 2018;Bollegala et al., 2018) or common-sense stories . Other approaches combine meta-embeddings with contextual representations with the aim of achieving SOTA results for tasks such as STS or POS tagging (García-Ferrero et al., 2020). While those previous works assume that meta-embeddings might be helpful for such extrinsic evaluation tasks, our results show that when evaluating on ten challenging tasks, FastText is indeed a very strong baseline that is not improved by any meta-embedding proposed up to date. These results suggest that meta-embeddings generated using complementary knowledge from WordNet, ConceptNet or PPDB help to improve performance for intrinsic tasks, but that this is not the case for extrinsic evaluations using GLUE.

Ablation study
We perform an ablation study to determine which steps of our method contribute the most. For the ablation study we use the best meta-embedding in the intrinsic and extrinsic evaluation tasks. We do this by skipping a different step of the method each time. For -OOV we do not apply the technique to obtain representations for the OOV words, we just average the available representations for a given word.  -Vecmap the source embeddings are not mapped to a common vector space. The results reported in Table 7 show that the normalization and the mapping steps provide most of the performance. If we average embeddings that have not been normalized the difference in scale and the centroid of the vector space can cause some embeddings to take higher importance in the meta-embeddings. Averaging word embeddings that have not been mapped to the same vector space can cause vectors to cancel each other.
With respect OOV, the results are mixed. This step increases the performance in the categorization and word similarity tasks but it hurts the performance on the analogy and extrinsic tasks. This is caused by two factors. First, since all the embeddings have been normalized and mapped to the same vectors space, the average of the available representations is already a good approximation for OOV words. If the source embeddings would have a representation for the OOV words, it would be close to the ones already available.
Additionally, a larger vocabulary is not beneficial for every task. Consider the example in Table 2 where a much larger vocabulary obtains worse results in the Word Analogy task. We demonstrate this by counting the number of nearest neighbors to love with a cosine similarity greater than 0.85 in the meta-embeddings. Table 8 shows the most similar words when using and not using the OOV algorithm (27 and only 4 words respectively). Generating a meta-embedding containing the union of the vocabularies of all the source embeddings may be useful for some tasks, such as word similarity. However, for tasks such as word analogy, reducing the final vocabulary to the set of most common words is the best approach.

Conclusions
We have presented a meta-embedding generation method that improves over previous approaches. Moreover, our method does not rely on hyperparameter tuning and generates general-purpose meta-embeddings that can be used for any task. We   also propose a comprehensive and unified evaluation framework for evaluating meta-embeddings. This framework allows to fairly and objectively compare different meta-embedding generation approaches using the same settings and methodology. Using this framework we demonstrate that combining embeddings that encode the most complementary knowledge produces better metaembeddings. In fact, the meta-embeddings that encode in the same vector space the knowledge from large text corpora, WordNet, PPDB and Con-ceptNet achieve the best published results in the intrinsic evaluation benchmarks. Interestingly, and contrary to what previous research suggested, we empirically demonstrate that when evaluating in a large set of extrinsic tasks, meta-embeddings are not helpful for improving the results of the source embeddings. We plan to investigate the performance of our approach in a cross-lingual setting for under-resourced languages. We suspect that the performance of under-resource language embeddings can be improved by combining them with embeddings from a rich-resource language.
Cheung. 2020. Learning lexical subspaces in a distributional vector space.

A Meta-embedding Generation Algorithm Illustrated
In this section we illustrate our meta-embedding generation algorithm using two sample embeddings with 3 dimension vectors and 1000 word vocabulary sizes (Figure 1). The vocabularies of the two embeddings have 791 common words, and each embedding has 209 unique words for which the other embeddings does not have a representation (OOV words). The resulting meta-embedding vocabulary will be the union of the vocabularies, 1197 words. Our approach to generate meta-embeddings consists of two main steps (i) pre-processing of the source embeddings and (ii) generation of the meta-embedding by averaging. Figure 1: Step 0 Source embeddings at the start of the embedding generation process

A.1 Word embeddings pre-processing
Word embedding generated with different sources or techniques can result in very different vectors spaces and vocabularies. Before aligning the vector spaces an harmonization pre-processing step is needed. Thus, we translate, scale, rotate and match the vocabularies of the source embeddings. 1) Mean Centering and scaling: Following (Artetxe et al., 2018) we first length normalize the source embeddings (Figure 2). We mean center each dimension (Figure 3), and we length normalize them again (Figure 4). This translates all the source embeddings to the origin and scales them to have the same length.
2) Aligning the vector spaces: We align the  Figure 5). VecMap learns word embedding mapping using an orthogonal transformation. Orthogonality allows monolingual invariance during the mapping, preserving vector dot products between word vectors. Monolingual invariance ensures no information loss during the mapping step, which is desirable for our aim of generating meta-embeddings. In our experiments we align the source embedding by projecting them to the vector space of one particular source embeddings involved in the construction of  3) OOV generation: Different word embeddings have different vocabularies. When combining two word embeddings we can distinguish two sets of words. Those for which we have a representation in both embeddings and those for which one of the embeddings has no representation. We call the latter "OOV words". We unify the vocabulary of the source embeddings by creating new approximate representations for the OOV words ( Figure 6). The process is as follows. Given two source embeddings E1 and E2 where for a word W only E1 has a representation, we generate a new approximation for the OOV word in E2 by revising the most similar words from the common vocabulary of E1 and E2. First, using the cosine similarity as distance metric, we select the k (ranging from 2 to 50) nearest neighbours of the word W in E1 that also appear in the common vocabulary with E2. For each k, we calculate k candidate representations of the OOV word in E2 and E1 as a weighted average of the selected k nearest neighbours in their corresponding spaces. We use the cosine similarity from the nearest neighbors in E1 to W as weights. Finally, the selected representation of the OOV word in E2 is the one corresponding to the closest candidate to W in E1.

A.2 Meta-embedding generation
We combine the harmonized source embeddings by averaging them (Figure 7). We empirically demonstrate that thanks to the pre-processing steps, averaging source embeddings effectively combines multiple source embeddings resulting in representations as good as the ones generated by embedding concatenation without increasing its dimensionality.

B Computing infrastructure
We run all the experiments in a Linux system with an Intel Xeon CPU E5-2620 V4 CPU, 1024GB of RAM and an Nvidia Titan V GPU. To reproduce the generation of the FJNP meta-embedding with a reasonable run-time (less than 24 hours) we recommend using at least a quad-core CPU, 32GB of Step 6 Meta-embedding generation by averaging RAM and a 2GB GPU with CUDA support (GPU is optional but highly recommended). The intrinsic evaluation framework can be run in less than one hour in a system with enough primary memory to load a full embedding/meta-embedding (8GB). The extrinsic evaluation framework will run in less than 24 hours in a system with a reasonably modern CPU and enough primary memory to load the full embedding/meta-embedding and the bag-ofwords model (8GB). The extrinsic evaluation can be speed-up with an 8GB GPU with CUDA and FP16 support.