Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction

Bilingual Lexicon Induction (BLI) aims to map words in one language to their translations in another, and is typically through learning linear projections to align monolingual word representation spaces. Two classes of word representations have been explored for BLI: static word embeddings and contextual representations, but there is no studies to combine both. In this paper, we propose a simple yet effective mechanism to combine the static word embeddings and the contextual representations to utilize the advantages of both paradigms. We test the combination mechanism on various language pairs under the supervised and unsupervised BLI benchmark settings. Experiments show that our mechanism consistently improves performances over robust BLI baselines on all language pairs by averagely improving 3.2 points in the supervised setting, and 3.1 points in the unsupervised setting.

Most work on BLI learns a mapping between two static word embedding spaces, which are pretrained on large monolingual corpora (Ruder et al., 2019). Both linear mapping (Mikolov et al., 2013;Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017) and non-linear mapping (Mohiuddin et al., 2020) methods have been studied to align the two spaces. Recently, other than the static word embeddings, contextual representations are used for BLI due to the significant progresses on cross-lingual applications (Aldarmaki and Diab, 2019;Schuster et al., 2019). Although the static word embeddings and the contextual representations exhibit properties suited for alignment, there is no works to combine the two paradigms.
On one hand, the static word embeddings have been widely used for BLI, but one specific embedding mapping function does not ensure that in all conditions, words in a translation pair are nearest neighbors in the mapped common space. On the other hand, the contextual representations contain rich semantic information beneficial for alignment, but the dynamic contexts of word tokens pose a challenge for aligning word types.
In this paper, we propose a combination mechanism to utilize the static word embeddings and the contextual representations simultaneously. The combination mechanism consists of two parts. The first part is the unified word representations, in which a spring network is proposed to use the contextual representations to pull the static word embeddings to better positions in the unified space for easy alignment. The spring network and the unified word representations are trained via a contrastive loss that encourages words of a translation pair to become closer in the unified space, and penalizes words of a non-translation pair to be farther. The second part is the weighted interpolation between the words similarity in the unified word representation space and the words similarity in the contextual representation space.
We test the proposed combination mechanism in both the supervised BLI setting which can utilize a bilingual dictionary as the training set, and the unsupervised BLI setting which does not allow using any parallel resources as supervision signal.
On BLI benchmark sets of multiple language pairs, our combination mechanism performs significantly better than systems using only the static word embeddings and systems using only the contextual representations. Our mechanism improves over robust BLI baselines on all language pairs, achieving average 3.2 points improvement in the supervised setting, and average 3.1 points improvement in the unsupervised setting.

Background
The early works on Bilingual Lexicon Induction (BLI) date back to several decades ago, including feature-based retrieval (Fung and Yee, 1998), distributional hypothesis (Rapp, 1999;Vulić and Moens, 2013), and decipherment (Ravi and Knight, 2011). Following Mikolov et al. (2013), which pioneered the embedding based BLI method, word representation based method becomes the dominant approach, and can be categorized into two classes: static word embedding based method, and contextual representation based method.
• Static Word Embedding Based Method Word embeddings of different languages are pre-trained in large monolingual corpora independently. Then a mapping function is applied to align the embedding spaces of the two languages (Mikolov et al., 2013;Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017).
We follow one robust BLI system VecMap (Artetxe et al., 2018a,b), which maps both source space and target space into a third common space. Let E x and E y be the word embedding matrices in two languages for a given bilingual dictionary such that their ith rows are the embeddings of words of the ith translation pair in the dictionary. The training objective is to find mapping functions W x and W y such that where d is the dimension of the embeddings, M d (R) is the space of d × d matrices of real numbers. The optimal W x and W y maximizes the cosine similarity between words of each translation pair in the mapped common space. In the unsupervised version where no bilingual dictionary is given, an artificial dictionary is initialized and iteratively updated through training W x and W y according to equation (1) (Artetxe et al., 2018b).
Both mapping functions are constrained to be orthogonal during training by setting W x = U and W y = V , where U ΣV T = X T Y is the singular value decomposition of X T Y . Such orthogonal constraint is based on the assumption that the source embedding space and the target embedding space are isometric, which is a particularly strong assumption that does not hold in all conditions (Zhang et al., 2017b;Søgaard et al., 2018). To depart from the isometry assumption, Patra et al. (2019) uses a semi-supervised technique that leverages both seed dictionary and a larger set of unaligned word embeddings, Mohiuddin et al. (2020) uses a non-linear mapping function that is not constrained to be orthogonal.
We propose another method to relax the isometry assumption by combining the contextual representations with the word embeddings to compensate the shortage of the overly strong assumption.

• Contextual Representation Based Method
Contextual representations can be obtained through multilingual pre-training, which encodes whole sentence and outputs contextual representation for each word (Devlin et al., 2019;Lample and Conneau, 2019). Due to the rich context information contained in the contextual representations, there are endeavors to align them in different languages (Schuster et al., 2019;Aldarmaki and Diab, 2019;Wang et al., 2020;Kulshreshtha et al., 2020;Cao et al., 2020).
Since a word may appear in different sentences with different contexts, Schuster et al. (2019) use an average anchor to summarize multiple contexts for a word type and align the anchors of different languages, while other works aim to align each individual context representation based on parallel corpora, including learning alignment on sentence level representations and applying the learned mapping on word level contextual representations (Aldarmaki and Diab, 2019), using word alignments in a parallel corpora to learn the mapping for word contextual representations (Wang et al., 2020), and directly minimizing Figure 1: The illustration of the proposed combination mechanism. (A) is the static word embedding space, where e x and e y are the source and target embeddings, respectively. (B) is the unified word representation space which consists of the mapped word embeddings pulled by a spring network F x /F y with the contextual representations as input. We just depict two springs for illustration. (C) is the mapped contextual representation space. (D) is the original contextual representation space. a x and a y are the source and target contextual representations, i.e., the average anchors, respectively. In the similarity interpolation shown in the bottom, u x and u y are the unified word representations in the two languages, a x and a y are the mapped contextual representations in the two languages, cos denotes the cosine similarity function, λ denotes the weight.
the distance between two contextual representations of an aligned word pair in parallel corpora without mapping (Cao et al., 2020).
We adopt the average anchor method for the contextual representations (Schuster et al., 2019), which does not depend on parallel corpora. Let the contextual representation of a source word x in context c i be denoted as r x,c i . If x appears a total of p times in the source corpus, the average anchor for x across all contexts is: Similar to the mapping for the static word embeddings, we conduct mapping for the average anchors. Let A x and A y be the matrices of average anchors in two languages with correspondence to word pairs from a given bilingual dictionary. The mapping functions V x and V y are optimized by maximizing cos(A x V x , A y V y ), where A x and A y are fixed, V x and V y ∈ M d (R), and d is the dimension of the contextual representations. Besides the above methods, there is another direction that extracts word alignments in pseudo parallel corpora for BLI. The pseudo parallel corpora are built by either the unsupervised machine translation (Artetxe et al., 2019) or the unsupervised bitext mining (Shi et al., 2021). Both methods need significant computation overload or use monolingual corpora that are magnitudes larger than ours, and are beyond the scope of this paper that focuses on representation based methods.

Proposed Combination Mechanism for BLI
Since there is no work to combine both the static word embeddings and the contextual representations, we propose a combination mechanism illustrated in Figure 1. The mechanism first builds a unified word representation space that unifies the static word embeddings and the contextual representations, then performs similarity interpolation between the unified space and the contextual space.

The Unified Word Representations
As shown in Figure 1, the original word embedding space (A) is mapped to (B) through the mapping functions. Since the mapping functions are orthogonal, (A) is just rotated to (B). Notice that the spaces of the two languages are not necessarily isometric everywhere, some words in certain translation pairs are still far away from each other after rotation. To pull words in a translation pair getting closer, we propose a spring network that can pull the mapped embedding points to better positions such that words in a translation pair are nearest neighbors of each other. Since the contextual representations contain rich context information that can be used as the flexible adjustment, the spring network takes the contextual representations as input, and outputs offsets for the word embeddings. Specifically, in the unified word representations, the mapped word embeddings are pulled to new positions by offsets, which are produced by the spring network with the the contextual representations as input: where U x and U y are the unified word representations, E x and E y are the mapped word embeddings, F x and F y are the spring networks, and γ 1or2 is the weight vector, which is used to element-wisely multiply each row of the output of the spring network. Take the source side for example, the mapped word embedding matrix E x is added with a weighted offset produced by the spring network F x on the contextual representation (i.e., the average anchor) matrix A x .
The Spring Network stacks two feedforward layers with Tanh activations on top of the contextual representation matrices. The first layer transforms the dimension of the contextual representation d to the dimension of the word embedding d. Equations (4-5) list the network structure of both sides.
where φ denotes the Tanh activation, and θ denotes the feedforward layer. A 2 x/y is the output of the spring network, and fulfills as the offset distance to compensate the deviation of words in each translation pair in the mapped word embedding space.
Since we use cross-lingual pre-training (Lample and Conneau, 2019) to generate the contextual representations, which are actually BPE's (Sennrich et al., 2016) contextual representations, we have to form the contextual representations in the word level. Suppose a word x has q BPEs, and x appears p times in the monolingual corpus, then the word level contextual representation denotes the representation of the jth BPE with the ith context c i,j . a x actually averages q BPEs' representation at first, then averages p contexts. After this cascaded averaging, it constitutes one row of A x .
Contrastive Training is used to train the spring networks F x and F y with the pre-trained mapped word embeddings and the contextual representations fixed in the unified space. Basically, through the spring adjustment, the training encourages parallel words to get closer, and drives non-parallel words to be farther. It is divided into two scenarios: supervised contrastive training and unsupervised contrastive training.
• In the supervised contrastive training, given a bilingual dictionary with I translation pairs, the contrastive loss is: where u i x and u i y are the unified representations corresponding to the ith entry of the given bilingual dictionary.
In equation (6), (u i x , u i y ) is the positive translation pair according to the given dictionary, and the cosine similarity of this pair is maximized during training, while (u i x , u j y ) is the negative pair whereȳ is not aligned to x. The cosine similarity of (u i x , u j y ) is minimized during training.
We select J negative pairs for a source word x. In the implementation, we use Jbest outputs of the current model excluding the correct translation as the negative pairs. To keep balance between positive and negative pairs, the positive pair is copied J times to pair with negative pairs.
During inference, we select y = arg max y cos(u x , u y ) as the translation of x.
• In the unsupervised contrastive training, no bilingual dictionary is given. The contrastive loss is the same to that of the supervised contrastive training, except that the bilingual dictionary is not given. We initialize the bilingual dictionary using the output of the static word embedding based unsupervised method, and iteratively update it by using the trained model of last iteration to find new translations for given source words and compose a new dictionary, which is used to train the new model. Such process iterates until the dictionary does not change any more.

Similarity Interpolation
The similarity interpolation is for inference. As shown in Figure 1, both the unified word representation space and the mapped contextual representation space can output the cosine similarities between words. Given a source word x, we interpolate both similarities as below: where λ is the weight, a x/y is the mapped contextual representation, which is pre-trained as introduced in the section of the background of the contextual representation based method. We aim to find y that has the maximal S as the translation of x.
In the supervised setting, λ is tuned on the validation set consisting of translation pairs. In the unsupervised setting, λ is tuned by an unsupervised procedure: when source-to-target model and targetto-source model have been trained, the word x in the validation set is aligned to y based on equation (7), then y is back aligned to x based on the inverse version of equation (7). We select λ that has the highest accuracy of this back alignment to x.

Data
We need monolingual corpora to compute the contextual representations. Unfortunately, most existing BLI datasets distribute pre-trained word embeddings alone, but not the monolingual corpora used to train them. For that reason, we use WikiExtractor 2 to extract plain text from Wikipedia dumps, and preprocess the resulting corpora using standard Moses (Koehn et al., 2007) tools by applying sentence splitting, punctuation normalization, tokenization, and lowercasing. On these corpora, we use the cross-lingual pre-training system XLM (Lample and Conneau, 2019) 3 to compute the contextual representations.
Meanwhile, we also use these corpora to train the static word embeddings by using fastText 4 to ensure that both the contextual representations and the static word embeddings come from the same data. We use the bilingual dictionaries released by Muse project 5 in our experiments. Note that some words in these dictionaries do not necessarily appear in our monolingual corpora, we have to recompose the training, validation, and test sets such that all words in these sets are included in our monolingual corpora. In the end, we have 5000 entries with unique source words in the training set, and 1500 entries with unique source words in both the validation set and the test set for all language pairs.

Baseline Systems
Baseline systems are divided into two tasks as below. We run the released code of each baseline system in our experiments. Supervised BLI task, which is allowed to use bilingual dictionaries for training and validation. The baseline systems are:  Table 1: P@1 on all language pairs. "Unified" denotes our unified word representation based method, which computes cos(u x , u y ), "Contextual" denotes the contextual representation based method, which computes cos(a x , a y ), "Interpolation" denotes our similarity interpolation, which computes cos(u x , u y ) + λcos(a x , a y ). The subscript "VecMap" denotes that our method is based on the work of Artetxe et al. (2018a), the subscript "RCSLS" denotes that our method is based on the RCSLS criterion in training (Joulin et al., 2018). In unsupervised BLI, there is no subscript in our method, which means using the default "VecMap" (Artetxe et al., 2018b).
• BLISS 8 : Patra et al. (2019) use a semisupervised method that leverages both the bilingual dictionary and a larger set of unaligned word embeddings.
Unsupervised BLI task, which is not allowed to use any parallel resources for training and validation. The baseline systems are: • Muse: Unsupervised Muse (Conneau et al., 2017) uses adversarial training and iterative Procrustes refinement.
• VecMap: Artetxe et al. (2018b) use careful initialization, robust self-learning procedure, and symmetric re-weighting to improve the unsupervised mapping result.

Experimental Settings
We use fastText to train the word embeddings for BLI. The dimension of the word embeddings is 300. The contextual representations are extracted from XLM, and the dimension of the contextual representations is 1024. For each word type, we randomly select ten sentences containing the word from the monolingual corpora to do the averaging 8 https://github.com/joelmoniz/BLISS 9 https://github.com/taasnim/unsup-word-translation/ to get the contextual representation. The influence of the number of selected sentences for each word type is reported in section 4.5.2. Regarding the spring network, we use ten negative pairs for each source word in the supervised contrastive training, and use one negative pair for each source word in the unsupervised contrastive training. All inferences in our experiments, including all baseline systems, use CSLS which is introduced in Conneau et al. (2017). The results are evaluated by Precision@1 (P@1). Table 1 summarizes the main results of the supervised and the unsupervised BLI tasks on all test sets. In both tasks, our proposed methods achieve significant improvements, with average 3.2 points higher than the strongest baseline RCSLS in the supervised task, and with average 3.1 points higher than the strong baselines VecMap and Ad. in the unsupervised task.

Main Results
In the supervised task, we have two independent bases to build our proposed methods. One is VecMap (Artetxe et al., 2018a), the other is RCSLS (Joulin et al., 2018). They are the preprocessing steps to align the static word embeddings, and align the contextual representations in the two languages. Our methods build upon these alignments, and further train the spring networks and the unified word representations for the combination. The performances of our methods with these two bases are   reported in Table 1 with the corresponding subscripts. Table 1 shows that if we use VecMap as the basis of our method, we can improve 3.6 points over the corresponding VecMap baseline. If we use RCSLS as the basis, we can improve 3.2 points over the corresponding RCSLS baseline. In our methods, "Unified" can achieve around 2 points improvement over the corresponding baselines. Although "Contextual" obtains inferior performances, it is complementary to "Unified". When "Contextual" is combined with "Unified" through the interpolation, the performance is further improved, achieving the best performance among all systems. It shows that our combination mechanism is effective to utilize the merits of both the static word embeddings and the contextual representations.
In the unsupervised task, we achieve the significant improvements over the baselines. "Unified" is 1.5 points better than VecMap baseline. "Contextual" is inferior to other methods, but it can provide useful complements to "Unified", resulting in the final 3.1 points improvement through interpolation.
In summary, our combination mechanism consistently improves the performances for both distant language pairs, such as EN-AR and EN-ZH, and closely-related European language pairs.

XLM v.s. mBART
Our results in Table 1 are based on using XLM for obtaining the contextual representations. In this section, we also use mBART (Liu et al., 2020) to compare with XLM. Table 2 shows the comparison result. XLM pre-trains the Transformer encoder through the masking mechanism, while mBART pre-trains the full Transformer encoderdecoder through multilingual denoising. Regarding BLI task, we obtain the contextual representations from the encoder. Table 2 shows that XLM and mBART get similar BLI performances since only encoder is used. In some directions, mBART performs slightly better than XLM, while in other directions, XLM is slightly better. According to the average performance, XLM ties with mBART in the supervised task, and is slightly better in the unsupervised task.

The Randomness of Contexts
The contextual representations are derived randomly from sentences of the monolingual corpora. We study if this random derivation affects the performances. Firstly, we run 5 times of randomly selecting 10 sentences to gather contexts for each word type in the "Unified" setting. Figure 2 shows   that the performance is stable in the 5 trials. Secondly, we try randomly selecting 1-100 sentences to gather contexts for each word type. Figure 3 shows that selecting 1 sentence will drag the performance down to baseline, which indicates that 1 sentence is too random to gather enough information for BLI.
We only list the studies on EN-ES due to space limit. Studies on other language pairs can be found in the appendix.

The Influence of Selecting Encoder Layer
In the above experiments, we derive the contextual representations from the first layer of the encoder of XML/mBART. In this section, we show how different will be when we change the layer in the "Unified" setting. Figure 4 shows that as layer go higher, the performance drops. Please refer to the appendix for performances of other language pairs.

Results of using WaCKy Corpora
WaCKy corpora is introduced in Dinu et al. (2014) for BLI, but only word embeddings trained on WaCKy corpora are provided in their work. To obtain the contextual representations, we find WaCKy corpora from BUCC 10 , and use the corresponding dictionaries with the same training, validation, and test split. We use mBART instead of XLM for computing the contextual representations in this task. Table 3 shows that our combination mechanism is robust on this dataset. Both "Unified" and "Interpolation" perform better than the baselines. "Interpolation" achieves significant improvements in the supervised setting.

Discussion
The static word embeddings in our paper are trained by skip-gram or CBOW, while the word embeddings from XLM/mBART are trained by the pretraining objectives. Different training objectives result in quite different word embeddings. Actually, they show remarkably different behavior for BLI. By using VecMap, the word embeddings from XLM/mBART perform averagely around 30 points lower than the static word embeddings used in our paper. We also test fastText with 1024 dimension and word2vec with 300 dimension for fair comparison. They all perform remarkably better than the word embeddings from XLM/mBART. We plug in the word embeddings from XLM/mBART in place of the fastText static embeddings in our combination approach, and obtain much worse performance. This indicates that the static word embeddings trained by skip-gram or CBOW are more suitable for BLI and our combination approach.
In addition, regarding the asymmetry in Figure  1, we used to try the spring function that takes the static word embeddings as input, but got much worse results. This indicates that the spring function conditioned on the static space may be not helpful for BLI. Such observation may also explain that symmetrizing Figure 1 by yielding two unified spaces to combine performs slightly worse than the asymmetry version of Figure 1 of this paper. This is because that the spring function conditioned on the static space is introduced to maintain the symmetry, while this introduced spring function is not helpful for the combination.

Conclusion
Most BLI systems use either the static word embeddings or the contextual representations, but there is no works to combine both. In this paper, we propose a combination mechanism, which consists of the unified word representations and the similarity interpolation. The unified word representations use a spring network to pull the static word embeddings with offsets produced by the contextual representations, and compose a unified space such that parallel words are nearest neighbors to each other. The similarity interpolation is applied afterward to interpolate the similarities in the unified space and the contextual representation space. BLI experiments on multiple language pairs show that our combination mechanism can utilize the merits of both the static word embeddings and the contextual representations, achieving significant improvements over robust baseline systems in both the supervised and the unsupervised BLI tasks.
Learning principled bilingual mappings of word embeddings while preserving monolingual invariance.  Regarding the hyperparameter λ in the similarity interpolation, we search the optimal value in [0.05-0.3] with step size of 0.01. We found λ = 0.1 is superior on validation sets in all supervised settings. The optimal value of λ found by the unsupervised tuning procedure in the unsupervised settings, which is introduced in section 3.2, is shown in Table 4. In the search range, the performances has low variance, and are better than the baselines.

C The Influence of Selecting Encoder Layer in Other Language Pairs
We report the influences on EN-AR, EN-ZH, EN-DE, EN-FR in Figure 5. It shows that as we select higher layers for deriving the contextual representations, the performances become lower. This trend exists in most language pairs, except that the trend in EN-ZH is not significant.

D The Randomness of Contexts in Other Language Pairs
We report the randomness analyses on EN-AR, EN-ZH, EN-DE, EN-FR in Figure 6. It shows that trying 5 times of selecting 10 random sentences for gathering contexts gets stable performances in all language pairs. In most cases, using 1 sentence for computing the contextual representation drags the performance down, which indicates the inadequacy of 1 sentence for gathering contexts.

E Performances on Words with Different Frequencies
We use WaCKy corpora and the dictionaries provided by BUCC2020 12 to study the performances on words with different frequencies. The provided dictionaries are divided into groups of high frequency words, mid frequency words, and low frequency words. We test our combination mechanism on these three groups respectively. The results are presented in Table 5 and Table 6. We can see that our combination mechanism is effective on all three groups.