Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.


Introduction
Cross-lingual word embeddings (CLWEs) represent words from two or more languages in a shared space, so that semantically similar words in different languages are close to each other. Early work focused on jointly learning CLWEs in two languages, relying on a strong cross-lingual supervision in the form of parallel corpora (Luong et al., 2015; or bilingual dictionaries (Gouws and Søgaard, 2015;Duong et al., 2016). However, these approaches were later superseded by offline mapping methods, which separately train word embeddings in different languages and align them in an unsupervised manner through self-learning (Artetxe et al., 2018;Hoshen and Wolf, 2018) or adversarial training (Zhang et al., 2017;Conneau et al., 2018a).
Despite the advantage of not requiring any parallel resources, mapping methods critically rely on the underlying embeddings having a similar structure, which is known as the isometry assumption. Several authors have observed that this assumption does not generally hold, severely hindering the performance of these methods (Søgaard et al., 2018;Nakashole and Flauger, 2018;Patra et al., 2019). In later work, Ormazabal et al. (2019) showed that this issue arises from trying to align separately trained embeddings, as joint learning methods are not susceptible to it.
In this paper, we propose an alternative approach that does not have this limitation, but can still work without any parallel resources. The core idea of our method is to fix the target language embeddings, and learn aligned embeddings for the source language from scratch. This prevents structural mismatches that result from independently training embeddings in different languages, as the learning of the source embeddings is tailored to each particular set of target embeddings. For that purpose, we use an extension of skip-gram that leverages translated context words as anchor points. So as to translate the context words, we start with a weak initial dictionary, which is iteratively improved through self-learning, and we further incorporate a restarting procedure to make our method more robust. Thanks to this, our approach can effectively work without any human-crafted bilingual resources, relying on simple heuristics (automatically generated lists of numerals or identical words) or an existing unsupervised mapping method to build the initial dictionary. Our experiments confirm the effectiveness of our approach, outperforming previous mapping methods on bilingual dictionary induction and obtaining competitive results on zero-shot crosslingual transfer learning on XNLI. 6480 2 Related work Word embeddings. Embedding methods learn static word representations based on co-occurrence statistics from a corpus. Most approaches use two different matrices to represent the words and the contexts, which are known as the input and output vectors, respectively (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017). The output vectors play an auxiliary role, being discarded after training. Our method takes advantage of this fact, leveraging translated output vectors as anchor points to learn cross-lingual embeddings.
To that end, we build on the Skip-Gram with Negative Sampling (SGNS) algorithm (Mikolov et al., 2013), which trains a binary classifier to distinguish whether each output word co-occurs with the given input word in the training corpus or was instead sampled from a noise distribution.
Mapping CLWE methods. Offline mapping methods separately train word embeddings for each language, and then learn a mapping to align them into a shared space. Most of these methods align the embeddings through a linear map-often enforcing orthogonality constraints-and, as such, they rely on the assumption that the geometric structure of the separately learned embeddings is similar. This assumption has been shown to fail under unfavorable conditions, severely hindering the performance of these methods (Søgaard et al., 2018;. Existing attempts to mitigate this issue include learning non-linear maps in a latent space (Mohiuddin et al., 2020), employing maps that are only locally linear (Nakashole, 2018), or learning a separate map for each word (Glavaš and Vulić, 2020). However, all these methods are supervised, and have the same fundamental limitation of aligning a set of separately trained embeddings (Ormazabal et al., 2019).
Self-learning. While early mapping methods relied on a bilingual dictionary to learn the alignment, this requirement was alleviated thanks to selflearning, which iteratively re-induces the dictionary during training. This enabled learning CLWEs in a semi-supervised fashion starting from a weak initial dictionary (Artetxe et al., 2017), or in a completely unsupervised manner when combined with adversarial training (Conneau et al., 2018a) or initialization heuristics (Artetxe et al., 2018;Hoshen and Wolf, 2018). Our proposed method also incorporates a self-learning procedure, showing that this technique can also be effective with non-mapping methods.
Joint CLWE methods. Before the popularization of offline mapping, most CLWE methods extended monolingual embedding algorithms by either incorporating an explicit cross-lingual term in their learning objective, or directly replacing words with their translation equivalents in the training corpus. For that purpose, these methods relied on some form of cross-lingual supervision, ranging from bilingual dictionaries (Gouws and Søgaard, 2015;Duong et al., 2016) to parallel or documentaligned corpora (Luong et al., 2015;Vulić and Moens, 2016). More recently,  reported positive results learning regular word embeddings over concatenated monolingual corpora in different languages, relying on identical words as anchor points.  further improved this approach by applying a conventional mapping method afterwards. As shown later in our experiments, our approach outperforms theirs by a large margin.
Freezing. Artetxe et al. (2020) showed that it is possible to transfer an English transformer to a new language by freezing all the inner parameters of the network and learning a new set of embeddings for the new language through masked language modeling. This works because the frozen transformer parameters constrain the resulting representations to be aligned with English. Similarly, our proposed approach uses frozen output vectors in the target language as anchor points to learn aligned embeddings in the source language.

Proposed method
Let x i andx i be the input and output vectors of the ith word in the source language, and y j and y j be their analogous in the target language. 1 In addition, let D be a bilingual dictionary, where D(i) = j denotes that the ith word in the source language is translated as the jth word in the target language. Our approach first learns the target language embeddings {y i } and {ỹ i } monolingually using regular SGNS. Having done that, we learn the source language embeddings {x i } and {x i }, constraining them to be aligned with the target language embeddings according to the dictionary D. For that purpose, we propose an extension of for it ← 1 to T do 5: (wi, wj) ← NEXT INSTANCE(Csrc) 6: BACKPROP(L(wi, wj)) ⊲ core method ( §3.1) 7: end if 10: end for 11: end for SGNS that replaces the output vectors in the source language with their translation equivalents in the target language, which act as anchor points ( §3.1). So as to make our method more robust to a weak initial dictionary, we incorporate a self-learning procedure that re-estimates the dictionary during training ( §3.2), and perform iterative restarts ( §3.3). Algorithm 1 summarizes our method.

SGNS with cross-lingual anchoring
Given a pair of words (w i , w j ) co-occurring in the source language corpus, we define a generalized SGNS objective as follows: where k is the number of negative samples, P n (w) is the noise distribution, and ctx(w t ) is a function that returns the output vector to be used for w t . In regular SGNS, this function would simply return the output vector of the corresponding word, so that ctx(w t ) =x wt . In contrast, our approach replaces it with its counterpart in the target language if w t is in the dictionary: During training, the replaced vectors {ỹ i } are kept frozen, acting as anchor points so that the resulting embeddings {x i } are aligned with their counterparts {y i } in the target language.

Self-learning
As shown later in our experiments, the performance of our basic method is largely dependent on the quality of the bilingual dictionary itself. However, this is not different for conventional mapping methods, which also rely on a bilingual dictionary to align separately trained embeddings in different languages. So as to overcome this issue, modern mapping approaches rely on self-learning, which alternates between aligning the embeddings and re-inducing the dictionary in an iterative fashion (Artetxe et al., 2017).
We adopt a similar strategy, and re-induce the dictionary D a total of K times during training, where K is a hyperparameter. To that end, we first obtain the translations for each source word using CSLS retrieval (Conneau et al., 2018a): Having done that, we discard all entries that do not satisfy the following cyclic consistency condition: 2

Iterative restarts
While self-learning is able to improve a weak initial dictionary throughout training, the method is still susceptible to poor local optima. This can be further exacerbated by the learning rate decay commonly used with SGNS, which makes it increasingly difficult to recover from a poor solution as training progresses. So as to overcome this issue, we sequentially run the entire SGNS training R times, where R is a hyperparameter of the method. We use the output from the previous run as the initial dictionary, but all the other parameters are reset and the full training process is run from scratch.

Experimental setup
We next describe the systems explored in our experiments ( §4.1), the data and procedure used to train them ( §4.2), and the evaluation tasks ( §4.3).

Systems
We compare 3 model families in our experiments: Offline mapping. This approach learns monolingual embeddings in each of the languages separately, which are then mapped into a common space  through a linear transformation. We experiment with 3 popular methods from the literature: MUSE (Conneau et al., 2018a), ICP (Hoshen and Wolf, 2018) and VecMap (Artetxe et al., 2018). We use the original implementation of each method in their unsupervised mode with default hyperparameters.
Joint learning + offline mapping. This approach jointly learns word embeddings for two languages over their concatenated monolingual corpora, where identical words act as anchor points . Having done that, the vocabulary is partitioned into one shared and two language specific subsets, which are further aligned through an offline mapping method . We use the joint align implementation from the authors with default hyperparameters, which relies on fastText for the joint learning step and MUSE for the mapping step. 3 Cross-lingual anchoring. Our proposed method, described in Section 3. We explore 3 alternatives to obtain the initial dictionary: (i) identical words, where D i = j if the ith source word and the jth target word are identically spelled, (ii) numerals, a subset of the former where identical words are further restricted to be sequences of digits, and (iii) unsupervised mapping, where we use the baseline VecMap system described above to induce the initial dictionary. 4 The first two variants make assumptions on the writing system of different languages, which is usually regarded as a weak form of supervision (Artetxe et al., 2017;Søgaard et al., 2018), whereas the latter is strictly unsupervised, yet dependant on an additional system from a different family.

Data and training details
We learn CLWEs between English and six other languages: German, Spanish, French, Finnish, Russian and Chinese. Following common practice, we 3 The original implementation only supports the supervised mode with RCSLS mapping, so we modified it to use MUSE in the unsupervised setting as described in the original paper. 4 We use CSLS retrieval and apply the cyclic consistency restriction as described in Section 3.2.
de-en es-en fr-en fi-en ru-en zh-en  use Wikipedia as our training corpus, 5 which we preprocessed using standard Moses scripts, and restrict our vocabulary to the most frequent 200K tokens per language. In the case of Chinese, word segmentation was done using the Stanford Segmenter. Table 1 summarizes the statistics of the resulting corpora, while Table 2 reports the sizes of the initial dictionaries derived from it for our proposed method. For joint align, we directly run the official implementation over our tokenized corpus as described above. All the other systems take monolingual embeddings as input, which we learn using the SGNS implementation in word2vec. 6 For our proposed method, we set English as the target language, fix the corresponding monolingual embeddings, and learn aligned embeddings in the source language using our extension of SGNS ( §3). 7 We set the number of restarts R to 3, the number of reinductions per restart K to 50, and the number of epochs to 10 #trg sents #src sents , which makes sure that the source language gets a similar number of updates to the 10 epochs done for English. 8 For all the other hyperparameters, we use the same values as for the monolingual embeddings. We made all of our development decisions based on preliminary experiments on English-Finnish, without any systematic hyperparameter exploration. Our implementation runs on CPU, except for the dictionary reinduction steps, which run on a single GPU for around one 5 We extracted the corpus from the February 2019 dump using the WikiExtractor tool. 6 We use 10 negative samples, a sub-sampling threshold of 1e-5, 300 dimensions, and 10 epochs. Note that joint align also learns 300-dimensional vectors, but runs fastText with default hyperparameters under the hood. 7 In our preliminary experiments, we observed our proposed method to be quite sensitive to which language is the source and which one is the target. We find the language with the largest corpus to perform best as the target, presumably because the corresponding monolingual embeddings are better estimated, so it is more appropriate to fix them and learn aligned embeddings for the other language. Following this observation, we set English as the target language for all pairs, as it is the language with the largest corpus. 8 For a fair comparison, we also tried using the same number of epochs for the baseline systems, but this performed worse than the reported numbers with 10 epochs.  hour in total.

Evaluation tasks
As described next, we evaluate our method on two tasks: Bilingual Lexicon Induction (BLI) and Cross-lingual Natural Language Inference (XNLI).
BLI. Following common practice, we induce a bilingual dictionary through CSLS retrieval (Conneau et al., 2018a) for each set of cross-lingual embeddings, and evaluate the precision at 1 (P@1) with respect to the gold standard test dictionary from the MUSE dataset (Conneau et al., 2018a). For the few out-of-vocabulary source words, we revert to copying as a back-off strategy, 9 so our reported numbers are directly comparable to prior work in terms of coverage.
XNLI. We train an English natural language inference model on MultiNLI (Williams et al., 2018), and evaluate the zero-shot cross-lingual transfer performance on the XNLI test set (Conneau et al., 2018b) for the subset of our languages covered by it.
To that end, we follow Glavaš et al. (2019)

Results
We next discuss our main results on BLI ( §5.1) and XNLI ( §5.2), followed by our ablation study ( §5.3) and error analysis ( §5.4) on BLI. Table 3 comprises our main BLI results. We observe that our method obtains the best results in all directions (matched by VecMap in Russian-English), outperforming the strongest baseline by 2.4 points on average for the mapping based initialization. Our improvements are more pronounced in the backward direction (3.1 points on average) but still substantial in the forward direction (1.7 points on average). It is worth noting that some systems fail to converge to a good solution for the most challenging language pairs. This includes our proposed method in the case of Chinese-English when using the numeral-based initialization, which we attribute to the smaller size of the initial dictionary (only 244 entries, see Table 2). Other than that, we observe that our approach obtains very similar results regardless of the initial dictionary. Quite remarkably, in the source language with them, either through mapping (MUSE, ICP) or learning from scratch (ours).   the variant using VecMap for initialization (mapping init) is substantially stronger than VecMap itself despite not using any additional training signal. So as to put our results into perspective, Table 4 compares them to previous numbers reported in the literature. Note that the numbers are comparable in terms of coverage and all systems use Wikipedia as the training corpus, although they might differ on the specific dump used and the preprocessing details. 12 As it can be seen, our approach obtains the best results by a substantial margin. 13

XNLI
We report our XNLI results in Table 5. We observe that our method is competitive with the baseline 12 In particular, most mapping methods use the official Wikipedia embeddings from fastText. Unfortunately, the preprocessed corpus used to train these embeddings is not public, so works that explore other approaches, like ours, need to use their own pre-processed copy of Wikipedia. 13   report even stronger results based on unsupervised machine translation instead of direct retrieval with CLWEs. Note, however, that their method still relies on cross-lingual embeddings to build the underlying phrase-table, so our improvements should be largely orthogonal to theirs.
Basic method (identical init) 53.9 + self-learning 66.9 + iterative restarts 67.3 Basic method (numeral init) 2.6 + self-learning 53.9 + iterative restarts 61.0 Basic method (mapping init) 67.5 + self-learning 67.5 + iterative restarts 67.5 Table 6: Ablation results on BLI (average P@1) mapping systems, achieving the best results on 3 out of the 5 transfer languages by a small margin. Nevertheless, it significantly lags behind MUSE on Chinese, even if the exact same set of crosslingual embeddings performed better than MUSE at BLI. While striking, similar discrepancies between BLI and XNLI performance where also observed in previous studies (Glavaš et al., 2019). Finally, we observe that the initial dictionary has a negligible impact in the performance of our proposed method, which supports the idea that our approach converges to a similar solution given any reasonable initialization.

Ablation study
So as to understand the role of self-learning and the iterative restarts in our approach, we perform an ablation study and report our results in Table 6. We observe that the contribution of these components is greatly dependant on the initial dictionary. For the numeral initialization, the basic method works poorly, and both extensions bring large improvements. In contrast, the identical initialization does not benefit from iterative restarts, but selflearning still plays a major role. In the case of the mapping-based initialization, the basic method is already very competitive. This suggests that both the self-learning and the iterative restarts are helpful to make the method more robust to a weak initialization, and have a minor impact otherwise.
In order to better understand the underlying learning dynamics, we analyze the learning curves for Finnish-English in Figure 1. We observe that, when the initial dictionary is strong, our method surpasses the baseline and stabilizes early. In contrast, convergence is much slower when using the weak numeral-based initialization, and the iterative restarts are critical to escape poor local optima.

Error analysis
So as to better understand where our improvements in BLI are coming from, we perform an error analysis on the Spanish-English direction. To that end, we manually inspect the 69 instances for which our method (with mapping-based initialization) produced a correct translation while VecMap failed according to the gold standard, as well as the 26 instances for which the opposite was true. We then categorize these errors into several types, which are summarized in Table 7.
We observe that, in 52.6% of the 95 analyzed instances, the translation produced by our method is identical to the source word, while this percentage goes down to 4.2% for VecMap. This tendency of our approach to copy its input is striking, as the model has no notion about the words being identically spelled. 14 A large portion of these cases correspond to named entities, where copying is the right behavior, while VecMap outputs a different proper noun. There are also some instances where the input word is in the target language, 15 which can be considered an artifact of the dataset, but copying also seems the most reasonable behavior in these cases. Finally, there are also a few cases where the input word is present in the target vocabulary, which is selected by our method and counted as an error. Once again, we consider these to be an artifact of the dataset, as copying seems a reasonable choice if the input word is considered to be part of the target language vocabulary. The remaining cases where neither method copies mostly correspond to common errors, where one of the systems (most often VecMap) outputs a semantically related but incorrect translation. However, there are also a few instances where both translations are correct, but one of them is missing in the gold standard.
With the aim to understand the impact of identical words in our original results, we re-evaluated the systems using a filtered version of the MUSE gold standard dictionaries, where we removed all source words that were included in the set of candidate translations. In order to be fair, we filtered out identical words from the output of the system, reverting to the second highest-ranked translation whenever the first one is identical to the source word. The results for the strongest system in each family are shown in Table 8. Even if the margin of improvement is reduced compared to   of 1.1 points. It is also worth noting that joint align, which shares a portion of the vocabulary for both languages (and will thus translate all words in the shared vocabulary identically), suffers a large drop in performance. This highlights the importance of accompanying quantitative BLI evaluation with an error analysis as urged by previous studies (Kementchedjhieva et al., 2019).

Conclusions and future work
Our approach for learning CLWEs addresses the main limitations of both offline mapping and joint learning methods. Different from mapping approaches, it does not suffer from structural mismatches arising from independently training embeddings in different languages, as it works by constraining the learning of the source embeddings so they are aligned with the target ones. At the same time, unlike previous joint methods, our system can work without any parallel resources, relying on numerals, identical words or an existing mapping method for the initialization. We achieve this by combining cross-lingual anchoring with self-learning and iterative restarts. While recent research on CLWEs has been dominated by mapping approaches, our work shows that the fundamental techniques that popularized these methods (e.g., the use of self-learning to relax the need for crosslingual supervision) can also be effective beyond this paradigm.
Despite its simplicity, our experiments on BLI show the superiority of our method when compared to previous mapping systems. We complement these results with additional experiments on a downstream task, where our method obtains competitive results, as well as an ablation study and a systematic error analysis. We identify a striking tendency of our method to translate words identically, even if it has no notion of the words being identically spelled. Thanks to this, our method is particularly strong at translating named entities, but we show that our improvements are not limited to this phenomenon. These insights confirm the value of accompanying quantitative results on BLI with qualitative evaluation (Kementchedjhieva et al., 2019) and/or other tasks (Glavaš et al., 2019).
In the future, we would like to further explore CLWE methods that go beyond the currently dominant mapping paradigm. In particular, we would like to remove the requirement of a seed dictionary altogether by using adversarial learning, and explore more elaborated context translation and dictionary re-induction schemes.