Improving Word Translation via Two-Stage Contrastive Learning

Word translation or bilingual lexicon induction (BLI) is a key cross-lingual task, aiming to bridge the lexical gap between different languages. In this work, we propose a robust and effective two-stage contrastive learning framework for the BLI task. At Stage C1, we propose to refine standard cross-lingual linear maps between static word embeddings (WEs) via a contrastive learning objective; we also show how to integrate it into the self-learning procedure for even more refined cross-lingual maps. In Stage C2, we conduct BLI-oriented contrastive fine-tuning of mBERT, unlocking its word translation capability. We also show that static WEs induced from the ‘C2-tuned’ mBERT complement static WEs from Stage C1. Comprehensive experiments on standard BLI datasets for diverse languages and different experimental setups demonstrate substantial gains achieved by our framework. While the BLI method from Stage C1 already yields substantial gains over all state-of-the-art BLI methods in our comparison, even stronger improvements are met with the full two-stage framework: e.g., we report gains for 112/112 BLI setups, spanning 28 language pairs.


Introduction and Motivation
Bilingual lexicon induction (BLI) or word translation is one of the seminal and long-standing tasks in multilingual NLP (Rapp, 1995;Gaussier et al., 2004;Heyman et al., 2017;Shi et al., 2021, inter alia). Its main goal is learning translation correspondences across languages, with applications of BLI ranging from language learning and acquisition (Yuan et al., 2020;Akyurek and Andreas, 2021) to machine translation (Qi et al., 2018;Duan et al., 2020;Chronopoulou et al., 2021) and the development of language technology in low-resource languages and domains (Irvine and Callison-Burch, 2017;Heyman et al., 2018). A large body of recent BLI work has focused on the so-called mappingbased methods (Mikolov et al., 2013;Artetxe et al., mBERT/mT5 Static Word Embeddings Seed Dictionary 2018; Ruder et al., 2019). 1 Such methods are particularly suitable for low-resource languages and weakly supervised learning setups: they support BLI with only as much as few thousand word translation pairs (e.g., 1k or at most 5k) as the only bilingual supervision (Ruder et al., 2019). 2 Unlike for many other tasks in multilingual NLP (Doddapaneni et al., 2021;Chau and Smith, 2021;Ansell et al., 2021), state-of-the-art (SotA) BLI results are still achieved via static word embeddings (WEs) (Vulić et al., 2020b;Liu et al., 2021b). A typical modus operandi of mapping-based approaches is to first train monolingual WEs independently on monolingual corpora and then map them to a shared cross-lingual space via linear (Mikolov et al., 2013;1 They are also referred to as projection-based or alignmentbased methods (Glavaš et al., 2019;Ruder et al., 2019). 2 In the extreme, fully unsupervised mapping-based BLI methods can leverage monolingual data only without any bilingual supervision (Lample et al., 2018;Artetxe et al., 2018;Hoshen and Wolf, 2018;Mohiuddin and Joty, 2019;Ren et al., 2020, inter alia). However, comparative empirical analyses (Vulić et al., 2019) show that, with all other components equal, using seed sets of only 500-1,000 translation pairs, always outperforms fully unsupervised BLI methods. Therefore, in this work we focus on this more pragmatic (weakly) supervised BLI setup (Artetxe et al., 2020); we assume the existence of at least 1,000 seed translations per each language pair. Glavaš et al., 2019) or non-linear mapping functions (Mohiuddin et al., 2020). In order to achieve even better results, many BLI methods also apply a self-learning loop where training dictionaries are iteratively (and gradually) refined, and improved mappings are then learned in each iteration (Artetxe et al., 2018;Karan et al., 2020). However, there is still ample room for improvement, especially for lower-resource languages and dissimilar language pairs (Vulić et al., 2019;Nasution et al., 2021).
On the other hand, another line of recent research has demonstrated that a wealth of lexical semantic information is encoded in large multilingual pretrained language models (LMs) such as mBERT (Devlin et al., 2019), but 1) it is not straightforward to transform the LMs into multilingual lexical encoders (Liu et al., 2021b), 2) extract word-level information from them (Vulić et al., 2020b(Vulić et al., , 2021, and 3) word representations extracted from these LMs still cannot surpass static WEs in the BLI task (Vulić et al., 2020b;Zhang et al., 2021). Motivated by these insights, in this work we investigate the following research questions: (RQ1) Can we further improve (weakly supervised) mapping-based BLI methods based on static WEs? (RQ2) How can we extract more useful crosslingual word representations from pretrained multilingual LMs such as mBERT or mT5? (RQ3) Is it possible to boost BLI by combining cross-lingual representations based on static WEs and the ones extracted from multilingual LMs? Inspired by the wide success of contrastive learning techniques in sentence-level representation learning (Reimers and Gurevych, 2019;Carlsson et al., 2021;Gao et al., 2021), we propose a twostage contrastive learning framework for effective word translation in (weakly) supervised setups; it leverages and combines multilingual knowledge from static WEs and pretrained multilingual LMs. Stage C1 operates solely on static WEs: in short, it is a mapping-based approach with self-learning, where in each step we additionally fine-tune linear maps with contrastive learning that operates on gradually refined positive examples (i.e., true translation pairs), and hard negative samples. Stage C2 fine-tunes a pretrained multilingual LM (e.g., mBERT), again with a contrastive learning objective, using positive examples as well as negative examples extracted from the output of C1. Finally, we extract word representations from the multilingual LM fine-tuned in Stage C2, and combine them with static cross-lingual WEs from Stage C1; the combined representations are then used for BLI.
We run a comprehensive set of BLI experiments on the standard BLI benchmark (Glavaš et al., 2019), comprising 8 diverse languages, in several setups. Our results indicate large gains over state-of-the-art BLI models: e.g., ≈+8 Precision@1 points on average, +10 points for many language pairs, gains for 107/112 BLI setups already after Stage C1 (cf., RQ1), and for all 112/112 BLI setups after Stage C2 (cf., RQ2 and RQ3). Moreover, our findings also extend to BLI for lowerresource languages from another BLI benchmark (Vulić et al., 2019). Finally, as hinted in recent work (Zhang et al., 2021), our findings validate that multilingual lexical knowledge in LMs, when exposed and extracted as in our contrastive learning framework, can complement the knowledge in static cross-lingual WEs (RQ3), and benefit BLI. We release the code and share the data at: https: //github.com/cambridgeltl/ContrastiveBLI.

Methodology
Preliminaries and Task Formulation. In BLI, we assume two vocabularies X ={w x 1 , . . . , w x |X | } and Y={w y 1 , . . . , w y |Y| } associated with two respective languages L x and L y . We also assume that each vocabulary word is assigned its (static) typelevel word embedding (WE); that is, the respective WE matrices for each vocabulary are X∈R |X |×d , Y ∈R |Y|×d . Each WE is a d-dim row vector, with typical values d=300 for static WEs (e.g., fast-Text) (Bojanowski et al., 2017), and d=768 for mBERT. 3 We also assume a set of seed translation pairs D 0 ={(w x m 1 , w y n 1 ), ..., (w x m |D 0 | , w y n |D 0 | )} for training (Mikolov et al., 2013;Glavaš et al., 2019), where 1 ≤ m i ≤ |X |, 1 ≤ n i ≤ |Y|. Typical values for the seed dictionary size |D 0 | are 5k pairs and 1k pairs (Vulić et al., 2019), often referred to as supervised (5k) and semisupervised or weakly supervised settings (1k) (Artetxe et al., 2018). Given another test lexi- the goal is to retrieve its correct translation from L y 's vocabulary Y, and evaluate it against the gold L y translation w y g i from the pair. Method in a Nutshell. We propose a novel two-stage contrastive learning (CL) method, with both stages C1 and C2 realised via contrastive learning objectives (see Figure 1). Stage C1 ( §2.1) operates solely on static WEs, and can be seen as a contrastive extension of mapping-based BLI approaches with static WEs. In practice, we blend contrastive learning with the standard SotA mapping-based framework with self-learning: VecMap (Artetxe et al., 2018), with some modifications. Stage C1 operates solely on static WEs in exactly the same BLI setup as prior work, and thus it can be evaluated independently. In Stage C2 ( §2.2), we propose to leverage pretrained multilingual LMs for BLI: we contrastively fine-tune them for BLI and extract static 'decontextualised' WEs from the tuned LMs. These LM-based WEs can be combined with WEs obtained in Stage C1 ( §2.3).

Stage C1
Stage C1 is based on the VecMap framework (Artetxe et al., 2018) which features 1) dual linear mapping, where two separate linear transformation matrices map respective source and target WEs to a shared cross-lingual space; and 2) a self-learning procedure that, in each iteration i refines the training dictionary and iteratively improves the mapping. We extend and refine VecMap's self-learning for supervised and semi-supervised settings via CL.
Initial Advanced Mapping. After 2 -normalising word embeddings, 4 the two mapping matrices, denoted as W x for the source language L x and W y for L y , are computed via the Advanced Mapping (AM) procedure based on the training dictionary, as fully described in Appendix A.1; while VecMap leverages whitening, orthogonal mapping, re-weighting and de-whitening operations to derive mapped WEs, we compute W x and W y such that a one-off matrix multiplication produces the same result (see Appendix A.1 for the details).
Contrastive Fine-Tuning. At each iteration i, after the initial AM step, the two mapping matrices W x and W y are then further contrastively finetuned via the InfoNCE loss (Oord et al., 2018), a standard and robust choice of a loss function in CL research (Musgrave et al., 2020;Liu et al., 2021c,b). The core idea is to 'attract' aligned WEs of positive examples (i.e., true translation pairs) coming from the dictionary D i−1 , and 'repel' hard negative samples, that is, words which are semantically similar Algorithm 1 Stage C1: Self-Learning 1: Require: X,Y ,D0,Dadd = ∅ 2: for i = 1:Niter do 3: Wx, Wy ← Initial AM using Di−1; 4: DCL ← D0 (supervised) or Di−1 (semi-super); 5: for j = 1:NCL do 6: RetrieveD for the pairs from DCL; 7: Wx, Wy ← Optimize Contrastive Loss; 8: Compute new D add ; 9: Update Di = D0 ∪ Dadd; return Wx, Wy but do not constitute a word translation pair.
These hard negative samples are extracted as follows. Let us suppose that (w x m i , w y n i ) is a translation pair in the current dictionary D i−1 , with its constituent words associated with static WEs x m i , y n i ∈R 1×d . We then retrieve the nearest neighbours of y n i W y from XW x and derivew x m i ⊂ X (w x m i excluded) , a set of hard negative samples of size N neg . In a similar (symmetric) manner, we also derive the set of negativesw y n i ⊂ Y (w y n i excluded). We useD to denote a collection of all hard negative set pairs over all training pairs in the current iteration i. We then fine-tune W x and W y by optimizing the following contrastive objective: τ denotes a standard temperature parameter. The objective, formulated here for a single positive example, spans all positive examples from the current dictionary, along with the respective sets of negative examples computed as described above.
Self-Learning. The application of (a) initial mapping via AM and (b) contrastive fine-tuning can be repeated iteratively. Such self-learning loops typically yield more robust and better-performing BLI methods (Artetxe et al., 2018;Vulić et al., 2019). At each iteration i, a set of automatically extracted high-confidence translation pairs D add are added to the seed dictionary D 0 , and this dictionary D i = D 0 ∪ D add is then used in the next iteration i + 1.
Our dictionary augmentation method slightly deviates from the one used by VecMap. We leverage the most frequent N freq source and target vocabulary words, and conduct forward and backward dictionary induction (Artetxe et al., 2018). Unlike VecMap, we do not add stochasticity to the process, and simply select the top N aug high-confidence word pairs from forward (i.e., source-to-target) induction and another N aug pairs from the backward induction. In practice, we retrieve the 2×N aug pairs with the highest Cross-domain Similarity Local Scaling (CSLS) scores (Lample et al., 2018), 5 remove duplicate pairs and those that contradict with ground truth in D 0 , and then add the rest into D add .
For the initial AM step, we always use the augmented dictionary D 0 ∪ D add ; the same augmented dictionary is used for contrastive fine-tuning in weakly supervised setups. 6 We repeat the selflearning loop for N iter times: in each iteration, we optimise the contrastive loss N CL times; that is, we go N CL times over all the positive pairs from the training dictionary (at this iteration). N iter and N CL are tunable hyper-parameters. Self-learning in Stage C1 is summarised in Algorithm 1.

Stage C2
Previous work tried to prompt off-the-shelf multilingual LMs for word translation knowledge via masked natural language templates (Gonen et al., 2020), averaging over their contextual encodings in a large corpus (Vulić et al., 2020b;Zhang et al., 2021), or extracting type-level WEs from the LMs directly without context (Vulić et al., 2020a(Vulić et al., , 2021. However, even sophisticated templates and WE extraction strategies still typically result in BLI performance inferior to fastText (Vulić et al., 2021).
(BLI-Oriented) Contrastive Fine-Tuning. Here, we propose to fine-tune off-the-shelf multilingual LMs relying on the supervised BLI signal: the aim is to expose type-level word translation knowledge directly from the LM, without any external corpora. In practice, we first prepare a dictionary of positive examples for contrastive fine-tuning: (a) D CL =D 0 when |D 0 | spans 5k pairs, or (b) when |D 0 |=1k, we add the N aug =4k automatically extracted highest-confidence pairs from Stage C1 (based on their CSLS scores, not present in D 0 ) to D 0 (i.e., D CL spans 1k + 4k word pairs). We then extract N neg hard negatives in the same way as in §2.1, relying on the shared cross-lingual space derived as the output of Stage C1. Our hypothesis is that a difficult task of discerning between true translation pairs and highly similar non-translations as hard negatives, formulated within a contrastive 5 Further details on the CSLS similarity and its relationship to cosine similarity are available in Appendix A.2. 6 When starting with 5k pairs, we leverage only D0 for contrastive fine-tuning, as Dadd might deteriorate the quality of the 5k-pairs seed dictionary due to potentially noisy input. learning objective, will enable mBERT to expose its word translation knowledge, and complement the knowledge already available after Stage C1.
Throughout this work, we assume the use of pretrained mBERT base model with 12 Transformer layers and 768-dim embeddings. Each raw word input w is tokenised, via mBERT's dedicated tokeniser, into the following sequence: The sequence is then passed through mBERT as the encoder, its encoding function denoted as f θ (·): it extracts the representation of the [CLS] token in the last Transformer layer as the representation of the input word w. The full set of mBERT's parameters θ then gets contrastively fine-tuned in Stage C2, again relying on the InfoNCE CL loss: Type-level WE for each input word w is then obtained simply as f θ (w), where θ refers to the parameters of the 'BLI-tuned' mBERT model.

Combining the Output of C1 and C2
In order to combine the output WEs from Stage C1 and the mBERT-based WEs from Stage C2, we also need to map them into a 'shared' space: in other words, for each word w, its C1 WE and its C2 WE can be seen as two different views of the same data point. We thus learn an additional linear orthogonal mapping from the C1-induced cross-lingual WE space into the C2induced cross-lingual WE space. It transforms 2normed 300-dim C1-induced cross-lingual WEs into 768-dim cross-lingual WEs. Learning of the linear map W ∈R d 1 ×d 2 , where in our case d 1 =300 and d 2 =768, is formulated as a Generalised Procrustes problem (Schönemann, 1966;Viklands, 2006) operating on all (i.e., both L x and L y ) words from the seed translation dictionary D 0 . 7 7 Technical details of the learning procedure are described in Appendix A.3. It is important to note that in this case we do not use word translation pairs (w x m i , w y n i ) directly to learn the mapping, but rather each word w x m i and w y n i is duplicated to create training pairs (w x m i , w x m i ) and (w y n i , w y n i ), where the left word/item in each pair is assigned its WE from C1, and the right word/item is assigned its WE after C2.
Unless noted otherwise, a final representation of an input word w is then a linear combination of (a) its C1-based vector v w mapped to a 768dim representation via W , and (b) its 768-dim encoding f θ (w) from BLI-tuned mBERT: where λ is a tunable interpolation hyper-parameter. In Stage C1, when |D 0 |=5k, the hyperparameter values are N iter =2, N CL =200, N neg =150, N freq =60k, N aug =10k. SGD optimiser is used, with a learning rate of 1.5 and γ=0.99. When |D 0 |=1k, the values are N iter =3, N CL =50, N neg =60, N freq =20k, and N aug =6k; SGD with a learning rate of 2.0, γ=1.0. τ =1.0 and dropout is 0 in both cases, and the batch size for contrastive learning is always equal to the size of the current dictionary |D CL | (i.e., |D 0 | (5k case), or |D 0 ∪D add | which varies over iterations (1k case); see §2.1). In Stage C2, N neg =28 and the maximum sequence length is 6. We use AdamW (Loshchilov and Hutter, 2019) with learning rate of 2e − 5 and weight decay of 0.01. We fine-tune mBERT for 5 epochs, with a batch size of 100; dropout rate is 0.1 and τ =0.1. Unless noted otherwise, λ is fixed to 0.2.
Baseline Models. Our BLI method is evaluated against four strong SotA BLI models from recent literature, all of them with publicly available implementations. Here, we provide brief summaries: 10 RCSLS (Joulin et al., 2018) optimises a relaxed CSLS loss, learns a non-orthogonal mapping, and has been established as a strong BLI model in empirical comparative analyses as its objective function is directly 'BLI-oriented' (Glavaš et al., 2019). VecMap's core components (Artetxe et al., 2018) have been outlined in §2.1. LNMap (Mohiuddin et al., 2020) non-linearly maps the original static WEs into two latent semantic spaces learned via non-linear autoencoders, 11 and then learns another non-linear mapping between the latent autoencoder-based spaces. FIPP (Sachidananda et al., 2021), in brief, first finds common (i.e., isomorphic) geometric structures in monolingual WE spaces of both languages, and then aligns the Gram matrices of the WEs found in those common structures.
For all baselines, we have verified that the hyperparameter values suggested in their respective repositories yield (near-)optimal BLI performance. Unless noted otherwise, we run VecMap, LNMap, and FIPP with their own self-learning procedures. 12 Model Variants. We denote the full two-stage BLI model as C2 (Mod), where Mod refers to the actual model/method used to derive the shared crosslingual space used by Stage C2. For instance, C2 (C1) refers to the model variant which relies on our Stage C1, while C2 (RCSLS) relies on RC-SLS as the base method. We also evaluate BLI performance of our Stage C1 BLI method alone.
Multilingual LMs. We adopt mBERT as the default pretrained multilingual LM in Stage C2. Our supplementary experiments also cover the 1280dim XLM model 13 (Lample and Conneau, 2019) and 512-dim mT5 small (Xue et al., 2021). 14 For clarity, we use C2 [LM] to denote C2 (C1) obtained from different LMs; when [LM] is not specified, mBERT is used. We adopt a smaller batch size of 50 for C2 [XLM] considering the limit of GPU memory, and train C2 [mT5] with a larger learning rate of 6e − 4 for 6 epochs, since we found it much harder to train than C2 [mBERT].

Results and Discussion
The main results are provided in Table 1, while the full results per each individual language pair, and also with cosine similarity as the word retrieval function, are provided in Appendix E. The main findings are discussed in what follows.
Stage C1 versus Baselines. First, we note that there is not a single strongest baseline among the four SotA BLI methods. For instance, RCSLS and VecMap are slightly better than LNMap and FIPP with 5k supervision pairs, while FIPP and VecMap come forth as the stronger baselines with 1k supervision. There are some score fluctuations over individual language pairs, but the average performance of all baseline models is within a relatively narrow interval: the average performance of all four baselines is within 3 P@1 points with 5k pairs (i.e., ranging from 38.22 to 41.22), and VecMap, FIPP, and LNMap are within 2 points with 1k pairs. Strikingly, contrastive learning in Stage C1 already yields substantial gains over all four SotA BLI models, which is typically much higher than the detected variations between the baselines. We mark that C1 improves over all baselines in 51/56 BLI setups (in the 5k case), and in all 56/56 BLI setups when D 0 spans 1k pairs. The average gains 13 We pick the XLM large model pretrained on 100 languages with masked language modeling (MLM) objective. 14 We also tested XLM-Rbase, but in our preliminary experiments it shows inferior BLI performance.  L → * and * →L denote the average BLI scores of BLI setups where L is the source and the target language, respectively. The word similarity measure is CSLS (see §3). Underlined scores are the peak scores among methods that rely solely on static fastText WEs; Bold scores denote the highest scores overall (i.e., the use of word translation knowledge exposed from mBERT is allowed). + RCSLS is always used without self learning (see the footnote in 3); x We report VecMap with selflearning in the 1k-pairs scenario, and its variant without self-learning when using supervision of 5k pairs as it performs better than the variant with self-learning.
with the C1 variant are ≈5 P@1 points over the SotA baselines with 5k pairs, and ≈6 P@1 points with 1k pairs (ignoring RCSLS in the 1k scenario). Note that all the models in comparison, each currently considered SotA in the BLI task, use exactly the same monolingual WEs and leverage exactly the same amount of bilingual supervision. The gains achieved with our Stage C1 thus strongly indicate the potential and usefulness of word-level contrastive fine-tuning when learning linear crosslingual maps with static WEs (see RQ1 from §1).  (5k) and 3 P@1 points (1k), and we observe gains for all language pairs in both translation directions, rendering Stage C2 universally useful. These gains indicate that mBERT does contain word translation knowledge in its parameters. However, the model must be fine-tuned (i.e., transformed) to 'unlock' the knowledge from its parameters: this is done through a BLI-guided contrastive fine-tuning procedure (see §2.2). Our findings thus further confirm the 'rewiring hypothesis' from prior work (Vulić et al., 2021;Liu et al., 2021b;Gao et al., 2021), here validated for the BLI task (see RQ2 from §1), which states that task-relevant knowledge at sentence-and word-level can be 'rewired'/exposed from the off-the-shelf LMs, even when leveraging very limited task supervision, e.g., with only 1k or 5k word translation pairs as in our experiments.
Performance over Languages. The absolute BLI scores naturally depend on the actual source and target languages: e.g., the lowest absolute performance is observed for morphologically rich (HR, RU, FI, TR) and non-Indo-European languages (FI, TR). However, both C1 and C2 (C1) mode variants offer wide and substantial gains in performance for all language pairs, irrespective of the starting absolute score. This result further suggests wide applicability and robustness of our BLI method.

Further Discussion
Evaluation on Lower-Resource Languages.   presented in Table 2. Overall, the results further confirm the efficacy of the C2 (C1), with gains observed even with typologically distant language pairs (e.g., HE→BG and EU→ET).
Usefulness of Stage C2? The results in Table 1 have confirmed the effectiveness of our two-stage C2 (C1) BLI method (see RQ3 in §1). However, Stage C2 is in fact independent of our Stage C1, and thus can also be combined with other standard BLI methods. Therefore, we seek to validate whether combining exposed mBERT-based translation knowledge can also aid other BLI methods. In other words, instead of drawing positive and negative samples from Stage C1 ( §2.2) and combining C2 WEs with WEs from C1 ( §2.3), we replace C1 with our baseline models. The results of these C2 (RCSLS) and C2 (VecMap) BLI variants for a selection of language pairs are provided in Table 3.
The gains achieved with all C2 (·) variants clearly indicate that Stage C2 produces WEs which aid all BLI methods. In fact, combining it with RC- SLS and VecMap yields even larger relative gains over the base models than combining it with our Stage C1. However, since Stage C1 (as the base model) performs better than RCSLS and VecMap, the final absolute scores with C2 (C1) still outperform C2 (RCSLS) and C2 (VecMap).
Different Multilingual LMs? Results on eight language pairs, shown in Combining C1 and C2? The usefulness of combining the representations from two stages is measured through varying the value of λ for several BLI setups. The plots are shown in Figure 2, and indicate that Stage C1 is more beneficial to the performance, with slight gains achieved when allowing the 'influx' of mBERT knowledge (e.g., λ in the [0.0 − 0.3] interval). While mBERT-based WEs are not sufficient as standalone representations for BLI, they seem to be even more useful in the combined model for lower-resource languages on PanLex-BLI, with steeper increase in performance, and peak scores achieved with larger λ-s.
Ablation Study, with results summarised in Table 5, displays several interesting trends. First, both CL and self-learning are key components in the 1k-setups: removing any of them yields substantial drops. In 5k-setups, self-learning becomes less important, and removing it yields only negligible drops, while CL remains a crucial component (see also Appendix F). Further, Table 5 complements the results from Figure 2 and again indicates that, while Stage C2 indeed boosts word translation capacity of mBERT, using mBERT features alone is still not sufficient to achieve competitive   BLI scores. After all, pretrained LMs are contextualised encoders designed for (long) sequences rather than individual words or tokens. Finally, Table 5 shows the importance of fine-tuning mBERT before combining it with C1-based WEs ( §2.3): directly adding WEs extracted from the off-the-shelf mBERT does not yield any benefits (see the scores for the C1+mBERT variant, where λ is also 0.2).
The Impact of Contrastive Fine-Tuning on mBERT's representation space for two language pairs is illustrated by a t-SNE plot in Figure 3. The semantic space of off-the-shelf mBERT displays a clear separation of language-specific subspaces (Libovický et al., 2020; Dufter and Schütze, 2020), which makes it unsuitable for the BLI task. On the other hand, contrastive fine-tuning reshapes the subspaces towards a shared (cross-lingual) space, the effects of which are then also reflected in mBERT's improved BLI capability (see Table 5 again).
To understand the role of CL in Stage C1, we visualise static WEs mapped by C1 without CL (i.e., AM+SL, see §2.1) and also from the complete Stage C1, respectively. Figure 4 shows that C1 without CL already learns a sensible cross-lingual space. However, we note that advanced mapping (AM) in C1 without CL learns a (near-)orthogonal map, which might result in mismatches, especially with dissimilar language pairs. With TR-HR, the plot reveals that there exists a gap between C1aligned WE spaces although the final BLI performance still gets improved: this might be due to 'repelling' negatives from each other during CL.
Finally, we direct interested readers to Appendix G where we present some qualitative translation examples.

Related Work
This work is related to three topics, each with a large body of work; we can thus provide only a condensed summary of the most relevant research.
Mapping-Based BLI. These BLI methods are highly popular due to reduced bilingual supervision requirements; consequently, they are applicable to low-resource languages and domains, Contrastive Learning in NLP aims to learn a semantic space such that embeddings of similar text inputs are close to each other, while 'repelling' dissimilar ones. It has shown promising performance on training generic sentence encoders (Giorgi et al., 2021;Carlsson et al., 2021;Liu et al., 2021a;Gao et al., 2021) 2021c) apply similar techniques for phrase and word-in-context representation learning, respectively. The success of these methods suggests that LMs store a wealth of lexical knowledge: yet, as we confirm here for BLI, fine-tuning is typically needed to expose it.

Conclusion
We have proposed a simple yet extremely effective and robust two-stage contrastive learning framework for improving bilingual lexicon induction (BLI). In Stage C1, we tune cross-lingual linear mappings between static word embeddings with a contrastive objective and achieve substantial gains in 107 out of 112 BLI setups on the standard BLI benchmark. In Stage C2, we further propose a contrastive fine-tuning procedure to harvest crosslingual lexical knowledge from multilingual pretrained language models. The representations from this process, when combined with Stage C1 embeddings, have resulted in further boosts in BLI performance, with large gains in all 112 setups. We have also conducted a series of finer-grained evaluations, analyses and ablation studies.

Ethics Statement
Our research aims to benefit the efforts in delivering truly multilingual language technology also to under-resourced languages and cultures via bridging the lexical gap between languages, groups and cultures. As a key task in cross-lingual NLP, bilingual lexicon induction or word translation has broad applications in, e.g., machine translation, language acquisition and potentially protecting endangered languages. Furthermore, compared with many previous studies, we stress the importance of diversity in the sense that our experiments cover various language families and include six lowerresource languages from the PanLex-BLI dataset. Hoping that our work can contribute to extending modern NLP techniques to lower-resource and under-represented languages, we focus on semisupervised settings and achieve significant improvements with self-learning techniques.
The two BLI datasets we use are both publicly available. To our best knowledge, the data (i.e., word translation pairs) do not contain any sensitive information and have no foreseeable risk.

A.1 Advanced Mapping (AM) in Stage C1
Suppose X D , Y D ∈ R |D|×d are source and target embedding matrices corresponding to the training dictionary D. Then X T D and Y T D are whitened, and singular value decomposition (SVD) is conducted on the whitened embeddings: W x and W y are then derived after re-weighting and de-whitening as follows:

A.2 Word Similarity/Retrieval Measures
Given two word embeddings x ∈ X and y ∈ Y , their similarity can be defined as their cosine similarity m(x, y) = cosine(x, y). In the FIPP model, we calculate dot product m(x, y) = x T · y between x and y instead without normalisation, as with FIPP this produces better BLI scores in general. 15 For the simple Nearest Neighbor (NN) BLI with cosine (or dot product), we retrieve the word from the entire target language vocabulary of size 200k with the highest similarity score and mark it as the translation of the input/query word in the source language.
For the Cross-domain Similarity Local Scaling (CSLS) measure, a CSLS score is defined as CSLS(x, y) = 2m(x, y)−r X (y)−r Y (x). r X (y) is the average m(·, ·) score of y and its k-NNs (k = 10) in X; r Y (x) is the average m(·, ·) scores of x and its k-NNs (k = 10) in Y . Note that when using CSLS scores to retrieve the translation of x in Y , the term r Y (x) can be ignored, as it is a constant for all y, and we can similarly ignore r X (y) when doing BLI in the opposite direction.

A.3 Generalised Procrustes in Stage C2
We consider the following Procrustes problem: where X ∈ R n×d 1 is a C1-induced cross-lingual space spanning all source and target words in the training set D, Y ∈ R n×d 2 is a C2-induced space representing all mBERT-encoded vectors corresponding to the same words from X, and W ∈ R d 1 ×d 2 , d 1 ≤ d 2 . A classical Orthogonal Procrustes Problem assumes that d 1 = d 2 and W is an orthogonal matrix (i.e., it should be a square matrix), where its optimal solution is given by U V T ; here, U SV T is the full singular value decomposition (SVD) of X T Y . In our experiments, we need to address the case d 1 < d 2 when mapping 300-dimensional static fastText WEs to the 768dimensional space of mBERT-based WEs. It is easy to show that when d 1 < d 2 , U [S, 0]V T =X T Y (again the full SVD decomposition), the optimal W is then U [I, 0]V T (it degrades to the Orthogonal Procrustes Problem when d 1 = d 2 ). Below, we provide a simple proof.
, Ω F (14) In the formula above, · F and ·, · F are Frobenius norm and Frobenius inner product, and we leverage their properties throughout the proof. Note that S is a diagonal matrix with non-negative elements and thus the maximum is achieved when Ω = [I, 0] and W = U [I, 0]V T .
Note that the Procrustes mapping over word embedding matrices keeps word similarities on both sides intact.
Since W W T =I, cos(x i W , x j W ) = cos(x i , x j ).
We would also like to add an additional note, although irrelevant to our own experiments, that the above derivation cannot address d 1 > d 2 scenarios: in that case W W T cannot be a full-rank matrix and thus W W T = I.
• Baseline BLI Models: All models are accessible online as publicly available github repositories.
• • Runtime: The training process (excluding data loading and evaluation) typically takes 650 seconds for Stage C1 (seed dictionary of 5k pairs, 2 self-learning iterations) and 200 seconds for C1 (1k pairs, 3 self-learning iterations) on a single GPU. Stage C2 runs for ≈ 500 seconds on two GPUs (TITAN X).
• Robustness and Randomness: Our improvement is robust since both C1 and C2 outperform existing SotA methods in 112 BLI setups by a considerable margin. We regard our C1 as a deterministic algorithm because we adopt 0 dropout and a batch size equal to the size of the whole training dictionary (no randomness from shuffling). In C2, considering its robustness, we fix the random seed to 33 over all runs and setups.

C Visualisation of mBERT-Based Word Representations
To illustrate the impact of the proposed BLIoriented fine-tuning of mBERT in Stage C2 on its representation space, we visualise the 768dimensional mBERT word representations (i.e., mBERT-encoded word features alone, without the infusion of C1-aligned static WEs). We encode BLI test sets (i.e., these sets include 2k sourcetarget word pairs unseen during C2 fine-tuning), before and after fine-tuning, relying on 1k training samples as the seed dictionary D 0 . Here, we provide comparative t-SNE visualisations between source and target word mBERTbased decontextualised word representations (see §2.2) for six language pairs from the BLI dataset of Glavaš et al. (2019): EN-IT, FI-RU, EN-HR, HR-RU, DE-TR, and IT-FR, while two additional visualisations are available in the main paper (for RU-IT and TR-HR, see Figure 3 in §4.1). As visible in all the figures below, before BLI-oriented fine-tuning in Stage C2, there is an obvious separation between mBERT's representation subspaces in the two languages. This undesired property gets mitigated, to a considerable extent, by the finetuning procedure in Stage C2.

D Visualisation of fastText Word Representations
To show the impact of contrastive tuning in Stage C1, we provide t-SNE plots of 300-dimensional C1-aligned fastText embeddings with and without contrastive tuning (see §2.1) respectively for the same six language pairs as in Appendix C. The C1 w/o CL alignment consists of advanced mapping and self-learning loops, which has already been discussed in our ablation study (see §4.1). Like in Appendix C, the linear maps are learned on 1k seed translation pairs and our plots only cover the BLI test sets.

E Appendix: Full BLI Results
Complete results on the BLI dataset of Glavaš et al. (2019), per each language pair and also including NN-based BLI scores, are provided in Tables 7-8. It can be seen as an expanded variant of the main Table 1 presented in the main paper.

F Appendix: Full Ablation Study
Complete results of the ablation study, over all languages in the evaluation set of Glavaš et al. (2019), are available in Table 9, and can be seen as additional evidence which supports the claims from the main paper (see §4.1)

G Appendix: Translation Examples
We showcase some translation examples of both C1 alignment (see §2.1) and C2 alignment (see §2.2) in HR→EN and IT→EN word translation scenarios. In order to gain insight into the effectiveness of contrastive learning, we adopt C1 w/o CL as a baseline (also used in Table 5). All three models (i.e., C1 w/o CL, C1 and C2) are learned with 5k seed training word pairs, and we report top five predictions via Nearest Neighbor (NN) retrieval (for simplicity) on the BLI test sets. We consider both SUCCESS and FAIL examples in terms of BLI-oriented contrastive fine-tuning, where 'SUCCESS' represents the cases where at least one of C1 and C2 predicts the correct answer when the baseline fails, and 'FAIL' denotes the scenarios where the baseline succeeds but both C1 and C2 make wrong predictions. Here, we show some statistics for each language pair: (1) HR-EN sees 284 SUCCESS samples and 79 FAIL ones; (2) IT-EN has 165 SUCCESS data points, but only 27 FAIL ones.               infraction offence sanction discretionary penalty C2 (C1) infraction sanction offence penalty probation Table 10: Translation examples on HR-EN and IT-EN. We include here ground truth translation pairs and show top five predictions (in the "Top Five Predictions" column above, left → right: number one item in the ranked list → number five item in the ranked list) via NN retrieval for each of the three methods, that is, C1 w/o CL (Baseline), C1 and C2 (C1).