On Bilingual Lexicon Induction with Large Language Models

Bilingual Lexicon Induction (BLI) is a core task in multilingual NLP that still, to a large extent, relies on calculating cross-lingual word representations. Inspired by the global paradigm shift in NLP towards Large Language Models (LLMs), we examine the potential of the lat-est generation of LLMs for the development of bilingual lexicons. We ask the following research question: Is it possible to prompt and fine-tune multilingual LLMs (mLLMs) for BLI, and how does this approach compare against and complement current BLI approaches? To this end, we systematically study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs, both without any LLM fine-tuning, as well as 3) standard BLI-oriented fine-tuning of smaller LLMs. We experiment with 18 open-source text-to-text mLLMs of different sizes (from 0 . 3 B to 13 B parameters) on two standard BLI benchmarks covering a range of typologically diverse languages. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs. The results reveal that few-shot prompting with in-context examples from nearest neighbours achieves the best performance, establishing new state-of-the-art BLI scores for many language pairs. We also conduct a series of in-depth analyses and ablation studies, providing more insights on BLI with (m)LLMs, also along with their limitations.


Introduction and Motivation
Bilingual Lexicon Induction (BLI), also known as word translation, is a fundamental research topic in multilingual NLP that aims to bridge the lexical gap between languages (Ruder et al., 2019).It has a wide range of applications such as machine translation (Artetxe et al., 2018b;Marchisio et al., 2020;Chronopoulou et al., 2021) and cross-lingual transfer learning, especially for low-resource languages (Sun et al., 2021;Zhou et al., 2021;Wang et al., 2022).Over the past decade, state-of-the-art (SotA) BLI approaches have been predominantly supported by learning a cross-lingual word embedding (CLWE) space, with which BLI is tackled via nearest neighbour retrieval (Artetxe et al., 2018a;Heyman et al., 2019;Peng et al., 2021;Li et al., 2022a;Marchisio et al., 2022, inter alia).
Meanwhile, autoregressive text-to-text large language models (LLMs) have emerged as the cornerstone of cutting-edge NLP research (Raffel et al., 2020;Brown et al., 2020;Ouyang et al., 2022;Chowdhery et al., 2022).For example, multilingual LLMs (mLLMs) have shown (sentence-level) machine translation capabilities (Vilar et al., 2022;Briakou et al., 2023), although they have not been pretrained for machine translation in a supervised manner.Motivated by the recent remarkable success of (m)LLMs, in this work we investigate 1) the potential of prompting and fine-tuning of mLLMs for BLI and 2) how their capabilities compare against and complement current BLI approaches.We focus on how to expose word-level bilingual knowledge and elicit word translations from multilingual LLMs.To our best knowledge, we are the first to leverage autoregressive mLLMs for BLI. 1e systematically study zero-shot and fewshot prompting for BLI with off-the-shelf encoder-decoder and decoder-only autoregressive mLLMs (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020), respectively.In the few-shot scenario, we propose to incorporate in-context examples from nearest neighbours into the prompts to boost the BLI performance.In order to guide the mLLMs' generation, we hand-craft 'mask-fillingstyle' and 'GPT-style' templates catering to the characteristics of different LLMs and conduct extensive template search for BLI.In addition to pro-viding a complete and effective pipeline for BLI via prompting off-the-shelf mLLMs, we also investigate BLI-oriented fine-tuning with the LLMs' own pretraining objectives, aiming at specialising mLLMs into 'few-shot word translators'.
We conduct extensive experiments on two standard BLI benchmarks, XLING (Glavaš et al., 2019) and PanLex-BLI (Vulić et al., 2019), investigating the word translation capabilities of off-theshelf mLLMs (we adopt 18 models from 5 LLM families) in various BLI setups.Our comprehensive comparisons between mLLMs confirm, as expected, that 1) different LLM families display varying word translation capabilities and 2) stronger BLI performance tends to be associated with larger model sizes.To demonstrate the effectiveness of our prompt-based approach, we benchmark our method against two SotA CLWE-based baselines.Notably, our approach with LLaMA 13B outperforms the CLWE-based SotA on the XLING dataset by a considerable margin, establishing new SotA results on many language pairs in all BLI setups.Meanwhile, we also identify two limitations of BLI with mLLMs: 1) they are less competitive on the PanLex-BLI benchmark for lower-resource languages; 2) CLWE-based approaches usually support more languages than mLLMs.Finally, we run a series of insightful ablations and discuss the usefulness of BLI-oriented fine-tuning.In short, our work validates the BLI capabilities of mLLMs and proposes new methodology for BLI.We hope that the combination of our comprehensive analyses and discussions, including on limitations, will pave the way for the development of stronger BLI systems in the future.Our code is publicly available at github.com/cambridgeltl/prompt4bli.

Related Work
Bilingual Lexicon Induction.Over the past decade, predominant BLI approaches have relied on the calculation of cross-lingual word embeddings (CLWEs) where, in the most popular BLI variant, two transformation functions are learned to respectively map source and target monolingual static word embedding spaces into a shared crosslingual space (Xing et al., 2015;Lample et al., 2018;Joulin et al., 2018;Artetxe et al., 2018a;Alvarez-Melis and Jaakkola, 2018;Patra et al., 2019;Mohiuddin et al., 2020;Glavaš and Vulić, 2020;Peng et al., 2021;Li et al., 2022a;Marchisio et al., 2022).Then, relying on the learned CLWE space, BLI has been conducted via nearest neighbour retrieval.A detailed overview of different BLI principles can be found, e.g., in the work of Ruder et al. (2019).
More recently, researchers have attempted BLI by leveraging encoder-only multilingual masked language models (mMLMs) such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) whose neural architecture consists of only Transformer encoders (Vaswani et al., 2017).Gonen et al. (2020) prompt mBERT with templates where the target word is replaced with a '<mask>' token, and the language modelling head of mBERT outputs a subword token to fill the mask.This method is theoretically flawed because it cannot address the cases where the target word comprises two or more subword tokens.Therefore, Gonen et al. (2020) only evaluate BLI on a small set of 'toy' examples rather than standard BLI datasets.In terms of performance, this method lags far behind traditional BLI approaches.A more successful way of leveraging mMLMs is to extract decontextualised word representations from them (Zhang et al., 2021).The strongest CLWEs for BLI so far are learned via a two-stage contrastive approach combining both static (e.g., fastText) and mMLMextracted features (Li et al., 2022a). 2ext-to-Text LLMs.Autoregressive LLMs have established new state-of-the-art results on many NLP tasks.The prominent model groups include 1) encoder-decoder LLMs such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020); 2) Ope-nAI's decoder-only GPT series such as GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and InstructGPT (Ouyang et al., 2022); 3) other GPTlike LLMs with specific improvements such as Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022), and LLaMA LLM series (Touvron et al., 2023).

Methodology
BLI Task: Preliminaries and Terminology.Assuming a bilingual scenario with a source language L x and a target language L y with their respective vocabularies denoted as X and Y, the BLI task is typically formulated as a standard information retrieval task (Gaussier et al., 2004;Glavaš et al., 2019).The goal is to rank the words from Y with respect to their similarity to the input source word w x .The vocabulary size for each language is typically set to 200k (Li et al., 2022a), covering the most frequent 200k word types in each language.A bilingual lexicon then comprises a set of one-to-one source and target word translation pairs (Mikolov et al., 2013), and we denote a word pair as π=(w x , w y ) where w x ∈ X , w y ∈ Y.
We assume a set D S of N available seed translation pairs, constituting the so-called seed dictionary, which are used as the training set.Depending on the number of training pairs, the task is usually referred to as supervised BLI (typically, N ≥ 5K), semi-supervised BLI (e.g., 0 < N ≤ 1K), and unsupervised BLI (N = 0) in the literature (Artetxe et al., 2018a;Zhao et al., 2020;Li et al., 2022a).For convenience, we also refer to the unsupervised setup as zero-shot BLI (N = 0) and denote the setup with a handful of seed translation pairs as few-shot BLI (N > 0), corresponding to how we prompt mLLMs for BLI (we describe zero-shot and few-shot prompts for BLI later in §3).A test set D T , where D S ∩ D T = ∅, is used for evaluation.
In some cases, a source word may have more than one ground-truth translation (i.e., there exist two or more word pairs in a BLI dictionary that share the same source word).Following previous work (Lample et al., 2018;Glavaš et al., 2019;Li et al., 2022a), we consider a prediction correct as long as it is any of the ground-truth translations.The BLI scores are reported based on the standard Precision@K (P@K) BLI measure, where K denotes the length of the ranked list.

Prompting Multilingual LLMs for BLI
This study employs five families of mainstream multilingual text-to-text LLMs (mLLMs): mT5, mT0, XGLM, mGPT, and LLaMA. 3Based on their model structures, we group these models into two categories; in what follows, we briefly introduce each of them and showcase some simple templates used for 'BLI-prompting' the LLMs.
The first category includes mT5 and mT0, two encoder-decoder LLM families that leverage the full Transformer architecture (Vaswani et al., 2017).Each model family comes in five different sizes, and we evaluate all these ten models.
• mT5 (Xue et al., 2021) is pretrained on the mC4 dataset covering 101 languages.The LLM leverages a span-corruption objective that tries to reconstruct consecutive spans of dropped-out tokens replaced with special mask tokens.
• mT0 (Muennighoff et al., 2022) is a multitaskfinetuned mLLM based on instruction fine-tuning from the original mT5 model.The fine-tuning is conducted with English prompts on mT0's xP3 dataset spanning 46 languages. 4or these two encoder-decoder style mLLMs, we aim to derive prompts such that the first word of the output sequence serves as its guess for w y .Catering to its span-corruption objective, for mT5 we propose to design mask-filling-style English templates where '<mask>' tokens are used as placeholders for the target words.Here is an example template: 'The L x word w x in L y is <mask>.',where L x , L y , and w x are placeholders for the source language, target language, and the input source word, respectively. 5When a prompt based on this template is fed into mT5, its decoder will then output a sequence to fill the mask.Since mT0 is based on mT5, we found that mask-filling-style prompts are also applicable to mT0.However, unlike for mT5, the instruction-tuned mT0 fits templates without the '<mask>' token. 6For simplicity, we will denote all such templates without any '<mask>' tokens as 'GPT-style templates'.
The second model category comprises XGLM, mGPT, and LLaMA as three decoder-only LLMs pretrained with causal LM losses.Our experiments involve five XGLM and two LLaMA models whose model sizes are no larger than 13B parameters, while mGPT only releases one model of size 1.4B.Unlike encoder-decoder LLMs for conditional generation, the decoder-only causal LLMs first repeat the input sequence in their output, and we construct prompts that induce LLMs to produce w y immediately after the repeated input sequence.
• XGLM (Lin et al., 2022) offers multilingual LLMs similar to GPT-3 (Brown et al., 2020) and is reported to outperform GPT-3 of comparable size in a series of tasks.The work builds a CC100-XL dataset based on Conneau et al. (2020) and Wenzek et al. (2020), and XGLM is pretrained with a subset of it covering 30 languages.
• LLaMA (Touvron et al., 2023) is a recently released SotA LLM family trained on trillions of tokens exclusively from publicly available datasets; it supports 20 languages.LLaMA also features its efficient implementation, and it adopts a series of recent improvements on normalisation, activation functions, and positional embeddings.
Our decoder-only LLMs solely leverage GPTstyle prompts introduced above for mT0, since their tokenisers usually do not support '<mask>' tokens.

Retrieval-Augmented In-Context Learning
In §3.1, we presented some simple zero-shot prompts (i.e., prompts without in-context examples) for BLI.However, recent work highlights the few-shot capabilities of modern LLMs (Brown et al., 2020).Therefore, we also investigate fewshot templates for improved BLI performance. 7e propose to retrieve the nearest neighbours of a source word which we use to construct in-context samples to boost BLI performance.More specifically, given D S and an input source word w x , we extract n word pairs (w x i , w y i ) ∈ D S , 1 ≤ i ≤ n, such that w x i , 1 ≤ i ≤ n are n nearest neighbours of w x in the auxiliary static monolingual word embedding space of X .This auxiliary space is based on pretrained fastText word embeddings (Bojanowski et al., 2017) 8 and we use the cosine similarity measure for the retrieval. 9We again design mask-filling-style and GPT-style few-shot templates for the mLLMs, as discussed in §3.1.Similar to zero-shot prompts, for few-shot prompts we also extract the first word after removing special tokens (e.g., start-of-sentence, padding, and '<mask>' tokens) and repeated input sequence (for decoder-only models) as the prediction of w y .

Template Design and BLI Inference
Template Design.We hand-craft in total 102 English zero-shot and few-shot templates, respectively listed in Tables 10 and 11 of Appendix C. A small set of basic templates is fully manually designed, and additional variants are then created by modifying or replacing the punctuation (see the tables).For each LLM, we search for its best zero-shot template and best few-shot template on a randomly chosen language pair (German, French) and fix the template choices for experiments on all other language pairs.The best template choices for each LLM are provided in Table 12 (Appendix C).BLI Inference.At inference, we adopt beam search for both encoder-decoder and decoder-only LLMs and make the generator return the final beam ranked by their sequence scores.For each input prompt corresponding to w x , we iterate through the returned set of sequences, and for each sequence we extract the word after removing any redundant prefix content, as described in §3.2.The first word extracted that appears in the target vocabulary is returned as our prediction of w y .

BLI-Oriented Fine-Tuning
This work predominantly focuses on 'learningless' experiments based on zero-shot and few-shot incontext setups with off-the-shelf mLLMs for BLI without any fine-tuning.As a side experiment, we also aim to fine-tune smaller-scale mLLMs, making them specialise into few-shot word translators with our few-shot prompts as input.Our training set is still D S , but we now exclude retrieving an input w x itself as an in-context example.We combine the D S of each language pair with which we finetune encoder-decoder mLLMs with mT5's spancorruption loss and fine-tune decoder-only LLMs with the standard causal LM objective.

Experimental Setup
Training and Evaluation Data.Our experiments adopt two standard and publicly available BLI datasets, also used in a body of very recent BLI research (Vulić et al., 2020;Sachidananda et al., 2021;Aboagye et al., 2022;Li et al., 2022a,b;Vulić et al., 2023).1) XLING (Glavaš et al., 2019) provides BLI dictionaries covering 8 languages and 56 BLI directions.Among these 8 languages, 5 are supported by all of our mLLMs: English (EN), French (FR), German (DE), Italian (IT), and Russian (RU).Therefore, §5 mainly focuses on and reports results on all the 20 = 5 × 4 BLI directions for the 5 languages. 10For each language pair, XLING provides a test set D T of 2K translation pairs.It also provides training sets D S of 5K and 1K translation pairs, where the former is the superset of the latter.For brevity, we denote the cases |D S | = 5K, |D S | = 1K, and |D S | = 0 as the 5K setup, 1K setup, and unsupervised setup, respectively.112) PanLex-BLI (Vulić et al., 2019) offers BLI lexicons spanning 15 lower-resource languages and all 210 BLI directions.We select three languages that are supported by most of our mLLMs: Bulgarian (BG), Catalan (CA), and Hungarian (HU).The test set size of PanLex-BLI is also 2K; under the lower-resource assumption, we only focus on unsupervised and 1K BLI setups.
Main Experiments.In our main experiments, we prompt 18 off-the-shelf models from 5 mLLM families mentioned in §3.1 12 for BLI without any finetuning13 and systematically evaluate their BLI performance in three different BLI setups on XLING and PanLex-BLI datasets introduced above.In 5K and 1K setups, 5-shot in-context learning is adopted for our mLLMs, while in the unsupervised setup, zero-shot prompts are used.We compare the BLI scores between different mLLMs from the perspectives of LLM family and model size, and we also benchmark their performance against two SotA CLWE-based baselines, introduced later.Selected results are summarised in §5.1 while full and detailed BLI scores are reported in Appendix D.
Side Experiments.We conduct a series of additional experiments to further understand the BLI capabilities of mLLMs. 1) We investigate how the BLI performance is related to the number of incontext examples (5K and 1K setups).2) As an ablation study, we validate the usefulness of (our proposed) in-context samples extracted from nearest neighbours by comparing with randomly sampled in-context examples.3) Finally, we fine-tune some of our relatively smaller-scale LLMs, including mT5 base , mT5 large , XGLM 564M , and XGLM 1.7B on our 5-shot templated BLI data (XLING) and further study the effectiveness of our BLI-oriented fine-tuning (5K and 1K setups).The training set includes all XLING language pairs, where the 5K and 1K setups have 271, 754 and 55, 228 training instances respectively.
Hyperparameters.We first introduce our hyperparameters for BLI inference.In our main experiments, we adopt n = 514 while in side experiments we further investigate and compare using different numbers of in-context examples n.Concerning the generation of output sequences, we adopt a beam size of 5 for all LLMs, and the maximum sequence length is 5 for encoder-decoder models and 5 plus the input sequence length for decoder-only models which first repeat the input sequence before generating new content.As for encoder-decoder LLMs, we use an evaluation batch size of 100 for smaller models and 8 for larger models as listed in Table 8 (Appendix B).Since the pretraining of decoder-only LLMs usually does not see padding tokens, we adopt a batch size of 1.15 Following prior work (Li et al., 2022a,b), all our hyperparameters are tuned on (German, French), a randomly selected language pair.
For 'BLI-oriented' fine-tuning, we use the XLING data combining all language pairs, and the batch size is 16 for XGLM 1.7B and 32 for mT5 base,large and XGLM 564M .
Baselines.We adopt the following two SotA CLWE-based approaches as our baselines; both are open-source.We follow their original suggested hyperparameter choices respectively for 5K (supervised), 1K (semi-supervised), and unsupervised BLI setups, and we re-verify that the hyperparameters recommended are (near-)optimal.The Cross-domain Similarity Local Scaling (CSLS) retrieval (Lample et al., 2018) is adopted as recommended in the baselines.
• VECMAP (Artetxe et al., 2018a) is one of the most representative BLI approaches based on static CLWEs.It induces fastText-based CLWEs in various BLI supervision setups, and is notable for its effective self-learning mechanism, especially in weakly supervised and unsupervised BLI setups.
• CONTRASTIVEBLI (Li et al., 2022a) refines CLWEs with a two-stage contrastive learning procedure and reports the currently highest CLWEbased BLI scores on XLING and PanLex-BLI in 5K and 1K BLI setups.
We adopt its strongest CLWEs derived with both fastText and mBERT (Devlin et al., 2019).CONTRASTIVEBLI does not support unsupervised BLI.BLI Evaluation.Following previous work, we report the standard Precision@1 (P@1) scores both for our methods and for baseline methods. 16  5 Results and Discussion

Main Results
Comparison between mLLMs.We compare the average BLI scores on 20 XLING BLI directions derived from all our 18 models from 5 LLM families in Figure 1.In all (5K, 1K, and zero-shot) BLI setups, the same general trends are observed.1) As expected, within the same mLLM family, larger models usually present stronger BLI capabilities, although exceptional cases exist (e.g., XGLM 7.5B underperforms XGLM 4.5B ).2) For encoder-decoder models, we find that mT5 outperforms mT0, showing that the instruction fine-tuning of mT0 does not benefit BLI in our experimental setups; 3) LLaMA models achieve the strongest BLI performance among our 5 model families.
16 P@1 is the most authoritative metric for BLI.Other measures such as P@5 and Mean Reciprocal Rank (MRR) show similar trends (Lample et al., 2018;Li et al., 2022a).find that traditional CLWE-based approaches still outperform LLM-elicited BLI in general.This may reveal that current SotA mLLMs (size≤13B) still lack strong word translation capabilities for a large number of languages and language pairs, even for those they currently cover.
Put simply, while current mLLMs do exhibit strong performance for arguably high-resource languages (from XLING), they still have deficiencies with lower-resource languages as well as with their portability to a much larger number of languages, currently covered by more traditional BLI approaches (Li et al., 2022a).We leave to future work the investigation of larger mLLMs (e.g., LLaMA 30B ) for BLI with lower-resource languages and languages unseen by the mLLMs.
Statistical Significance.We conduct χ 2 test comparing LLaMA 13B against the strongest single baseline in each BLI setup (i.e., CONTRASTIVEBLI in fewshot setups and VECMAP in the zero-shot setup) on the average BLI performance over 20 XLING and 6 PanLex-BLI BLI directions respectively, and we estimate the p-values as follows.1) On XLING, p is 2.8e-23 in the 5K setup, 8.5e-32 in the 1K setup, and 4.3e-9 in the zero-shot setup.2) For PanLex-BLI, p is 1e-4 in the 1K setup and 1.9e-35 in the zero-shot setup.The p-values show that our main findings are clearly statistically significant. 17

Further Analyses
n-Shot Prompting.To better understand the influence of the number of in-context examples, we pick mT5 large (an encoder-decoder LLM) and LLaMA 13B (a decoder-only LLM) and run experiments ranging from 0-shot to 10-shot. Figure 2 depicts their average BLI scores on 20 XLING BLI directions in 5K and 1K setups, respectively.The results clearly demonstrate the usefulness of in-context learning.Even when having only one incontext example (one-shot), the same model vari-17 By convention, p < 0.05: statistically significant; p < 1e-3: statistically highly significant.BLI-Oriented Fine-Tuning.The fine-tuning experiments, due to computational constraints, are conducted on four relatively smaller-scale LLMs.Table 5 reports each model's average performance on 20 XLING BLI directions before and after finetuning.We run fine-tuning experiments three times with different random seeds and report both mean scores and standard deviations.We only observe salient gains on mT5 models and XGLM 564M in the 5K setups.For XGLM 1.7B in both setups and all models in the 1K setup, the gains are smaller or even non-existent.Even in the 5K setup, the tuned mT5 base still cannot match the off-the-shelf mT5 large , and the tuned mT5 large underperforms mT5 xl (cf.(Lester et al., 2021), adapters (Li et al., 2020(Li et al., , 2023) ) and LoRA (Hu et al., 2022).
Templates.Now, we additionally provide some preliminary findings from our template search. 19) Models from the same mLLM family may tend to prefer the same template.For example, Table 12 (Appendix C) shows that all five XGLM models prefer the same best zero-shot template and four of them share one best few-shot template.This phenomenon is to some extent seen also on mT5 (zero-shot and few-shot), mT0 (zero-shot and fewshot), and LLaMA (few-shot).This should be due to the same training data, training strategy, and model architecture adopted for all models in the same LLM family.2) As already mentioned in §3.1, mT0 is compatible with both mask-fillingstyle and GPT-style templates: Table 12 shows that some mT0 models prefer templates with '<mask>' and others do not.3) Under the 'GPT-style' templates, decoder-only models all prefer templates for sentence completion while some of the instructiontuned mT0 models prefer questions with '?'.

Further Discussion
Few-Shot Learning for BLI.  13 and 14).However, we again note that this might hold only for high-resource languages such as the ones covered in XLING.
Impact Statement.Here, we discuss the potential impact of our study on the following two aspects.et al., 2018;Li et al., 2023).

Conclusion
This paper presents the first study on bilingual lexicon induction (BLI) with multilingual text-to-text large language models (mLLMs).We develop the methodology to prompt mLLMs for BLI, conduct extensive template search, and systematically experiment with 5 representative mLLM families (18 models) on a variety of zero-shot and few-shot BLI tasks.Relying on off-the-shelf mLLMs, our experiments on the standard XLING dataset offer strong performance in all BLI setups, where our proposed few-shot prompting with in-context examples from nearest neighbours outperforms the strongest CLWE-based SotA by a considerable margin.However, our study also points out that prompting-based methods still need to be successfully extended to lower-resource languages.Finally, we conduct a series of in-depth analyses covering variants of our few-shot prompting and preliminary investigations on BLI-oriented finetuning.Our key findings and comprehensive analyses may pave the way for the development of stronger mLLM-based BLI systems in the future.

Limitations
First, most recently released state-of-the-art mLLMs are still unable to support as many languages as static word embeddings, which currently limits their wider portability.For instance, LLaMA supports 20 languages and XGLM supports 30 languages, while fastText provides pretrained static WEs for 294 languages that can be used for the induction of static CLWEs. 21Intuitively, this is because training LLMs that support more languages would require higher computational costs (with more training data and typically larger model sizes).
We hope that researchers in the future can pretrain and release mLLMs that support a larger set of linguistically diverse languages, which can thus probably extend the success of our approach to more languages and language families.Second, our work did not investigate opensource LLMs with more than 13B parameters22 due to a large number of experiments conducted combined with our limited computing resources, and we did not evaluate any closed-source LLMs.Quite a few tech companies and AI research labs have been training LLMs with 100+B and even 500+B parameters.We encourage interested readers who have access to adequate computing resources or specific closed-source LLMs to take a step further and investigate if larger LLMs can provide an even stronger BLI performance than reported in this particular work, following the recipe presented in this work.
Third, as also discussed in other BLI work (Li et al., 2022b), existing BLI datasets did not control the synonyms and polysemy well and to a sufficient detail.In fact, when constructing BLI datasets, it is very difficult to collect all correct translations for each source word.Therefore, one limitation of BLI evaluation is that it cannot give credit to correct answers that are not included in the groundtruth translation set, and evaluation is typically conducted out-of-context.Constructing finer-grained BLI datasets with the help of qualified annotators (e.g., linguists, typologists and bilingual speakers) is beyond the scope of this work.

B Reproducibility Checklist
• BLI Data: We adopt two publicly available BLI datasets. 23 24  • Static Word Embeddings: Following the datasets' own recommendations and other previous work, we use the XLING-preprocessed fast-Text WEs trained on Wikipedia 25 for XLING data and fastText WEs trained on Common Crawl + Wikipedia26 for PanLex-BLI, and the WEs are trimmed to the most frequent 200K words for each language.For fair comparisons, we use the same set of fastText WEs both for the retrieval of nearest neighbours (in-context examples) and for the CLWE-based baselines.
• Pretrained LLMs and Parameter Counts: All the LLMs used in our experiments are publicly available from the huggingface.comodel hub.We summarise their model identifiers and model sizes in Table 7. Please refer to each LLM's own copyright and licence before downloading, using, fine-tuning, or redistributing any LLM.
• Computing Infrastructure: We have run our code on Wilkes3, a GPU cluster hosted by Re-search Computing Services at the University of Cambridge, where each run leverages a single Nvidia 80GB A100 GPU and 32× CPU cores.
• Runtime (Wall Time): We present the average inference time on one single BLI direction (i.e., circa 2, 000 word pairs in an XLING test set; the time required for loading the LLM and the dataset is not included) for each LLM in Table 8.The perepoch training time for BLI-oriented fine-tuning is provided in Table 9.
• Significance: We have discussed the significance of our main results and ablation results in the last paragraph of §5.1 and in Table 4 respectively, which demonstrates that our findings are statistically significant.
• Randomness: Our main experiments are completely deterministic since we rely on off-the-shelf LLMs without any fine-tuning, nearest neighbour retrieval for in-context examples (a deterministic retrieval algorithm), and the deterministic beam search.The randomness only exists in two parts of our side analysis.First, we use random in-context examples in our ablation study, and we verify our findings with statistical tests in Table 4. Second, the fine-tuning experiments do have randomness, and we run fine-tuning three times for each model, reporting both average BLI performance and the standard deviation.
• Carbon Footprint: All the experiments involved in this project including hyperparameter tuning, template search, BLI inference, and BLI-oriented fine-tuning of our LLMs consume circa 1, 650 A100 GPU hours.Based on a publicly available 'machine learning emissions calculator' (Luccioni et al., 2019) 27 and our computational infrastructure, we estimate that our work causes the emission of circa 200kg CO 2 equivalents.

C Templates
We summarise all our zero-shot templates in Table 10 and few-shot templates in

D Full BLI Results
Here we present our full results on both XLING and PanLex-BLI.Table 13, 14, and 16 are our results on all 56 XLING BLI directions in 5K, 1K, and zero-shot (unsupervised) BLI setups respectively.Table 15 and 17 are results for PanLex-BLI lower-resource languages (6 BLI directions) in 1K and zero-shot (unsupervised) BLI setups.Note that an (m)LLM usually cannot support every language, and we use '-' to denote this scenario.Throughout this paper, our expression 'a language is not supported by an LLM' means that the language is not used for pretraining the LLM even if the LLM's tokeniser may still be able to tokenise possibly many input sentences in the language.Table 18 shows the full ablation results for each of our 18 mLLMs.

E Translation Examples
To illustrate how few-shot learning improves BLI, we present some of our BLI results with LLaMA 13B in Table 19 comparing five-shot and zero-shot prompting on HR→EN and IT→EN BLI test sets.
The L x word w x 2 in L y is w y 2 .The L x word w x in L y is: 87 The word w x 1 in L y is w y 1 , The word w x 2 in L y is w y 2 , The word w x in L y is 88 The word w x 1 in L y is w y 1 , The word w x 2 in L y is w y 2 , The word w x in L y is: 89 The L x word w x 1 in L y is w y 1 , The L x word w x 2 in L y is w y 2 , The L x word w x in L y is 90 The L x word w x 1 in L y is w y 1 , The L x word w x 2 in L y is w y 2 , The L x word w x in L y is: 91 The word 'w x 1 ' in L y is w y 1 .The word 'w x 2 ' in L y is w y 2 .The word 'w x ' in L y is 92 The word 'w x 1 ' in L y is w y 1 .The word 'w x 2 ' in L y is w y 2 .The word 'w x ' in L y is: How do you say the L x word 'w x ' in L y ?How do you say the L x word 'w x ' in L y ?
The word w x 1 in L y is w y 1 .The word w x 2 in L y is w y 2 .The word w x in L y is <mask>.mT5 large The L x word w x 1 in L y is w y 1 .The L x word w x 2 in L y is w y 2 .The L x word w x in L y is <mask>.mT5 xl The L x word w x 1 in L y is w y 1 , The L x word w x 2 in L y is w y 2 , The L x word w x in L y is <mask>.mT5 xxl The L x word w x 1 in L y is w y 1 , The L x word w x 2 in L y is w y 2 , The L x word w x in L y is <mask>.mT0 small The word 'w x 1 ' in L y is w y 1 .The word 'w x 2 ' in L y is w y 2 .How do you say 'w x ' in L y ?mT0 base The L x word w

Table 1 :
Main results on 20 XLING BLI directions in 5K, 1K, and zero-shot (unsupervised) setups.Off-theshelf mLLMs are used without any fine-tuning.Average P@1×100% scores of each language going to and from other 4 languages are reported.'-':CONTRASTIVEBLIdoesnotsupportunsupervisedBLI.XLING: Main Results.In Table1, we report the BLI performance of the strongest single model from each LLM family (full results for each of the 18 LLMs are available in Appendix D).Our results on 20 XLING BLI directions reaffirm the leading position of LLaMA where the LLaMA 13B variant achieves the highest overall average BLI scores in all BLI setups, also outperforming CON-TRASTIVEBLI, the previous CLWE-based SotA on the same dataset.We speculate that LLMs are more adept at few-shot learning: LLaMA 13B outperforms VECMAP by circa 10 P@1 points in few-shot setups, but only by about 2 points in the zero-shot setup.It is also worth mentioning that in 5K and 1K setups, mT5 xxl and XGLM 4.5B beat VECMAP although they underperform CONTRASTIVEBLI; however, in the zero-shot setup they still cannot match VECMAP.PanLex-BLI: Main Results.We present our results on lower-resource languages from PanLex-BLI in Table2.We also provide here only one strongest model of each LLM family, while the full results from all 18 LLMs are available in Appendix D).This time, LLaMA 13B still outperforms other LLMs, but we report here XGLM 7.5B instead which beats XGLM 4.5B .Unlike for XLING, we

Table 2 :
Main results on 6 PanLex-BLI BLI directions in 1K and zero-shot (unsupervised) setups.Off-the-shelf mLLMs are used without any fine-tuning.P@1×100% scores are reported.'-':1)alanguage is not supported by the LLM; 2) CONTRASTIVEBLI does not support unsupervised BLI.S|=1K Figure 2: BLI scores averaged over 20 BLI directions from XLING with respect to the number of in-context examples n (0 to 10), with mT5 large and LLaMA 13B in both 5K and 1K BLI setups.antoutperformsitszero-shot results by ∼ 5 P@1 points.However, with higher values for n (i.e., n ≥ 5), the gains become saturated.directionsfromonly one best LLM from each LLM family, and full results on all LLMs are available in Appendix D. Our results in Table3demonstrate the following.1) The nearest neighbour-based 'NN (*K)' scores for every LLM outperform the 'Ran-

Table 3 :
Ablation results.Averaged BLI scores (P@1×100%) on 20 XLING BLI directions.Rows 1-2: 5-shot prompting with in-context examples extracted from NN in D S of size 5K and 1K.Rows 3-4: 5-shot prompting with random in-context examples in D S of size 5K and 1K.Row 5: zero-shot prompting without any in-context examples.

Table 4 :
Statistical significance associated with Table3.We conduct χ 2 tests and report p-values.

Table 18
Our main results demonstrate that few-shot learning derives consistent gains over zero-shot prompting.For instance, HR→EN and IT→EN saw 345 and 272 cases in their test sets respectively where few-shot learning makes the correct prediction but zero-shot learning fails (positive cases).There are only 85 and 87 cases where zero-shot prompting beats few-shot prompting (negative cases).We present 8 positive examples and 4 negative examples for each of HR→EN and IT→EN, comparing five-shot (5K setup) and zero-shot results with LLaMA 13B in Table 19 (Appendix E).For instance, 'gušter (HR) → lizard (EN)' and 'sezam (HR) → sesame' are two positive cases, their in-context examples being five different animal names and five plant names, which may help LLaMA 13B to narrow down the scope of the target word to animal and plant names respectively.Similarly, 'valcer (HR) → waltz (EN)' (a positive case) is associated with five in-context examples related to either music or dance.However, few-shot learning does not always help.For example, in 'eventuale (IT) → eventual (EN)' and 'scopre (IT) → discovers (EN)' translation tasks, the LLM seems to make a mistake due to directly copying one of the words provided in the in-context examples, whereas zero-shot prompting predicts the correct answers.

1 )
On future BLI research.Our work minimises the technical gap between BLI and prompt-based learn-ing and opens up new possibilities for BLI research.In fact, LLM prompting provides a generic and straightforward way of leveraging external knowledge for BLI.While we have demonstrated the effectiveness of in-context word translation examples, external information such as word definition, parts-of-speech, spelling, and sentence translation pairs can also be integrated into text prompts.2) On NMT and other related fields.Recent work has incorporated word translation pairs into text templates to prompt LLMs for sentence-level neural machine translation (NMT) and demonstrates that the bilingual lexical 'hints' lead to significant gains inNMT (Ghazvininejad et al., 2023; Jones et al.,  2023).While a ground-truth bilingual dictionary can be leveraged, BLI is able to provide word translations for language pairs and words not covered in existing bilingual lexica.20Ourwork can provide strong word translation pairs for lexicon-enhanced MT, and the improved MT may further benefit, e.g., the field of cross-lingual transfer learning via TRANSLATE-TRAIN/TEST approaches (Conneau

Table 6 :
Languages covered in our experiments with their ISO 639-1 codes and the mLLM families that support that language, categorized by language family.IE = Indo-European.

Table 7 :
LLMs used in our experiments with their hugg ingface.comodel IDs and model sizes.

Table 8 :
Inference time (in seconds) of each LLM with 0-Shot and 5-Shot prompts respectively.

Table 9 :
Per-epoch training time (in minutes) of each LLM with 5-Shot prompts in 5K and 1K setups respectively.

Table 11 :
Our 36 templates for few-shot prompting.For simplicity, we present only two in-context examples in each template.These include 12 mask-filling-style templates (template IDs: 67 ∼ 78) and 24 GPT-style templates (template IDs: 79 ∼ 102).In our experiments, the '<mask>' token is '<extra_id_0>' for mT5 and mT0.word 'w x ' in L y is: <mask>.mT5 xl The L x word 'w x ' in L y is: <mask>.mT5 xxl The L x word 'w x ' in L y is: <mask> mT0 small Translate from L x to L y : w x => <mask>.mT0 base Translate from L x to L y : w x => <mask>.mT0 large Q: What is the L y translation of w x A: mT0 xl How do you say the L x word 'w x ' in L y ?mT0 xxl

Table 12 :
The best template for each LLM respectively for zero-shot and few-shot prompting.For simplicity, we present only two in-context examples in each template in the few-shot setup.

Table 13 :
Full 5-shot BLI results (P@1×100%) on 56 XLING BLI directions with 5K seed translation pairs.Off-the-shelf LLMs are used without any fine-tuning.'-': a language in the pair is not supported by the LLM.

Table 14 :
Full 5-shot BLI results (P@1×100%) on 56 XLING BLI directions with 1K seed translation pairs.Off-the-shelf LLMs are used without any fine-tuning.'-': a language in the pair is not supported by the LLM.

Table 15 :
5-shot BLI results (P@1×100%) on PanLex-BLI with 1K seed translation pairs.Off-the-shelf LLMs are used without any fine-tuning.'-': a language in the pair is not supported by the LLM.

Table 19 :
Translation examples on HR→EN and IT→EN.We include here ground truth translation pairs and show the predictions derived from zero-shot and five-shot prompting with LLaMA 13B .