Towards Effective Disambiguation for Machine Translation with Large Language Models

Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate “ambiguous sentences” - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl.


Introduction
While the field of NMT has advanced rapidly in recent times, the disambiguation and translation of ambiguous words still remain an open challenge.Notably, Campolungo et al. (2022) created a benchmark named DiBiMT to study the behaviour of state-of-the-art (SOTA) NMT systems when translating sentences with ambiguous words 1 .They reported that even the best-performing commercial NMT systems yielded accurate translations only 50-60% of the time 2 , while other open-source multilingual models like mBART50 (Tang et al., 2021) and M2M100 (Fan et al., 2021) performed much 1 https://nlp.uniroma1.it/dibimt/public/leaderboard 2 Subsequent iterations of these commercial models have improved, but large margins still remain

Source
The horse had a blaze between its eyes.
DeepL 那匹马的两眼之间有一团火焰。 (There is a flame between the horse's eyes.)

BLOOMZ 176B
这匹马的眼睛之间有一道白线。 (There is a white line between the horse's eyes.)Table 1: An example of English-to-Chinese translation involving an ambiguous term "blaze".For BLOOMZ, we use 1-shot prompting to obtain the translation.
worse.This was found to be due to biases against rare and polysemous word senses inherited during pretraining.Table 1 shows an example from the DiBiMT benchmark where DeepL mistranslates an ambiguous word while BLOOMZ resolves the word to its correct in-context meaning.
In this paper, we explore whether LLMs can perform better in scenarios where such ambiguous texts need to be translated.The motivation behind this is that while NMT models can potentially learn biases from noisy or narrow domain parallel data, hurting their ability to detect and translate rare word senses, LLMs can potentially be pretrained on a wider variety of monolingual text, though they might also prefer fluency over accuracy.LLMs have shown many emergent abilities due to scale (Brown et al., 2020;Chowdhery et al., 2022;Wei et al., 2022a) and moreover, have demonstrated great potential for Machine Translation (MT) (Vilar et al., 2023;Zhang et al., 2023).
We comprehensively examine how these trends extend to the specific task of translating ambiguous sentences.We select a diverse set of foundational and instruction-tuned LLMs, of different sizes and with varying combinations of languages in the pretraining data.We then compare how the LLMs match up against several widely used NMT models on the DiBiMT test set, which covers translation from English to five languages: Spanish, Italian, German, Russian and Chinese.We find that, with only 1-shot in-context learning (Brown et al., 2020), LLMs (in particular, BLOOMZ 176B and LLaMA 65B) outperform Google Translate, and achieve similar if not better performance compared to the top MT system (DeepL) -setting a new SOTA in two of the five languages we tested.Furthermore, we propose two methods for adapting LLMs for ambiguous translation: 1) in-context learning with sentences having the same word sense, and 2) finetuning on curated ambiguous parallel corpora.We show that these methods are highly effective and can further improve performance by up to 15 points in DiBiMT accuracy in the best case.
Our work thus makes three key contributions: 1. We evaluate the performance of LLMs compared to top-performing NMT systems in the challenging task of translating ambiguous sentences.We report SOTA scores on 2 of the 5 languages tested, and comparable performance otherwise.
2. We also show that our efforts on similar sentence in-context learning and targeted disambiguation fine-tuning surpass the performance of mere prompting of LLMs.
3. We conclude our work by evaluating LLMs on the FLORES200 test sets, and confirm that trends in disambiguation accuracy scores correlate strongly with overall MT quality.

Ambiguity in machine translation
Resolving ambiguity in the source sentence was historically framed as one of the most fundamental challenges in MT (Weaver, 1952).In an effort to address this challenge, traditional works integrating Word Sense Disambiguation in Statistical Machine Translation (Carpuat and Wu, 2007;Chan et al., 2007) were followed by those integrating it in NMT architectures in various ad-hoc ways (Choi et al., 2017;Liu et al., 2018;Pu et al., 2018).Later, with the introduction of the Transformer (Vaswani et al., 2017), it was shown that higher layer encoder representations are robust enough to handle disambiguation (Tang et al., 2019) without any explicit handling of word senses.However, more recent research creating challenging evaluation benchmarks has called the purported abilities of NMT systems into question once again.Following the proposal of the MuCoW benchmark for testing WMT19 (Raganato et al., 2019) and WMT20 (Scherrer et al., 2020) systems, Raganato et al. (2020) showed how Transformer-based NMT models, in general, underperform when translating rare word senses.Campolungo et al. (2022) experimented with SOTA commercial (Google Translate, DeepL) and open-source systems (mBART50, M2M100, OPUS-NMT (Tiedemann and Thottingal, 2020) etc.) arrived at the same conclusion when they proposed the DiBiMT benchmark for evaluating MT systems between English and 5 languages (Spanish, Italian, German, Russian, and Chinese).They found similar biases against lowfrequency and highly polysemous word senses.They also noted the accuracies of these systems were much lower than the then SOTA WSD system, ESCHER (Barba et al., 2021) -indicating significant room for improvement.In this work, we explored whether foundational and instructiontuned LLMs could bridge this gap with minimal supervision (i.e.few-shot prompting).

LLMs and translation
Previous research has found that LLMs can perform machine translation without being specifically fine-tuned (Radford et al., 2019).In order to elicit a translation, research in this direction follows the paradigm of LLM prompting: 1. Zero-shot prompting, where an LLM is directly asked to translate a source input into the target language (Radford et al., 2019).
2. Few-shot prompting, also called in-context learning, where an LLM is supplied with demonstrations of input and output pairs from the same task it is performing, before being queried an input (Brown et al., 2020).
3. Chain-of-thought (CoT), where an LLM is initiated to reason to gain relevant knowledge about the input before responding to a specific task (Wei et al., 2022b;Kojima et al., 2022).
Besides training-free approaches, another route can be instruction tuning, which optimizes an LLM on a mixed range of downstream tasks and fine-tunes the model to understand and respond to user intention through natural language (Wei et al., 2021).
It was observed that LLMs might not surpass Transformer models solely trained to translate, especially for non-English and low-resource translation directions (Vilar et al., 2023;Hendy et al., 2023).Nevertheless, LLMs have been shown to achieve superiority in tasks requiring in-depth understanding and manipulation of text, primarily due to pretraining on very large corpora.For example, without fine-tuning, LLMs are good at adapting to word alignments (Moslem et al., 2023), translation evaluation (Kocmi and Federmann, 2023), idiom translation (Raunak et al., 2023), iterative refinement (Chen et al., 2023), and interactive translation via CoT (Pilault et al., 2023;He et al., 2023).Related to our work is Pilault et al. (2023)'s proposal of using interactive question answering as a CoT process for LLMs to disambiguate source words.Although an interesting approach, we aim to generate translations in a single pass itself by leveraging SOTA WSD systems to provide contexts that guide LLMs to disambiguate better.

Preliminaries
A word sense is a concept in a Knowledge Base (in this work, BabelNet (Navigli et al., 2021)) that denotes a distinct meaning of a word in the context of a sentence.The polysemy degree of an ambiguous word is defined as the total count of all possible senses that a particular word can have.The sense frequency is defined as the occurrence count of that particular sense in a disambiguated training corpus.
In this work, we define an ambiguous word as a polysemous term with multiple possible, and likely related, meanings -with the correct sense inferable only from the sentence-level context.We then refer to a sentence with an ambiguous word as an "ambiguous sentence" for brevity and ease of explanation.By definition, the DiBiMT test set (Campolungo et al., 2022) contains only one ambiguous word per sentence.
Word Sense Disambiguation (WSD) is the process of linking an ambiguous word in a sentence to its appropriate word sense in the Knowledge Base.We use ESCHER-WSD (Barba et al., 2021) in this work, a high-performing WSD system that had achieved the SOTA for English.

k-shot prompting
Given a test sentence X and a Large Language Model to prompt for translations, we construct a query with k demonstrations, i.e. parallel sentence pairs {(X 1 , Y 1 ), (X 2 , Y 2 ) . . .(X k , Y k )} as examples, followed by the test sentence.As shown in Figure 1, for foundation LLMs, we frame the prompt as a text completion task, while for instruction-tuned LLMs (like BLOOMZ) we structure the last phrase as a question, in order to conform to the latter's question answering format.In the naive setting, we choose our demonstrations randomly from the development set.

In-Context Learning with similar ambiguous contexts
LLMs can effectively gain knowledge relevant to the test domain through prompting, and this process is named in-context learning (ICL).We leverage ICL to help LLMs ingest information on translation of ambiguous sentences, by providing related sense translations as examples in the prompt.To achieve this, we first identify the most polysemous word in the test sentence by disambiguating it with a WSD system, and then calculate the polysemy degree of all disambiguated senses with respect to a large development set.We choose the most polysemous word sense3 and search for other occurrences of the same sense in the same development set.Finally, we randomly sample k source-target pairs including such a sense to use as demonstrations in k-shot prompting, instead of using random pairs.This technique seemed to return enough examples for our purposes in most cases -for 5-shot prompting, given a corpus of 1.8M sentences, we observed that we got all 5 matches 92.5% of the time.

Low-rank fine-tuning
Apart from the injection of related examples through prompting, another conventional approach is to optimize the model parameters in a domain adaptation fashion for disambiguation.Considering the computational cost, our work experiments with instruction fine-tuning via low-rank adaptation (LoRA).This technique appends trainable lowerrank decomposition matrices to giant matrices in an LLM that can remain frozen during fine-tuning (Hu et al., 2021).By sacrificing a little performance, this fine-tuning method achieves great parameter efficiency.We aim to adjust LLMs to perform the translation task specifically.In order to maximise an LLM's capability to disambiguate when translating, we follow a careful data selection procedure to identify the most ambiguous sentences in our corpus.
Given the size of LLMs, it would be infeasible to fine-tune them on a large parallel corpus, so we opt to curate a smaller dataset that suits the ambiguous translation task.We would like a balanced mix of sentences with highly polysemous words as well as those with rare senses of a given word.This is to ensure fine-tuning reduces both polysemy degree-related and sense frequency-related biases, as discovered by Campolungo et al. (2022) and consequently maximises disambiguation performance.We, thus, sort our corpora in two ways: one, by the maximum polysemy degree (greatest first) and two, by the minimum sense frequency (rarest first) of all word senses in a given sentence, disambiguated with ESCHER-WSD.We take the top N/2 sentences from each set and interleave them to create our final fine-tuning corpus of size N .
Once the data is chosen, we follow the finetuning paradigm of Alpaca (Taori et al., 2023): the 2. RQ2: What are the techniques to improve LLMs performance over few-shot prompting in this task?(Section 4.4) 3. RQ3: How do these disambiguation-adapted LLMs fare in terms of overall translation quality?(Section 4.5)

Models
To ensure reproducibility, we pick four well-known and high-performing5 open-source LLMs, of which we sample seven versions for experimentation: • BLOOM (Scao et al., 2022): A fully opensource, multilingual, foundation LLM that supports 46 languages.To establish the range of its capabilities, we explore both the smallest (7.1B) and the largest (176B) versions.
• LLaMA (Touvron et al., 2023a): The popular LLM trained by Meta AI, on gigantic datasets ranging up to 1.5T tokens.We evaluate the smallest (7B) and the largest (65B) versions.
To effectively position these open-source LLMs against traditional NMT systems, we compare them against the best-performing and the most widely used commercial and open-source models: 1. DeepL Translator6 : a SOTA commercial NMT system (accessed on 24th July 2023).
2. Google Translate7 : Probably the most widely used commercial NMT system (accessed on 24th July 2023).
4. mBART50 (Tang et al., 2021): Multilingual NMT models pretrained on monolingual corpora from 50 languages, and fine-tuned on the translation task.We report performances of both the English-to-many and many-to-many fine-tuned models.
5. M2M100 (Fan et al., 2021): A massive multilingual NMT model that was trained on 2200 language directions to support many-to-many translation among 100 languages in total.We compare both the base (418M) and the large (1.2B) versions.
6. NLLB-200 (NLLB Team et al., 2022): It is the current SOTA in many low-resource pairs, scaling to 200 languages.We experiment with all its variants, where the largest is a mixtureof-experts (MoE) model with 54B parameters.We also benchmark its smaller checkpoints at 1.3B and 3.3B, as well as distilled versions at 0.6B and 1.3B.
We take the results for mBART50, M2M100, and OPUS directly from the DiBiMT leaderboard8 .We use Hugging Face9 for accessing and inferencing all other models -except for Google Translate and DeepL, which are accessed using their respective APIs.Despite their presence on the leaderboard, we re-evaluate these systems since they are being constantly updated.

System
En-Es En-It Similar contexts dev.set 1.81M 1.73M Fine-tuning corpus 100K 100K

Experimental settings
Datasets In this study, we use the DiBiMT test set for evaluation and measure accuracy across all five language directions: English to Spanish, Italian, Chinese, Russian, and German, respectively.
For validation, we use the development set from FLORES 200 (NLLB Team et al., 2022) in our default setting.To search for similar ambiguous contexts (Section 3.3), we require a larger development set to find relevant examples and also to accurately estimate polysemy degree.Hence, we use the Europarl corpus (Koehn, 2005), disambiguated with ESCHER-WSD.We also use the same disambiguated corpus for fine-tuning, however, we first follow the filtering procedure described in Section 3.4 to create a small corpus full of ambiguous sentences.Validation during fine-tuning is done using 500 randomly sampled sentences from this corpus and the rest is used for training.We detail the data statistics used for these experiments in Table 2.
LLM prompting setup Due to memory constraints, and to compare all models fairly, we load LLMs in 8-bit and use a batch size of 1.For generation, we set both beam size and temperature to 1.To prevent repetition in LLM output, we set no_repeat_ngram_size to 4. From the LLM's response, we filter out the sentence before the first newline character as the output translation.
LoRA fine-tuning We inject LoRA modules into all query, key, and value matrices.We set rank to 8, alpha to 8, and dropout to 0.05.For training, we set the effective batch size to 32, the learning rate to 3e-4, and the maximum length to 256.The total training budget is 5 epochs, and we pick the best model checkpoint based on cross-entropy loss on the validation set.The training data is shuffled after every epoch.Inference is done with a beam size of 3, and a maximum generation length of 150.

LLMs vs NMT systems on DiBiMT
We show the results of our experiments in Table 3.
For the purposes of the subsequent discussion, we note here that LLaMA was not intentionally trained on Chinese and is, thus, an 'unseen' language.Similarly, for BLOOM, Chinese and Spanish are "seen" and the rest are "unseen".We share our key observations below: 1. LLMs usually match or beat massive MT models on seen languages.Table 3: Accuracies on the DiBiMT test set for well-known NMT systems and LLMs, using naive k-shot prompting.
For Alpaca, we can only use 0-shot prompting due to its prompt template only permitting an Instruction and an input sentence.We highlight the top 3 scores per language in bold, with the best underlined as well, the 2nd best as is, and the 3rd best italicized.We indicate the unseen language (i.e.not intentionally included in pretraining) scores for each model with † .
MT systems appear to have an edge, LLaMA 65B mostly matches the SOTA NMT systems (namely DeepL and NLLB-200).Furthermore, BLOOMZ sets a new SOTA in its seen languages, Spanish and Chinese, and outperforms DeepL by margins of 7.3% and 12.2% respectively.These improvements against such strong, supervised massive NMT systems are particularly remarkable since our corresponding setup for inferencing the LLMs is quite cheap -as we noted previously, this is only naive few-shot prompting of an 8-bit quantized model, with a beam size of 1.
2. LLMs perform relatively worse for unseen languages, but they can still be much better than some supervised MT models.We note that relative to seen languages, LLaMA underperforms in translation to Chinese.Similarly, BLOOM performs worse for its' unseen languages of German, Italian, and Russian.Still, LLMs yield reasonable performance here that is still much better than some supervised NMT systems.For example, BLOOMZ-7B achieves 40.68% accuracy in English-Italian, which is about 35.9% more than OPUS, 52.8% more than mBART50 and 75% more than   3).The reported annotations are obtained from a native Chinese speaker who was invited to label the sense of the translated ambiguous word.
3. Scale helps improve performance for ambiguity translation.Continuing from the last point, similar to NMT models that improve with scale (e.g.NLLB-200), we observe that LLMs too perform consistently better at ambiguous translation on scaling up to their larger variants.This applies to the translation of both seen and unseen languages.That said, the lighter models, such as LLaMA 7B or BLOOM 7B, also perform quite well and in many cases, 1-shot prompting of these LLMs is almost as good as NLLB translations.
4. LLM performance does improve on average with more demonstrations, but this is not uniform.On average, we observe that 5-shot prompting works best, followed by 3shot and then 1-shot, though some outliers exist for LLaMA 7B.Moreover, when looking at the performance of individual language pairs, we note that the improvement trend is not uniform, and it is possible a 3-shot translation outperforms a 5-shot one.This aligns with the finding of Zhang et al. (2023), who reach the same conclusion regarding overall MT quality.Nonetheless, as we show in Section 4.4.1, accuracy does significantly improve when we provide relevant and helpful examples -suggesting quality of demonstra-tions matters more than quantity.
Interestingly, we observe that 1-shot prompting of a general-purpose instruction-tuned LLM like BLOOMZ often significantly outperforms 5-shot prompting of BLOOM, even on the very specific task of ambiguity translation.For Alpaca 7B, while we are unable to try few-shot prompting due to restrictions in the prompt template, we observe that zeroshot prompting seems to work reasonably well, matching some supervised MT systems.
We note that we could not get the same result with foundation LLMs like BLOOM 165B and LLaMA 7B because, as expected, zeroshot prompting of these models yielded hallucinations in many cases.
Lastly, we include a qualitative comparison of DeepL and BLOOMZ 176B translations for the En-Zh pair in Table 4.We observe that while DeepL tends to translate literally, BLOOMZ does generate more contextual translations more often.However, sometimes both models fail to produce the correct translation (note the last example), which highlights the challenging nature of the benchmark.

Adapting LLMs for ambiguous MT
We now experiment with two proposed strategies to tune LLMs to disambiguate better and improve performance on the ambiguous translation task.

Improving In-Context Learning by leveraging similar ambiguous contexts
Rather than selecting our examples randomly as in our naive setting, we employ the data selection procedure described in Section 3.3 to discover other examples that contain the same word sense as the most polysemous sense in the test sentence.We report our scores in Table 5, and our findings below: 1. Similar contexts yield more improvements as the example count increases We observe that for 1-shot prompting, similar contexts perform comparably or slightly better than random examples.However, the gains increase substantially as we move towards 3-shot and 5shot prompting.We can understand this from the intuition that 1-shot prompting likely just guides the LLM towards generating a reasonable translation, whereas with more relevant examples, it learns to disambiguate better and translate in context accordingly.
2. Larger models observe greater and more consistent gains than smaller LLMs Compared to LLaMA 7B, the other LLMs (LLaMA 65B, BLOOM 176B and BLOOMZ 176B) yield much larger accuracy improvements on a more uniform basis.This is probably because scaling up allows LLMs to model polysemous words better in their semantic space, facilitating effective in-context learning of disambiguation capabilities.

Fine-tuning with ambiguous corpora
We fine-tune Alpaca 7B, BLOOM 7B and BLOOMZ 7B in En-Es and En-It directions using the data described in Section 4.2.We show our results when prompting these fine-tuned LLMs in Table 6.We make the following observations: 1. Fine-tuning generally improves performance.We observe that fine-tuned LLMs significantly outperform their non-finetuned versions in most cases.The biggest improvement is observed for BLOOM 7B in En-It, where accuracy increases by as high as 47.73%, indicating the effectiveness of our method.The only exception to this is when the LLM is already strong, such as BLOOMZ 7B at En-Es, where the improvements are marginal.Even so, BLOOMZ still gains significantly from fine-tuning on the En-It pair -where it was originally weaker due to Italian being an unseen language during pretraining.
We plot the DiBiMT accuracy versus epoch curves in Figure 2 where the performance is evaluated after each epoch.We observe that in all cases, accuracy peaks between the 1st and the 3rd epoch, after which it mostly plateaus or dips slightly -suggesting that one does not need to fine-tune these LLMs for too long.
3. Fine-tuning improves LLM performance until about 36K-63K training samples.We now try to answer the Research Question of how many training samples we need for fine-tuning these LLMs, to get optimal performance.We plot the Accuracy vs corpus size graph in Figure 3, where we indicate corpus size by number of parallel sentences.We observe that accuracy increases non-monotonically with an increase in corpus size, but peaks anywhere between 36K-63K training samples -which seems to depend on the pre-existing capabilities of the LLM.For a raw foundation LLM like BLOOM 7B, relatively more fine-tuning data (54K-63K) appears to be beneficial.Alpaca 7B, which has been instruction-tuned on an English-only dataset, also seems to benefit from further finetuning -especially for En-Es, where accuracy peaks after 63K training samples.However, for a powerful LLM like BLOOMZ that has been instruction-tuned on a large multilingual dataset like xP3 (Muennighoff et al., 2023), fine-tuning on smaller datasets (at most 36K sentences, in our case) appears to be sufficient.

Overall MT performance of disambiguation-adapted LLMs
Lastly, for completeness, we also evaluate the overall translation quality of the key LLMs used in this work -since we are interested in noting how well the reported disambiguation accuracies extend to overall MT performance.While choosing our test set, we want to ensure it is recently released (ideally within the last year) to minimize the chances of its inclusion in the pretraining corpora of LLMs.We, thus, choose FLORES 200 (NLLB Team et al., 2022) as our test set since it satisfies this criterion and also supports all our languages of evaluation.We use spBLEU (Goyal et al., 2022), chrF++       (Benesty et al., 2009) We also report p-values.
We observe trends similar to those of our DiBiMT experiments.BLOOM 176B performs well in translation of seen languages, performing comparably to NLLB-200 in English-Spanish and outperforming it in English-Chinese.This is particularly the case for COMET22 scores, a metric which has shown high correlations with human evaluation, ranking second in the WMT22 Metrics shared task (Freitag et al., 2022).For the other languages, LLaMA 65B usually performs better than BLOOMZ, but in the 1-shot prompting setup, it is unable to beat the NLLB-200 54B MOE.It is possible that higher-shot prompting or any of the disambiguation accuracy-enhancing techniques proposed earlier improve overall MT quality.
Rather than verifying this by re-conducting all our experiments on FLORES200, we try to answer a broader question: how well does disambiguation accuracy on DiBiMT correlate with standard MT metrics?We conduct a Pearson's correlation test (Benesty et al., 2009) between the accuracy metric and spBLEU, chrF++, and COMET22 respectively.We report our results in Table 8, and find that all MT quality metrics correlate positively with accuracy -with p-values of the two-sided alternative hypothesis being much lesser than 0.05 in all cases.We discover that spBLEU and COMET22 exhibit higher correlations than chrF++.We hypothesize that this could be due to the character-level chrF++ being less sensitive to word-level senses.Overall, the results of Table 8 help us speculate that the significant accuracy improvements noted earlier are not at the cost of translation quality, and in turn, they could improve general MT metrics scores too.

Conclusion
In this work, we studied the capabilities of LLMs to handle ambiguity during machine translation.We choose seven of the most widely used foundation and instruction-tuned LLMs and compare accuracy with SOTA commercial and open-source NMT systems on the DiBiMT translation benchmark.Out of 5 language directions, we report scores comparable to the SOTA on two (En-Ru, En-It) and set a new SOTA on two others (En-Zh, En-Es).We then present two techniques that significantly improve disambiguation accuracy: in-context learning with similar contexts, and fine-tuning on an ambiguous corpus.We end the paper with an evaluation of overall MT quality.We hope the methods and findings shared in this work could guide future researchers studying ambiguity in translation.
In this work, we attempt to note overall trends in LLM performance as compared to conventional NMT systems and, based on our results, suggest methods that generally improve performance.That said, there are exceptions to these trends -prompting with similar contexts can, at times, degrade performance and so can increasing the number of demonstrations (see Table 5).But there is some consistency here too that these observations mostly apply to smaller LLMs (such as LLaMA 7B) while the larger LLMs benefit more significantly.Alsoas noted in Section 4.4.1 -in a small percentage of cases (7.5%), we are unable to find 5 matches when attempting 5-shot prompting with similar contexts.In such cases, it might be worthwhile, from a performance perspective, to use random demonstrations but since we are interested in verifying the utility of similar contexts we do not explore this, and also since there are only a few cases where it might be pertinent.Finally, a last area of improvement would have been to explore newer models such as LLaMA 2 (Touvron et al., 2023b) and Falcon 180B (Almazrouei et al., 2023) which were not released at the time of initial experimentation.Given their superior performance on other tasks, we expect these LLMs would perform as well, if not better, than the ones used in this work.

Figure 2 :
Figure 2: DiBiMT accuracy at the end of every epoch, for the LoRA fine-tuned LLMs

Table 2 :
Statistics of data used in our experiments, in terms of parallel sentence count Except for the very rich-resourced En-De, where supervised

Table 4 :
Manual inspection on English-to-Chinese translation focused on the disambiguation of "head", corresponding to the first five test instances in DiBiMT.The baselines are DeepL and BLOOMZ 176B, the highest performing NMT system and LLM for this pair (from Table

Table 5 :
1-shot, 3-shot and 5-shot results for En-Es and En-It prompting with randomised examples (Rand.)versussimilarcontexts (Sim.).The best-performing NMT systems from Table3, i.e.DeepL and NLLB-200 are chosen as baselines.For LLMs, for each setting, the better-performing baseline between Rand.and Sim. is highlighted in bold.The overall best score (among all LLMs) is underlined as well, while the best NMT system is also italicized.

Table 6 :
DiBiMT Accuracy scores after fine-tuning Alpaca 7B, BLOOM 7B, and BLOOMZ 7B on the En-Es and En-It pairs.The "Best Loss" baseline refers to the checkpoint with best cross-entropy loss.The "Best Acc." baseline refers to the checkpoint with best DiBiMT accuracy, on evaluation of the last checkpoint after each epoch.

Table 7 :
FLORES 200 results for 1-shot prompting of some key LLMs used in this work, compared with the NLLB-200 baseline.Just like before, we indicate unseen language results with a † .We observe similar trends in all standard MT metrics, as those observed with DiBiMT accuracy.