In-context Examples Selection for Machine Translation

Large-scale generative models show an impressive ability to perform a wide range of Natural Language Processing (NLP) tasks using in-context learning, where a few examples are used to describe a task to the model. For Machine Translation (MT), these examples are typically randomly sampled from the development dataset with a similar distribution as the evaluation set. However, it is unclear how the choice of these in-context examples and their ordering impacts the output translation quality. In this work, we aim to understand the properties of good in-context examples for MT in both in-domain and out-of-domain settings. We show that the translation quality and the domain of the in-context examples matter and that 1-shot noisy unrelated example can have a catastrophic impact on output quality. While concatenating multiple random examples reduces the effect of noise, a single good prompt optimized to maximize translation quality on the development dataset can elicit learned information from the pre-trained language model. Adding similar examples based on an n-gram overlap with the test source significantly and consistently improves the translation quality of the outputs, outperforming a strong kNN-MT baseline in 2 out of 4 out-of-domain datasets.


Introduction
In-context learning (Brown et al., 2020) has recently received a lot of attention from the NLP research community due to its remarkable ability to utilize only a few input-output examples to perform many NLP tasks (Liu et al., 2021).For example, Lin et al. (2021) demonstrate that a 7.5B multilingual generative model, XGLM, outperforms a supervised sequence-to-sequence baseline in 45 translation directions on the FLORES-101 machine translation benchmark (Goyal et al., 2022) using just 32 randomly sampled translation examples as demonstrations.While these results are compelling, recent work has also shown that the performance and capability of a pre-trained language model (PLM) can be highly sensitive to many factors, such as the choice of in-context examples (Liu et al., 2022b), their ordering (Lu et al., 2022) and the template (Jiang et al., 2020).
Typically, in-context learning for MT uses examples that are randomly sampled from a small development set that resembles the domain of the test dataset.The effect of the aforementioned factors (such as the choice of the examples) on the translation quality of the PLM hence remains unclear and unexplored.Yet another crucial gap in using in-context learning for MT in the current literature is the effect of the domain of in-context examples on translation quality since out-of-domain generalization is a known and important challenge in MT (Koehn and Knowles, 2017).
In this work, we systematically analyze how factors such as the choice and the number of few-shot in-context examples and their ordering impact MT output quality.We show that while noisy unrelated 1-shot example can have a significantly adverse effect on translation quality, a single prompt optimized to maximize the translation quality on a development set can sufficiently elicit task-based information from the PLM.Our analysis thus demonstrates the importance of selecting good examples for MT and raises the question: What are the properties of good in-context examples for MT?In that direction, our findings suggest that a well-formed meaning-equivalent translation example results in higher quality translation than randomly selected in-context examples.
Motivated by the use of Translation Memory in Computer-Aided Translation (Yamada, 2011) and its usage in computational approaches to Machine Translation (Somers, 1999;Koehn and Senellart, 2010;Khandelwal et al., 2020, inter alia), we retrieve similar examples to the test source from a datastore that includes pairs of the source text and their corresponding translations via BM25, an unsupervised efficient retriever to provide additional context to the model.We propose a novel incontext example selection and re-ranking strategy to maximize the coverage of the source n-grams in the retrieved examples.Experiments on WMT'19 English↔German and English↔Russian datasets show that our proposed strategy can consistently improve the translation quality over the outputs generated using BM25 retrieved examples.Combining optimized 1-shot task-level with examplespecific in-context examples using a simple concatenation strategy further improves translation quality, outperforming state-of-the-art inferenceadapted nearest-neighbor MT models (kNN-MT) on two out-of-domain datasets (Medical and IT) while being memory and compute efficient as our approach does not require constructing and querying a dense token-level datastore.

Background: In-context Learning
Generating translations from large-scale multilingual language models like mGPT (Shliazhko et al., 2022), XGLM (Lin et al., 2021)    In this work, we aim to better understand the impact of prompt selection on the translation quality of the outputs.Given a training dataset consisting of n parallel examples D = {x i , y i } n i=1 , and a test source x j , we select a subset of m informative samples to form a prompt which either provides task-level and/or example-specific information as discussed below.

Task-level In-context Examples
A good task-level in-context example should be able to elicit information learned during pretraining from the PLM.One way to measure the efficacy of an example as a prompt is via computing the translation quality of the outputs generated when prompting the PLM given an example.Hence, we select the task-level prompt as follow: For a given example sampled from the training dataset, (x i , y i ) ∈ D S , we create a prompt, x p i by concatenating the example {(x i , y i )} to each source in the development set.The system outputs are then generated using equation 1.We then rank examples from D S as task-level prompts based on the BLEU of the generated outputs against the references on this held-out development set, D dev = {X, Y }:

Example-specific In-context Examples
Prior work on retrieving good in-context examplespecific prompts for tasks other than MT (like question answering or knowledge retrieval) either trains a dense-retriever (Rubin et al., 2021) or utilizes samples that are closer to the test source in the embedding space of a PLM like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), or XLNET models (Liu et al., 2022b).While contextual models can generate a global sentence representation, they overlook rare lexicons which can be important for generating translations in unseen domains like medical or IT (Wrzalik and Krechel, 2021).However, for MT, overlapping n-grams between the source and the retrieved sentences ensures informativeness as the target associated with the retrieved sentence is likely to include partial translations of the source.We can thus use BM25 as an efficient unsupervised retrieval method to retrieve similar examples.However, as the examples are scored independently and BM25 favors rare word matches (Robertson et al., 2009), the top retrieved candidates might not cover all the terms in the source text (Figure 1).Given that the context window of the PLM is usually limited (∼ 3096 tokens, 16 − 20 examples), maximizing the coverage of all the terms found in the test input might be favorable.Hence, we propose to re-rank the top 100 candidates retrieved from BM25 using our algorithm outlined in 1.We extract all the word n-grams, and their counts from the test source, x s j and source of the BM25 retrieved examples, {P j (x i )} k 1 (lines 2-4).Let S and Q denote the set of the source n-grams and the n-grams from a BM25 retrieved example, respectively.We compute a recall-based (R) n-gram overlap score (line 7): The example with the maximum score is then added to the set of selected prompts, and the found n-grams from the test source are then downweighted by a factor, λ, for the next iteration of selection (line 14).For example, setting λ = 0 will select the example that covers the n-grams from the test source in the subsequent iteration that has not already been encountered.This process is then repeated over the retrieved pool until a set threshold of the score is reached.
Figure 1 shows the top-100 candidates retrieved via BM25 for the input: "Welche Risiken sind mit Poulvac FluFend H5N3 RG verbunden?".The top few candidates provide the same information to the PLM, i.e., translation of the phrase "Poulvac FluFend H5N3 RG".The examples including the other terms ("Welche Risiken sind mit verbunden ?") from the input text, are ranked lower.On the Algorithm 1: An N-gram Recall-based Strategy to Re-rank In-context Examples

Datasets and Evaluation Metric
We perform our in-domain evaluation on the WMT-19 German (de) ⇔ English (en) and WMT-19 Russian (ru) ⇔ English (en) datasets (Barrault et al., 2019).For the out-of-domain evaluation, we use the multi-domain dataset from Aharoni and Goldberg (2020) for the following domains: Medical, Law, IT, and Koran.The dataset statistics are reported in the Appendix (Table 8).Following Ng et al. (2019), we normalize punctuation using Moses (Koehn et al., 2007) and remove sentences longer than 250 tokens and sentence pairs with a source/target length ratio exceeding 1.5 from the in-domain datasets.The detokenized length truncated model-generated outputs are evaluated using sacreBLEU (Papineni et al., 2002;Post, 2018). 1he PLM outputs are truncated to twice the source length, as preliminary analysis suggested degeneration in a few (∼10-20) examples.

Experimental Conditions
Language Model We use the publicly available checkpoint of the XGLM 7.5B , a decoder-only multilingual language model (Lin et al., 2021) for all our experiments, which has 32 layers and a hidden dimension of 4096.

Baselines and Comparisons
We consider the following comparisons: • Random: p random few-shot examples sampled from the training dataset (number of trials=3).
• Task-level: top-p examples that achieve the highest BLEU on the development set ( § 3.1).
• Retrieved In-context (BM25): q max examples retrieved via BM25, since, unlike task-level examples, there is no guarantee that exactly q similar examples will be found in the training dataset for each input.
• Retrieved Re-ranked In-context (R-BM25): q max re-ranked examples using our proposed approach as detailed in § 3.2.
We also compare our results with the state-ofthe-art nearest neighbor-based approach for out-ofdomain evaluation, kNN-MT (Khandelwal et al., 2020).We use λ = 0.1, threshold=1.0and order the examples according to their similarity to the source, with the most similar examples on the left in all our experiments (Appendix Tables 9,10).Concatenating task-level prompt to R-BM25 consistently achieves the best BLEU scores across the board.p and q max are the number of task-level and example-specific prompts respectively.

In-domain Evaluation
A single task-level prompt is competitive with 16 random few-shot examples.Our experiment suggests that it is possible to elicit the task-level knowledge from the large-scale language model using a single prompt as opposed to using 16 random few-shot examples when translating into English (Table 2).Using a single task-level prompt (optimized on the development set) improves BLEU over using 16 random few-shot examples for 2 out of 4 translation directions (De-En, Ru-En).We hypothesize that when translating out of English, the model still benefits from getting exposed to multiple and diverse random few-shot examples as the target language model is relatively weaker.
Multiple example-specific prompts are required to improve translation quality over a single tasklevel prompt.Using a single task-level (p = 1) prompt attains higher BLEU over using a single example-specific prompt (q = 1; BM25, R-BM25) across the board.By contrast, using up to 16 BM25 prompts (q max = 16) significantly improves output quality over using task-level prompts, with an average gain of 1.41 in BLEU.
Re-ranking BM25 retreived examples improves BLEU.Our proposed re-ranking strategy consistently improves BLEU across the board over BM25 for both values of q max = {1, 16} showing that both the order and the choice of the in-context examples matters.Both task-level and R-BM25 examples provide complementary advantages, as combining them us-ing a simple concatenation strategy improves output quality over task-level or R-BM25 examples.We leave the exploration of optimizing the number and the joint order of task-level and examplespecific prompts to future work.

Out-of-domain Evaluation
As XGLM is trained on monolingual Common Crawl snapshots, translation in any domain and language could be considered an out-of-domain task.However, we hypothesize that translation in specific domains like medical, law, or IT could still be challenging for the PLM as the model is less likely to have observed sufficient monolingual datasets for these specialized domains, in contrast to the news text found in WMT.Examples from these domains will require translating rare terminology and carry domain-specific idiosyncrasies, which is known to pose a challenge even for a well-trained supervised neural MT model (Koehn and Knowles, 2017).Hence, we also evaluate PLM under these specialized out-of-domain scenarios.

Domain of few-shot in-context examples matter.
Task-level in-context examples drawn from the domain of evaluation, i.e., domain-specific, obtain on an average higher BLEU scores across the board than using examples from a distant WMT corpus as expected (Table 3) in both 1-shot (p = 1: +1.4) and 16-shot (p = 16: +2.7) settings.
Example-specific prompts significantly improve translation quality over task-level prompts.Unlike the in-domain evaluation, retrieved and reranked example-specific prompts (R-BM25) im-

Corpus
p + q max MEDICAL LAW IT KORAN Avg.prove the translation quality significantly across the board with up to 23 BLEU gain in the Law domain using just a single example as a prompt over a task-level prompt.This can be attributed to the high lexical overlap in the examples retrieved from the training data for these domains (Table 6).

Task-level
Task-level and R-BM25 prompts are complementary.Both task-level and R-BM25 provide supporting information for a given test source sentence as concatenating these set of prompts improves output quality over using these methods independently, outperforming a strong kNN-MT baseline on 2 out of 4 domains (Medical and IT).Where kNN-MT utilizes token-level nearestneighbor inference with representations extracted for bitext using and in combination with a strong supervised MT model to reach the reported translation quality, our approach only uses a sentencelevel unsupervised retrieval (BM25) to provide additional context to the unseen source with a multilingual PLM that has not been trained with any known parallel supervision to reach better or comparable translation quality.Hence, our results provide support for further analysis of the translation abilities of retrieval-augmented PLM on new domains and language pairs.Our manual analysis suggests that the higher gain obtained in the IT domain (+0.86) with both task-level and example-specific prompts can be explained by the observation that for 100 test source sentences, there are no training examples with any lexical overlap with the test source.The task-level prompt can still elicit learned information from the PLM over using no examples for these inputs.

Task-level Example Selection
Choice of Few-shot Examples We show the distribution of output quality as measured by BLEU when using 100 different examples as prompts in Figure 2. Across all four language pairs, there is a large variation in BLEU scores (up to 20 BLEU), where noisy or unrelated prompts can lead to significantly worse output quality.Given that most existing parallel corpora are web-crawled and the quality of bitext can vary significantly across different language pairs (Kreutzer et al., 2022)  fashion could be that we might still be underestimating the PLM (s) performance, as a larger pool size could result in better output quality.We study the impact of using a larger pool size in Table 4 where increasing the number of examples from 100 to 1000 only leads to a gain of 0.5 points in the maximum BLEU.From the same table, we can also observe that for any subset of random 100 fewshot examples, we can extract a task-level prompt (BLEU: 36) with a small standard deviation in overall output quality (0.18).Properties of good Task-level prompts Our manual analysis on the best task-level prompts suggests that any well-formed and meaning-equivalent translation (Vyas et al., 2018;Briakou and Carpuat, 2020) could make a good task-level prompt (see examples in Appendix Table 11).To quantify the meaning equivalence of the 1-best task-level prompt against random 1-shot examples, we report the percentage of aligned words between the source and reference translation ("% Aligned words") using fastAlign (Dyer et al., 2013) and the log probability of generating the reference translation conditioned on the source using a pre-trained multilingual NMT model, Prism-src (Thompson and Post, 2020;Agrawal et al., 2021) in Table 5.2 Across all language pairs and both metrics, task-level examples achieve higher semantic similarity scores   Table 7: BLEU over all 24 permutations of 3 seeds of 4 randomly selected and top 4 task-level prompts.

Informativeness of BM25 Examples
To understand the benefit of retrieved examples in the out-of-domain evaluation, we measure the lexical overlap between the test input (x, y) and the prompts (I x , I y ) using BLEU (Avg.BLEU (I x , x), Avg.BLEU (I y , y)), where I x and I y are the sources and target translations of the retrieved incontext examples.We also report the correlation against the output translation quality BLEU(ŷ, y).
Table 6 shows that the source lexical overlap is a good indicator of the informativeness of a prompt for 3 out of 4 domains, with Koran as an exception.
For Koran, while the retrieved sentences have a high overlap with the source (36.03), the target associated with the prompts (I y ) does not get high BLEU with the reference (10.36) compared to other domains.We hypothesize that this might be due to a bias in the reference translations towards a particular output style.We provide examples of this phenomenon in the Appendix Section F.

Size of the Datastore
Figure 3 shows BLEU when varying the size of the datastore used to retrieve similar in-context examples using BM25 on the Medical dataset.As the size of the datastore increases, the likelihood of retrieving a more similar example increases.However, similar output quality in BLEU can be achieved by using multiple in-context examples when a smaller in-domain datastore is available as multiple examples can provide better coverage of the source terms -BLEU @q=16 with a datastore size of 100k is equivalent to BLEU @q=1 with twice as many examples (200k).

Related Work
The selection of in-context examples and their impact on downstream NLP task performance has been studied in prior work for tasks other than MT (Liu et al., 2022b;Lu et al., 2022;Jiang et al., 2020;Min et al., 2022;Zemlyanskiy et al., 2022;Rubin et al., 2021;Liu et al., 2022a).Garcia and Firat (2022) use natural language prompts to control the target language in multilingual MT and investigate the effect of scale, number of languages, and their similarity for this phenomena.(Luong and Manning, 2015;Freitag and Al-Onaizan, 2016;Wang et al., 2017) or inference (Zheng et al., 2021;Khandelwal et al., 2020;Martins et al., 2022).Similar to past work, our work utilizes out-of-domain bitext during inference but instead adapts a PLM on unseen domains.However, our approach does not rely on creating a domainspecific token-level datastore, hence is more compute and memory efficient.
Several concurrent works investigate in-context learning for MT: Zhang et al. (2023) study prompting strategies for MT and examine several factors that could impact translation quality.Garcia et al. (2023) show the effectiveness of using few-shot examples to control translation formality and also corroborates our finding that the quality of the fewshot in-context examples matter.Ghazvininejad et al. (2023) provide control hints to large language models via bilingual dictionaries to improve the translation of rare words.Our work provides both supporting and complementary pieces of evidence to these studies by a) contributing a systematic analysis showing that the impact of the ordering of the demonstration examples on translation quality is dependent upon the nature and the quality of the examples and b) proposing a novel recall-based reranking approach that overcomes the limitations of BM25-based retrieval for in-context examples selection and optimizes for the selection of multiple prompts for MT.To the best of our knowledge, ours is the first work to jointly optimize the selection of multiple prompts for MT either via combining task-level and example-specific prompts or via directly optimizing the joint utility of multiple example-specific prompts by maximizing the coverage of the selected n-grams.

Conclusion
We investigate the choice of in-context examples selection for MT in both in-domain and out-ofdomain settings.We propose a novel recall-based re-ranking approach to utilize similar training examples as prompts and show their efficacy across multiple datasets and domains.Our findings show that task-level prompts can provide a complementary advantage to example-specific prompts, outperforming a strong kNN-MT baseline in 2 out of 4 out-of-domain datasets while being memory and compute efficient.Our manual analysis of the generated outputs reveals that the PLM can mimic the style of the in-context examples provided and can be used for template-based translation synthesis.These results allow future research to evaluate the potential of generating diverse and style-specific outputs for MT.

Limitations
We note a few limitations of our work: a) while we systematically investigate the choice of in-context examples for both in-and out-of-domain settings for higher-resource language pairs (English-German, English-Russian), it is unclear how this in-context ability of the PLM varies for the lowerresourced language pairs; b) We only experimented with one pre-trained language model, XGLM.Our preliminary experiments suggested XGLM-7.5B to result in better translation quality than Bloom-7B (Scao et al., 2022) under the same settings.However, further investigation is required to understand how these results vary across different model scales; c) We analyze different orderings for the few-shot task-level prompts but only examine limited sets of ordering (most similar to the left or right) for the example-specific prompts.As the PLM is shown to be sensitive to the ordering of these in-context examples, it remains an open question to study how to best combine the information from multiple example-specific prompts, with prompt ensembling being a viable option, which we leave to future work.

C Results using Second Metric: Comet
We report translation quality using Comet (Rei et al., 2020) in Tables 14 and 15.We use the eamt22-cometinho-da model (Rei et al., 2022) to generate the scores as it was shown to achieve higher correlations with human judgments than lexical overlap metrics while being computationally efficient.Our re-ranking strategy (with q max = 16) consistently performs the best across the board except for Koran, outperforming strong kNN-MT baselines on the multi-domain test set in 3 out of 4 settings.Adding a task-level prompt to 16 R-BM25 prompts via concatenation further improves quality in 5 out of 8 settings.

D Hyperparameter Search D.1 Order of BM25 Retrieved Examples
We report BLEU when using two different orderings of example-specific prompts on the development set for the medical domain.Ordering the examples with the most similar examples on the left attains higher BLEU than the right-to-left order.We note that the trend could vary depending on the noise in the training dataset, the degree of similarity, and the number of retrieved examples.We leave the exploration of the ordering of example-specific prompts to future work.

E Example Task-Level Prompts
Table 11 shows the best task-level in-context example selected by our method described in § 3.1 and the respective BLEU scores on the development set for the German-English and Russian-English tasks.

F Output Analysis
We report two interesting findings when prompting PLM with task-level and example-specific prompts: Stylistic Outputs One advantage of using a single task-level in-context example to prompt the PLM is that it allows us to systematically study how the choice of prompt influences the style of the generated translation.English: If the browser sends back an earlier saved cookie, then the service managing these can connect to the userś earlier visit, but only in respect of their own content.
Development BLEU: 25.63 such example: we can observe that as the prompt includes a contraction ("we are" vs. "we're"), the outputs generated by the PLM also include contractions and can be incorrectly penalized by BLEU while being meaning equivalent.
Template-based MT Template-based translation in medical, legal, it, or e-commerce domain can be preferable as they reduce the risk of generating errors in automatically generated translations.We present some examples in or AlexaTM 20B (Soltan et al., 2022) requires conditioning the decoder-only language model with in-context parallel examples.These examples serve two purposes: a) providing the model with the format and knowledge of the task (task-level) and b) guiding the output generation via providing useful information about the unseen source sentence (example-specific).This is different from the standard sequence-to-sequence models, where the task is always known, and the model learns generalizable patterns from the input-output examples to perform the task (in this case, translation) for the unseen source text.Source: Welche Risiken sind mit Poulvac FluFend H5N3 RG verbunden?Template: {Source text} = {Target text}.Example-Specific: Welche Risiken sind mit Sebivo verbunden?= What are the risks associated with Sebivo?Task-Level: Bei PROMESS1 werden drei Hauptziele verfolgt.= PROMESS1 has three main objectives.

Figure 1 :
Figure 1: Our proposed strategy can cover all the terms from the input text,"Welche Risiken sind mit Poulvac FluFend H5N3 RG verbunden?", in this case, with just the two examples.
)× = λ 15 Return T other hand, our proposed re-ranking strategy can cover all the terms from the input text, in this case, with just the top-2 examples.

Figure 2 :
Figure 2: BLEU distribution on the WMT'18 test set for 100 randomly sampled 1-shot prompts from the training dataset.The same set of 100 random 1-shot prompts are used for x→y and y →x translation directions.

Table 6 :
Correlation between the degree of overlap as measured by BLEU and the translation quality of the outputs, BLEU(ŷ, y), across different domains when using the top-1 prompt retrieved using BM25.I x and I y are the sources and the reference translations in the BM25 examples respectively.choice of in-context examples and their ordering matters.

Figure 3 :
Figure 3: BLEU on the Medical domain when varying the data store size and the number of BM25 examples.
Wang et al. (2022) utilize BM25 retrieved training examples in a supervised fashion to learn from similar examples during training.Contrary to prior work, we utilize similar examples to form a textual prompt which is used to guide the generation of a translation during inference.Prior work on domain adaptation for MT uses domain-specific bilingual or monolingual datasets to improve the translation quality of a neural sequence-to-sequence MT model either during training

Table 1 :
In-context Examples for Machine Translation.
Formally, given k in-context examples {x i , y i } k 1 the prefix input or the prompt, x p j , is generated by concatenating the demonstration examples {(x i , y i )} k 1 to the test input, x s j according to a template, P (see Table1).The output, ŷ, is then generated via the PLM with parameters θ via greedy decoding as follows: (Zhang et al., 2022) classification tasks, the incontext examples provide information about the task (the distribution of the input text, the label space, and the format of the task) and that the model does not rely on these examples to generate the final output.However, their analysis is limited to a) classification tasks and 2) randomly sampled in-context examples.Prior work has also shown that the order of these in-context examples can also lead to high variance in downstream performance(Zhang et al., 2022).However, less is understood about how these factors impact text generation tasks like MT. Do we need multiple incontext examples?What makes good in-context examples for MT? How sensitive is the model to the order of the prompts?

Table 2 :
Table 2 and 3 summarize the main results for the in-domain and the out-of-domain evaluations.Method p + q max En-De De-En Ru-En En-Ru Avg.Results on WMT'19 test sets:

Table 3 :
Results on the Multi-Domain Test Set: Prompting XGLM with R-BM25 in-context examples outperforms kNN-MT on 2 out of 4 domains.

Table 10 :
BLEU using different values of λ and threshold on the Medical Development Set (q max = 16).
Table 12 illustrates one

Table 11 :
Best task-level prompt For De-En and Ru-En Language Pairs according to the BLEU score on the development set.
Table 13 on how PLM can seamlessly use retrieved prompts to synthesize a translation from the template provided.Prompt: Wegen des heißen Sommers fangen wir erst spät an.= Because of the hot summer, we're late getting started.Source: Ja, ich bin sehr zufrieden mit dem Auftritt.Reference: Yes, I am very happy with the performance.PLM Output: Yes, I'm very satisfied with the performance.Source: Es ist eine andere Unternehmenskultur.Reference: It is a different corporate culture.PLM Output: It's a different corporate culture.

Table 12 :
Outputs mimic the style of the prompt.Zeigt die aktuelle Datei mit Opera an.= View the current file with Opera.Source: Zeigt die aktuelle Datei mit Lynx an (Textbasierter Browser).PLM Output: View the current file with Lynx (Text-based browser).

Table 13 :
Outputs follow the template of the prompt.

Table 15 :
Comet Scores on the Multi-Domain Test Set.