Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning

Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.


Introduction
Large language models (LLMs) have shown remarkable performance on a wide range of NLP tasks by leveraging in-context learning (Brown et al., 2020).In particular, when provided with fewshot examples, these models have demonstrated impressive capabilities for performing machine translation (MT) without requiring explicit supervision on parallel data (Garcia et al., 2023).However, this approach exhibits several drawbacks: performance is highly dependent on the quality of examples (Vilar et al., 2022), outputs are plagued by overgeneration (Bawden and Yvon, 2023), and inference costs are greatly increased by processing all input pairs.When parallel data is available, LLMs can alternatively be finetuned on translation instructions (Li et al., 2023).This method generally outperforms few-shot prompting and eliminates the need for in-context examples.However, it remains unclear whether finetuned models can benefit from the desirable properties of in-context learning, such as on-the-fly domain adaptation (Agrawal et al., 2022).Additionally, traditional finetuning (Devlin et al., 2019;Radford et al., 2018) incurs a high computational overhead due to the cost of updating all the model weights.
In this paper, we provide a closer examination of the impact of finetuning and few-shot prompting for adapting LLMs to perform translation.Our experiments encompass 10 language pairs on general and specific domains, comprising over 100,000 generated translations ( §2).Our main findings are: • We show that finetuning with adapters (Houlsby et al., 2019;Hu et al., 2022) is a very effective method to steer LLMs for translation ( §3.1).This method matches the performance of traditional finetuning at a fraction of the computational cost, by training 50 times fewer parameters.It also achieves better translation quality than in-context learning and eliminates the need for post-processing the generated outputs and selecting in-context examples.
• We show that finetuning large language models degrades their few-shot performance, limiting their adaptation capabilities ( §3.2).In particular, we show that finetuned LLMs perform poorly on domain adaptation scenarios when provided in-context examples.
• To address this issue, we propose a simple approach that introduces few-shot examples during finetuning ( §4).Our results show that we can recover few-shot capabilities while retaining the benefits of finetuning.

Experimental Setup
In our experiments, we use LLaMA 7B and 13B (Touvron et al., 2023) as backbone language models and finetune them with the standard cross entropy loss.
We train our models on general domain OPUS (Tiedemann, 2012) data from the Europarl, Globalvoices, Paracrawl, Tilde, Ubuntu, and Wikipedia domains.We consider the languages Dutch (nl), French (fr), German (de), Portuguese (pt) and Russian (ru), both from and into English (en). 2 To ensure the quality of the training records, we first apply Bicleaner (Ramírez-Sánchez et al., 2020) using a threshold of 0.85 and then filter the remaining pairs, ensuring both language directions have a COMETKiwi (Rei et al., 2022b) score above 0.8.Finally, we sample 250K records for each language pair.During training, we uniformly sample from the data to ensure each language pair is seen a similar number of times.We perform validation on the Flores-200 development set for the language pairs in the training data.
For in-domain evaluation, we consider the Flores-200 (NLLB Team et al., 2022) test dataset on all the translation directions included during training, as well as the WMT22 test sets3 for the language pairs considered in our training data.Regarding data for specialized domains, we consider the Medical and Law domains from Aharoni and Goldberg (2020), the TICO dataset (Anastasopoulos et al., 2020) and WMT Chat (Farinha et al., 2022).We evaluate our models on zero and five shot settings, uniformly sampling for each test sentence five independent few-shot samples from the respective development set.
We refer the reader to Appendix A for full details on hyperparameters and instruction formats used in the following experiments.

Finetuning LLMs on MT instructions
In this section, we investigate the performance of LLMs finetuned on machine translation instructions in relation to few-shot prompting with a pretrained language model.Note that, throughout this section, we always analyse few-shot prompting for the pretrained model.We deem that this offers a fairer comparison to finetuning on translation instructions, since both methods have access to training examples.
Nevertheless, we also provide the results for zero-shot translation with the pretrained model in Appendix G. Similar to the findings in Bawden and Yvon (2023), zero-shot performance is far behind few-shot performance, in particular for outof-English language pairs, likely due to the prevalence of English data during the pretraining of the LLaMA models.

Efficient finetuning with LoRA
We start by studying parameter efficient training with low-rank adaptation (LoRA) (Hu et al., 2022) and compare it with traditional finetuning. 5n Figure 1, we observe that LoRA performs comparably to traditional finetuning while training 50 times fewer parameters. 6We also see that both LoRA and traditional finetuning outperform the pretrained model with few-shot prompts-the latter is consistent with the findings in Li et al. (2023), which show that finetuning leads to better translations than few-shot prompting of pretrained language models.As a general trend, all methods exhibit better translation quality when translating into English, following recent trends in the literature (Arivazhagan et al., 2019;Vilar et al., 2022).
We also find that finetuning LoRA requires a very small number of translations to obtain the reported performance, as shown in Figure 2. In particular, it outperforms the few-shot pretrained model with as few as 2,000 training examples.
Considering the high computational costs of full finetuning compared to parameter-efficient finetuning and the negligible degradation obtained with the LoRA-based model, we use LoRA in subsequent experiments.

Few-shot prompting of finetuned models
We now direct our attention to comparing zeroand five-shot performance.We argue that, even when an LLM can achieve high zero-shot translation quality, few-shot capabilities can be very beneficial for efficient adaptation.As shown by Agrawal et al. (2022), LLMs can leverage a very small pool of few-shot examples to perform translation on new domains.
In the leftmost plots of Figure 3, we examine the zero-and few-shot performance of our finetuned models on general domains.Few-shot performance degrades and is surpassed by zero-shot performance, suggesting that the finetuning procedure is hindering the in-context learning abilities. 7n order to further study this phenomenon, we evaluate the above models on specialized domains.
General domain examples may be of little help for a model already trained on that domain.On the contrary, in specialized domains, examples should bring domain-specific information about the properties of the translation, such as style, register, and thus help the model achieve better performance.
In the rightmost plots of Figure 3, we observe that the above issue happens consistently in all domains, with a larger degradation in performance.This finding further supports our hypothesis that finetuning can degrade the performance of few-shot prompting.

Finetuning with few-shot examples
In order to recover few-shot performance, we introduce instructions with few-shot examples in the training process: namely, we finetune on data which contains both zero-shot and few-shot instructions.Following Min et al. (2022), we uniformly sample between 0 and 5 few-shot examples for each training example from an example pool previously separated from the training data. 8From here, we build an instruction prompt with the training example and the selected examples and proceed with the training.
In Figure 3, we observe that the models trained with in-context examples recover their few-shot capabilities, both for the general and specialized domains.The few-shot performance is on par or above the zero-shot performance, further suggesting that the models are extracting helpful information from the examples.In Appendix D, we present a set of examples that highlight these gains.

Analysis on output format
We also analyze whether finetuned models continue to generate context after the desired translation.This issue is present in pretrained LLM outputs and requires post-processing of the generated content, deleting all words generated after the first new line.
In Figure 4, we show the length of the tokenized outputs for the 7B models. 9We observe that the distribution of the length for the outputs generated by both finetuned models matches the distribution of the references.This shows that the finetuned 8 We also considered a training mixture where 50% of the data contained no examples and the remaining data had between 1 and 5 uniformly sampled examples.We did not further explore this as preliminary results (see Appendix C) show the results are similar to the ones obtained with the procedure above. 9The 13B models follow a similar distribution.models no longer overgenerate.We also found that these models no longer delimit their output with the newline symbol and instead produce the end of sentence token, removing the necessity for post-processing and increasing computational efficiency.In Appendix F, we provide a set of examples to illustrate these findings.

Influence of in-context examples
In order to obtain a more fine-grained analysis of the gains obtained by adding in-context examples, we analyzed the difference in COMET scores for each source sentence when prompting the 7B finetuned models with and without examples.
In Figure 5, we observe that the distributions have a high concentration of points slightly above 0.However, we also observe very large tails, in particular for out-of-English language pairs.10 We manually inspected the examples with the highest differences11 and found that introducing examples can fix the model generating in the wrong language, supporting the findings in Bawden and Yvon (2023).Surprisingly, we also discovered examples where the model correctly generated a translation in a zero-shot scenario and inserting in-context examples lead to hallucinated content.
In Table 1, we see that the models finetuned without examples have higher hallucination rates than their respective counterparts, further showing their degradation in few-shot performance.Through a manual inspection of the obtained outputs, we observed that the models generate hallucinations of different categories.In particular, they generate both detached (fully and strongly) and oscillatory hallucinations, and can also generate off-target translations.One common case is that the models copy from the instruction (either from the source or the examples).
The models finetuned with few-shot examples exhibit lower hallucination rates, suggesting that the training procedure reduced the prevalence of this issue.In particular, these models no longer copy from the instruction.However, they still produce hallucinations and their impact is very serious.As such, we believe that it motivates further study on the influence of in-context examples and the generated output.

Conclusion
In this paper, we provide a study on finetuning and few-shot prompting for adapting LLMs for translation.We show that adapter-based finetuning matches the performance of traditional finetuning while training 50 times fewer parameters.Additionally, finetuning with adapters outperforms few-shot prompting of large language models and eliminates the need for output post-processing and in-context examples.
In addition, we show that finetuned models exhibit poor performance when prompted with

Limitations
In this paper, we focus on English-centric highresource language pairs.It remains an open question how these findings generalize for non-English language pairs or in low-resource settings.
We also do not perform a human assessment on the quality of the translations quality due to the time and cost of performing this study.Instead, we base our evaluation on COMET, a state-of-the-art metric for MT evaluation, and provide results for other metrics in Appendix G.

Ethics Statement
This paper is based on large language models.These models can encompass several risks, which are discussed in detail in Brown et al. (2020) and Chowdhery et al. (2022).Namely, they are trained on large web corpora, which can contain toxic content (Gehman et al., 2020), and have a high energy consumption, in particular during training (Strubell et al., 2019).
Additionally, our evaluation is based on automatic metrics finetuned based on human preferences.In such cases, annotators may not consider better alternatives when evaluating generated text and wrongfully classify the text as high quality (Bansal et al., 2021).
Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla.2023.A paradigm shift in machine translation: Boosting translation performance of large language models.
Translate the source text from X to Y. Source: ... Target:

A Details on experimental setup
A.1 Instruction format The training data for finetuning without few-shot examples follows the template shown in Table 2.
The same format is used when testing all models in a zero-shot setting.
We treat the few-shot instruction template as a hyperparameter and experiment with three different methods, as shown in Table 3.Our first template follows recent trends in the literature and repeats the zero-shot instruction for each example (Vilar et al., 2022).However, in our experiments, we found that pretrained language models see the repeating pattern and continue to generate more examples besides the target translation.In order to circumvent this issue in the finetuned models, we designed the two remaining templates with separate examples sections.Our goal was to better separate the examples from the input and thus reduce the propensity for overgeneration.We found that all templates lead to overgeneration with the pretrained model and none suffered from this issue when the model is finetuned.
In order to select the template format for our remaining experiments, we test them by finetuning with examples the LLaMA 7B model and choosing the template with the highest average COMET score on the languages in the validation set.In order to collect examples for few-shot prompting in the validation set, we sampled from the validation set ensuring the predicted example was not in the in-context examples.
In Table 4, we observe that the templates lead to very similar results, suggesting that the finetuning procedure is not very sensitive to the template used.Nevertheless, their ranking is consistent across metrics, with the second one obtaining the best scores.As such, we use it when prompting models in a few-shot scenario.

A.2 Training hyperparameters
In order to choose the best hyperparameters for both finetuning approaches, we perform a hyper- We only consider zero-shot translation and use the template format in Table 2.We find the best configuration based on the average COMET score on all language pairs in the validation set.Table 5 specifies the hyperparameters experimented when training LLaMA 7B with traditional finetuning.We first chose the learning rate and weight decay, while not using warm-up steps.We then tuned the scheduler and warm-up steps.Our final configuration has a learning rate of 1e-6, no weight decay, and a constant learning scheduler with no warm-up steps.
Table 6 details the hyperparameters experimented when finetuning with LoRA.We based our experiments on the best configurations for the GPT-2 models trained in Hu et al. (2022).Initial experiments with lower r values lead to an underfitting model so our configurations focused on increasing model capacity, with higher r values, while keeping regularization through label smoothing and weight decay.In Table 7, we present the results for all the runs.We saw very little variation on the obtained scores.We adopted the best configuration, with an r value of 256, weight decay of 0.0, and label smoothing of 0.001.
Regarding the 13B models, we used the same hyperparameters as in the 7B models.

B Analysis on Chinese language pairs
In this section, we explore the results for the language pairs including Chinese with the LLaMA 7B model, in order to study if our previous results hold.We start by investigating whether LoRA is still competitive with full finetuning.In Figure 6, we observe that LoRA performs comparably to the finetuned model and outperforms the pretrained LLM, following the trend of other language pairs (see Section 3.1).
We also investigate the performance of the models finetuned with and without examples.In In Figure 8, we compare the training mixes by finetuning LLaMA 7B.We see that the results are very similar for both configurations.The alternative configuration (Unbalanced) obtains slightly lower results.As such, we adopted the method described in Section 4 for mixing few-shot examples during finetuning.

D Examples of domain adaptation
In this section, we provide examples of translations where the LLaMA 7B model trained with few-shot example was able to absorb domain knowledge from the examples in the prompt.
In the first example from Table 8, we see that the model correctly translates the terminology GVO to GMOs (Genetically Modified Organisms), instead of adopting the acronym in the source sentence.In the second example, the model is able to correctly order the words in the translation.

E Analysis on the distributions of COMET score differences
We also provide a more in-depth analysis on the distributions of COMET score differences, with a focus on the examples with the highest differences.
In Figure 9, we observe that the distributions for the LLaMA 7B model finetuned without in-context examples also have large tails, similar to the results of the model finetuned with in-context examples (see in Section 4.2).
We also analyzed whether the same long tails appear on the specialized domains.In Figure 10, we observe that this is in fact the case.The distributions of the differences are centered around zero and have extreme values on both sides for all domains and finetuned models.
Finally, we show several examples where fewshot prompting both helped or degraded the model performance.In Table 9, prompting the model with few-shot examples fixed the generation in the wrong language.In Table 10, introducing incontext examples in the model prompt leads to hallucinated content.

F Examples of generated outputs
In this section, we present translations where prompting the pretrained LLaMA 7B model leads to overgeneration, and both 7B finetuned models correctly stopped to translate.In Table 11, we see that, although all models generated the same translation, the pretrained model continued to generate, repeating the prompt and translation, while both finetuned models correctly stopped to generate tokens.

G Results with all evaluation metrics
We provide the evaluation for the models considered in this paper using three other MT evaluation metrics: BLEU (Papineni et al., 2002), chrF (Popović, 2015) and COMETKiwi (Rei et al., 2022b).
In Figure 11, we show the comparison between both finetuning approaches with the LLaMA 7B model.The results are consistent across all metrics, with the LoRA model performing similarly to the finetuned and outperforming the pretrained model.
In  Source "bezeichnet "genetisch veränderte Futtermittel" Futtermittel, die GVO enthalten, daraus bestehen oder hergestellt werden;" Reference ""genetically modified feed" means feed containing, consisting of or produced from GMOs;" Zero-shot translation ""Genetically modified feed" means feed containing GVO, derived from GVO or produced from GVO;" Few-shot translation ""genetically modified feed" means feed containing, consisting of or produced from GMOs;"   For the lexical metrics, the degradation in fewshot performance is not visible on the 13B models.However, these metrics may not be reliable for evaluating translations from LLMs (Hendy et al., 2023), as LLMs tend to produce less literal translations which are poorly captured by lexical overlap with the reference.
In Tables 12, 13, 14, 15, 16, 17 and 18 we also provide the exact scores for all metrics in a tabular format.

Figure 4 :
Figure 4: Length of the tokenized outputs when translating the Flores-200 test set for the 7B models.

Figure 6 :
Figure 6: COMET scores on the Chinese language pairs of the Flores-200 test set by LLaMA 7B trained with full finetuning and LoRA.

Figure 7 :
Figure 7: COMET scores for Chinese language pairs by the 7B finetuned models on zero-shot and five-shot scenarios for the Flores-200 test set.
Figure 7, we observe a similar trend to the results above.The model finetuned without few-shot examples exhibits a performance degradation, while the model finetuned with few-shot examples obtains higher performance with few-shot prompting, indicating it is extracting helpful information from the examples in the prompt.C Experiments with more zero-shot data We explored an alternative method for combining few-shot examples during finetuning.Instead of uniformly sampling between 0 and 5 examples, we build a training mixture where 50% of the training examples were zero-shot and the remaining ones had between 1 and 5 uniformly sampled examples.

Figure 8 :
Figure 8: COMET scores for zero-shot and five-shot translation by finetuning the LLaMA 7B model with the two methods for combining few-shot examples.Balanced is the method described in Section 4 and Unbalanced is the alternative method described in Appendix C.

Figure 9 :
Figure 9: Difference in COMET scores for zero-vs fewshot translations by the LLaMA 7B FT w/o few-shot model on Flores-200 (∆ > 0 means that the translation with few-shot examples was scored higher than the translation without examples).

Figure 10 :
Figure 10: Difference in COMET scores for translations obtained with zero-and few-shot prompting for all domains for the finetuned LLaMA 7B models.

Table 1 :
Hallucination Rates for finetuned models on each evaluation dataset, considering all languages pairs.

Table 3 :
Prompting templates for finetuning with in-context examples.

Table 4 :
Scores for the few-shot formats on the Flores-200 validation set.

Table 7 :
Scores for the LoRA hyperparameters on the Flores-200 validation set.

Table 8 :
Examples of translations where the LLaMA 7B finetuned with few-shot examples was able to extract domain information from the examples in the prompt.

Table 9 :
Examples of translations by the 7B FT w/o few-shot model where adding examples corrected the language in which the model was generating.

Table 10 :
Examples of translations by the 7B FT w/o few-shot model where adding examples introduced an hallucination.

Table 11 :
Examples of translations where finetuning the LLaMA 7B model eliminated the overgeneration in the outputs.

Table 12 :
Figure13: BLEU scores for zero-shot and five-shot translation by models finetuning with and without few-shot examples.Scores are averaged across all language pairs."FTw/ofew-shot"refers to finetuning with translation instructions, as in Section 3. "FT w/ few-shot" refers to finetuning with few-shot examples, detailed in Section 4.Figure14: chrF scores for zero-shot and five-shot translation by models finetuning with and without few-shot examples.Scores are averaged across all language pairs."FT w/o few-shot" refers to finetuning with translation instructions, as in Section 3. "FT w/ few-shot" refers to finetuning with few-shot examples, detailed in Section 4. Scores for the 7B pretrained model and 7B both finetuned models on the Flores-200 test set.
Figure12: COMETKiwi scores for zero-shot and five-shot translation by models finetuning with and without few-shot examples.Scores are averaged across all language pairs."FT w/o few-shot" refers to finetuning with translation instructions, as in Section 3. "FT w/ few-shot" refers to finetuning with few-shot examples, detailed in