Augmenting Large Language Model Translators via Translation Memories

Using translation memories (TMs) as prompts is a promising approach to in-context learning of machine translation models. In this work, we take a step towards prompting large language models (LLMs) with TMs and making them better translators. We find that the ability of LLMs to ``understand'' prompts is indeed helpful for making better use of TMs. Experiments show that the results of a pre-trained LLM translator can be greatly improved by using high-quality TM-based prompts. These results are even comparable to those of the state-of-the-art NMT systems which have access to large-scale in-domain bilingual data and are well tuned on the downstream tasks.


Introduction
Marrying the world of translation memory (TM) and the world of neural machine translation (NMT) is a challenging but interesting problem in natural language processing (NLP).Previous work along this line of research either requires architecture changes of NMT models and/or additional training (Gu et al., 2018;Bulté and Tezcan, 2019;Xu et al., 2020;Hossain et al., 2020;He et al., 2021) or constructing translation knowledge base from TM (Zhang et al., 2018;Khandelwal et al., 2021;Meng et al., 2022).
More recently, researchers have been aware of the strength of prompting techniques for oneshot/few-shot machine translation (Vilar et al., 2022;Agrawal et al., 2022;Zhang et al., 2023).In particular, Reheman et al. (2023) investigated oneshot learning methods for NMT by simply viewing TMs as prompts.The result of their work is a stronger NMT system that works in the same way as usual but can be prompted when TMs are available.Interestingly, they found that the ability of NMT models to "understand" prompts plays an important role in this type of system.Prompts are still difficult to use if NMT systems are weak.
In this work, we take a step forward.We treat large language models (LLMs) as machine translation systems and prompt them with TMs (see Table 1 for a comparison of different methods).This is in part motivated by recent developments of LLMs: one of the most powerful properties of LLMs is their ability to understand and respond to complex instructions and questions (Ouyang et al., 2022;Thoppilan et al., 2022).We show that this ability is crucial for in-context learning of TM-based prompts, and LLM-based translation systems can be greatly improved by using simple instructionlike prompts.To this end, we propose Translation Memory Prompting for large Language Models, namely TMPLM -a simple but effective approach to injecting TMs into LLM translators.
We experiment with our method on a GPT-based LLM (text-davinci-003 * ).On translation tasks ranging over multiple languages and domains, TM-based prompting improves the LLM-based translation system by 20 to 30 BLEU points, showing better performance than a well-tuned, largescale, in-domain NMT system on most of the tasks.We also compare different kinds of prompt templates and discuss some interesting issues, such as the role of prompting in treating LLMs as translators.
f (•) : What is the translation of " " from src-lang to tgt-lang?Only translation results are required. x If the translation of " " from src-lang to tgt-lang is " " and the translation of x 1 tm y 1 tm " " from src-lang to tgt-lang is " ", then what is the translation of " " from src-lang to tgt-lang?Only translation results are required.x in red stands for the sentence that needs to be translated.x tm in blue and y tm in green stand for the source and target sentence found in the TM, respectively.Both src-lang and tgt-lang need to be replaced by the names of the source and target language.

Prompting Methods
TM is a database that contains the bilingual translation history of professional translators.It is usually used to help the translation of the test sentence by providing similar sentence pairs, which may have translation hints, such as similar sentence patterns, phrases, lexicons, terminologies, or other translation knowledge.Either an NMT model or an LLM need to dig out those hints and ignore the irrelevant content.This motivates us to investigate prompting LLMs with TMs benefiting from their dazzling ability of "understand" prompts.Suppose we have a TM database that retains a collection of pairs of sentences.Given a sourcelanguage sentence x, the database returns k most similar sentences X tm = {x 1 tm , ..., x k tm } along with their corresponding translations Y tm = {y 1 tm , ..., y k tm }.Now suppose we have a pretrained translation model (either an NMT model or an LLM) that takes x in some way and outputs a translation y, written as where Trans(•) denotes the translation model, and f (•) denotes a template by which we represent x as the input of Trans(•).For example, if We then wish to use this model to generate a new translation y ′ by considering (X tm , Y tm ) as instances for reference.This can be written as Here In this work, we focus on the case in which a powerful generative LLM (such as ChatGPT) is used to perform translation.The input of Trans(•) could be an instruction or question-like text, and so we can design f ref (•) in many different ways.
In Figure 1, we present two types of templates: the instruction-style template and the code-style template.These designs come from a consideration of the human instruction tuning and the code training used in developing davinci-003.For a more extensive discussion of template design, see Appendix B.2.
It is worth emphasizing that, while we restrict ourselves to TM-based prompts in experiments, we can apply this general approach to deal with other knowledge about translation.As a simple example, we can extend (X tm , Y tm ) to term or phrase translations.Also, when some MT systems are available, we can make use of automatic translations from other systems to define prompts.

Data and LLM Setup
We tested our method (denoted by TMPLM) on three widely-used datasets of TM: DGT-TM (Steinberger et al., 2012), JRC-Acquis (JRC-A) (Steinberger et al., 2006) and the multi-domain dataset described in (Aharoni and Goldberg, 2020).To ensure a fair comparison, we adopted the same preprocessing steps as outlined in Reheman et al. (2023) (Ng et al., 2019), containing 200 million parameters.WMT21 4B indicates WMT21 champion models (Tran et al., 2021) trained by multi language-pairs data containing 4 billion parameters.One-shot and few-shot represent the results of TMPLM with k = 1 and k = 5, respectively.The BLEU improvements are reported in subscripts.See Table 6 for the COMET-22 version.For LLMs, we chose the davinci-003 model developed by OpenAI because it is currently one of the state-of-the-art generative LLMs.The model was configured with default values of all parameters, except that the sampling temperature was set to 0. In the experiments, we used the codestyle template and set k to 5 by default.The quality of translations was mainly evaluated using multi-bleu.perlfrom Moses † .In addition, following the recommend of using neural networkbased metrics in machine translation evaluation (Freitag et al., 2022), we also used COMET-22 ‡ (wmt22-COMET-da) (Rei et al., 2022) to make a complementary evaluation.See more details about data processing in Appendixes A.3 and A.4.

Baselines
We re-implemented Reheman et al. (2023)'s method which augments NMT systems via TMbased one-shot learning.For NMT systems, we chose two champion models in WMT: Facebook's WMT19 en ↔ de models (Ng et al., 2019) and WMT21 multilingual models (Tran et al., 2021).These WMT models were all trained on large-scale † http://www.statmt.org/moses/‡ https://github.com/Unbabel/COMETbilingual data and are improved by using a series of techniques, such as back-translation and finetuning.As a second baseline, we chose the kNN-MT model (Khandelwal et al., 2021) because it is a very strong model for TM and NMT combination.

Translation Quality
Main Results.Table 2 shows BLEU scores on the DGT-TM and JRC-A datasets.We see, first of all, that TMPLM achieves the best result among all the systems.When TMs are not involved, the performance of LLMs is 10 BLEU points lower than that of the NMT baselines.But, when armed with TMs, LLMs obtain very large BLEU improvements.The few-shot learning+LLM system even outperforms the strong NMT+TM baseline on all of the test sets.Also, by comparing the results of WMT19 200M models and WMT21 4B models, we see that larger models help more for making use of TM (see Section 3.4 for more discussions).Besides, one-shot learning can give satisfactory results for TMPLM indicating that the most similar TM provides the most helpful translation hints.In Appendix B.4 we will see that few-shot learning yields BLEU gains in a long-tail manner.
Multi-language Experiments.We test TMPLM on more languages and run our system on data of 7 extra language pairs (i.e., 14 directions) from   JRC-Acquis.From Figure 2, we see consistent improvements over all the language pairs.Even for non-English tasks, TMPLM can still achieve significant BLEU improvements.See Table 8 in Appendix B.3 for complete experimental results.
Multi-domain Experiments.Table 3 shows BLEU results on the multi-domain dataset.Again, the TMPLM system is robust to the domain shift.It performs best on three of the four domains.

Language Understanding Matters Most
We then investigate an interesting issue: what kind of ability do large models have to make better use of TM-based prompts?There are possibly three reasons, including the abilities of translating, logically reasoning and language understanding.However, as seen from Table 2, the baseline LLMs are not strong translation systems and their BLEU scores are generally 10 points lower than the NMT systems.The translation ability of LLMs does not turn out to be important in TM-based prompting.Note that davinci-003 is a successor of GPT3 and is trained on additional large-scale code data.It has been pointed out that training LLMs on code data can lead to a strong ability of logical reasoning (Liang et al., 2022).As seen in Figure 3 (a), however, no big difference between davinci-003 and GPT3 in BLEU performance.On the other hand, davinci-003 has a significant ability to deal with instructions because it is tuned by using feedback to human instructions.Such a property makes davinci-003 a better text processor, and thus a stronger translation system that works with various prompts.Therefore, it is the ability of language understanding that boosts LLMs' translation performance when prompted with TMs.

Template Styles
In Figure 3 (b), we compare the performance between the code-style and instruction-style templates on the DGT-TM en-de and de-en tasks.For systems without TMs, the instruction-style template shows similar performance as the code-style template.However, when TMs are used, the codestyle template is better in most cases.In Appendix B.2, we test more templates and see a similar phenomenon that simpler templates work better.

Prompting with randomly selected demonstrations
We also compare the performance of TMPLM with the conventional few-shot, i.e., prompting LLM translators with randomly selected high quality demonstrations (Vilar et al., 2022;Agrawal et al., 2022;Zhang et al., 2023;Moslem et al., 2023;Hendy et al., 2023).We conduct experiments on the DGT-TM dataset, with demonstrations selected from the TM database of the DGT-TM dataset (indomain) and newstest2017 (out-domain), respectively.In Figure 4, we see, TMPLM exceeds the conventional few-shot by about 30 BLEU points indicating that LLM can benefit from TMs much more than the conventional few-shot itself.It also demonstrates the valid information hinted by TMs as explained in Section 2.

Combining TMs and NMT results
To examine the impact of high-quality translations on prompting LLMs, we replace the retrieved TM with the translation result of the WMT19 and WMT21 NMT systems when the TM's similarity is not high enough.We conducted experiments on the DGT-TM de → en data and the IT data in the multi-domain dataset because the sentence similarity distributes differently on them (see Appendix A.1).In Figure 5, we can see that the performance declines as more NMT translation results replace the TM results in prompting.This demonstrates that the quality of translations plays an important role in prompting LLMs.We also see that the performance on DGT-TM declines faster than that on IT domain.We attribute this to the better translation quality of the NMT models on the DGT-TM dataset.
There is an interesting finding that the method of prompting LLMs with the NMT results cannot surpass the NMT system itself, while the BLEU scores of prompting LLMs with TMs are always better than those of the TMs.It indicates that LLMs indeed process the prompting texts rather than simply outputting the prompting texts.

Conclusion
We have proposed TMPLM, an in-context learning method to prompt TMs for LLMs.By incorporat-ing TMs into tailored templates, LLMs with TM-PLM outperforms the state-of-the-art NMT models with TM prompting.We have also demonstrated that the ability of language understanding plays an important role in prompting LLMs with TMs.

Limitations
The similarity of TMs is an important factor influencing the translations of TMPLM.However, high-similarity TMs are not always available in practical applications.It is worth studying methods to make use of relatively low-similarity translations in LLM-based translation systems.

A.1 Retrieval of Similar Sentences
Following Reheman et al., 2023, we adopt a wordlevel fuzzy matching strategy, with the numbers and punctuation marks removed.Specifically, we first use the search engine Apache Lucene (Bialecki et al., 2012) to acquire the Top500 similar TMs from TM database, then rerank the most similar TM by using the length normalized Levenshtein Distance, given by where FMS(•, •) denotes the Fuzzy Match Score, LD(•, •) denotes the word level Levenshtein Distance, and | • | denotes the length of a sentence.

A.2 Details of Datasets
Datasets and their language directions used in our experiments are listed here.
The statistics of these TM and the corresponding similarity ratios of retrieved sentences in the FMS metric are shown in Table 4.

A.3 Data Pre-processing
For the DGT-TM, JRC-A and multi-domain datasets, we clean the data using the scripts provided by Reheman et al. (2023)'s work.To construct the test set and TM database for the DGT-TM and JRC-A datasets, we process each language direction separately.Specifically, we randomly extract 3,000 sentence pairs from each dataset as the test set, and use the remaining sentence pairs as the TM database.For the multi-domain dataset, we use its original test set as our test set and its original training set as the TM database.We use the FMS algorithm on the split data to obtain the TM corresponding to the test set.In particular, for the few-shot experiments, we retrieved the k most similar sentence pairs from the TM database for each test sentence.
Finally, we replace the escaped characters in the dataset and use Moses § decoder detokenizer to recover the tokenized data before feeding it to the davinci-003 system.
A.4 Data Post-processing davinci-003 always generates redundant symbols at the beginning and end of sentences, including: '"', '\n', '[', ']', and other escaped  characters.The occurrences of these characters is regular and can be removed uniformly by scripts.Consequently, before scoring, we use NiuTrans (Xiao et al., 2012) word segmentation tool for Chinese and Moses decoder's tokenizer.perlfor all other languages.Finally we use multi-bleu.perlfor scoring.

A.5 More Prompt Templates
We try a large number of prompt templates, as shown in Table 5.Without special specification, the instruction-style template with TM is the #1, and without TM is the #2, and the code-style template with TM is the #17, and without TM is the #18.In particular, in the multi-language experiment, we use the instruction-style template.The template for all of the few-shot experiments is obtained by increasing the number of TMs in #17.
Punctuation has a significant impact on the generation results.For example, using template #13, if the source sentence ends with ':', it will lead the model to continue generating words but not stop in an appropriate number of decoding steps.Meanwhile, although many templates have a similar form, their performance still differs.We believe that adding a strong boundary signal to the templates helps the model to know where to end.

B.1 Evaluation by COMET-22
Except for the BLEU scores, we also provide the COMET-22 scores as seen in Table 6 and Table 7.We can see that despite LLM's poor performance on zero-shot, prompting LLM with a few TMs can achieve significant improvement.On the other hand, the few-shot learning+LLM system can still outperform the strong NMT+TM baseline in most cases.

B.2 Performance of Different Prompt Templates
In order to explore the effect of using different prompt templates on the performance of davinci-003, we use 20 prompt templates in the de → en direction of the DGT-TM dataset for experiments.Seen from table 5, the code-style template is better than the instruction-style template in most cases.

B.3 Experiments on More languages
We perform multi-lingual experiments on the JRC-A dataset, and in these experiments, we use the instruction-style template as shown in multi-language experiment.Great BLEU improvements are obtained on these datasets.

B.4 Impact of k
To explore the effect of k on the performance of davinci-003 in the few-shot experiments, we conduct experiments with k from 1 to 9 in both directions of the DGT-TM dataset.Figure 6 shows a long-tail performance gain as k increases.

B.5 Impact of Orders of TM results
To observe the effect of constructing the prompt template with different TMs similarity orders on the performance in the few-shot experiments, we constructed two types of prompt templates in the DGT-TM dataset with a few-shot sample size of 5.One is arranged in descending order of TMs similarity, and the TM adjacent to the sentence to  be translated is the lowest similarity.The other one is arranged in ascending order of TMs similarity, and the TM adjacent to the sentence to be translated is the highest similarity.The results are shown in Table 10.

B.6 Performance on the WMT Datasets
We conduct experiments on WMT14 en → de and WMT19 de → en directions.We use the same method as that used on the multi-domain dataset to process these two benchmarks.It is worth noting that the data obtained on these two benchmarks have a low similarity of TMs, as shown in Table 11.Table 11 shows the performance of the LLM and baseline models on the WMT14 en → de and WMT19 de → en datasets.

B.7 Performance of Different Sized Models
Moreover, we conduct experiments using "small" models such as text-curie-001 and text-babbage-001.But their performance is far away behind davinci-003 whose outputs contain null in lines sometimes.We attribute this to the lack of emergent abilities of big models (Wei et al., 2022).The results are shown in Table 12.
Methods of using TM for better MT.w/o-archchange = without architecture changes or training, w/obase = without constructing translation knowledge base from TM, and few-shot = few-shot learning.

Figure 1 :
Figure 1: Two styles of template.f (•) denotes a template by which we represent the input sentence as the input of the translation model (such as LLM in this figure).f ref (•) is a new template involving outputs of a TM (k = 2 in this example).xin red stands for the sentence that needs to be translated.x tm in blue and y tm in green stand for the source and target sentence found in the TM, respectively.Both src-lang and tgt-lang need to be replaced by the names of the source and target language.

Figure 2 :
Figure 2: Comparison of LLM w/o and w/ TMs (oneshot) on 8 language-pairs from JRC-A.Points in deep and light color stand for the BLEU scores of LLM w/o and w/ TM, respectively.

Figure 3 :
Figure 3: Experiments on two impacts including different LLMs and different template styles.

Figure 4 :
Figure 4: BLEU scores of different prompting strategies on the DGT-TM dataset.In-domain and out-domain represent demonstrations randomly selected from the TM database of the DGT-TM dataset and newstest2017, respectively.TM represents top-k similar translation memories (i.e., demonstrations) retrieved from the TM database of the DGT-TM dataset.

Figure 5 :
Figure5: BLEU scores as functions of thresholds of using similar sentences in TMs on the DGT-TM and IT domain data.The left y-axis represents the BLEU scores of prompting LLMs with the translation results from NMT systems, and the x-axis represents the similarity (i.e., the FMS in Appendix A.1) thresholds by which we have a trade-off between using TMs and NMT results as prompts (1 means that we only use TMs as prompts, and 0 means that we only use NMT outputs as prompts).Deep and light red curves represent the performance of the LLMs when working with the WMT19 200M and WMT21 4B systems.Blue curves represent the proportion of the use of TMs (see the right y-axis).
Figure 6: BLEU scores of different k on the DGT dataset for data cleanup and training/testing data split.

Table 2 :
BLEU scores of NMT models and LLMs on the DGT-TM and JRC-A dataset.WMT19 200M indicates WMT19 champion models

Table 3 :
Comparison of the NMT models and the kNN-MT model on the multi-domain dataset by BLEU.The COMET-22 version can be found in Table7.

Table 4 :
TMs and proportions of the retrieved sentences in different ranges of FMS.

Table 6 :
scores of NMT models and LLMs on the DGT-TM and JRC-A dataset.

Table 7 :
COMET-22 scores of NMT models and LLMs on the multi-domain dataset.

Table 8 :
Experiment results on 8 language-pairs from JRC-A.

Table 9 :
Performance of replacing the low-matching part of TMs at different thresholds of FMS with the translation results from NMT.For example, FMS 0.2 in first row means that TMs with FMS less than 0.2 are replaced by NMT translation results.

Table 10 :
The performance comparison of different templates which is constructed based on the similarity of TM when the number of few-shot samples is 5.

Table 11 :
Comparison of performance on WMT dataset.

Table 12 :
Comparison of performance with different size models on DGT-TM de → en.