P ARRO T: Translating During Chat Using Large Language Models

Large language models (LLMs) like ChatGPT and GPT-4 have exhibited remarkable abilities on a wide range of natural language processing (NLP) tasks, including various machine translation abilities accomplished during chat. However, these models are only accessible through restricted APIs, which creates barriers to new research and advancements in the ﬁeld. Therefore, we propose the P ARRO T framework to enhance and regulate the translation abilities during chat based on open-sourced LLMs (i.e., LLaMA-7b, BLOOMZ-7b-mt) and human written translation and evaluation data. Speciﬁcally, P ARRO T reformulates translation data into the instruction-following style, and introduces a “Hint” ﬁeld for incorporating extra requirements to regulate the translation process. Accordingly, we propose three instruction types for ﬁnetuning P ARRO T models, including translation instruction, contrastive instruction, and error-guided instruction. We can ﬁnetune either the full models or partial parameters via low rank adaptation (LoRA). Experiments on Flores subsets and WMT22 test sets suggest that translation instruction improves the translation performance of vanilla LLMs signiﬁcantly while error-guided instruction can lead to a further improvement, which demonstrates the importance of learning from low-quality translations annotated by human. Meanwhile, the P ARRO T models can also preserve the ability on general tasks with the Al-paca multi-task dataset involved in ﬁnetuning. Please refer to our Github project for more implementation details: https://github. com/wxjiao/ParroT


Introduction
Large language models (LLMs), designed in the instruction-following format, such as ChatGPT and GPT-4 (OpenAI, 2023), have garnered considerable interest due to their remarkable abilities in comprehending instructions and generating humanlike responses.These versatile models can efficiently perform a wide range of natural language ParroT 有什么问题？ Translate the following sentences to English.
Machine translation, a quintessential NLP task, faces both challenges and opportunities presented by the emergence of LLMs.Traditional machine translation encompasses several sub-tasks (Farhad et al., 2021), such as bilingual translation (Vaswani et al., 2017), multilingual translation (Johnson et al., 2017;Jiao et al., 2022), terminology translation (Wang et al., 2022;Hou et al., 2022), quality estimation (Rei et al., 2020), and automatic postediting (Pal et al., 2016), among others.These tasks are typically addressed by individual models with limited cross-task interaction.However, current LLMs have the potential to revolutionize this inefficient approach and redefine the machine translation paradigm.On one hand, LLMs can leverage the benefits of various sub-tasks and seamlessly transition between them using only natural language instructions.For instance, if a user is dissatisfied with a translation result, they can request the LLM to refine the translation implicitly (i.e., through automatic post-editing) or explicitly, by imposing constraints on specific entities (i.e., terminology translation).On the other hand, LLMs are expected to enhance the explainability of machine translation, ultimately leading to further improvements in translation quality.For example, users may want LLMs to compare two translations of a sentence (i.e., quality estimation) and provide an explanation for the discrepancies (i.e., error analysis), which can then be addressed in a targeted manner by the LLM itself.However, superior LLMs like ChatGPT and GPT-4 are only accessible through restricted APIs, which creates barriers to new research and advancements in the field.Therefore, developing comprehensive machine translation abilities upon open-source LLMs has become a critical and challenging research problem.
In this paper, we propose the ParroT framework to enhance and regulate the translation abilities of LLMs during chat by leveraging existing human-written translation and feedback data.To be compatible with chat, our framework reformulates translation data into the instruction-following style (Taori et al., 2023), and introduces a "Hint" field for incorporating extra requirements to guide the translation process.Accordingly, we propose three distinct instruction types: (1) Translation Instruction, that asks LLMs to generate translations based on source sentences.(2) Contrastive Instruction, that asks LLMs to generate the translations of two different systems with the preferred one at first.(3) Error-Guided Instruction, that asks LLMs to generate the translations with humanannotated errors as the hint.The first instruction guarantees the basic translation ability of LLMs while the latter two regulate the LLMs to align with human feedbacks (Ouyang et al., 2022;Liu et al., 2023).We adopt the open-source LLaMA (Touvron et al., 2023) and BLOOM (Scao et al., 2022) models, and conduct instruction tuning on previous WMT validation data and Multidimensional Quality Metric (MQM) human evaluation data.The resulting ParroT models are evaluated on Flores subsets and WMT22 test sets.
Our main findings are summarized as below: • Translation instruction, as expected, can improve the translation performance of LLMs significantly, especially for directions from English to other languages.
• Error-guided instruction can further improve the performance when asking ParroT to generate translations with no error, indicating the importance of learning from low-quality translations annotated by humans.
• Parameter efficient finetuning with low-rank adaptation (LoRA, Hu et al., 2022) can prevent LLMs from overfitting, which achieves better performance on dominant languages but slows down the learning from other languages.
• We demonstrate the potential of automatic evaluation tools (i.e., COMET) in providing quality information of translations, when constructing error-guided instructions for directions that lack human annotation data.

Instruction Pool
In this section, we introduce the three distinct instruction types: translation instruction, contrastive instruction, and error-guided instruction.The first instruction guarantees the basic translation ability of LLMs while the latter two regulate the LLMs to align with human-written translation and feedback.

Translation Instruction
As traditional translation systems, we rely on bilingual sentence pairs to accomplish the basic translation ability of LLMs.We follow Stanford Alpaca (Taori et al., 2023) to transform bilingual sentence pairs into the instruction-following format, named translation instruction, for finetuning.Table 1 presents an example of the translation instruction, which includes a preface fixed for all tasks, an "### Instruction:" to describe the translation task (e.g., stating the language pair), an "### Input:" with the source sentence, and a "### Response:" with the target sentence to be generated.To ensure the high quality of sentence pairs, we use human-written translations rather than public training data that could be noisy.

Contrastive Instruction
Besides the basic translation ability, we also want LLMs to understand the relative quality difference between translations.In this way, we may improve the quality of translations by asking LLMs to output the preferred ones.To realize this goal, we need multiple different translations for each source sentence, which can be acquired by the systems submitted to WMT competitions.Meanwhile, the human evaluation results of these systems also provide scores to reflect the quality differences.As shown in Table 1, we form the response by concatenating two translations (e.g., linked by "rather than"), in which the first translation has a higher quality score.Meanwhile, we indicate that the first translation is preferred in the "### Hint:" field.Essentially, the second translation acts like a negative sample to this sentence pair, which explains the name contrastive instruction.

Error-Guided Instruction
The potential problem of contrastive instruction is that, it only tells the LLMs that the two translations have quality differences but not clarify which kind of translation errors lead to such differences.However, we want LLMs to learn the correspondence between the errors and the translations.With such a deeper understanding on the translation errors, we may ask LLMs to produce translations with no error so as to improve the quality.
We propose error-guided instruction.As shown in Table 1, we use the translation with errors annotated by the "<v></v>" span to form the response.Similar to contrastive instruction, we adopt the "### Hint:" field to indicate the error types.This kind of fine-grained error annotation also comes from the human evaluation data.

Training Data
Alpaca Data.This dataset is built by Stanford Alpaca (Taori et al., 2023) 1 project, which contains 52.0K instruction-following data of multi-tasks for tuning the LLaMA (Touvron et al., 2023)2 models.We call these data general instructions, which help the resulting ParroT models to maintain capabilities on general tasks.
WMT Validation Data.We use human-written validation data from previous WMT competitions rather than public training data to avoid introducing noises into instruction tuning.In this version, we use the newstest2017-2020 of Chinese⇔English (i.e., Zh⇔En) and German⇔English (i.e., De⇔En) tasks, which consist of 51.2K sentence pairs for all the four directions.These sentence pairs are formed into the translation instructions.
MQM Human Evaluation Data.Our human feedback data comes from the Multidimensional Quality Metrics (MQM) datasets (Freitag et al., 2021) 3 , which annotate the different translation errors (e.g., major accuracy/mistranslation, minor fluency/grammar) of top WMT systems.Due to its higher reliability than Direct Assessment, MQM was introduced to WMT competitions starting from WMT20 but only provided for a few language pairs.In this version, we use the MQM data for the WMT20 En⇒De and Zh⇒En submissions.These data are formed into the contrastive instructions (i.e., 20K) based on the quality scores and the errorguided instructions (i.e., 26K) based on the error annotations, respectively.
Automatically Assessed Data.Although the Direct Assessment (DA) data of WMT systems provide scores for language directions that lack MQM data (i.e., De⇒En, En⇒Zh), we find the DA score to be very unreliable as they could be quite different for two similar translations.Instead, we opt for automatic evaluation metrics like COMET to score the translations of WMT systems.We also heuristically determine a rough error level for each translation based on the COMET score, namely, Major Error: [0, 85]; Minor Error: (85, 90]; No Error: (90,100].This decision comes in part from the observation that top commercial systems achieve COMET scores of nearly 90 on the Flores subsets (Table 3).Finally, we obtain 24K contrastive instructions and 29K error-guided instructions.
Note: To obtain a set of diverse instructions, we use the three instructions in Jiao et al. (2023), including the one in Table 1, as the seeds to ask GPT-4 (OpenAI, 2023) to paraphrase them.In total, we have 33 different instructions that are randomly combined with the training examples.

Model Training
We conduct our experiments with HuggingFace Transformers4 on open-source LLMs from both the LLaMA family (Touvron et al., 2023) and the BLOOM family (Scao et al., 2022).Specifically, we choose LLaMA-7b and BLOOMZ-7b1-mt with matched parameters, and also include LLaMA-13b and BLOOMZ-560m to study the effect of model sizes.We finetune them to the following variants: • Alpaca, as a reimplementation of the Stanford Alpaca model fine-tuned only on the Alpaca multi-task dataset.
• ParroT-T, finetuned on the Alpaca multi-task dataset and only the translation instructions from WMT validation data.
• ParroT, finetuned on the Alpaca multi-task dataset, and all the three types of instructions introduced above.
The hyper-parameters for finetuning are basically consistent with Stanford Alpaca (Taori et al., 2023).We finetune the Alpaca and ParroT-T models for 3 epochs on the corresponding data combination.
For ParroT and ParroT-LoRA, we finetune them for 1.5 epochs to maintain similar training steps as ParroT-T.We conduct finetuning on 8 Nvidia A100 GPUs and utilize DeepSpeed6 ZeRO stage 3 for model parallel.

Evaluation
Test Data.We evaluate the translation performance of LLMs on two sources of test sets: • Flores Subset: This dataset is a subset of Flores benchmark, in which 50 sentences are sampled for German, English, Romanian and Chinese, respectively, for evaluating the translation performance of ChatGPT (Jiao et al., 2023)  For models based on BLOOM, we only evaluate them on WMT22 test sets since the Flores benchmark has been used in the development of BLOOMZ models.

Ablation Study
Before diving into more experiments, we investigate some factors that may affect the translation performance of LLMs.By default, we conduct the ablation studies on the Flores En⇒De subset with the Alpaca model based on LLaMA-7b.
Prompt Format.In the Alpaca multi-task dataset, about 60% examples contain empty "### Input:", which results in two different prompt formats during finetuning, i.e., prompt-input and prompt-no-input.During inference, they use prompt-no-input which combines the instruction and input to fill the "### Instruction:" field, introducing the inconsistency between finetuning and inference.Therefore, we study if such an operation makes any performance variation.
Instruction Variation.Recent studies (Jiao et al., 2023;Zhang et al., 2023) suggest that LLMs are sensitive to task instructions, which could vary the translation performance considerably.We conduct a brief study for this by comparing the TP1 and TP3 instructions in Jiao et al. (2023).TP1 is the one presented in Table 1 while TP3 is "Please provide the [TGT] translation for the following sentences.",which was demonstrated a better choice when tested on ChatGPT9 .
Search Algorithm.In machine translation, the beam search strategy (Sutskever et al., 2014;Freitag and Al-Onaizan, 2017;Vaswani et al., 2017) has been the standard search algorithm for inference.However, beam search requires high computation costs which becomes infeasible with the LLMs, since they can easily induce out-ofmemory (OOM) issues.Therefore, more efficient search algorithms such as sampling may have to be the choice.Therefore, we compare the sampling strategy (Taori et al., 2023) and the beam search strategy with a beam size of 4 for this factor.
Table 2 presents the results of these ablation studies.We have the following observations: (1) The prompt-input performs slightly better than prompt-no-input though the gap is marginal.(2) The TP1 instruction works better on Alpaca than TP3 which is different from that on ChatGPT.
(3) Generally, beam search outperforms sampling significantly, especially in terms of BLEU score.Therefore, we use prompt-input + TP1 + beam search as the default setting for inference.

Main Results
Table 3 and Table 4 present the translation performance of LLaMA and BLOOM models on the test sets.For Flores subsets, we include the baseline results reported in Jiao et al. (2023).
Instruction tuning exploits the potential of vanilla LLMs for machine translation.Table 3 shows that the vanilla LLaMA-7b without any further training performs badly on the Flores subsets.By inspecting the outputs, we find that the vanilla LLaMA-7b model tends to generate very long sentences (e.g., copy the instructions, continuing text expansion), which makes the generated text not faithful to the source sentences and also not grammatically correct.The reason could be the long context modeling during pretraining.Another reason is that we use the Alpaca inference format, which is basically a zero-shot setting that exhibits no guidance for translation.Tuning LLaMA-7b on the Alpaca multi-task dataset (i.e., Alpaca) can ameliorate the above issue, resulting in complete generations with proper lengths.We find that Alpaca performs much better on translation, which may benefit from the 0.5% translation instructions in the Alpaca multi-task dataset.However, the best performance is mainly observed on high-resource directions like De⇒En, due to the dominant language of Alpaca dataset in English.Further introducing a small amount of translation instructions (i.e., ParroT-T) in the four language directions can significantly improve the performance, especially for En⇒Zh, in which Chi-nese was unseen in the pretraining of LLaMA models (Touvron et al., 2023).The findings of these LLaMA-based models are also consistent with that on the WMT22 test sets.
Learning from low-quality translations annotated by humans is also important.While presenting the high-quality bilingual pairs to LLMs is important, as discussed above, we argue that low-quality translations annotated by humans also bring benefits.As shown in Table 3, without hint in inference, ParroT outperforms ParroT-T slightly on translation directions from English to other languages (i.e., En⇒De, En⇒Zh).However, when asking ParroT to generate translations with no error, the performance can be significantly improved across translation directions and test sets.We speculate that ParroT does learn the relationship between errors and translations by error-guided in- struction, such that it can avoid the translation errors as much as possible when the hint of no error is provided.A bit unexpected is that when asking ParroT to generate preferred translations, the performance drops considerably.As stated in Section 2.3, contrastive instruction only indicates that two translations may have quality differences but not state why, which is difficult for LLMs to identify by themselves.Previous study by Min et al. (2022) also suggests that it is easier for LLMs to learn the instruction formats rather than the input-response patterns, which may explain the phenomenon here.
Parameter efficient finetuning may prevent LLMs from overfitting.We also try low-rank adaptation (LoRA, Hu et al., 2022) to finetune partial parameters of LLMs for efficiency.Experimental results in Table 3 show that Alpaca-LoRA outperforms its full model counterpart noticeably.We speculate that LoRA can prevent LLMs from overfitting the small Alpaca multi-task dataset, leading to a stronger generalization ability.However, applying LoRA to ParroT exhibits distinct behaviors for high-resource and low-resource translation directions.Specifically, ParroT-LoRA outperforms the corresponding full model ParroT on De⇒En but performs much worse on the other directions.It seems that the small amount of tunable param-eters also hinder the learning of instructions from other translation directions.Obviously, the hyperparameters of LoRA should also be properly adjusted to better learn from more instruction data.
LLMs families and sizes also matter.For both LLaMA and BLOOM families, larger models can achieve much better translation performance after instruction tuning.Our ParroT framework proves to be effective across all the models.Comparing the two LLMs families, the ParroT model based on BLOOMZ-7b1-mt performs much better on Zh⇒En and En⇒Zh directions than those based on LLaMA-7b, which mainly results from the better modeling of Chinese during the pretraining process of BLOOM.
Automatic evaluation tools can be effective in constructing error-guided instructions.In Section 3.1, we construct the automatically assessed data for De⇒En and En⇒Zh that are not provided with the MQM data.As shown in Table 3 and Table 4, we can observe considerable improvements of error-guided instruction on these two translation directions.It demonstrates the potential of automatic evaluation tools (i.e., COMET) in providing the quality information of translations, as an augmentation to translation directions that lack human annotation data.

Analysis
We conduct more analyses to understand the effects of our instruction types.By default, we use the model variants based on LLaMA-7b, and the Flores subsets.
Effectiveness of Error-Guided Instruction.To understand how error-guided instruction works, we investigate the behavior of ParroT when asking it to generate translations with varied error levels as   For qualitative analysis, we show an example from Flores Zh⇒En subset in Table 6, in which we highlight all errors in each translation.Compared to no error level, minor and major error levels tend to produce more over-translations and mistranslations.It is also important to point out that no error level does not guarantee that completely correct translations will be generated, especially for named entities, which we attribute to the underexplored translation abilities of current LLMs.
Failure of Contrastive Instruction.We try to understand why contrastive instruction does not work.By examining the responses of ParroT when asking it to generate preferred translations, we observe significant differences in lexical choices between the "preferred" and "unpreferred" (i.e., the second translation in the response) translations.Surprisingly, as shown in Table 7, the "unpreferred" translations obtain a much higher BLEU score but the situation is different for the COMET score.It indicates that ParroT attempted to identify the quality differences between the first and second translations in the contrastive instructions through lexical choices, which is a low-level pattern to reflect the translation quality.One potential reason is that the WMT systems are so competitive with each other that the quality differences between them are too subtle for the LLM to learn effectively.We will investigate more about contrastive instruction in future work.

Related Work
LLMs for MT.With the increasing capacity of LLMs, they have become good few-shot learners (Brown et al., 2020;Lin et al., 2022) on various NLP tasks, including machine translation.A number of recent studies focus on how to prompt LLMs for machine translation, including prompt template comparison (Zhang et al., 2023), few-shot example selection (Agrawal et al., 2022;Vilar et al., 2022), domain adaptation (Moslem et al., 2023), and rare word translation (Ghazvininejad et al., 2023).However, our ParroT framework aims to develop instant translation capability for chatbots without few-shot examples.This is consistent with the performance of ChatGPT and GPT-4 (OpenAI, 2023), which exhibit excellent translation ability (Jiao et al., 2023;Bang et al., 2023;He et al., 2023;Liang et al., 2023) during chat.
Instruction Tuning.To eliminate the reliance on few-shot examples, recent studies also try to finetune LLMs on a small amount of instructions covering different NLP tasks, making the LLMs zero-shot learners (Mishra et al., 2022;Wei et al., 2022).With the emergence of various powerful open-source LLMs such as BLOOM (Scao et al., 2022) andLLaMA (Touvron et al., 2023), there has been a boom for creating instruction data and tuning customized chatbots, for example, Alpaca (Taori et al., 2023), Vicuna, WizardLM (Xu et al., 2023) and the like.However, most of these studies focus on developing chatbots that are capable of general NLP tasks, while we pay more attention to machine translation.More importantly, apart from the instructions built from parallel translation data, we also transform human feedback data into instructions and demonstrate its effectiveness in improving the translation performance.

Conclusion
We propose ParroT to enhance and regulate the translation abilities during chat based on opensource LLMs, human-written translation and feedback data.We reformulate translation data into the instruction-following style, and introduce a "Hint" field for incorporating extra requirements to regulate the translation process.Accordingly, we propose three instruction types for finetuning ParroT models, i.e., translation instruction, contrastive instruction, and error-guided instruction.Experiments on Flores subsets and WMT22 test sets suggest that translation instruction improves the translation performance of vanilla LLMs significantly while error-guided instruction can lead to further improvement, demonstrating the importance of learning from low-quality translations annotated by humans.While we only use three instruction types in this paper, it is natural to extend ParroT to other hints (e.g., entity alignments), which we leave for future exploration.

Limitations
This work performs a preliminary exploration on the instant translation capability for chatbots, which can be further improved in the following aspects: • Instruction Variants: Presently, the instructions only support the translation of incoming sentences.It may be beneficial for chatbots to also translate previous chat records when users struggle to comprehend responses in foreign languages.
• Contrastive Translations: In this study, we did not observe performance improvements related to contrastive instructions, possibly due to incorrect instruction formatting.By exploring alternative formats, such as automatic post-editing (APE), we could potentially capitalize on the advantages of contrastive translations.
• LoRA Effectiveness: The current analysis did not reveal consistent performance improvements when using LoRA as compared to full model training.It may be necessary to adjust the number of tunable parameters according to the dataset size for better results.

A Flores Full Sets
We adopted the subset of Flores in order to directly compare with the results of ChatGPT and commercial MT systems from the previous evaluation report (Jiao et al., 2023), as stated in Section 3.3.However, for a more convincing evaluation, we also test the LLMs on the full sets of Flores (i.e., with 1012 sentences), and list the results in Table 8.Obviously, the trend of performance across systems still holds, and our "ParroT + Infer w/ No Err" setting performs the best.

B General Tasks
In our experiments, we include the Alpaca multitask dataset for finetuning ParroT to preserve its ability on general tasks.We present two examples in Table 9, from question answering and code generation, respectively.The responses of both Alpaca and ParroT are correct, although the expressions may be different.For comprehensive comparison, it is better to perform a quantitative evaluation and we leave it for future work.

Figure 1 :
Figure 1: Framework of ParroT.Hints are (optional) extra requirements to regulate the translation process.
castles in the Loire Valley, the Rhine valley or taking a cruise to interesting cites on the Danube or boating along the Erie Canal.None 81.2For example, visiting the castles in the Rhine Valley and the Lahn Valley , or taking a cruise along the Rhine River to explore the cities along the river , or taking a boat trip along the Illi River .No Err.81.4For example, visiting the castles in the Rhine Valley and the Ruhr Valley , or taking a cruise to visit the cities along the Rhine River , or taking a boat ride along the Illi River .Minor Err.78.1 For example, visiting the castles in the Rhine Valley and the Ruhr Valley , or taking a cruise to visit the cities along the Rhine River , or taking a boat to play in the cities along the Elbe River .Major Err.65.3For example, visiting the fortress in the valley of the River Volga and the River Levan , or taking a cruise to visit the cities along the River Volga , or taking a boat to play in the cities along the River Volga .

Table 1 :
Instruction types for finetuning ParroT models.
品价格基本稳定，未发现严重违法违规行为，市场经营秩序总体平稳。\n\n ### Hint: A translation with major accuracy/mistranslation errors could be\n\n ### Response: The results of the inspection indicate the sufficient supply of living necessities <v>on market-ing</v> including cereals and oils, meat, fruits, vegetables, eggs and milk, and the basically stabilized commodity price.The inspection hasn't found serious violation of laws and regulations.The market order is stable on an overall basis.### Hint: A translation with minor fluency/grammar errors could be\n\n ### Response: The results of the <v>inspection</v> indicate the sufficient supply of living necessities on marketing including cereals and oils, meat, fruits, vegetables, eggs and milk, and the basically stabilized commodity price.The inspection hasn't found serious violation of laws and regulations.The market order is stable on an overall basis.

Table 2 :
Ablation study of key factors on Flores En⇒De subset with Alpaca based on LLaMA-7b.

Table 3 :
Translation performance of LLaMA models on Flores subsets and WMT22 test sets.

Table 4 :
Translation performance of BLOOM models on WMT22 test sets.

Table 5 :
Effects of error levels as hints during inference.Red : improvement; Green : degradation.

Table 6 :
Examples from Flores Zh⇒En subset when translated with varied error levels.Red : translation errors.

Table 7 :
Effects of preference as hints during inference.Red : improvement; Green : degradation.