Towards Making the Most of ChatGPT for Machine Translation

ChatGPT shows remarkable capabilities for machine translation (MT). Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages, but lags behind in complex tasks, e.g., low-resource and distant-language-pairs translation. However, they usually adopt simple prompts which can not fully elicit the capability of ChatGPT. In this paper, we aim to further mine ChatGPT's translation ability by revisiting several aspects: temperature, task information, and domain information, and correspondingly propose an optimal temperature setting and two (simple but effective) prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP). We show that: 1) The performance of ChatGPT depends largely on temperature, and a lower temperature usually can achieve better performance; 2) Emphasizing the task information can further improve ChatGPT's performance, particularly in complex MT tasks; 3) Introducing domain information can elicit ChatGPT's generalization ability and improve its performance in the specific domain; 4) ChatGPT tends to generate hallucinations for non-English-centric MT tasks, which can be partially addressed by our proposed prompts but still need to be highlighted for the MT/NLP community. We also explore the effects of advanced in-context learning strategies and find a (negative but interesting) observation: the powerful chain-of-thought prompt leads to word-by-word translation behavior, thus bringing significant translation degradation.


Introduction
Recently, the emergence of ChatGPT1 has brought remarkable influence on natural language processing (NLP) tasks.ChatGPT is a large-scale language model developed by OpenAI, based on Instruct-GPT (Ouyang et al., 2022a), that has been trained to follow instructions with human feedback.Chat-GPT possesses diverse abilities of NLP, including question answering, dialogue generation, code debugging, generation evaluation, and so on (Qin et al., 2023;Zhong et al., 2023;Wang et al., 2023a;Kocmi and Federmann, 2023;Lu et al., 2023b;Wang et al., 2023b).We are particularly interested in how well ChatGPT can perform on the machine translation task.
Previous studies (Jiao et al., 2023;Hendy et al., 2023) on translation tasks have found that ChatGPT performs competitively with commercial translation products (e.g., Google Translate and Microsoft Translator) on high-resource languages, but has limited capabilities for low-resource and distant languages.However, they only adopt simple prompts and basic settings regardless of the significant influence of the prompts' quality (Zhou et al., 2022), which may limit ChatGPT's performance.In this paper, we aim to further elicit the capability of ChatGPT by revisiting the following three aspects and correspondingly propose an optimal temperature setting and two simple but effective prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP).
Temperature.Temperature is an important parameter to ensure ChatGPT generates varied responses to human queries.Basically, decoding with higher temperatures displays greater linguistic variety, while the low one generates grammatically correct and deterministic text (Ippolito et al., 2019).However, for tasks with a high degree of certainty, such as machine translation, we argue, a diverse generation may impede its translation quality.We evaluate the performance of ChatGPT at different temperatures to verify its effect and find the optimal temperature setting for the following experiments.
Task Information.ChatGPT is fine-tuned on high-quality chat datasets and thus essentially a conversational system that has a certain distance from the translation system, we argue that the task inconsistency will limit its translation ability to a certain degree.In response to this problem, we proposed Task-Specific Prompts (TSP) to further emphasize the task information to bridge the task gap, i.e., conversation and translation.Domain Information.Compared with traditional machine translation systems, ChatGPT can incorporate additional information, like human interactions, through the input prompts (Dong et al., 2023).We argue that such flexible interaction may alleviate some classical MT challenges, e.g., crossdomain generalization (Koehn and Knowles, 2017).We, therefore, propose Domain-Specific Prompts (DSP) to introduce the domain navigation information to elicit ChatGPT's generalization ability across different domains.
Through extensive experiments, we find that: ChatGPT's performance largely depends on the temperatures, especially in difficult languages.Generally, setting a lower temperature can result in higher performance.
Emphasizing the task information in prompts can further improve ChatGPT's performance, especially in complex tasks.
Introducing the correct domain information consistently improves ChatGPT's performance while wrong domain information leads to significant degradation in performance.
When tackling the non-English-centric tasks (both the input and expected output are non-English), ChatGPT may generate hallucinations, which should be paid more attention to by the MT/NLP community.
Furthermore, we explore the effects of several advanced in-context learning strategies (Brown et al., 2020b).Specifically, we investigate ChatGPT's few-shot in-context learning (ICL) and chain-ofthought (CoT) (Wei et al., 2022c;Kojima et al., 2022) abilities on MT tasks.Experimental results show that few-shot ICL can further improve Chat-GPT's performance, which is identical to the findings of Hendy et al. (2023), and we also find a negative but interesting observation: CoT leads to word-by-word translation behavior, thus bringing significant translation degradation.Also, we call for improving ICL and CoT for MT upon ChatGPT by incorporating the philosophy of example-based and statistical MT (Nagao, 1984;Koehn, 2009).
The remainder of this paper is designed as follows.We present the evaluation settings in Section 2. In Section 3, we revisit the performance of ChatGPT from three aspects (temperature, task, and domain information) and show the zero-shot translation performance of ChatGPT with our proposed advanced prompt recipes.Section 4 summarizes the few-shot in-context learning and chain-ofthought results.Section 6 presents conclusions.

Evaluation Setting
We provide a brief introduction of the evaluation setting, which mainly includes the used models, test set, and evaluation metrics.
Models.We mainly compare ChatGPT2 with the commercial translation product Google Translator3 , which supports translation in 133 languages.By default, the results in this paper come from the gpt-3.5-turbo-0301models, which power the ChatGPT.
Data.For multilingual translation and in-context learning, we evaluate the performance of the models on the Flores-200 (Goyal et al., 2022) 4 test sets, which consists of 1012 sentences translated into 204 languages.To evaluate the effect of cross-domain translation, we adopt the test set of WMT19 Biomedical (Bawden et al., 2019), News Translation Task (Barrault et al., 2019) and WMT22 E-Commerce task (Kocmi et al., 2022).Table 1 lists the statistics of these test sets.We test all samples through OpenAI API.
Metric.The translation metrics shared task (Freitag et al., 2022) recommends using neural network-

Method Translation Prompt
ChatGPT "role": "user", "content": "Please provide the [TGT] translation for the following sentence:" ChatGPT + TSP "role": "system", "content": "You are a machine translation system.","role": "user", "content": "Please provide the [TGT] translation for the following sentence:" based metrics since they have demonstrated a high correlation with human evaluation and are resilient to domain shift.Hence, we adopt the mostly used COMET (Rei et al., 2020) as our primary metric and use the default parameters of "cometcompare" for significance test 5 .Specifically, we use the reference-based metric COMET-20 (wmt20-COMET-da).Additionally, we also report BLEU scores (Papineni et al., 2002) and ChrF (Popović, 2015) using SacreBLEU (Post, 2018) for completeness, but notably, we mainly analyze the performance in terms of model-based metric COMET.

Zero-Shot Translation
In this section, we explore the performance of Chat-GPT from three aspects: TEMPERATURE, TASK INFORMATION, and DOMAIN INFORMATION, and correspondingly propose an optimal temperature setting and two simple and effective prompts to improve ChatGPT's performance.

The Effect of Temperature
ChatGPT is a chatting machine designed to provide fluent and diverse responses to a wide range of human requests.It is intuitive that the diversity of responses may hinder its performance on tasks with a high degree of certainty, such as machine translation, to some extent.
To investigate the influence of diversity, we compare the performance of ChatGPT in different temperature settings, including 0, 0.2, 0.4, 0.6, 0.8, and 1, across three translation directions: English⇒Romanian, English⇒Chinese, and English⇒German.The relationship between temperature and performance of ChatGPT is shown in Figure 1 and 2.
Results. Figure 1 and 2 show that ChatGPT's performance largely depends on the value of temperatures, and as the temperature rises, there is  a clear degradation both in COMET and BLEU scores.Furthermore, it is noteworthy that Chat-GPT's sensitivity to the temperature varies depending on the language pair: the impact of temperature is relatively small when translating to high-resource languages, e.g., German, while for complex languages, e.g., Chinese, it has a large degradation in performance (−4.3 COEMT points and −3.7 BLEU points for Chinese) when the temperature changes from 0 to 1.We speculate that the huge resource variance in training data leads to differences in the confidence of languages, which partially explains the different performances.In the following experiments, we adopt T = 0 as our default setting to make the most of ChatGPT and ensure the stability of generation to avoid a result of noise.Table 3: Performance with different prompts on 4 language pairs from Flores-200."TSP" denotes our proposed task-specific prompting method.The best scores across different systems are marked in bold and the best scores of ChatGPT are underlined.Notably, we set the temperature as 0 for ChatGPT in this experiment.We can see that our TSP method consistently boosts the performance of ChatGPT in most settings.Shadowed areas mean difficult English-centric translation tasks, Green areas mean non English-centric translation tasks." † " indicates a statistically significant difference from the ChatGPT baseline (p < 0.05).

The Effect of Task Information
Previous studies (Jiao et al., 2023;Hendy et al., 2023) have shown that ChatGPT can achieve exceptional performance in conversational domain translation, which is attributed to its ability to generate more natural and diverse spoken language.However, given that ChatGPT is deliberately designed as a general task solver (Qin et al., 2023), when asking the ChatGPT to perform as a specific task engine, there will arise a task gap.This task inconsistency may limit ChatGPT's effectiveness in translation tasks other than the spoken domain.
To bridge the task gap and generate more translation-like sentences, we propose Task-Specific Prompts (TSP) to emphasize the translation task information.Specifically, we prepend the sentence "You are a machine translation system." to the best translation template in Jiao et al. (2023), and adopt it to query ChatGPT.The templates of prompts present in Table 2, and [TGT] represents the target languages of translation.
We have compared the performance of various models on four language pairs, covering eight distinct translation directions.These languages comprise 1) German, which is one of the most non-English languages in the GPT training data, 2) Romanian, a less frequently encountered non-English language in the GPT training data, and 3) Chinese, a large-scale language with a script distinct from English.We also adopt Chinese-Romanian as a non-English-centric use case.Table 3 lists the full results, where we list both English-centric and non-English-centric language directions (marked with green ), and also, among English-centric directions, we highlight the difficult pairs (EN-ZH and EN-RO with shadow ) in terms of their resources and language distance.

English-Centric Language Pairs
We first consider the performance of Chat-GPT in English-centric translation language pairs.Specifically, we conduct experiments in three language pairs: German⇔English (highresource), Romanian⇔English (low-resource), and Chinese⇔English (distant language).
Results.Our results presented in Table 3 show that our TSP method achieves comparable results on COMET score compared to Google Translator and even outperforms it in some language pairs, e.g., English⇒Romanian (92.9 v.s.91.6).We also observe that our TSP method consistently improves the performance of vanilla ChatGPT, especially when translating to low-resource or distant languages.Specifically, our TSP method brings +0.8 and +0.5 COMET score improvements in English⇒Chinese and English⇒Romanian, respectively, and +0.2 on average when translating to English.We speculate that the high-resource training data can help the model better understand the specific task from a few task-related navigations, thereby reducing the need for additional taskspecific information.Although our proposed TSP consistently improves the performance in terms of semantic metric, i.e., COMTE, notably, we have not consistently bridged the task gap in terms of lexical metrics (BLEU and ChrF), which is consistent with similar findings from Vilar et al. ( 2022) on PALM-540B model.

Non-English-Centric Language Pairs
We also evaluate the performance of ChatGPT in non-English-centric language pairs (since the pretraining process was dominated by the English tokens and the multilingual MT community argues it may harm the non-English-centric performance (Costa-jussà et al., 2022;Zan et al., 2022aZan et al., , 2023)).).We have an important finding that, when tackling non-English-centric MT language pairs, ChatGPT tends to generate translation hallucinations, that is, some unrelated in-  formation obeyed some patterns followed the translation, such as "Translation may vary depending on context", which will greatly affect the MT performance.We used a post-processing method to remove irrelevant information from the generated text.Specifically, we summarize some templates about irrelevant sentences and remove them from the generation texts.Some templates are shown in Table 4 and the number of post-processed sentences is presented in Figure 3.
Results. Figure 3 shows that lower temperature can reduce the number of hallucinations (especially in distant languages, e.g., Chinese) and our TSP method can further reduce its number, which suggests that our method can help ChatGPT to better serve as a machine translation system.The full results on Romanian⇔Chinese lists are in Table 3.As seen, our TSP method can only slightly improve ChatGPT's performance, which could be due to the difficulty in both understanding and generating the language pairs.Meanwhile, our used post-editing approach could only roughly remove the hallucination patterns, the NLP/MT community should pay more attention to the potential hallucination when using ChatGPT to tackle the non-English text.
The subsequent experiments will use ChatGPT with TSP as the default setting.

The Effect of Domain Information
Compared with traditional machine translation systems, ChatGPT can incorporate additional information through the prompts to further improve its performance.While previous studies have shown that ChatGPT has great robust translation capabilities (Hendy et al., 2023), we believe that we can further enhance its performance by incorporating domain-specific guidance.
To this end, we propose Domain-Specific Prompts (DSP) that identify the domain informa-

Method Translation Prompt
ChatGPT "role": "system", "content": "You are a machine translation system.","role": "user", "content": 'Please provide the [TGT] translation for the following sentence: ' ChatGPT+DSP "role": "system", "content": "You are a machine translation system that translates sentences in the [DOM] domain.","role": "user", "content": 'Please provide the [TGT] translation for the following sentence: ' ChatGPT+F-DSP "role": "system", "content": "You are a machine translation system that translates sentences in the [FDOM] domain.","role": "user", "content": 'Please provide the [TGT] translation for the following sentence: ' We evaluate our method on the WMT19 Bio and News datasets followed Jiao et al. (2023), which allows us to examine domain bias's impact.For example, the WMT19 Bio test set comprises Medline abstracts that require domain-specific knowledge, while the WMT19 News dataset features news-style texts that are significantly different from dialogues.To further prove the effectiveness of our method, we conduct our method on WMT22 English-Chinese E-Commerce test set, which is less likely to overlap with the GPT training data.
Results.The results are listed in Table 6.Obviously, the original ChatGPT does not perform as well as Google Translator in both COMET and lexical metrics (e.g., BLEU).However, our DSP method can consistently improve the performance of ChatGPT in terms of COMET score and even outperforms Google Translator in two datasets (WMT19 Bio Chinese ⇒ English and WMT19 News English ⇒ Chinese).This finding indicates that our method can further improve the generalization ability of ChatGPT and narrow the gap with one of the most advanced commercial systems -Google Translator.Nonetheless, our method's impact on BLEU is inconsistent, and it still lags significantly behind Google Translator's performance.
To verify that the observed improvement is indeed due to the introduction of the domain information, we deliberately provided incorrect domain information for each sentence, namely F-DSP, to attack the improvement brought by the DSP strategy.Specifically, We exchange domain information for the biomedical sentences and the news sentences.We expect that the wrong domain guidance (F-DSP) will under-perform the DSP, and even perform worse than the vanilla ChatGPT.The results of these experiments are shown in the last row of Table 6, which clearly shows a consistent degradation in COMET, proving that the domain information is the key to the success of our method.
All the above DSP and F-DSP results confirm the importance of domain-specific prompting guidance in using ChatGPT for MT tasks.

Few-shot Machine Translation
In this section, we simply explore the effects of advanced in-context learning (ICL) strategies, specifically, we investigate ChatGPT's few-shot ICL and Chain-of-Thought (CoT) abilities on MT tasks.

Few-Shot In-Context learning
In-context learning (Brown et al., 2020b) has shown its remarkable ability for many NLP tasks (Liu et al., 2023).To further explore the capabilities of the ChatGPT, we conduct experiments with different sample selection strategies.Specifically, we evaluate the performance of few-shot machine translation in the following three directions: English⇒Chinese, English⇒Romanian, and English⇒German in Flores-200.We conducted experiments with randomly and TopK (Liu et al., 2022) sampled demonstrations from development sets in the 1-shot and 3-shot settings.
Results.Our results are listed in Table 7.As seen, in-context learning with random examples consistently improves the performance in both lexical metric (BLEU) and COMET score compared to the zero-shot approach, and increasing the number of shots can lead to further improvement, which is consistent with previous finding (Hendy et al., 2023) Table 6: Performance of ChatGPT on translation robustness, i.e., different domains."DSP" denotes our proposed domain-specific prompting method, while "F-DSP" denotes the false domain-specific prompting, i.e., we specify wrong/unrelated domain information in the prompt.The results in green denote that "DSP" improves ChatGPT by a clear margin (0.5 (↑) score), while the red results denote the significant performance drops caused by "F-DSP"." † " indicates a statistically significant difference from the ChatGPT baseline (p < 0.05).TopK, which chooses test-sample similar examples as demonstrations, can further improve the performance, even outperform Google Translator in some language pairs, e.g., English⇒Romanian (94.0 v.s.91.6) and English⇒Chinese (68.8 v.s.68.5).

System
We encouragingly find that the advanced sampleselection strategy for in-context learning for MT tasks upon ChatGPT is extremely similar to the design philosophy of example-based machine translation (EBMT, Nagao, 1984), where the EBMT is often characterized by its use of a bilingual corpus as its main knowledge base, at run-time.It is worthy of designing better ICL strategies inspired by EBMT in future work.

Chain-of-Thought
Chain-of-Thought (CoT) prompting (Wei et al., 2022c) has been demonstrated to be effective in eliciting the reasoning ability of large language models.Previous studies have shown that CoT can improve the ChatGPT's performance in natural language understanding tasks (Zhong et al., 2023), but its influence on machine translation tasks has hardly been investigated.
To investigate this further, we randomly select 20 samples from the test set and adopt the zero-shot CoT technique (Kojima et al., 2022) and the 1-shot CoT technique.Specifically, as shown in Table 8, for zero-shot CoT, we use the prompt "Please provide the [TGT] translation for the following sentence step by step" to extract step-by-step translation.We also add the sentence 'and then provide the complete sentence:' to the end of the prompting to ensure that ChatGPT can generate the complete translation.While for the 1-shot CoT, we provide the manual intermediate reasoning steps inspired by zero-shot CoT, as shown in Table 8.Here, [S] and [T] represent the corresponding source and target sentence in the demonstration, respectively, and [S_i] and [T_i] are the i-th matching tokens in the source and target sentence.

Method Translation Prompt
Zero-Shot CoT "role": "system", "content": "You are a machine translation system.","role": "user", "content": 'Please provide the German translation for the following sentence step by step and then provide the complete sentence: ' 1-Shot CoT "role": "system", "content": "You are a machine translation system.","role": "user", "content": 'Please provide the German translation for the following sentence step by step and then provide the complete sentence: [S] 1.We looked in detail at the sentences generated by different prompts, presented in Table 10, and we have a negative but interesting observation: the CoT prompt leads to word-by-word translation behavior, which is the main reason for the significant translation degradation.
For more CoT variants designed with different principles inspired by the philosophy in statistical MT (Zens et al., 2002;Koehn, 2009) will be explored in the future.For example, word-by-word and then reordering the translation (Du and Way, 2017;Ding et al., 2020), phrase-to-phrase (Feng et al., 2018;Ding et al., 2021) and then reordering the translation, and structure-to-structure transla-tion (Kaplan et al., 1989).
Traditionally, these PLMs can achieve remarkable performance in various natural language processing (NLP) tasks through fine-tuning on specific tasks.But with the scaling up and the development of LLMs (Brown et al., 2020a;Ouyang et al., 2022b), decoder-only LLMs exhibit remarkable zero-shot and few-shot abilities, denoted emergent abilities (Wei et al., 2022b), and achieve comparable results with other LLMs in NLU and conditional NLG tasks.Especially the emergency of ChatGPT, developed by OpenAI, takes LLMs a big step forward in both academia and industry.ChatGPT possesses diverse abilities of NLP and can generate human-like responses by instructiontuning (Wei et al., 2022a) and Reinforcement Learning from Human Feedback (RLHF) technique (Ouyang et al., 2022b).
ChatGPT for Machine Translation.The ability of ChatGPT has been widely studied in various domains (Qin et al., 2023;Zhong et al., 2023), but its ability on machine translation tasks has not been fully investigated.Jiao et al. (2023) and Hendy et al. (2023) first provided an evaluation on the performance of ChatGPT for machine translation, they found that ChatGPT can perform competitively with commercial translation products on high-resource European languages but lags behind significantly on low resource or distant languages.However, they usually adopt simple prompts and basic settings which cannot fully exploit the capabilities of ChatGPT, we first proposed that Chat-GPT can achieve comparable results with proper settings and investigate how to make the most of ChatGPT for machine translation.
Subsequent work follows our work to further explore the performance of ChatGPT, Gao et al. (2023) and Lu et al. (2023a) introduce new information (e.g., POS or multilingual dictionaries), He et al. (2023) proposed a CoT-like framework to generation human-like translation.

Conclusion
In this paper, we investigate how to further mine ChatGPT's translation ability from three perspectives, namely temperature, task, and domain information, and correspondingly propose an optimal temperature setting and two simple but effective prompts.We empirically demonstrated that there is a high correlation between temperature and ChatGPT's performance, and a lower temperature usually can achieve better performance.Experimental results across various language pairs and domains proved the effectiveness of our proposed prompts.We further explore the effectiveness of advanced in-context learning strategies for Chat-GPT, we find that the few-shot in-context learning method can consistently improve ChatGPT's performance, while conventional Chain-of-Thought (CoT) prompting will degrade its performance because of its word-by-word translation behavior.
In future work, besides the aforementioned explorations (EBMT-inspired prompts designing, statistical MT-inspired chain-of-thought designing), we would like to investigate how to further elicit the ability of ChatGPT by designing more effective prompts (e.g., design human-like CoT to navigate the LLMs, and better demonstration selection algorithms in few-shot ICL) and investigate the ability of ChatGPT for more MT settings (e.g., document translation).

Limitations
Our work has several potential limitations.First, we only propose some simple prompts that have not been carefully designed to investigate the capabilities of ChatGPT, which may not sufficiently elicit the power of ChatGPT.Second, we have not fully studied the performance of ChatGPT in fewshot scenarios, especially the effect of Chain-Of-Thought in machine translation.In future work, we would like to design different types of prompts to further improve ChatGPT's performance in machine translation and conduct more in-depth analyses and discussions.

Figure 2 :
Figure 2: The relationship between temperature and ChatGPT's performance (in terms of BLEU scores) when translating from English to other languages.

Figure 3 :
Figure 3: Number of Post-Edited sentences in non-English-centric language pairs, where a higher value means the translation contains more hallucinations.RO represents the translation for ZH⇒RO, while ZH represents the translation for ZH⇒RO.

Table 1 :
Data statistics and descriptions.

Table 4 :
Some templates about irrelevant information in generated sentences for Chinese⇔Romanian.Semicolon is used to separate different templates.[Ro]representsthesentence in Romanian while[Zh]represents that in Chinese.

Table 5
tion of translated sentences in prompts to facilitate ChatGPT's generalization.Specifically, we ask ChatGPT with the following prompts "You are a machine translation system that translates sentences in the [DOM] domain", as shown in Table 5.Here, [DOM] represents the correct domain of the translated sentence, while [FDOM] represents the wrong domain of that, which is used to verify whether the improvement comes from domain information.For example, for a biomedical sentence, [DOM] is biomedical, while [FDOM] can be any field except biomedical.
. The advanced sample-selection strategy like

Table 7 :
Few-shot translation performance of ChatGPT on Flores-200.In the random sampling few-shot prompting setting, we randomly sample 1/3 examples from the development set with 3 runs.The best scores across different systems are marked in bold and the best scores of ChatGPT are underlined.

Table 8 :
The templates of Zero-Shot CoT and 1-shot CoT.[S_n] represents the n-th token in source demonstration [S], [T_n] represents the n-th token in target demonstration [T].

Table 9 :
Performance of ChatGPT equipped with CoT prompting methods on randomly selected 20 samples from English⇒German and English⇒Chinese.