Leveraging GPT-4 for Automatic Translation Post-Editing

While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality under critical settings. In this work, we formalize the task of direct translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing, producing meaningful and trustworthy edits to translations that help improve its general quality as well as remove different classes of major errors in translations. In particular, human evaluations on assessing edit trustworthiness show that GPT-4 exhibits a large improvement over the prior state-of-the-art LLM. Notably, we improve upon state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics. However, we also show that GPT-4 could produce hallucinated edits, thereby urging caution in its use as an expert translation post-editor.


Introduction
State of the art Neural Machine Translation (NMT) models, trained on web-mined parallel corpora suffer from reliability problems even for high resource language pairs, despite high average case performance (He et al., 2020;Gupta et al., 2020;Sun et al., 2020;Wang et al., 2021;He et al., 2021;Raunak et al., 2022).Thereby, post-editing neural machine translations remains an important exercise for their use in critical settings.As such, a relevant question to ask is whether Large Language Models (LLMs) such as GPT-3, GPT-4 and PaLM, PaLM2 (Brown et al., 2020;OpenAI, 2023;Anil et al., 2023;Chowdhery et al., 2022), which have demonstrated a wide-range of general purpose reasoning as well as knowledge-based capabilities, could be leveraged for the task of translation post-editing.
Post-editing of translations obtained from MT models is a staple task across the translation and localization industry, with higher quality translations obtained from NMT models leading to reduced post-editing time (Zouhar et al., 2021).However, a number of prior works have demonstrated that the parallel data and model training artifacts in NMT could manifest in terms of catastrophic outputs in rare cases, and the detection of such egregious model behaviors remains a challenging task (Raunak et al., 2021;Tang et al., 2022;Guerreiro et al., 2023b;Xu et al., 2023;Guerreiro et al., 2023a).LLM based automatic translation postediting could aid in detecting and fixing such errors to ensure greater reliability of NMT outputs.Besides alleviating reliability problems in NMT, there are a couple of reasons as to why leveraging LLMs for post-editing could be opportune, namely, the advanced multi-lingual understanding capabilities of latest LLMs (OpenAI, 2023) and potentially, their ability to apply desirable knowledge-based or culture-specific customization to translations (Baker, 2009).
In this work, we explore the efficacy of state-ofthe-art LLMs such as GPT-4 on the task of translation post-editing in a natural setting i.e., without any quality-estimation or error detection step applied to the translations.We formalize the task of translation post-editing with LLMs and posit a set of research questions to quantify their utility for the goal of improving translations obtained from NMT models.We also demonstrate gains on translation quality across a number of language pairs on the WMT-22 benchmark (Koehn et al., 2022), achieving state-of-the-art translation performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics.

The Translation Post-editing Task
Task Definition: We formalize the Post-editing Task in a generative setting as follows: Given a Source (S) and a Translation (T ), propose improvements over the translation (E) and generate the translation with the proposed improvements (T ′ ),

arXiv:2305.14878v1 [cs.CL] 24 May 2023
Example of Post-editing using GPT-4 (Verbatim Outputs under the CoT Setting) Source: They were addressed to her son, who has autism and lives in a private care facility, she said.But instead of her son's name inside when you opened them, the letters said Dear Maine's Department of Health and Human Services -in Cincinnati, she told local media.Translation: Sie waren an ihren Sohn gerichtet, der Autismus hat und in einer privaten Pflegeeinrichtung lebt, sagte sie.Aber anstelle des Namens ihres Sohnes im Inneren, als Sie sie öffneten, hieß es in den Briefen Dear Maine Dear Maine 's Department of Health and Human ServicesServices -in Cincinnati, sagte sie den lokalen Medien.
Proposed Improvements: 1. Remove the repetition of "Dear Maine" in the German translation.2. Correct the translation of "Dear Maine's Department of Health and Human Services" to "Sehr geehrtes Department of Health and Human Services von Maine".3. Replace "ServicesServices" with "Services".4. Add a comma after "Cincinnati" for better sentence structure.Improved Translation: Sie waren an ihren Sohn gerichtet, der Autismus hat und in einer privaten Pflegeeinrichtung lebt, sagte sie.Aber anstelle des Namens ihres Sohnes im Inneren, als Sie sie öffneten, hieß es in den Briefen "Sehr geehrtes Department of Health and Human Services von Maine" -in Cincinnati, sagte sie den lokalen Medien.i.e.: Under this task setting, E represents the improvements or edits that are verbalized by a LLM.Note that in the absence of E, the task is to simply generate the improved translation without any intermediate reasoning chain or Chain of Thought (CoT) (Wei et al., 2022b;Kojima et al., 2022).Throughout this work, we refer to the post-editing task in the above zero-shot CoT setting setting as post-editing with CoT and with the setting without E as postediting without CoT.Table 1 shows an input-output example for the post-editing task under the CoT setting.Additionally, throughout this work, we refer to Z as the zero-shot translation of a given source obtained from the LLM that is employed for the post-editing task.Further, through this formalization, we explore the following research questions: Nature of Post-Edited Translations: LLMs can generate high quality, state of the art translations (Hendy et al., 2023) across a number of language pairs in a zero-shot setting.As such, it is quite conceivable that during post-editing, LLMs could generate translations that is incognizant of the provided translation.Hence, the first question is whether during translation post-editing, do LLMs generate improved translations from scratch (i.e.only based on the source S) or do they actually edit the translations provided as per the instructions.Answering this question serves as a sanity check for further exploring the utility of LLMs as a translation post-editors.
General Quality Improvements: Do LLMs lead to general quality improvements as measured by state-of-the-art MT quality metrics?An affirmative answer to this question would enable the use of LLMs as a way to detect reliability issues in existing translations.Another related questions is whether the Post-editing chain of thought is helpful towards translation quality improvements?Even though zero-shot chain of thought has been demonstrated to be effective across reasoning tasks, the translation post-editing task might not require the same degree of variable computation that makes it effective.
Editing Human Annotated Error Spans: Do LLMs modify human annotated translation error spans during the post-editing task?A high frequency of modifications made to the human annotated error spans would signify a greater correlation with human judgement in evaluating translation quality.
Fidelity of Proposed Edits: Do the proposed improvements actually appear in the improved translation produced by LLMs?It is quite conceivable that LLMs might make edit proposals or produce chain of thought that is not realized in the final postedited translation produced by the same model (Ye and Durrett, 2022).If the post-editing explanation is a desiderata of the translation post-editing process, then it becomes critical to examine the fidelity of the proposed edits in addition to the final translation quality.
Through the above questions, we study the efficacy of translation post-editing capabilities of LLMs.In the next section, we describe our experimental settings.

Experimental Settings
Datasets We experiment with WMT-22 News translation task datasets (Kocmi et al., 2022) as well as with WMT-20 and WMT-21 News translation task submissions annotated with MQM errors Freitag et al. (2021).For the post-editing experiments pertaining to the MQM annotated WMT-20 and WMT-21 system outputs, we experiment with samples that have a Major error as an annotation, whereas we experiment with the full WMT-22 datasets throughout.We use the latest WMT-22 test sets for the majority of our experiments, the curation of which falls beyond the training cut-off date for GPT-4 and other LLMs under investigation1 .
Large Language Models We experiment with gpt-3.5-turboand GPT-4 in our experiments.These models represent the most capable publicaly available LLMs (Liang et al., 2022).We use a prompt that describes the system role as a translation posteditor and under the CoT setting, instruct the LLM to propose improvements to the provided translation (T ) of a given source (S), before producing the final post-edited translation (T ′ ).

Metrics and Evaluation
For each of the four research questions posed, we use the metrics highlighted in Table 2.We explain the justification of these measurements in the relevant following sections.For general quality measurements, we use four COMET (Rei et al., 2020) models: the reference-free COMET-QE (wmt20-comet-qeda), COMET-KIWI (wmt-22-cometkiwi-da) Quality Estimation models and the reference-based COMET-20 (wmt20-comet-da) and COMET-22 (wmt22-comet-da) models.We use the Translation Edit Rate (TER) implementation from Post (2018).

Results and Measurements
To answer the above questions, we experiment under two settings: for WMT-20 and WMT-21 systems, we take the translations provided by the different NMT systems as the initial translation upon which the post-editing step is applied.For WMT-22, we use the translations provided by Microsoft-Translator as the initial translation upon which postediting step is applied.

Nature of Post-Edited Translations
To measure whether the translations leverage the generation capabilities of LLMs for producing the final translations or adhere to editing the initial translations provided, we compute the Translation Edit Rate (TER) (Snover et al., 2006) of the postedited translation against the zero-shot translations obtained using the same LLM, and compare it with the TER of the post-edited translation against the initial translation.A higher value of the difference between the two measurements would imply that the translation is closer to the initial translation and that the LLM adheres to the task of editing the translation.To quantify this, we experiment on 10 different English to German NMT systems from the WMT-20 and WMT-21 Shared Tasks on MT.The Direct-Assessment and MQM evaluation based evaluation of these systems is described in (Freitag et al., 2021).
WMT-20 Systems (En-De): Table 3 describes our results on five WMT-20 systems.We find that the post-edited translations (in the default CoT setting) are closer to the initial translations than to the zero-shot translations obtained from the same LLM (gpt-3.5-turbo).
System Name TER (T ′ , Z) TER (T  Discussion We find that the above results hold true across different metrics such as edit distance, BLEU (Post, 2018) or ChrF (Popović, 2015).Further, our results also show a peculiar side-effect of the post-editing task under the CoT setting -that post-editing a system translation might end up leading to a lower quality final translation if the initial translation is lower in quality than the zero shot translation quality of the LLM under consideration.

Impact of CoT
In the next section, we evaluate GPT-4 both under direct and CoT post-editing settings in terms of general quality improvements.

General Quality Improvements
We compare the translation quality of the postedited translation with the initial translation using both reference-free and reference-based state-ofthe-art neural quality metrics for MT (Rei et al., 2020).
Results: Tables 7, 8 The + sign reflects that the post-editing is applied on the initial translations produced by the given System.MS-Translator + GPT-4 shows better performance than GPT-4-Zero-Shot.
consistently surpasses the WMT-22-Best translation system quality.
Gains vs Initial System Quality We observe two different trends in the translation quality: for the more recent systems (outputs on WMT-22), only GPT-4 leads to quality gains in the CoT based post-editing setting whereas for older systems (e.g., outputs on WMT-20, Table 9), both gpt-3.5-turboand GPT-4 lead to quality improvements.

Analyzing the Distribution of Quality Gains
In Figure 1, we compare the distribution of the gains in COMET-KIWI on the En-De WMT-22 dataset and find that GPT-4 is better at abstaining from proposing any edits if the initial translation is already fluent and adequate.

Edits On Human Annotated Error Spans
We use the the MQM error span annotated system outputs provided by Freitag et al. (2021) and measure whether the post edited translation modifies the translation error span as annotated by Results Tables 12 and 13 report the results obtained on 10 NMT system outputs from WMT-20 and WMT-21.We find that gpt-3.5-turboproduces high E3S rates with gains in general quality as well (measured through COMET-KIWI), signifying that it is able to remove the undesirable artifacts (spans) present in the translations.We find that GPT-4 obtains lower E3S values than gpt-3.5-turbo.

Fidelity of the Proposed Edits
In a practical setting, the edits (E) produced in the post-editing task might be useful to illustrate the changes made by the LLM in the post edited translation.Therefore, not only is the fidelity of the proposed edits useful in helping the model leverage variable compute (Wei et al., 2022b) prior to producing the final improved translation, but also in imparting more trust in the LLM based postediting process.Thereby, the question whether the proposed edits are present in the final improved translation or are hallucinated by the model is of significant practical interest.As such, we quantify this property using Edit Realization Rate (ERR), which measures: of the proposed edits (E) by the LLM in the CoT post-editing setting, how many of the edits were actually realized in the improved translation?Since, we do not have any ground truth data to quantify this, we use human evaluation for measuring this property and differentiating between different LLM variants.

ERR Evaluation
We ask a human annotator (native in the target language) to label 50 post-editing samples for both En-De (from the OPPO WMT-20 system) and De-En (from the WMT-22 test set), generated by both gpt-3.5-turboand GPT-4.The annotator is asked to identify if all the proposed edits were realized in the final translation or not.Hence, a binary score is produced for each sample.We do not observe a significant difference in the De-En setting, whereas for En-De we notice a gap of over 30 percent in terms of ERR, with GPT-4 producing edits with higher fidelity.Qualitatively, we find that a number of edits proposed by gpt-3.5turbofor En-De could be deemed as hallucinations, while in general, the edits proposed by GPT-4 do pertain to the edits made in the actual improved translations.An instance of this difference is illustrated in Table 14.However, currently, our ERR quantification is limited both in the sample size (50) as well as in coverage (language pairs) and we leave further analysis to future work.
Our results on the WMT-22 test sets and ERR measurements show consistent gains in quality produced by GPT-4 over gpt-3.5-turbo.Thereby, a useful question to explore is whether GPT-4 shows emergent cross-lingual reasoning abilities.Recent debates on emergent abilities (Schaeffer et al., 2023;Srivastava et al., 2022) have posited it as an artifact of the metric against which the model performance is evaluated, rather than a sharp gain in the underlying capabilities (Wei et al., 2022a).However, we find that on the post-editing task un-  der investigation, under the discontinuous metric of ERR, GPT-4 does show an emergent capability on En-De when compared to the prior generation of GPT models.We further test this hypothesis and experiment on the multilingual Grade School Mathematics (MGSM) dataset (Shi et al., 2023).

Results on MGSM:
We present the results on the MGSM benchmark in Table 15.The results show that GPT-4 exhibits significantly better performance on this reasoning task than the prior generations of GPT models.We find the gains obtained by GPT-4 on this task are consistent with our results on post-editing in the zero-shot CoT setting.However, note the MGSM results are contaminated by the inclusion of the GSM-8K training set in the GPT-4 training corpus, hence the results should not be interpreted as results on a pure zero-shot task.We leave further analysis of our hypothesis to future work.

Discussion
Quality Gains Across Language Pairs: We also report the GPT-4 post-editing quality gains, under the CoT setting over MS Translator for several other language pairs in Table 16.The results show that post-editing leads to consistent gains in translation quality across language pairs.Utility of the Chain of Thought: translations, but useful in constraining the postedited outputs to the initial translations.Therefore, the necessity of variable computation leveraged by the zero-shot chain of thought step is questionable for the post-editing task, even though the edit artifacts produced by the GPT-4 might themselves be valuable for further research.

Related Work
Automatic Post-editing of Translations: A long line of prior work has tried to build Neural models for the translation post-editing task (Vu and Haffari, 2018;Shterionov et al., 2020;Chatterjee, 2019;Góis et al., 2020;Correia and Martins, 2019a;Voita et al., 2019;Chollampatt et al., 2020;do Carmo et al., 2021).Shterionov et al. (2020) presented a comprehensive road-map for APE, highlighting the challenges and potential directions for future research.Chatterjee (2019) explored the use of deep learning techniques for APE and proposed novel architectures to improve the quality of post-edited translations, while Góis et al. (2020) focused on learning strategies for APE and investigated the use of automatic orderings techniques to refine translations.Correia and Martins (2019b) proposed a simple yet effective neural model for APE using transfer learning, demonstrating promising results.Voita et al. (2019) introduced a context-aware approach to APE, incorporating source context information into the neural model to generate more accurate post-edits.Chollampatt et al. (2020) explored the use of APE to improve the overall translation quality for NMT models.They investigate the effects of varying training data sizes, using artificial training data, and domain specificity for the APE task.
In a comprehensive review, do Carmo et al. ( 2021) provided an overview of various techniques and approaches in the field of APE, covering both traditional and neural-based methods.Their work summarized the advancements made in the area, highlighting the strengths and limitations of different approaches.
Overall, these studies, among others, have contributed significantly to the development of neural models for APE.They have explored different architectures, learning strategies, and contextual information integration to improve the quality of post-edited translations.However, to the best of our knowledge, we present the first work that investigates using GPT-4 for automatic post-editing of translations.Our work is also related to a number of works exploring the using of LLMs for translation (Hendy et al., 2023;Gao et al., 2023;Lu et al., 2023;Vilar et al., 2022;Garcia et al., 2023).

Conclusion
We demonstrated promising results on post-editing using GPT-4, achieving state-of-the-art translation performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based postediting.We formalized a clear framework to make post-editing using LLMs amenable to further study.We will be exploring the posited questions with additional experiments in the immediate future.
Illustration: Edit Realization Rate Difference between GPT-Turbo and GPT-4 Source: Stephanie Lay said she found the piles of letters from the insurance company stashed in her mailbox between Thursday and Monday.Translation: Stephanie Lay sagte, sie habe die Briefstapel der Versicherung zwischen Donnerstag und Montag in ihrem Briefkasten versteckt gefunden.gpt-3.5-turboProposed Improvements: 1.The word "stashed" is not adequately translated in the German text.2. The word "mailbox" is not translated correctly in the German text.3. The word "piles" is not translated correctly in the German text.4. The word "found" is not translated adequately in the German text. 5.The word "between" is not translated correctly in the German text.

GPT-4
Proposed Improvements: 1. Replace "Briefstapel der Versicherung" with "Stapel von Briefen der Versicherungsgesellschaft" to better convey the meaning of "piles of letters from the insurance company".2. Remove "versteckt" as it is not supported in the input and is not necessary for the meaning of the sentence.

Improved Translation:
Stephanie Lay sagte, sie habe die Stapel von Briefen der Versicherungsgesellschaft zwischen Donnerstag und Montag in ihrem Briefkasten gefunden.The example shows an instance of the Proposed Edits (E) and Improved Translation (T ′ ) obtained using gpt-3.5-turboand GPT-4.In the above example, ERR for the given sample was deemed to be 0 for gpt-3.5 turbo and 1 for GPT-4, since a proposed edit was not realized in the final translation produced by gpt-3.5-turboduring post-editing.We find that GPT-4's edit proposals are included in the final translation with a far greater frequency.We quantify this property in Section 4.
Languages Davinci-002 PaLM GPT-3.5-TurboGPT-4-Zero-Shot GPT-4-Few-Shot GPT-4 shows a large improvement over the prior generation of GPT models on the MGSM benchmark.We add our results with the caveat that GPT-4 has been trained on the training set of the GSM8K benchmark.In the Few-Shot GPT-4 setting, only the native language prompts are supplied without any system-level instructions, while for text-davinci-003 and Palm, the best results are reported from (Shi et al., 2023) .

Figure 1 :
Figure 1: Post-editing Quality Analysis: (Top) gpt-3.5turboand (Bottom) GPT-4.The X-axis represents the difference in the segment-level COMET-KIWI scores between the initial translation and the post edited output.We find that GPT-4 shows a higher rate of abstention in the CoT post-editing setting, with 75 percent of the GPT-4 edits not leading to any degradation in segment-level COMET-KIWI scores.

Table 1 :
Post-editing TaskExample: GPT-4 produces edits that result in improving the general quality of the translation as well as in removing undesirable artifacts across a range of NMT systems, as quantified in Section 3.

Table 2 :
Measuring Post-editingEfficacy: Given, the Source S, Translation T and the Post-Edited Output T ′ , we explore the four research questions in Section 2 through experiments on the corresponding datasets, using the measurements described in Section 4.

Table 6 :
-shot translation, even though the TER difference in the direct setting is much smaller than the difference in the CoT setting.WMT-22 Zh-En: The post-edited translations (T ′ ) are closer to the initial translations (T ) than the zero-shot translations (Z) in the CoT setting, however, in the direct setting the opposite holds true, albeit with a smaller magnitude.
Table 5 describes our results on WMT-22 En-Zh and Table6describes our results on Zh-En with post-editing using GPT-4.We find that CoT constrains the final translations to be closer to the initial translation.In the direct post-editing setting, the final translation is closer to the zero′ , Z) TER (T ′ , T )

Table 7 :
, 10 and 11 provide the results of the experiments done on the WMT-22 General Quality Improvements on WMT-22 De-En: The + sign reflects that the post-editing is applied on the initial translations produced by the given System.MS-Translator + GPT-4 shows better performance than GPT-4-Zero-Shot.
test sets.Throughout, we find that post-editing under both CoT and direct settings lead to improvements over high-quality initial translations obtained through MS-Translator.Further, direct post-editing of MS-Translator outputs with

Table 10 :
General Quality Improvements on WMT-22 En-De: The + sign reflects that the post-editing is applied on the initial translations produced by the given System.

Table 11 :
General Quality Improvements on WMT-22 En-Zh: The + sign reflects that the post-editing is applied on the initial translations produced by the given System.

Table 12 :
Edit Efficacy over Erroneous Spans with gpt-3.5-turbo:On both WMT-20 and WMT-21 Systems, post-editing with gpt-3.5-turbomodifies more than half of the erroneous spans.

Table 13 :
From our results, the inclusion of the Edit Proposal or CoT Step is detrimental towards the quality of the post edited Edit Efficacy over Erroneous Spans with GPT-4: Post-editing with GPT-4 modifies more than half of the erroneous spans, however, the E3S rate is lower than gpt-3.5-turbo.

Table 16 :
General Quality Improvements on WMT-22 Test Sets: The + sign reflects that the post-editing is applied on the initial translations produced by the given System.The post-editing is applied in the CoT setting throughout the results in this table.