Do GPTs Produce Less Literal Translations?

Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations differ qualitatively from the translations generated by standard Neural Machine Translation (NMT) models. In this work, we investigate these differences in terms of the literalness of translations produced by the two systems. Using literalness measures involving word alignment and monotonicity, we find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on MT quality metrics. We demonstrate that this finding is borne out in human evaluations as well. We then show that these differences are especially pronounced when translating sentences that contain idiomatic expressions.


Introduction
Despite training only on a language-modeling objective, with no explicit supervision on aligned parallel data (Briakou et al., 2023), LLMs such as GPT-3 or PaLM (Brown et al., 2020;Chowdhery et al., 2022) achieve close to state-of-the-art translation performance under few-shot prompting (Vilar et al., 2022;Hendy et al., 2023). Work investigating the output of these models has noted that the gains in performance are not visible when using older surface-based metrics such as BLEU (Papineni et al., 2002a), which typically show large losses against NMT systems. This raises a question: How do these LLM translations differ qualitatively from those of traditional NMT systems?
We explore this question using the property of translation literalness. Machine translation systems have long been noted for their tendency to produce source He survived by the skin of his teeth .

NMT
Il a survécu par la peau de ses dents . GPT-3 Il a survécu de justesse . Table 1: An example where GPT-3 produces a more natural (non-literal) translation of an English idiom. When word-aligning these sentences, the source word skin remains unaligned for the  overly-literal translations (Dankers et al., 2022b), and we have observed anecdotally that LLMs seem less susceptible to this problem (Table 1). We investigate whether these observations can be validated quantitatively. First, we use measures based on word alignment and monotonicity to quantify whether LLMs produce less literal translations than NMT systems, and ground these numbers in human evaluation ( § 2). Next, we look specifically at idioms, comparing how literally they are translated under both natural and synthetic data settings ( § 3).
Our investigations focus on the translation between English and German, Chinese, and Russian, three typologically diverse languages. Our findings are summarized as follows: (1) We find that translations from two LLMs from the GPT series of LLMs are indeed generally less literal than those of their NMT counterparts when translating out of English, and (2) that this is particularly true in the case of sentences with idiomatic expressions.

Quantifying Translation Literalness
We compare the state-of-the-art NMT systems against the most capable publicly-accessible GPT models (at the time of writing) across measures designed to capture differences in translation literalness. We conduct both automatic metric-based as well as human evaluations. We explain the evaluation and experimental details below.
Datasets We use the official WMT21 En-De, De-En, En-Ru and Ru-En News Translation test sets for evaluation (Barrault et al., 2021).

Measures of Quality
We use COMET-QE 1 (Rei et al., 2020) as the Quality Estimation (QE) measure (Fomicheva et al., 2020) to quantify the fluency and adequacy of translations. Using QE as a metric presents the advantage that it precludes the presence of any reference bias, which has been shown to be detrimental in estimating the LLM output quality in related sequence transduction tasks (Goyal et al., 2022). On the other hand, COMET-QE as a metric suffers from an apparent blindness to copy errors (i.e., cases in which the model produces output in the source language) (He et al., 2022). To mitigate this, we apply a language identifier (Joulin et al., 2017) on the translation output and set the translation to null if the translation language is the same as the source language. Therefore, we name this metric COMET-QE + LID.

Measures of Translation Literalness
There do not exist any known metrics with high correlation geared towards quantifying translation literalness. We propose and consider two automatic measures at the corpus-level: 1. Unaligned Source Words (USW): Two translations with very similar fluency and adequacy could be differentiated in terms of their literalness by computing word to word alignment between the source and the translation, then measuring the number of source words left unaligned. When controlled for quality, a less literal translation is likely to contain more unaligned source words (as suggested in Figure 1).

Translation Non-Monotonicity (NM):
Another measure of literalness is how closely the translation tracks the word order in the source. We use the non-monotonicity metric proposed in Schioppa et al. (2021), which computes the deviation from the diagonal in the word to word alignment as the non-monotonicity measure.
1 wmt20-comet-qe-da This can also be interpreted as (normalized) alignment crossings, which has been shown to correlate with translation non-literalness (Schaeffer and Carl, 2014).
We use the multilingual-BERT-based awesomealigner (Devlin et al., 2019;Dou and Neubig, 2021) to obtain the word to word alignments between the source and the translation. Table 2 presents an illustration of translations with different USW and NM scores, obtained from different systems.

Systems Under Evaluation
We experiment with the below four systems (NMT and LLMs): 1. WMT-21-SOTA: The Facebook multilingual system (Tran et al., 2021) won the WMT-21 News Translation task (Barrault et al., 2021), and thereby represents the strongest NMT system on the WMT'21 test sets.
2. Microsoft-Translator: MS-Translator is one of the strongest publicly available commercial NMT systems (Raunak et al., 2022).
For both the GPT models, we randomly select eight samples from the corresponding WMT-21 development set, and use these in the prompt as demonstrations for obtaining all translations from GPTs.

Results
We compare the performance of the four systems on the WMT-21 test sets. Figure 1 shows the results of this comparison. A key observation is that while the GPT based translations achieve superior COMET-QE+LID scores than Microsoft Translator across the language pairs (except En-Ru), they also consistently obtain considerably higher number of unaligned source words. This result holds for the comparison between the WMT-21-SOTA and GPT systems as well. Further, GPT translations also consistently show higher non-monotonicity for E→X translations. However, this is not the case for translations into English, wherein the multilingual WMT-21-SOTA system obtains very close non-monotonicity measurements. The combined interpretation of these measurements suggests that GPTs do produce less literal E→X translations.

Human Evaluation
We verify the conclusion from the results in Figure 1 by conducting a human evaluation of translation literalness on 6 WMT-22 language pairs: En-De, En-Ru, En-Zh and De-En, Ru-En, Zh-En. For each language pair, we randomly sample 100 source-translation pairs, with translations obtained from MS-Translator (a strong commercial NMT system) and text-davinci-003 (a strong commercial LLM) (Hendy et al., 2023). We used zero-shot text-davinci-003 translations for human evaluations in order to eliminate any biases through the use of specific demonstration examples. In each case, we ask a human annotator (bilingual speaker for Zh-En, target-language native plus bilingual speaker otherwise) to annotate 100 translations from both GPT and MS-Translator and select which of the two translations is more literal. The human annotation interface is described in Appendix A. The results in Table 3 show that the annotators rate the GPT translations as less literal.

Experiments on Best WMT-22 NMT Systems
Further, we also experiment with the WMT-Best systems on the WMT-22 General Machine Translation task (Kocmi et al., 2022). We evaluate USW and NM on De-En, Ja-En, En-Zh and Zh-En, since on each of these language pairs, text-davinci-003's few-shot performance is very close to that of the WMT-Best system as per COMET-22 (Rei et al., 2022), based on the evaluation done in Hendy et al.
. We report our results in Table 4, which shows our prior findings replicated across the language pairs. For example, text-davinci-003, despite obtaining a 0.2 to 0.6 higher COMET-22 score than the best WMT systems on these language pairs, consistently obtains a higher USW score and a higher NM score in all but one comparison (NM for En-De). Note that the NM score differences for Chinese and Japanese are larger in magnitude owing to alignment deviations measured over character-level alignments. Further, we refer the reader to Hendy et al. (2023)

Effects On Figurative Compositionality
In this section, we explore whether the less literal nature of E→X translations produced by GPT models could be leveraged to generate higher quality translations for certain inputs. We posit the phenomenon of composing the non-compositional meanings of idioms (Dankers et al., 2022a) with the meanings of the compositional constituents within a sentence as figurative compositionality. Thereby, a model exhibiting greater figurative compositionality would be able to abstract the meaning of the idiomatic expression in the source sentence and express it in the target language non-literally, either through a non-literal (paraphrased) expression of the idiom's meaning or through an equivalent idiom in the target language. Note that greater nonliteralness does not imply better figurative compositionality. Non-literalness in a translation could potentially be generated by variations in translation different from the desired figurative translation.

Translation with Idiomatic Datasets
In this section, we quantify the differences in the translation of sentences with idioms between traditional NMT systems and a GPT model. There do not exist any English-centric parallel corpora dedicated to sentences with idioms. Therefore, we experiment with monolingual (English) sentences with idioms. The translations are generated with the same prompt in Section 2. The datasets with natural idiomatic sentences are enumerated below: • MAGPIE (Haagsma et al., 2020) contains a set of sentences annotated with their idiomaticity, alongside a confidence score. We use the sentences pertaining to the news domain which are marked as idiomatic with cent percent annotator confidence (totalling 3,666 sentences).
• EPIE (Saxena and Paul, 2020) contains idioms and example sentences demonstrating their usage. We use the sentences available for static idioms (totalling 1,046 sentences).
• The PIE dataset (Zhou et al., 2021) contains idioms along with their usage. We randomly sample 1K sentences from the corpus.

Results
The results are presented in Table 5. We find that text-davinci-002 produces better quality translations than the WMT'21 SOTA system, with greater number of unaligned words as well as with higher non-monotonicity.
Further Analysis Note that a direct attribution of the gain in translation quality to better translation of idioms specifically is challenging. Further, similarity-based quality metrics such as COMET-QE themselves might be penalizing non-literalness, even though they are less likely to do this than surface-level metrics such as BLEU or ChrF (Papineni et al., 2002b;Popović, 2015). Therefore, while a natural monolingual dataset presents a useful testbed for investigating figurative compositionality abilities, an explicit comparison of figurative compositionality between the systems is very difficult. Therefore, we also conduct experiments on synthetic data, where we explicitly control the finegrained attributes of the input sentences. We do this by allocating most of the variation among the input sentences to certain constituent expressions in synthetic data generation.

Synthetic Experiments
For our next experiments, we generate synthetic English sentences, each containing expressions of specific type(s): (i) names, (ii) random descriptive phrases, and (iii) idioms. We prompt text-davinci-002 in a zero-shot manner, asking it to generate a sentence with different instantiations of each of these types (details are in appendix B). We then translate these sentences using the different systems, in order to investigate the relative effects on our literalness metrics between systems and across types. In each of the control experiments, we translate the synthetic English sentences to German.  Synthetic Dataset 1 As described, we generate sentences containing expressions of the three types, namely, named entities (e.g., Jessica Alba), random descriptive phrases (e.g., large cake on plate) and idioms (e.g., a shot in the dark). Expression sources as well as further data generation details are presented in Appendix B. Results are in Table 6.   Table 7.
Results Table 6 shows that the percentage of unaligned source words is highest in the case of idioms, followed by random descriptive phrases & named entities. The results are consistent with the hypothesis that the explored GPT models produce less literal E→X translations, since named entities or descriptive phrases in a sentence would admit more literal translations as acceptable, unlike sentences with idioms. Davinci-002 obtains a much higher COMET-QE score in the case of translations of sentences with idioms, yet obtains a higher percentage of unaligned source words. Similarly, the difference in non-monotonicity scores is also considerably higher for the case of idioms. These results provide some evidence that the improved results of the GPT model, together with the lower literalness numbers, stem from correct translation of idiomatic expressions. Table 7 shows that this effect only increases with the number of idioms.

Discussion
In our experiments conducted across different NMT systems and GPT models, we find evidence that GPTs produce translations with greater nonliteralness for E→X in general. There could be a number of potential causes for this; we list two plausible hypotheses below: Parallel Data Bias NMT models are trained on parallel data, which often contains very literal webcollected outputs. Some of this may even be the output of previous-generation MT systems, which is highly adopted and hard to detect. In addition, even high quality target text in parallel data always contains artifacts that distinguishes it from text originally written in that language, i.e. the 'translationese' effect (Gellerstam, 2005). These factors could likely contribute to making NMT translations comparatively more literal.
Language Modeling Bias Translation capability in GPTs arises in the absence of any explicit supervision for the task during the pre-training stage. Therefore, the computational mechanism that GPTs leverage for producing translations might be different from NMT models, imparting them greater abstractive abilities. This could have some measurable manifestation in the translations produced, e.g., in the literalness of the translations.
Differences in E→X and X→E In E→X, we consistently find that GPT translations of similar quality are less literal and in the X→E direction, we observe a few anomalies. For X→E, in Figure 1, in all but one comparison (WMT-21-SOTA vs GPTs for De-En) GPTs obtain higher measures for non-literalness. On the other hand, we did not see anomalies in the trend for E→X directions.

Variations in Experimental Setup
We also experimented with a variant of USW and NM which doesn't use the alignments pertaining to stopwords. Each of our findings remain the same, with relatively minor changes in magnitudes but not in system rankings. Similarly, we observed a greater tendency towards less literalness in GPT translations in both few-shot and zero-shot settings, when compared across a range of NMT systems.

Summary and Conclusion
We investigated how the translations obtained through LLMs from the GPT family are qualitatively different by quantifying the property of translation literalness. We find that for E→X translations, there is a greater tendency towards nonliteralness in GPT translations. In particular, this tendency becomes evident in GPT systems' ability to figuratively translate idioms.
We thank Hitokazu Matsushita for help in conducting human evaluations.

Limitations
Measurement of translation literalness is neither well studied nor well understood. We rely on a combined interpretation of multiple measurements to investigate our hypothesis and its implications. This limits the extent to which we can make strong claims, since in the absence of a highly correlated metric for translation literalness, it is hard to compare systems. We could only claim that our investigation indicates the presence of a tendency towards non-literalness in GPT translations, but a stronger result would have been preferred to further disambiguate the translation characteristics. Further, we only compare GPT translations in the standard zero-shot and few-shot settings and it is quite conceivable that more specific & verbose instructions could steer the LLMs to produce translations with different characteristics.

A Human Annotation Interface
We use the annotation interface in Figure 2, in which the annotators are asked to rate the two translations. The bilingual and native speaker annotators were recruited in-house.

B Synthetic Dataset Details
Synthetic Dataset 1: For each of the three expression types, 100 synthetic sentences are generated. Figures 3, 4, 5 present examples. The sources of the named entities and descriptive phrase expressions are MultiNERD (Tedeschi and Navigli, 2022) and PhraseCut (Wu et al., 2020) datasets respectively.
Prompt: Q: Generate a sentence containing the idiom: a short fuse, in the form of a news article sentence. \n A: Output: The man was known to have a short fuse, and often exploded into fits of anger without warning.  Synthetic Dataset 2: Figure 6 presents an example of the prompt as well as for a generated synthetic sentence containing two idioms.
Prompt: Q: Generate a sentence containing the phrase: white chair, in the form of a news article sentence. \n A: Output: The white chair was found to be comfortable by the majority of the participants in the study. Prompt: Q: Generate a sentence using the two idioms: off the wall, claim to fame in the form of a news article sentence. \n A: Output: The company's off-the-wall marketing campaign was its claim to fame. C Alignments and Literalness Schaeffer and Carl (2014) find that more alignment crossings (which is measured by the nonmonotonicity metric) between the source and translations are proportional to the extra cognitive effort (measured using gazing time of human translators) required by human translators in processing nonliteral translations. This links alignment crossings (the non-monotonicity measure is normalized alignment crossing) with greater non-literalness.