Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the GPT-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system’s translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator’s intervention remains necessary to ensure that the author’s voice remains intact. We publicly release our dataset and error annotations to spur future research on the evaluation of document-level literary translation.


Introduction
Separate text from context and all that remains is a con.

Stewart Stafford
Large language models (LLMs) such as Chat-GPT (OpenAI, 2022) demonstrate remarkable performance as stand-alone translation systems, rivaling and sometimes surpassing commercial models 1 https://github.com/marzenakrp/ LiteraryTranslation  Figure 1: A plot of the total number of errors annotated in sentence-level (SENT) and paragraph-level (PARA) translations produced by GPT-3.5 across 18 different language pairs. In all cases, PARA produces fewer errors than SENT, which demonstrates that GPT-3.5 takes advantage of discourse context during translation. on sentence-level benchmarks (Vilar et al., 2022;Hendy et al., 2023;Jiao et al., 2023). Furthermore, LLMs are increasingly being deployed for document-level translation (Book Maker, 2023;Pawlak, 2023), a scenario for which there are currently no reliable automatic evaluation methods. In this paper, we hire human translators to conduct a rigorous fine-grained evaluation of GPT-3.5's ability to translate paragraph-level texts from literary works across 18 different language pairs. Our results ( Figure 4) demonstrate that GPT-3.5 2 effectively leverages discourse-level context to produce higher-quality translations than when translating sentences in isolation.
Why literary texts? Translating works of literature poses unique challenges due to the intricate nature of creative work and the importance of capturing the author's voice and contextual nuances. Translators thus apply a wide range of translation techniques (Chesterman, 1997;Molina and Hurtado Albir, 2004), from simple shifts in grammatical categories to more complex stylistic or content-based rearrangements that often cross sentence boundaries. Translators may also merge or split sentences, or even entire paragraphs, which renders the traditional sentence-level pipeline insufficient for capturing the full scope of the original text (Toral and Way, 2015;Taivalkoski-Shilov, 2019b). 3 Taken together, these properties make literary texts a good testbed for document-level machine translation ; in our work, we focus on the paragraph 4 as a minimal discourselevel unit.
Why human evaluation? The absence of rigorous document-level evaluations of LLM translators is striking but also somewhat understandable given the unreliability of automatic metrics  and the difficulty of properly conducting human evaluations (Castilho, 2021). Furthermore, evaluations of LLM translators are especially difficult due to data contamination (Aiyappa et al., 2023), as it is unclear whether the models are pretrained on existing benchmarks (e.g., from WMT). We fill this gap by first collecting paragraphs from recently-published literary translations. Then, we provide human translators with two candidate machine translations of a given source paragraph and ask them to (1) mark error spans and categorize them based on a predefined schema inspired by MQM (Lommel et al., 2014;Freitag et al., 2021), (2) make preference judgments of which of the two translations is of higher quality, and (3) provide free-form justifications of their preference judgments. In total, we collect such annotations on 720 pairs of translated paragraphs across 18 different language pairs (using three diverse target languages of English, Japanese, and Polish), which we then leverage for a fine-grained analysis of the behavior of different LLM translation methods.
How do we use LLMs to translate paragraphs? We use three strategies to generate the paragraphlevel translations for our evaluations that all rely 3 At least 55% of the reference target paragraphs used in our study split or merge sentences from the source text (measured with an automatic sentence tokenizer). 4 We broadly define a paragraph as a distinct passage within the novel, focusing on a single theme. on few-shot prompting with GPT-3.5: (1) translating each sentence in the paragraph in isolation of the others (SENT); (2) translating each sentence in the paragraph when provided with the rest of the paragraph as context (PARA_SENT); and (3) translating the entire paragraph in one shot (PARA), not sentence-by-sentence. Finally, we also compare these methods to Google Translate (GTR).
LLMs produce better translations when provided with paragraph-level context: Our evaluations reveal that using GPT-3.5 to translate complete paragraphs (PARA) yields translations of significantly higher quality than both the sentenceby-sentence GPT-3.5 methods as well as Google Translate. Our detailed analysis of annotated translation errors and free-form comments show that paragraph-level translations exhibit increased coherence, better preservation of literary style, and improved handling of context-dependent expressions (see Figure 2). That said, we also observe that PARA still makes numerous critical mistranslations and other errors across different language pairs, which shows that LLM-based translators still have significant room for improvement, particularly when applied to translating contextually-rich literary texts.

Background
Before describing our dataset and evaluation, we first contextualize our work within both the existing body of document-level 5 machine translation as well as recent papers on translation via large language models.

Existing approaches to document-level translation
Before the rise of neural machine translation, several attempts were made to incorporate discourselevel phenomena into statistical machine translation systems (Hardmeier, 2012;Carpuat and Simard, 2012;Hardmeier et al., 2013;Ding et al., 2014). Neural MT systems condition sentenceby-sentence translation on discourse-level context via concatenation models (Tiedemann and Scherrer, 2017;Jean et al., 2017;Agrawal et al., 2018;Junczys-Dowmunt, 2019;Lopes et al., 2020), hierarchical models (Miculicich et al., 2018;Tan et al., 2019;Zheng et al., 2020), -Japanese Source (from Convenience Store Woman by Sayaka Murata) (...) "Ah, and one pack of cigarettes, number five." "Understood." Quickly remove the Marlboro Light Menthol and scan it at the register. "Please confirm your age with a touch." The man's gaze shifted quickly to the showcase where the fast food was lined up while he was touching the screen, and he stopped his finger movement.
-GPT 3.5 SENT (English) sentence-level paragraph-level (...) "Ah, and one pack of cigarettes, number five." "Right away." I quickly pulled out a Marlboro Light Menthol and scanned it at the register. "Please touch the screen for age verification." His gaze shifted to the showcase with the fast food as he touched the screen, and I stopped my finger's movement.
-GPT 3.5 PARA (English) Figure 2: An example of paragraph-level (PARA) and sentence-level (SENT) translations of the same Japanese paragraph into English. Sentence-level translation results in a range of erroneous translations, from worse word choice ("understood" vs "right away") to incorrect pronouns ("he" vs "I").
multi-pass models (Mansimov et al., 2021), dynamic context models (Kang et al., 2020), multisource models (Zhang et al., 2018;Feng et al., 2022), and transfer learning approaches . Despite sometimes obtaining clear gains from discourse-level context (Voita et al., 2019), the machine translation community has not made much progress on this problem, particularly for non-English language pairs, due largely to the scarcity of parallel document-level corpora . This problem has been partially addressed by introducing a pivot language (Cohn and Lapata, 2007;Utiyama and Isahara, 2007), but this approach can also lead to substantial information loss.

Translation with large language models
Many recent studies explore the potential that LLMs hold for translation, an especially attractive prospect given that training or fine-tuning on large parallel corpora is not necessary. 6 These works span paragraph-level post-editing with LLMs , translating sentence-level inputs (Vilar et al., 2022;Jiao et al., 2023), and analyzing hallucinations in LLM-generated translations (Guerreiro et al., 2023). Studies on prompt engineering for translation conclude that simple sentence-level English prompt templates are effective for paragraph translations (Zhang et al., 2023). Other findings include that automatically-generated dictionaries assist translation (Ghazvininejad et al., 2023), and that example quality outweighs lexico-semantic 6 That said, parallel data is almost certainly included, at least for high-resource languages, in LLM pretraining data. proximity to input (Vilar et al., 2022). To the best of our knowledge, the only work other than ours that evaluates LLMs for paragraph-level translation is that of Hendy et al. (2023), which focuses on automatic evaluation of context-aware sentence-bysentence translation. Unlike Hendy et al. (2023), we perform a fine-grained human evaluation of paragraph-level translation, which sheds more light on the concrete strengths and weaknesses of LLM translators in this setting.

Data & methods
Our work differs from existing research on translating with large language models in two key ways: we focus on translating literary text at the paragraph level. In this section, we describe and motivate the paragraph-level translation dataset used in our study, which covers eighteen unique language pairs (three target languages) and is sourced from recently-published novels. Then, we outline the different ways in which we leverage GPT-3.5 to translate these paragraphs at both the sentence and paragraph levels.

Dataset collection
Literary texts (e.g., novels or short stories) pose unique challenges for translators due to their complex nature. Translators must interpret and honor the author's voice with no objective reality to measure against, which can result in several equally valid translations (Sager, 1998). For machine translation systems, these challenges exacerbate the need for discourse-level context : an author's intended meaning or style is often un- "Oh, poor old thing, she has a nervous bladder!" exclaimed someone's chubby mother. "Is that a Persian rug?" Whose mother was it? Unclear. No one would cop to it, of course. We canceled the performance. "Admit it, that was your mother," said a kid named Rafe to a kid named Sukey, when the parents had filed out. Some of their goblets, highball glasses, and beer bottles were completely empty. Drained. Those parents were in a hurry, then. "No way," said Sukey firmly, and shook her head. "Then who is your mother? The one with the big ass? Or the one with the clubfoot?" "Neither," said Sukey. "So fuck you."  clear from just a single sentence.
Selecting paragraphs from novels: How good are machines at translating literary paragraphs? To answer this question, we extract 20 paragraphs (dialogues and narrative texts) each from 18 recentlypublished translations of novels, and we manually align these paragraphs with corresponding paragraphs in the source novel 7 (see Table 1). The target language of each translation is English, Polish, or Japanese (6 books for each), and we consider eight different source languages. Almost all of the translations were published after 2021 (see Table 2), which is important to avoid data contamination with the pretraining data of large language models. In sum, we obtain 360 aligned source-target paragraphs, which we use for all of the experiments described in the rest of the paper.
Paragraph length: All paragraphs consist of at least two sentences, and the majority of them are between four to nine sentences long (mean=7.45, 7 In most cases, we purchase the source ebook and its corresponding translation before extracting aligned paragraphs, but for a few books, we utilized Amazon's free preview functionality. std=4.14). 8 As automatic sentence tokenizers are not always reliable for all of the languages considered in our study, we manually perform sentence tokenization to enable a direct comparison of sentence and paragraph-level translation systems. For more details about the dataset statistics, including token and sentence counts, see Table 8 and Table 9.
Target language selection: We select English, Japanese, and Polish as the target languages of our study, as these languages differ considerably in many linguistic aspects. English is an analytic language that is widely spoken and extensively studied in the field of natural language processing, and it serves as the primary pretraining language of most large language models, including GPT-3.5. 9 In con-8 A paragraph with fewer sentences is not necessarily short: for example, in the German novel "An Inventory of Losses," sentences can be as long as 70 to 80 words, with the longest reaching 117 words. 9 As of 2020, the reported distribution of languages featured in the present study within the GPT-3 training data was as follows: English -92.647% (1st), French -1.818% (2nd), German -1.469% (3rd), Russian -0.188% (9th), Polish -0.155% (11th), Japanese -0.111% (15th), Chinese -0.099% (17th), Czech -0.071% (18th) (see https://github. com/openai/gpt-3/blob/master/dataset_ statistics/languages_by_word_count.csv).  Table 2: Details of the translated novels used in our study. In cases where the same novel is used for multiple target languages (e.g., "The Years"), identical source paragraphs are extracted to enable comparisons across language pairs. These novels exhibit distinct differences beyond just their source languages. For instance, "What Can You See From Here" presents a philosophical exploration of life and death, while "Sword of Destiny" is a fantasy story part of "The Witcher" saga.
trast, both Japanese and Polish are comparatively under-explored. Japanese is an agglutinative language that employs three distinct writing systems: Kanji, Hiragana, and Katakana. As a high-context language, the translation of Japanese texts necessitates a profound comprehension of context and cultural nuances, rendering it a compelling choice for testing the limits of LLMs' translation capabilities. Polish, on the other hand, is a fusional language characterized by a rich morphological system. Its complex word forms, grammatical gender, conjugation, and declension make it an apt choice for testing the accuracy and robustness of LLMs. 10 Source language selection: As source languages, we select English (es), Polish (pl), Russian (ru), Czech (cs), French (fr), German (de), Japanese (ja), and Chinese (zh). These languages belong to a diverse array of language families -Indo-European (Romance, Germanic, Slavic), Sino-Tibetan, and Japonic -each with distinctive morphological traits -fusional, agglutinative, and analytic. Moreover, they employ a variety of writing systems such as the Latin alphabet, the Cyrillic alphabet, Hanzi, and Kanji/Hiragana/Katakana The current GPT-3.5 text-davinci-003 model is reported to incorporate data up to June 2021 and it is unclear what texts or languages were added to the original training data https://platform.openai.com/docs/ models/gpt-3-5. 10 The first author is fluent in all three target languages.
(see Table 7 in Appendix A for details). Finally, we carefully select source-target language pairs to ensure that our study encompasses both linguistically similar and dissimilar languages. For example, we paired cs-pl, as these languages are characterized by only 10% lexical distance 11 and have similar syntactic structures (Jágrová and Avgustinova, 2023). Conversely, we also include ja-pl, as the two languages have very little lexical overlap, vastly different grammars, and utilize distinct writing systems.

Translation with large language models
In this paper, we focus on translating the literary paragraphs in our dataset using large language models. More specifically, we use the GPT-3.5 text-davinci-003 checkpoint, which has been further tuned to follow instructions based on human feedback (Ouyang et al., 2022). Hendy et al. (2023) demonstrate that GPT-3.5 produces translations of reasonable quality, though their focus was mostly at the sentence level. Since many LLMs including GPT-3.5 are only accessible via blackbox APIs, we adapt the model for translation via in-context learning (Brown et al., 2020).
Demonstration examples: We use few-shot prompting, in which a model is provided with a 11 i.e., the percentage of non-cognates in the language pair. prompt consisting of five demonstrations. We manually curate the five demonstrations from literary texts for each of the 18 language pairs, resulting in 90 total demonstration examples. These demonstrations are sourced from novels that are not part of our translation dataset, resulting in potential differences in topic and style (see Table 10 in the Appendix A for details). We further ensure that each set of five demonstrations includes both dialogues and narrative texts.
Prompting for translation: We consider the following three prompting strategies for GPT-3.5 that allow us to compare the model's abilities to translate with and without discourse-level context (see Table 3 for templates and Appendix B for the exact prompts): • GPT-3.5 sentence-level translation without context (SENT): Each sentence of the paragraph is translated in isolation of the others. To maintain consistency, we provide the same five sentence-level examples 12 in each prompt for the given source-target language pair. 13 • GPT-3.5 sentence-level translation with context (PARA_SENT): Each sentence of the paragraph is translated in context. The model is provided with the entire source paragraph as input, where the sentence to be translated is wrapped in <translate> and </translate> tags, in addition to a partially-translated target paragraph. The demonstrations in the prompt also contain <translate> and </translate> tags wrapped around one sentence per demonstration. For each demonstration in the prompt, a sentence in a different position was chosen (e.g., from the beginning, middle, and end of the paragraph).
• GPT-3.5 paragraph-level translation (PARA): The entire source paragraph is passed into the model, and the output target paragraph is generated conditioned on this input (i.e., without any sentence tokenization).
Demonstrations in the prompt are also para-12 Sentence-level demonstrations for SENT are sampled from the demonstrations for paragraph-level translation. 13 To ensure consistent quotation mark usage and enable a fair comparison with paragraph-level translations, quotation marks in sentence-level translations were manually adjusted.  graphs 14 of translations from the respective source language into the target language in question. 15 Using Google Translate (GTR) as a baseline: In order to compare commercial-grade translation systems to LLM translators, we also translate all paragraphs in our dataset using Google Translate. 16 We opt for an off-the-shelf commercial system instead of a state-of-the-art system from e.g. WMT competitions for two primary reasons. First, our experiments focus on literary translations. Given that WMT systems are predominantly evaluated on the news domain, it is uncertain which system would perform best, and some language pairs may not even be supported. Second, our central main question revolves around LLMs' ability to incorporate contextual information, rather than merely comparing their performance with state-of-the-art 14 The examples for PARA and PARA_SENT configurations are necessarily lengthier. Due to the GPT-3.5 maximum context size, it is not always possible to include all five examples within the prompt. Consequently, around 10% of the data was translated using four or fewer examples. 15 Initially, we experimented with GPT-3 by translating between two non-English languages using English as a pivot, as it is the primary language of the model. The model had access to the source text and its English translation. After manual evaluation and comparison to translations without a pivot language, we found no significant benefit in using English as the pivot. Consequently, we directly translated paragraphs into the target language. Refer to Appendix D. for details and results of this preliminary study. 16 All paragraphs were translated in January 2023 using the GoogleTranslate API. translation systems. We employ GTR as a reasonably robust baseline to assess the extent to which context can enhance MT quality, rather than asserting that LLMs outperform all traditional MT systems.

Evaluating document-level literary translation
How do we compare the translation quality of the systems described above? Automatic metrics such as BLEURT and COMET are untested on document-level inputs as well as literary text, and as such we do not consider them reliable, although we do report them in Section 5.1. 17 Human evaluation is equally problematic, as direct assessments of translation quality (e.g., "rate the quality of this translation from 0-100") suffer from calibration issues that are exacerbated with longer texts (Karpinska et al., 2021). Thus, we opt for a human evaluation inspired by Multidimensional Quality Metrics (Lommel et al., 2014, MQM), in which annotators mark and classify error spans within the translation. Specifically, for each of the 18 language pairs studied in this work, we hire translators to identify all span-level errors in two competing translations. For each evaluated pair, the annotators were also asked to choose the better translation and provide a free-form rationale. For each source paragraph, the translators make three binary judgments of which translation is higher quality: SENT vs PARA, PARA_SENT vs PARA, and GTR vs PARA.
Recruiting annotators: As our task is complex and requires fluency in both the source and target language, we hire translators to provide the annotations. We recruit 13 translators, each of whom is a native speaker of English, Polish, or Japanese 18 through the Upwork freelancing platform. 19 One translator, hired directly, was a bilingual speaker of English and Polish with advanced knowledge of German; as such, she performed the pl-en, de-en, and de-pl evaluations. Evaluation of ja-pl, pl-ja, 17 Automatic metrics developed specifically for documentlevel MT are also insufficient as they either work best with one-to-one sentence level alignments (Vernikos et al., 2022;Hendy et al., 2023) or are available only for English . 18 The annotators for Czech-Polish and Russian-English were both native speakers of the respective source languages and highly proficient in their respective target languages. They collaborated with native speakers of the target languages, who possessed a basic understanding of the source language, to complete their annotations.
19 https://www.upwork.com/  Figure 3: A description of the annotation process for a pair of candidate translations given a source paragraph. Note that our hired translators go through this pipeline for three different pairs per source paragraph, comparing PARA with SENT, PARA_SENT, and GTR. and pl-en texts was done by the first author in a collaboration with native speakers of Polish/Japanese to avoid any potential bias. Each translator was paid $2 per evaluated pair of candidate translations, with an additional $5 bonus to cover the time spent familiarizing themselves with the instructions. We asked them to compare three pairs of system translations (PARA vs. SENT, PARA vs. PARA_SENT, PARA vs. GTR) for 10 paragraphs per language pair; as such, 180 total source paragraphs were used in our evaluations. Altogether, we paid approximately $12 per hour, with a total cost of $955.
Annotation task: First, we tasked the hired translators 20 with annotating a subset of MQM translation errors identified through a pilot analysis and annotation of the system's outputs. Specifically, we ask them to highlight spans within the candidate translations that contain errors belonging to any of the following error categories: • mistranslation: 21 accuracy errors that occur when the wrong target word or phrase is cho-20 They were presented with guidelines in their native language. The annotation task was performed using the Label-Studio annotation tool (Tkachenko et al., 2020(Tkachenko et al., -2022. 21 We note that mistranslations in literary text are often not as grave as, for instance, in news articles. Human translators hold poetic license, which allows them to change some details to make the text more enjoyable for the reader. Is changing "bonito" into "tuna" incorrect? Or can it be perceived as a way to accommodate an English-speaking readership that is likely more familiar with the latter? sen to represent content from the source text. In addition to canonical mistranslations, we also include overly literal translation errors that occur when systems translate word-byword into the target language even though the result is nonsensical.
• grammar: grammatical errors, such as errors in conjugation or declension, wrong prepositions, etc.
• untranslated: words or phrases that should have been translated into the target language but were either left in the source language or just transliterated into the target language.
• inconsistency: use of different terms to refer to the same entity, or different words where the same word should be used for stylistic reasons (e.g., "Kasia" and "Kate," "coat" and "jacket," or "bad" and "awful" ); • register: a clear violation in the use of formal and informal language within the same text, only annotated in Japanese 22 • format: incorrect usage of punctuation (e.g., "." instead of "。").
After the span-level annotation is complete, we then ask the translators to further identify if any of the candidate translations contains significant content additions or omissions in relation to the source text. 23 Finally, they are asked to choose the better translation and provide a justification for their choice in two to five sentences. We instruct them to additionally mark whether their chosen translation is significantly superior, or if the decision was difficult because both translations are of roughly comparable quality (see Figure 3 and Appendix C for details).

Results
In this section, we compare our different literary translation methodologies using both automatic 22 We only annotate cases where the level of formality changes abruptly within the same paragraph. It is possible that a given character would be more likely to use formal language but an informal language is being employed. As long as this is consistent we do not consider it an error as this cannot be fully determined from the paragraph context. 23 Note that this task was simplified to a binary choiceeither there were serious omissions/additions or not. We did not ask the annotators to further annotate them due to the time restrictions.  metrics and aggregate statistics from the human evaluations. Overall, we observe that the PARA configuration outperforms competing methods across all evaluations and language pairs. These results demonstrate that GPT-3.5 effectively leverages paragraph-level context to produce better translations than sentence-level methods, and also that the less efficient sentence-by-sentence translation with context is (PARA_SENT) is unnecessary to achieve high translation quality.

Automatic metrics favor PARA
We assess the translation from all four systems using the reference-based COMET (Rei et al., 2022), BLEURT (Sellam et al., 2020), and BERTSCORE  metrics, as well as the reference-free COMET-QE (Rei et al., 2021) 24 metric. Although these metrics were not explicitly designed for evaluating paragraph-level outputs and their results should be interpreted with caution, they prove more reliable than string-based metrics like BLEU, especially for literary translations . Table 4 shows the effectiveness of the PARA translation method: a statistical analysis with linear mixed-effects models (Baayen et al., 2008) demonstrates that PARA significantly outperforms SENT and GTR based on COMET, BLEURT, and COMET-QE scores (p<.001), and surpasses GTR based on the BERTSCORE results (p<.001). 25 Figure 5 contains human preference results comparing PARA to SENT, PARA to PARA_SENT, and PARA to GTR, aggregated across all 18 language pairs studied in this paper (i.e., 180 votes per system comparison). Table 11 breaks down these results for each language pair, and we observe the 24 We use the newest wmt22-comet-da checkpoints for COMET, Bleurt-20 checkpoints for BLEURT, wmt20-comet-qe-da checkpoints for COMET-QE, and the HuggingFace implementation which employs roberta-large for BERTSCORE. 25 We present more details of this analysis in Appendix E. Figure 4: The distribution of translator preference judgments between sentence-level translation (SENT) and paragraph-level translation (PARA). PARA is preferred (i.e., more votes) in every language pair except de-ja.

Human evaluation also favors PARA
same trends for the vast majority of pairs. Overall, the translators significantly favored PARA translations over the alternatives (p<.001, binomial test). Table 5 contains specific information about grammar and mistranslation errors split across the three target languages (see Table 6 and Table 12 for details), which we use to discuss the three comparison settings in more detail below.
PARA is clearly better than SENT: PARA is preferred by translators over SENT at a rate of 71.1% (p<.001, 95% CI [0.639, 0.776]). Additionally, when translators preferred PARA, they were usually confident in the decision (i.e., it was clearly better than SENT); even if we exclude all "unsure" votes, the preference for PARA translations remains significant at 78.5% (p<.001, 95% CI [0.695, 0.859]). The only language pair in which SENT is favored over PARA is de-ja (see Figure 4). This result may be attributed to the fact that the German novel An Inventory of Losses by Judith Schalansky, used for this language pair, contains the longest sentences in our dataset (on average 45 tokens per sentence), which means that the intra-sentence context is likely more informative than in other books (see Table 8). Overall, SENT translations contain 29.5% more mistranslations, 65.4% more grammatical errors, over 12 times more inconsistency errors, and three times more register errors (see Table 5).
PARA is clearly better than GTR: PARA translations are overwhelmingly preferred over those from Google Translate (GTR), with an 82.8% pref- . In the fr-ja, pl-ja, zh-ja, and cs-pl language pairs, PARA received all of the ten votes over GTR. Part of this advantage may be attributed to GTR sometimes using English as a pivot language, which can result in information loss. Our Czech translator observed that mistakes in GTR translations suggest the text was first translated into English. 26 Overall, GTR translations result in 57.7% more mistranslations, 37.3% more grammatical errors, over twice as many inconsistency errors, and ten times more register errors (see Table 5). Additionally, GTR produced 125 format errors while PARA produced perfect outputs in this regard. Finally, it is worth noting that GTR left fewer words untranslated, though this is inflated by the fact that in one German text, the word "Bauer" ("farmer") was untranslated 14 times in the PARA translation.

PARA is slightly preferred over PARA_SENT:
Our evaluations show that PARA is better than PARA_SENT, but the gap is smaller than it is for the other two methods. PARA is still preferred at a 66.1% rate ( FORMAT EN 0 n/a n/a 1 JA 0 n/a n/a 116 PL 0 n/a n/a 8 TOTAL 0 n/a n/a 125  Table 5). While PARA_SENT leaves around 22% more words untranslated, it appears to leverage the contexts and even occasionally selects better equivalents in the target language, as evidenced by translator comments. One major issue with PARA_SENT is that it occasionally repeats sentences, whereas PARA never does so.
What do translators think about PARA? To wrap up this section, we provide a qualitative analysis of the free-form comments written by translators to justify their preference judgments. Overall, the translators praise PARA for its more skillful use of rhetoric devices, and surpas[ing] SENT as a literary rendition. They also mention that PARA uses more of a poetic license but this makes it stylistically much smoother than SENT. Furthermore, translators state that PARA clearly better reflects the content and style of the original when compared to GTR, and that it stays consistent within the paragraph. Inevitably, translations are not flawless, and there are instances where both compared systems fall short, as highlighted by one of the translators when assessing PARA against SENT:  Nightmare, a mistake upon mistake (...) Despite all these mistakes, I can understand the [PARA] translation better but they are equally miserable.

Analyzing translation errors
The aggregate statistics from the previous section confirm that PARA-level translation via GPT-3.5 is the strongest literary translator of the methods that we study. Translations produced by PARA are favored by both automatic metrics and human translators, and it makes fewer errors than competing methods. In this section, we dive deeper into specific types of errors that are made within each high-level category (e.g., grammar, mistranslation), and we present examples of errors associated with lack of context understanding made by SENT and GTR that are fixed by PARA.

Language-specific grammatical errors
We begin by analyzing the types of grammatical errors that are made by the studied translation methods in all three target languages. 27 English: Perhaps not surprisingly, translations into English contain notably fewer grammatical mistakes than Japanese or Polish (see Table 5). The most prominent mistakes in English are incorrect articles, which is most frequent in the outputs of SENT and GTR. This is to be expected, as the choice between the definite and indefinite article in English depends heavily on the context. Other mistakes include wrong or omitted prepositions, wrong parts of speech, and incorrect word order (see Table 6).
Japanese: Translations into Japanese contain considerably more mistakes. Most notably, the systems struggle with the correct choice of particle: PARA and SENT produce twice as many mistakes in this regard than PARA_SENT and GTR (see Table 6). Other mistakes include incorrect tense, verb finite form within the sentence, or incorrect word order, the latter of which is much more frequent in GTR than any of the GPT-3.5 translations.
Polish: GPT-3.5 exhibits more difficulty with Polish, as evidenced by 55 vs 42 errors for PARA, 86 vs 50 for SENT, and 64 vs 37 for PARA_SENT (see Table 5). We notice that GPT-3.5 translations frequently generate incorrect gender, case, or prepositions (see Table 6). We also observe instances in which GPT-3.5 alters the gender of a noun, such as producing grilla, a non-existent feminine form, in place of the masculine grill, while accurately modifying all adjectives and verbs to match the novel feminine noun. 28 In contrast, the performance of GTR is comparable for Polish and Japanese in terms of grammar, with 59 and 63 errors respectively. Intriguingly, GTR seems to struggle with Polish aspect, leading to 12 errors, in contrast to 1 error in both PARA and PARA_SENT, and 5 errors in SENT within the same category (see Table 6). In summary, although GPT-3.5 is primarily trained on English, it is competitive with GTR at Polish and Japanese grammar proficiency. In fact, PARA generates the fewest grammatical errors of any system, with a total of 97 for both languages. This is in contrast to 136 errors made by SENT, 101 errors by PARA_SENT, and 122 errors by GTR (see Table 5). That said, none of these systems delivers translations devoid of grammatical inaccuracies, even for English. 28 It is worth noting that grilla can also be also the genitive form of the masculine noun grill; however, the agreement of surrounding verbs and adjectives with the feminine noun suggests that the system likely treated the word as feminine. Figure 6: Quantification of mistranslations resulting from missing or misinterpreted paragraph-level context in PARA, SENT, PARA_SENT, and GTR systems, organized by the target language (Japanese, Polish, and English).

Context-related errors
We manually classify all annotated mistranslations (2324 instances) into subcategories, several of which include instances where the absence of discourse-level context is clearly a contributing factor (see Table 12 for detailed classification). We also further analyze the translations in terms of content-related issues. Overall, we observe that context is indeed incorporated into the translations for both PARA and PARA_SENT outputs, which results in fewer context-dependent issues (see Figure 6).
-RUSSIAN SOURCE (from The Story of a Life) a. The wind would start to rustle in the bare trees and then fall silent, just as I listened to the flow of the night. But he didn't leave, he was here.
-  b. The wind would start to rustle in the bare trees, then die down, just like me, listening to the flow of the night. But it didn't go away, it was still here.
-  In Russian, nouns have grammatical gender. "Wind" in the first sentence of the source text is a masculine noun, so it is later referred to as "he" in (1). Without access to the context, the SENT model incorrectly translates it as "he" into English (1a), while the PARA translation correctly modifies the pronoun to "it" (1b).
- Although both Russian and Polish nouns possess grammatical gender, "Paper" in (2) is feminine in Russian and referred to as "she," whereas it is a masculine noun in Polish and should be referred to as "he," as in (2b). The absence of context in SENT leads to an incorrect translation in (2a).
Cultural nuances: Assigning appropriate pronouns without context becomes even more challenging when translating from languages like Japanese, in which speakers frequently refer to the listener (or themselves) in the third person rather than using second-person personal pronouns such as "you" in English. Consider the following example: From the context of this conversation, a Japanese listener can easily infer that "Furukura-san" or "Miss Furukura" 29 in the last source sentence (3) is used instead of the second-person "you" as per Japanese convention. Translating this sentence 29 Note that the gender of neither character is apparent from the fragment alone. without context into English, a language in which third-person reference is not common, 30 results in a confusing translation (3a) that implies that the speaker refers to some other "Furukura" rather than their listener. However, when translating the sentence in context, the model correctly changes "Furukura" into "you" (3b), which makes it clear whom the speaker refers to in English.
Ellipsis: Another example where context helps is the translation of elliptical constructions. Consider the following example: Czech uses the same collocation as English, do the dishes" (4), which is invalid in Polish. Hence, the ellipses in the last two sentences in (4) require broader context to be translated correctly. PARA does it properly, translating both as "wash" (4b), while SENT unsurprisingly fails to choose the correct collocation (4a).
Subject ellipsis: Similarly, context may be needed to attribute a state or an action to the correct character due to subject ellipsis. This is an obvious issue for languages like Japanese, which tend to omit the subject of the sentence and do not encode any relevant information in the verb form, but it can also arise in English. Consider the following example: (5) When we were done, the lipstick went back into some mother's Fendi handbag. We watched her apply it, unaware. From the second sentence alone it is not clear who is "unaware" (5) -the mother or the "we" (referring to children) watching her. Only from the broader context can we confidently deduce that it is in fact the mother, not the children, who is "unaware." PARA (5b) correctly attributes the state of being "unaware" to the mother, which is exhibited by its usage of the singular feminine form of the adjective. In contrast, SENT (5a) mistranslates it using the plural masculine form of the adjective "unaware," which implies that it refers to "we" rather than the "mother." Consistency: Context is sometimes critical for preserving the overall consistency of the text. The simplest cases include referring to the same entity -a place or a person -in the same way. More interesting cases pertain to style and can enhance the reader's experience. Consider the following example: The German source in (6) translates into English as "To forget everything is bad, certainly. Worse still is to forget nothing." 31 It is arguably important for the translation to repeat the same word which is an equivalent of the German "schlimm" ("bad"). PARA does it well, translating both as "warui," or "bad" (6b), in the exact same way as the human Japanese translator. SENT, on the other hand, uses two different words, "tragic" and "bad" (6a), which while technically correct omits the intentional repetition that is meant to introduce an unexpected conclusion.
Polysemy: The absence of context makes it difficult to interpret words or expressions that have multiple meanings in the source language. Consider the following example: 31 Excerpt taken from the official English translation by Jakie Smith (2020). The ambiguity stems here from multiple meanings of the Russian noun "forma" (7), which can mean either "shape" or "uniform." Since one can be "in shape" as well as "in a uniform", it is unclear from the sentence alone which meaning was intended by the author. From the preceding context, it is clear that "everything went well" for the narrator, who mastered the art of "book'n'grill," a unique form of expression exclusive to this fictional world. Based on this, we can infer that in this instance, the term "forma" signifies "shape," as in (7b), rather than "uniform," as in (7a).
Appropriateness: Finally, context may help to choose the more appropriate equivalent for the given situation. Consider the following example: The conversation above is between a clerk and a customer. The Japanese expression かしこま りました (8) is an honorific that literally means "understood." However, when choosing the best equivalent, the translator needs to consider the situation at hand to best reflect its meaning in the target language. "Understood" in SENT (8a) is technically correct, but it is an unfortunate word choice for the clerk to employ. On the other hand, "right away" in PARA (8b) fits much better in the context of this conversation. Had this been a series of commands (e.g., in a military context) "understood" would be the more favorable option.

Limitations
So far, we have shown that GPT-3.5 leverages paragraph-level context to produce translations that are better than those produced by sentence-level counterparts (SENT vs PARA). However, there are still many issues with PARA's translations. From the annotations and translators' comments, we observe that PARA suffers from occasional omissions of content from the source paragraph. SENT and GTR are certainly not free of that problem either, but omission appears to be more prominent for PARA translations (see Appendix C).
Moreover, PARA still makes a sizeable number of mistranslations and grammatical errors, though fewer than SENT or GTR. We observe that PARA occasionally merges sentences with two distinctive subjects attributing all states and/or actions to one of them. Very rarely, we also find cases where context possibly confuses the model, resulting in an incorrect translation. The following example illustrates this issue: In the French text, the narrator wonders whether the brand of the desk was Ruhlman or Leleu, with both proper nouns possibly referring to a person. In the last sentence, the French text uses "il" or "he" (9), as a desk is a masculine noun in French ("le bureau"). PARA, on the other hand, appears to be confused by the two preceding names and incorrectly translates the singular pronoun as 彼ら, or "they." Finally, we observe (very few) cases where the paragraph-level translation disregards the context. Most representative of this class of errors is when the model struggles to translate from Japanese in cases where the subject is omitted. The following example illustrates this issue: wieku.
-  b. Miho is now married and has bought an old house in her hometown, where her friends often gather. Though she often finds it a chore to work tomorrow, it is her only connection to the world outside the convenience store, and a valuable opportunity to interact with other "normal thirty-something women" her age, so she tries to accept Miho's invitations as often as possible.
-GPT-3.5 PARA (ENGLISH) Both Polish (10a) and English (10b) translations of the same source text (10) share a common issue. The narrator begins the paragraph by talking about Miho and then proceeds to describe her own (the narrator's) feelings about the situation, although the gender of the narrator is never revealed in the Japanese text. The second sentence should be written from a first-person perspective, particularly since it directly references Miho towards the end (blue text). However, both the Polish and English translations produced by PARA are confused by this: by using the third-person's perspective ("she," "her"), both translations incorrectly imply that Miho is the subject of the second sentence. SENT and GTR translate this passage accurately, albeit with some clumsy phrasing.
GPT-4 does not magically solve all of these issues! Our preliminary experiments indicate that GPT-4 (OpenAI, 2023) sometimes generates better paragraph-level translations than those of GPT-3.5. For instance, it seems to have a better grasp of the inverted word order in German, though no broader conclusions should be made without further testing. Nevertheless, it does not resolve all of the issues discussed in our paper. Mistranslations and grammatical errors are still abundant across many language pairs. GPT-4 produces the following translation when fed the previous example paragraph (10) as input; note that all of the issues still remain: 32 32 Although the given paragraph is already comprehensible for a human reader, we also attempt to enhance the translation by incorporating three additional preceding paragraphs for context. Intriguingly, when provided with this extended context, both GPT-3.5 and GPT-4 generated accurate translations.
(11) Miho is now married and has bought a used singlefamily home in her hometown where her friends often gather. Although she sometimes finds it a drag to work a part-time job the next day, she makes an effort to respond to Miho's invitations because it's a valuable opportunity to interact with "normal" women in their thirties like herself, apart from her convenience store job.
-  PARA translations hold the potential to captivate readers, especially if LLMs continue to improve at their current pace. Indeed, some of our translators mentioned that they genuinely enjoyed the task, though integrating these paragraphs into a coherent novel still poses a considerable challenge. With all that said, literary translation involves more than just overall "correctness" or mere entertainment value. A translation that is perfectly "correct" and enjoyable might still fail to convey the author's intentions or meaning skillfully hidden behind a simple phrase. Our fr-en translator shares her thoughts on this matter: Both translations [SENT and PARA] translate the words without the feeling; the original author's voice is lost.
-FRENCH TO ENGLISH TRANSLATOR

Conclusion
In this paper, we demonstrate that LLMs leverage paragraph-level context to produce translations that are more coherent and enjoyable than sentence-bysentence translation while containing fewer mistranslations and grammatical issues. Our evaluations reveal that professional translators prefer paragraph-level translations over both sentencelevel translations produced by the same language model, and also to those generated by an offthe-shelf commercial system (GTR). We release our dataset and error annotations to help facilitate the development of new evaluation methodologies and automatic metrics for document-level machine translation. Finally, a full-length novel extends far beyond the confines of paragraph-level translation. In future work, we will focus on integrating individual paragraphs into cohesive chapters, which can then be expanded to encompass the entire novel.

Ethical considerations
The rise of large language models has also brought many ethical concerns to the forefront of NLP research (Blodgett et al., 2020;Bender et al., 2021). LLMs encode biases and exhibit toxicity, and these behaviors can be exacerbated by unconstrained prompting (Gehman et al., 2020;Costa-jussà et al., 2022). Further ethical concerns arise in the context of machine translation, particularly literary translation, where multiple stakeholders -the author, the translator, and the audience -are involved (Taivalkoski-Shilov, 2019a). Low-quality output can influence the perception of the author's work, impair the reader's linguistic abilities, and hinder the transfer of ideas to the target language while overrelying on machine translation can possibly threaten the role of human translators (Drugan, 2013;Ning and Domínguez, 2016;Taivalkoski-Shilov, 2019a). On the other hand, machine translation employed responsibly as an auxiliary tool holds the potential to alleviate the translator's cognitive burden (O'Brien, 2012) and make the author's work accessible to a broader audience more swiftly (Besacier, 2014). Contrary to the claims made by Eloundou et al. (2023), we do not view large language models as a substitute for human translators, but rather as a means to assist translators in their work.

B Prompt Examples
Here we present examples of prompts employed for the translation with GPT-3.5. The prompt wording for SENT, PARA_SENT, and PARA, with one demonstration each, and presented in Figure 8

C Human Evaluation
Omissions One thing we ought to discuss is the omission issue. Upon examining translations and annotator feedback, we observe that PARA occasionally omits details, which are crucial to the story-line. Preliminary investigation indicates that PARA translations are more prone to omissions compared to SENT and GTR. Although PARA_SENT appears to mitigate this problem to some extent, it still results in a higher number of omissions than the sentence-level approach while at the same time introducing some repetition issues.   done by the authors on all 20 passages for every language pair that did not include translation from or into English. During the PARA_PIVOT translation process, the model utilized both the source text and its corresponding English translation (text-davinci-003, top-p=1.0, temp=0.3). This approach has the potential to mitigate the limitations associated with pivoting translations, where some information may be lost. For example, when translating from Czech into Polish both languages will encode the gender information in the past tense verbs. English will not, so this information is lost and will most likely result in an erroneous translation. Indeed, we notice that adding the source text helps the model to overcome this shortcoming, however, we do not observe a clear gain from using English as a pivot language. Consider the following example:
-  In each instance, the emphasized verbs could potentially be mistranslated when translated through English as the pivot language, as the speaker's gender information would be lost. For instance, the term "washed" remains unchanged regardless of the speaker's gender, with such details only encoded in the source (Czech) and target (Polish) languages. In this case, all verbs have been translated accurately with respect to grammatical gender, implying that incorporating the source language into the pivot pipeline does indeed improve the translation. However, PARA_PIVOT still selects less suitable verbs (highlighted in red) for the specific instances resulting in slightly more errors in this particular paragraph.
The only pair where pivoting seems to help is pl-ja. While it is unclear why this happens, it is possible that this outcome is due to the specifics of the Polish novel employed for the translation. Sword of Destiny by Andrzej Sapkowski uses a very specific language with many archaic expressions. It is possible that translating into English, a language the GTP models were trained on, helps the model deal with these difficult phrases.
Since we do not observe any apparent gain from performing the translation via English as a pivot language (p=0.62, 95% [0.448, 0.591]) and doing so reduces the number of examples one can fit into the prompt, we continue our experiments with a direct translation.