Challenges in Context-Aware Neural Machine Translation

Context-aware neural machine translation involves leveraging information beyond sentence-level context to resolve inter-sentential discourse dependencies and improve document-level translation quality, and has given rise to a number of recent techniques. However, despite well-reasoned intuitions, most context-aware translation models show only modest improvements over sentence-level systems. In this work, we investigate several challenges that impede progress within this field, relating to discourse phenomena, context usage, model architectures, and document-level evaluation. To address these problems, we propose a more realistic setting for document-level translation, called paragraph-to-paragraph (para2para) translation, and collect a new dataset of Chinese-English novels to promote future research.


Introduction
Neural machine translation (NMT) has garnered considerable scientific interest and commercial success in recent years, with current state-of-the-art systems approaching or exceeding human quality for a few resource-rich languages when translating individual sentences (Wu et al., 2016;Hassan et al., 2018;Yang et al., 2020).Despite the strong empirical performance of such systems, the independence assumption that underlies sentencelevel NMT raises several issues.Certain textual elements, such as coreference (Guillou and Hardmeier, 2016), lexical cohesion (Carpuat, 2009), or lexical disambiguation (Rios Gonzales et al., 2017) are impossible to correctly translate without access to linguistic cues that exist beyond the present sentence (Sim Smith, 2017).When evaluating documents rather than individual sentences, the adequacy and fluency of professional human translation continues to surpass that of MT systems (Läubli et al., 2018), thus underscoring the need for incorporating long-range context.
Despite some efforts to meaningfully exploit inter-sentential information, many context-aware (or interchangeably, document-level) NMT systems only show meager gains across sentence-level and document-level translation metrics (Tiedemann and Scherrer, 2017;Miculicich et al., 2018;Müller et al., 2018;Tu et al., 2018;Maruf et al., 2019;Lupo et al., 2022a,b;Wu et al., 2022).Performance improvements against sentence-level baselines on overall translation accuracy, pronoun resolution, or lexical cohesion become less pronounced when context-aware systems are trained on realistic, high-resourced settings (Lopes et al., 2020), casting doubt on the efficacy of such approaches.
In this paper, we conduct a thorough empirical analysis and present some key obstacles that hinder progress in this domain: 1. Existing document-level corpora contain a sparse number of discourse phenomena that require inter-sentential context to be accurately translated.
2. Though context is necessary for pronoun resolution and named entity consistency, it is less helpful for tense and discourse markers.
3. The sentence-level Transformer baseline already performs up to par with concatenationbased NMT settings.
4. Advanced model architectures do not meaningfully improve document-level translation on existing document-level datasets.
5. Current metrics designed for document-level translation evaluation do not adequately measure document-level translation quality.
The above findings suggest that paragraph-toparagraph (PARA2PARA) translation, wherein a document is translated at the granularity of paragraphs, may serve as a more suitable and realistic setting for document-level translation, which in practice is unencumbered by sentence-level alignments.To this end, we develop and release a new paragraph-aligned Chinese-English dataset, consisting of 10,545 parallel paragraphs harvested from 6 novels within the public domain, in order to spur future research.

Background
The high-level objective of sentence-level machine translation is to model the sentence-level conditional probability P (y|x), in which the source and target sentences x = (x 1 , ..., x M ), y = (y 1 , ..., y N ) are textual sequences of respective lengths M and N .Under the dominant paradigm of neural machine translation (Sutskever et al., 2014), the conditional probability P θ (y|x) is typically decomposed into the following auto-regressive formulation (with θ denoting parameterized weights): Equation 1 implies that when predicting the target token y n , the model could only access the current source sentence, x, as well as all previously translated tokens in the current target sentence, y <n .
Translating sentences in a document in such an isolated fashion, without any extra-sentential information that lies beyond sentence boundaries, has been found to produce syntactically valid, but semantically inconsistent text (Läubli et al., 2018).
To remedy this, context-aware neural machine translation considers a document D that entails a set of logically cohesive source sentences X = {x 1 , x 2 , ..., x d }, and a parallel set of target sentences Y = {y 1 , y 2 , ..., y d }.Under a left-to-right translation schema, the model computes the probability of translating the source sentence x i conditioned on the context C i , wherein 0 ≤ i ≤ d: In practice, there are multiple ways to formulate C i .Passing in C i = {∅} reduces to the sentencelevel case (1).Throughout this paper, we explore two concatenation-based setups first presented by Tiedemann and Scherrer (2017).The one-to-two (1-2) setup prepends the preceding target sentence to the current target sentence (C i = {y i−1 }), denoting sentence boundaries with a <SEP> token.
The two-to-two (2-2) setup incorporates additional context from the previous source sentence (C i = {x i−1 , y i−1 }).The target context is integrated in the same manner as in one-to-two.
In order to investigate the importance of context after the current sentence, we also explore a three-to-one (3-1) setting, wherein we introduce additional source-side context by concatenating the previous and subsequent sentences to the current one (C i = {x i−1 , x i+1 }), and do not incorporate any target context.

Model Architectures
Recent progress in context-aware NMT generally falls along two lines: multi-encoder approaches and concatenation-based ones (Kim et al., 2019).
Under the first taxonomy, additional sentences are encoded separately, such that the model learns an internal representation of context sentences independently from the current sentence.The integration of context and current sentences can occur either prior to being fed into the decoder (Maruf and Haffari, 2018;Voita et al., 2018;Miculicich et al., 2018;Zhang et al., 2018;Maruf et al., 2019), or within the decoder itself (Bawden et al., 2018;Cao and Xiong, 2018;Kuang and Xiong, 2018;Stojanovski and Fraser, 2018;Tu et al., 2018;Zhang et al., 2018).The effectiveness of these multiencoder paradigms is subject to debate; in a standardized analysis, Li et al. (2020) finds that rather than effectively harnessing inter-sentential information, the context encoder functions more as a noise generator that provides richer self-training signals, since even the inclusion of random contextual input can yield substantial translation improvement.In addition, Sun et al. (2022) finds that BLEU-score improvements from context-aware approaches often diminish with larger training datasets or thorough baseline tuning.
On the other hand, concatenation-based NMT approaches are conceptually simpler and have been found to perform on par with or better than multiencoder systems (Lopes et al., 2020;Ma et al., 2021).Under this paradigm, context sentences are appended to the current sentence, with special tokens to mark sentence boundaries, and the concatenated sequence is passed as input through the encoder-decoder architecture (Ma et al., 2020).

Datasets
Until recently, the bulk of context-aware NMT research has focused on document-level, sentencealigned parallel datasets.Most commonly used corpora, including IWSLT-17 (Cettolo et al., 2012), NewsCom (Tiedemann, 2012), Europarl (Koehn, 2005), and OpenSubtitles (Lison et al., 2018) are sourced from news articles or parliamentary proceedings.Such datasets often contain a high volume of sentences that is sufficient for training sentence-level NMT systems, yet the number of documents remains comparatively limited. 2n an attempt to address the scarcity of documentlevel training data, recent works have developed datasets that are specifically tailored for contextaware NMT.Jiang et al. ( 2023) curated Bilingual Web Books (BWB), a document-level parallel corpus consisting of 9.6 million sentences and 196 thousand documents (chapters) sourced from English translations of Chinese web novels.Thai et al. (2022) introduced PAR3, a multilingual dataset of non-English novels from the public domain, which is aligned at the paragraph level based on both human and automatic translations.Using automatic sentence alignments, Al Ghussin et al. ( 2023) extracted parallel paragraphs from Paracrawl (Bañón et al., 2020), which consists of crawled webpages.

Evaluation
In addition to metrics that evaluate sentence-level translation quality, e.g., BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020), a number of automatic metrics designed specifically for document-level MT have been recently proposed.Jiang et al. (2022) introduced BlonDe, a documentlevel automatic metric that calculates the similaritybased F1 measure of discourse-related spans across four categories.Vernikos et al. (2022) show that pre-trained metrics, such as COMET, can be extended to incorporate context for document-level evaluation.To measure the influence of context usage in context-aware NMT models, Fernandes et al. (2021) proposed Context-aware Cross Mutual Information (CXMI), a language-agnostic indicator that draws from cross-mutual information.
Another approach to document-level MT evaluation focuses on hand-crafted contrastive evaluation sets to gauge the model's capacity for capturing inter-sentential discourse phenomena, includ-ing ContraPro (Müller et al., 2018) in English-to-German, Bawden (Bawden et al., 2018) in Englishto-French, and Voita (Voita et al., 2019) in Englishto-Russian translation.Though targeted, these test sets tend to be small, and are constricted to a particular language pair and discourse phenomenon.

Challenges
We identify key obstacles that account for the lack of progress in this field, based on a careful empirical analysis over a range of language pairs, model architectures, concatenation schemas, and document-level phenomena.Contextual sparsity is a bottleneck to documentlevel neural machine translation that manifests in two forms (Lupo et al., 2022a).First, the majority of words within a sentence can be accurately translated without additional access to inter-sentential information; context poses as a weak training signal and its presence has not been found to substantially boost translation performance.Second, only a few words in neighboring sentences may actually contribute to the disambiguation of current tokens at translation time.
We investigate contextual sparsity via a finegrained analysis on the BWB (Jiang et al., 2022) test set, which has been manually tagged with specific discourse-level phenomena. 4Specifically, we use it to probe NMT models' ability to exploit long-range context by analyzing the frequency of particular discourse phenomena that can only be resolved with context.
For the manual analysis, we randomly sample 200 discourse-annotated instances from the test set and ask bilingual annotators who are fluent in Chinese and English to identify and count instances that contain a particular context-dependent discourse phenomenon.Annotators are asked to discern if the following document-level discourse phenomena exist in each sentence pair: • Pronoun Ellipsis: The pronoun is dropped in Chinese, but must be included in the English translation.
• Lexical Cohesion: The same named entity must be translated consistently across the current sentence and context sentences.
• Tense: Tense information that can be omitted in Chinese, and must be inferred based on context to be correctly translated in English.
• Ambiguity: Instances in which an ambiguous word or phrase in the current sentence requires context to be correctly translated.
• Discourse Marker: A discourse marker, e.g., while, as long as, else, that is not explicit in Chinese, but must be pragmatically inferred and present in English.5 Table 1 indicates that lexical cohesion (83.2%) and pronoun ellipsis (53.8%) constitute the majority of discourse phenomena found in the 119 sentences that require inter-sentential signals for correct translation.In contrast, other categories-tense (4.2%), ambiguity (9.2%) and discourse marker (16.8%)occur much less frequently.
We next examine how far the useful context tends to be from the cross-lingually ambiguous sentence.Taking d as the sentence distance, the majority of discourse phenomena can be disambiguated based on the nearest context sentence (d=1).Specifically, the necessary information for tense, ambiguity, and discourse markers can almost always be found by d=1, whereas relevant context for pronoun ellipses and lexical cohesion tends to be more spread out.Hardly any useful information can be found in very distant context (d>3).
A significant fraction (40.5%) of sentences in the sampled test set can be translated independently, i.e., without access to inter-sentential information.Correspondingly, we notice that many sentences across document-level data are not lengthy with discourse-level phenomena, but rather simple constructions.Figure 1 indicates that the majority of sentences are relatively short in BWB and IWSLT-17, ranging from 20-50 characters (Chinese) or 10-30 words (French and German).

Context does not help disambiguate certain discourse phenomena.
An implicit assumption in context-aware NMT is that the inclusion of the proper context would influence the model to leverage it to resolve any potential discourse ambiguities.To this end, we investigate different types of discourse phenomena on the  BWB test set and show that this premise does not always hold; while pronoun resolution or named entity consistency is often better resolved with the incorporation of context, tense and discourse markers are relatively insensitive to context and yield meager improvement.

Pronoun Resolution
We examine two types of pronoun translation: pronoun ellipsis and anaphoric resolution.Pronoun ellipsis.As Chinese is a pro-drop language, pronouns can be freely omitted and are implicitly inferred from surrounding context.In contrast, grammatical and comprehensible translation into English requires that the pronoun be made explicit.To test concatenation-based NMT systems' ability to resolve Chinese-English pronoun ellipsis, we conduct inference on a subset of BWB that contains 519 instances of pronoun ellipsis.Table 2 indicates that the disambiguation of pronoun ellipsis is particularly responsive to context.Incorporating a single target-side context sentence (the 1-2 setting) improves the BlonDe F1-score from 55.88 to 63.91; adding another source-side context sentence (the 2-2 setting) marginally improves to 65.91.In this scenario, more source-side context may carry useful information, as the 3-1 setting performs the best overall on BlonDe (66.06).ambiguity.For example, when translating into German, the English pronoun it can become either es, sie, or er, depending on the grammatical gender of its referent.Thus, we also conducted experiments from English to German (En→De) and French (En→Fr), both grammatically gendered languages, and evaluated on the contrastive sets ControPro (Müller et al., 2018) and Bawden (Bawden et al., 2018), respectively.While Table 2 shows steady improvement for anaphoric resolution on ContraPro, curiously, the 1-2 concatenation-based model exhibits a slight dip compared to its sentence-level counterpart on Bawden.We hypothesize that the small size (200 examples) of the Bawden dataset causes the significant variance in the results.

Named Entities
Named entities-real-world objects denoted with proper names-are domain-specific and lowfrequency, and thus tend to be absent from bilingual dictionaries (Modrzejewski et al., 2020).Their translations are often either inconsistent (e.g., different target translations for the same source phrase) or inaccurate (with regards to some target reference).In this section, we examine for named entity consistency and accuracy on the annotated BWB test set.Consistency.We extract 780 examples (705 person entities, 75 non-person entities) to construct a consistency test subset.Each instance includes a sentence with a named entity that is also mentioned in the preceding sentence.We then measure the frequency at which different context-aware translation models could consistently translate the entity across the two consecutive sentences.
According to Table 3, this task proves to be challenging-no system achieves above-random performance-but the presence of context facilitates consistency as each context-aware setting performs better than the 1-1 baseline on person entities (32.34%).Adding target-side context (1-2 and 2-2 settings) appears strictly more helpful.By contrast, source-side context (3-1 setting) results in marginal performance gains relative to the baseline.Accuracy.To explore the frequency at which named entities are accurately translated, we next examine the 1734 person entities from the BWB test set.Surprisingly, the sentence-level model is better than context-aware models at correctly translating named entities, with the best accuracy of 54.55% (Table 3).While context is important for ensuring named entity consistency, these findings suggest that adding context may introduce additional noise and do not necessarily lead to more accurate translations.We hypothesize that the dependency on context might hurt the model's downstream performance when the NMT model tries to be consistent with the context translation, which results in a propagation of errors down the sequence.
In addition, when comparing all the results using the entity category in BlonDe across the three language pairs in Table 5 and Table 6, it becomes clear that additional context does not meaningfully increase the accuracy of named entity translation.

Discourse Marker and Tense
Discourse makers.The omission of discourse markers (DM)-particles that signal the type of coherence relation between two segments ( 4 shows, the sentence-level (1-1) baseline performs the best across discourse markers in aggregate, and across the cause and condition categories.The incorporation of context does not significantly improve the accuracy of discourse marker translation; interestingly, the 3-1 setting fares poorly, with the lowest performance across all categories except on contrast DMs.
Tense.Tense consistency is another extrasentential phenomenon that requires context for disambiguation, particularly when translating from an analytic source language (e.g., Chinese) to a synthetic target language (e.g., English), wherein tense must be made explicit. 6 From experimental results on the BWB (Table 5) and IWSLT (Table 6) data, 7 there is minimal variance across all translation settings in the BlonDe scores for tense and DM, suggesting that context is not particularly conducive for any language pair.Tense is generally consistently resolvable, with all models surpassing 70 on Zh→En.As expected, translating from French-a more syn- 6 In analytic languages, concepts are conveyed through root/stem words with few affixes.Synthetic languages use numerous affixes to combine multiple concepts into single words, incurring a higher morpheme-to-word ratio (O'Grady et al., 1997). 7We train on the IWSLT dataset in reverse order (Fr→En and De→En) in order to evaluate with BlonDe.thetic language-yields marginally higher BlonDe scores, at over 75.One reason that the BlonDe score for tense may be relatively inflexible across language pairs is that most sentences from the corpora generally adhere to a particular tense, such as past tense in literature, thus diminishing the necessity of context.

Is source or target context more helpful?
Fernandes et al. ( 2021) finds that concatenationbased context-aware NMT models lean on target context more than source context, and that incorporating more context sentences on either side often leads to diminishing returns in performance.
However, according to Table 2-6, this is not universally the case; the effectiveness of target-side versus source-side context is largely dependent on the language pair.Though target-side context often helps with translation consistency, such as preserving grammatical formality across sentences, it does not necessarily guarantee a better translation quality than source-side context (e.g., the 3-1 setting performs best on pronoun translation for French and German according to Table 6, and pronoun ellipsis for Chinese in Table 2).

The context-agnostic baseline performs
comparably to context-aware settings.
and the inability of sentence-level metrics to capture document-level attributes, context-aware models do not exhibit a meaningful improvement over context-agnostic models at the sentence level.
In terms of document-level improvement, the sentence-level baseline even outperforms contextaware models in select instances, such as when translating named entities (53.17% on Zh, 65.11% on De).There are no notable differences in handling tense and discourse markers across contextual settings, which aligns with our observations in §4.2.3.These results demonstrate that on commonly used datasets, context-aware models also do not significantly improve document-level translation over a sentence-level Transformer baseline.

Advanced model architectures do not
meaningfully improve performance.
Motivated by the limitations of the self-attention mechanism on long-range dependency modeling (Tay et al., 2022), recent work has proposed more advanced architectures to better leverage contextual signals into translation (Lupo et al., 2022b;Sun et al., 2022;Wu et al., 2022Wu et al., , 2023)).The hypothesis is that as long-range sequence architectures can effectively model longer context windows, they are better-equipped to handle the lengthier nature of document-level translation.
To test this theory, we replace the Transformer (XFMR) attention mechanism with a recently introduced MEGA architecture (Ma et al., 2023), which overcomes several limitations of the Transformer on long-range sequence modeling. 8As Table 6 shows, MEGA always performs better than XFMR across both the 1-1 and 3-1 settings on the sentence-level metrics, BLEU and COMET.At the document level, MEGA has the highest overall BlonDe F1score when translating from both German (53.37 vs. 52.88)and French (49.22 vs. 48.23).While MEGA tends to outscore XFMR on the pronoun and entity categories, there is no significant improvement, if any for tense and discourse marker.Furthermore, MEGA usually starts from a higher sentence-level baseline (except on pronoun resolution for Fr→En); when moving from the sentence-level to the contextual 3-1 setting, MEGA does not show higher relative gains than XFMR.
One potential explanation as to why MEGA performs better on automatic metrics is because it is a stronger model and better at translation overall (Ma et al., 2023), rather than it being able to leverage context in a more useful manner.The lack of improvement in particular discourse categories does not necessarily indicate that existing context-aware models are incapable of handling long-range discourse phenomena.Rather, it suggests that current data may not sufficiently capture the complexities in such situations.As discussed, discourse phenomena are sparse; some of them could not be resolved even with necessary context.This finding aligns with similar work (Sun et al., 2022;Post and Junczys-Dowmunt, 2023) which also propose that, on existing datasets and under current experimental settings that use sentencelevel alignments, the standard Transformer model remains adequate for document-level translation.
4.5 There is a need for an appropriate document-level translation metric.
Though BLEU and COMET are both widely used for sentence-level machine translation, they primarily focus on assessing sentence-level transla-  tion quality, and do not adequately encapsulate discourse-level considerations.Contrastive sets are a more discourse-oriented means towards evaluating document-level translation quality, but they too contain shortcomings.First, contrastive sets are not generalizable beyond a particular discourse phenomena and language pair, and the curation of these sets is both time-and labor-intensive.Furthermore, contrastive sets evaluate in a discriminative manner-by asking the model to rank and choose between correct and incorrect translation pairswhich is at odds with, and does not gauge, the MT model's generative capacity.Post and Junczys-Dowmunt (2023) (concurrent work) proposes a generative version of contrastive evaluation, and finds that this paradigm is able to make a finer-grained distinction between document-level NMT systems.
The recently proposed BlonDe (Jiang et al., 2022) score, which calculates the similarity measure of discourse-related spans in different categories, is a first step towards better automatic document-level evaluation.However, BlonDe requires the source language's data to be annotated with discourse-level phenomena, and its applicability is restricted to language pairs in which the target language is English.
Finally, incorporating pre-trained models into metrics is another promising direction.To this end, Vernikos et al. (2022) present a novel approach for extending pre-trained metrics such as COMET to incorporate context for document-level evaluation, and report a better correlation with human preference than BlonDe.Nevertheless, the incorporation of pre-trained models raises the issue of metric interpretability, yielding opaque numbers with no meaningful linguistic explanations.Thus, we note the need to develop more robust, automatic, and interpretable document-level translation metrics.

PARA2PARA Translation
A recurrent theme throughout our analyses is that existing datasets are not conducive to meaningful context usage in document-level translation.The majority of datasets used in the literature of document-level NMT are aligned at the sentence level, which is artificial in design and not reflective of how documents are translated in practice.
As such, paragraph-level parallel data (Figure 2) may be more suited for document-level NMT and provide richer contextual training signals.Recent work have turned toward literary translation as a challenging, realistic setting for document-level translation (Zhang and Liu, 2020;Thai et al., 2022;Karpinska and Iyyer, 2023), given that literary texts typically contain complex discourse structures that mandate a document-level frame of reference.As Figure 2 illustrates, sentence alignment is not always feasible when translating literature.Karpinska and Iyyer (2023) finds that language models can effectively exploit document-level context and cause fewer discourse-level translation errors based on human evaluation, when the paragraph is taken as the minimal discourse-level unit.
To promote future research on document-level translation in a realistic setting, we collect professional English and Chinese translations of classic novels, and format the data by manually correcting paragraph-level alignments.The PARA2PARA dataset consists of 10,545 parallel paragraphs across six novels from the public domain.9To our knowledge, the only other paragraph-aligned, parallel dataset sourcing from the literary domain is PAR3 (Thai et al., 2022), which uses Google Translate and fine-tuned GPT-3 (Brown et al., 2020)  contrast, the source and target paragraphs in our dataset are culled from professional translations. 10 We then benchmark the dataset under two experimental settings for Zh→En translation: i). a standard closed-domain setup, in which both the training and testing data are sourced from the same novels; ii). a more challenging open-domain setup, wherein two novels are held and used as only the test set.We experiment with training a Transformer-based model on PARA2PARA data from scratch (NONE), as well as incorporating pretrained baselines, in which the model is first trained on the sentence-level WMT17 Zh-En dataset (Bojar et al., 2017), before further fine-tuning on the PARA2PARA data, using the following backbone architectures: • XFMR Big (Vaswani et al., 2017), the Transformer-BIG.
• LIGHTCONV Big (Wu et al., 2019), which replaces the self-attention modules in the Transformer-BIG with fixed convolutions.
Table 7 shows preliminary baseline results on BLEU, BlonDe, and COMET. 11In the NONE setting, the Transformer's relatively low performance and incoherent output underscores the difficulty of training from scratch on the PARA2PARA corpus, due to two reasons-the inherent difficulty of training on paragraph-level, longer-sequence data, and the limited dataset size (especially relative to that of sentence-level MT datasets).To disentangle the two factors, we report additional baselines that leverage pre-training to offset the 10 Another distinction is that the Zh-En split in PAR3 sources from ancient novels in Classical Chinese (which is different from the modern language) and consists of 1320 paragraphs.
11 Example translations are in Appendix B.2.
issue of low-domain data; all of them exhibit a marked performance improvement over the NONE setting, attesting to the challenging constitution of paragraph-to-paragraph translation.
On the closed-domain setting, LIGHTCONV Big yields the highest score across all three metrics.Open-domain results are mixed: as expected, scores are lower across the board as this setting is challenging.XFMR Big has the best BLEU and discourse marker F1-score on BlonDe, although all pre-training baselines perform similarly.LIGHT-CONV Big performs the best on pronoun, entity, and tense on BlonDe and has the highest COMET score.

Conclusion
Despite machine-human parity at the sentence level, NMT still lags behind human translation on long collections of text, motivating the need for contextaware systems that leverage signals beyond the current sentence boundary.In this work, we highlight and discuss key obstacles that hinder momentum in context-aware NMT.We find that training signals that improve document-level discourse phenomena occur infrequently in surrounding context, and that most sentences can be accurately translated in isolation.Another challenge is that context benefits the resolution of some discourse phenomena over others.A context-agnostic Transformer baseline is already competitive against contextaware settings, and replacing the Transformer's self-attention mechanism with a more complex long-range mechanism does not significantly improve translation performance.We also note the need for a generalizable document-level translation metric.Finally, we make the case for paragraphaligned translation, and release a new PARA2PARA dataset, alongside baseline results, to encourage further efforts in this direction.

Limitations
Several limitations restrict the scope of this work.To begin, our choice of languages in this study-English, Chinese, French, German-is nonexhaustive, and it is possible that our findings would not generalize well to scenarios that involve low-resourced languages or distant language pairs.In particular, a significant portion of our investigation on discourse relations that necessitate context for proper disambiguation targets the Chinese-English BWB test set, which is the only public dataset that has been manually annotated with this type of information.Some of the discourse phenomena that we consider may not occur as frequently in other languages.While this work is a preliminary step that sheds light on the current nature of data that drives context-aware neural machine translation, future directions could entail extending similar analysis to other languages or discourse phenomena (e.g., the disambiguation of deixis when translating from Russian to English (Voita et al., 2019)).
Another restriction is that this work only examines concatenation-based architectures, which tend to be conceptually simple, effective, and hence subject to widespread adoption in recent years (Fernandes et al., 2021).While the purported advantages of multi-encoder NMT models are mixed (Li et al., 2020), for comprehensiveness, it would be insightful to examine whether they behave differently relative to concatenation-based systems under our experimental setup.Other potential avenues for exploration entail loss-based approaches to contextaware neural machine translation, such as context discounting (Lupo et al., 2022b) or contrastive learning-based schemas (Hwang et al., 2021).
Lastly, although the PARA2PARA dataset may pose as a more natural setting for context-aware translation, it is considerably smaller than other document-level datasets.Given that the small scale of training data is a prevalent issue in context-aware neural machine translation (Sun et al., 2022), future efforts could focus on expanding this dataset (as it is easier to source paragraph-aligned parallel translations in the wild than sentence-level ones) or moving beyond the literary domain.

A.1 Training
We train all models on the fairseq framework (Ott et al., 2019).Following Vaswani et al. (2017);Fernandes et al. (2021), we use the Adam optimizer with β 1 = 0.9 and β 2 = 0.98, dropout set to 0.3, an inverse square root learning rate scheduler with an initial value of 10 −4 , and the warm-up step set to 4000.We run inference on the validation set and save the checkpoint with the best BLEU score.We compute all BLEU scores using the sacreBLEU toolkit (Post, 2018).12Wherever possible, we report the average and standard deviation across three randomly seeded runs.

A.2 Models
Transformer The Transformer (Vaswani et al., 2017) is an encoder-decoder architecture that relies on a self-attention mechanism, in which every position of a single sequence relates to one another in order to compute a representation of that sequence.An n-length output sequence of d-dimensional representations Y ∈ R n×d can be computed from an input sequence of d-dimensional representations X ∈ R n×d as follows: MEGA The recently introduced MEGA (Moving Average Equipped Gated Attention) (Ma et al., 2023) architecture solves for two limitations of the traditional Transformer, which have long since resulted in sub-optimal performance on longsequence tasks: a weak inductive bias, and a quadratic computational complexity.This mechanism applies a multi-dimensional, damped exponential moving average (Hunter, 1986) (EMA) to a single-head gated attention, in order to preserve inductive biases.MEGA serves as a drop-in replacement for the Transformer attention mechanism, and full details can be found in (Ma et al., 2023).MEGA is of comparable size to the Transformer, with 6 encoder and 6 decoder layers, a model dimension of 512, and an FFN hidden dimension of 1024, alongside an additional shared representation dimension (128), value sequence dimension (1024), and EMA dimension (16).
In total, the Transformer architecture is around 65M parameters; the MEGA architecture is around 67M parameters.

A.3 Data
For the En↔Fr and En↔De language pairs, we train on the IWSLT17 (Cettolo et al., 2012) datasets, which contain document-level transcriptions and translations culled from TED talks.The test sets from 2011-2014 are used for validation, and the 2015 test set is held for inference.For Zh→En, we use the BWB (Jiang et al., 2023) dataset, which consists of Chinese webnovels.
Data for each language pair is encoded and vectorized with byte-pair encoding (Sennrich et al., 2016) using the SentencePiece (Kudo and Richardson, 2018) framework.We use a 32K joint vocabulary size for Zh→En, and a 20K vocabulary size for the other language pairs.Full corpus statistics are in  10.Following standard practice, the model is evaluated in a discriminative manner: rather than generating translated sequences, the model is provided with the previous sentence as context, and is asked to choose the current sentence with the correct pronoun from the incorrect ones.

B.1 Data and Preprocessing
We gather the Chinese and English versions of six novels within the public domain, which are freely available online (Table 11).Prior to the tokenization step, we normalize punctuation and segment Chinese sentences using the open-sourced Jieba package.English sentences are tokenized using the Moses toolkit (Koehn et al., 2007).We employ byte-pair encoding (Sennrich et al., 2016) for subword tokenization.
In the open-domain setting, A Tale of Two Cities and Twenty Thousand Leagues Under the Seas are withheld as the test set.

B.2 Translation Examples
Translation examples on the PARA2PARA dataset are in Figure 3.

B.3 LLM Evaluations
Large Language Models (LLMs) (e.g., Chat-GPT (OpenAI, 2022)) have recently accrued a great deal of mainstream and scientific interest, as they are found to maintain considerable fluency, consistency, and coherency across multiple NLP tasks, including document-level NMT (Wang et al., 2023)(concurrent work).To investigate how LLMs would fare on the PARA2PARA dataset, we also obtain translations using GPT-3.5 (gpt-3.5-turbo), a commercial, black-box LLM.Table 12 shows GPT-3.5'sperformance alongside that of the three <PER, T,1>{Qiao Lian} clenched <O,1>{her} fists and lowered <O,1>{her}head.Actually, <P,2>{he} was right.<O,1>{She} was indeed an idiot, as only an idiot would believe that they could find true love online.<P,1>{She} curled <P,1>her} lips and took a deep breath.... <ORG, T, 3>{WeChat} account.<Q,1><PER, T,1>{Qiao Lian}: "What happened?"<\Q> <PER,T,18>{Song Cheng} was extremely nervous and followed <P,10>{him}.<PER,T,10>{Shen Liangchuan} walked forward, one step at a time, until <O,10>{he} reached the front of <FAC,N,19>{the room}.<PER,T,12>{Wang Wenhao} was currently ingratiating <O,12>{himself} with <PER,N,20>{a C-list celebrity}.<PER,N,20>{The celebrity} asked, <Q,20>"Hey, I heard that you beat <PER,N,21>{a paparazzi}?"<\Q> <Q,12>"Yeah, <PER,N,21>{the paparazzi} nowadays are so disgusting.I have wanted to teach <P,21>{them} a lesson myself for some time now!"<\Q> <Q,20>"Are not you afraid of becoming an enemy of <P,21>{them}?"<\Q>pre-trained baselines, for reference.This experiment is similar to that of Karpinska and Iyyer (2023), who test GPT-3.5 on paragraphs from recently-published literary translations, and show that while LLMs can provide better paragraph-level translation (as they are better-equipped to handle long context), there are nevertheless critical translation errors that a human translator would be able to avoid.Given that OpenAI did not disclose the composition of ChatGPT's training data, it is likely that there may be data leakage from pre-training (especially as our dataset is sourced from publicdomain data).Thus, we do not believe these results represent a fair comparison with the pre-training baselines; we report them for the sake of comprehensiveness.

B.4 Pre-trained baseline performance
To investigate how fine-tuning on the PARA2PARA dataset affects the baselines' performance, we evaluate the pre-trained baselines on the same test set without any training on the PARA2PARA corpus.
As Table 13 illustrates, all three baselines exhibit significantly worse performance across the board (✗), and improve after fine-tuning (✓).
I tried to think of an excuse.I knew he did not want to lunch with me.It was his form of courtesy.I should ruin his meal.I My face told me so clearly, but Captain Nemo didn't say anything.He asked me to follow him, just like a man who listens desperately to his own will.We went to the dining-room, and breakfast was already ready."Professor Aronnax," the captain told me, "I ask you to eat well.Don't be rude.We dine and talk.Although I promise you to go for a walk in the My face told me this clearly, but Captain Nemo did not say, 'What, to ask me to follow him is like following him like a man who listens desperately to fate.We are in the dining room, and breakfast is there.'Mr Aronas, 'said the captain to me,' I ask you to have dinner, not to be polite.We have dinner, and we talk.Though I promised you I might go out for a walk in My face told me so clearly, but Captain Nemo did not say: "What, just ask me to go with him and follow him like a man who has no choice but to listen to his fate.We are in the dining room, and breakfast has been arranged there."Mr. Aronnus, "said the captain," I ask you to take your dinner, and don't be invidious.We talk as we eat.I don't give you any

Open Domain Base
Open Domain Base+LightConv-WMT17

Source
These thoughts were clearly readable on my face ; but Captain Nemo remained content with inviting me to follow him , and I did so like a man resigned to the worst .We arrived at the dining room , where we found breakfast served ." Professor Aronnax , " the captain told me , " I beg you to share my breakfast without formality .We can chat while we eat

Reference
My face betrayed my idea clearly, but Captain Nemo said nothing: Let me follow him, and follow him like a desperate man.We were in the dining-room, and breakfast had been arranged there."Mr.Aronnax," the captain said to me, "I beg My face told me so plainly, but Captain Nemo didn't say anything.He merely asked me to follow him, as if he were a man who listened desperately to his will.We arrived in the dining-room, and breakfast was ready for us."Professor Aronnax," the captain told me, "I ask you to eat well.Don't be rude.We dine and talk.Though I promise you you a walk in the My face told me so clearly, but Captain Nemo did not say anything.Just ask me to follow him, just like a man who obeys orders recklessly.We arrived at the dining-room, and breakfast was ready."Professor Aronnax," the captain told me, "I'll treat you to lunch, not to be offended.We'll have dinner together and talk.Though I promised you I could go for a stroll in

Closed Domain
Base 3

4. 1
Discourse phenomena is sparse in surrounding context.
Frequency of context-dependent discourse phenomena in a 200-count sample of the BWB test set, and the percentage of cases where relevant context can be found at distance d = 1, 2, 3, > 3 sentences.

Figure 1 :
Figure 1: Sentence length distributions on test sets.
go back again, for the past is still too close.The things we have tried to forget would stir again and that sense of fear building up the blind unreasoning panic, now mercifully stilled-might once again become a living companion.… -Corresponding human-translated paragraph in Chinese -Source paragraph from Rebecca (Daphne du Maurier) in English

Figure 2 :
Figure 2: An example of paragraph-to-paragraph translation.Aligned sentences are underlined in the same color.Highlighted parts are added by translators and do not have a corresponding source segment.

Figure 3 :
Figure 3: An example of PARA2PARA translation across open-domain and closed-domain settings.

Table 2 :
When translating to languages that contain grammatical gender, anaphoric pronouns form another instance of cross-lingual BLONDE evaluation of pronoun translation on the BWB test subset and accuracy for anaphoric pronoun resolution on CONTRAPRO and BAWDEN.The 3-1 setting requires the surrounding context sentences, and therefore cannot be applied to contrastive sets.

Table 3 :
Named entity analysis for consistency and accuracy on relevant samples from the BWB test set.

Table 4 :
Accuracy across discourse marker categories and concatenation settings on the BWB test set.
cally richer ones.Following Jiang et al. (2022), we separate DMs into five categories: contrast, cause, condition, conjunction, and (a-)synchronous, and examine how different context-aware settings fare with each discourse relation.As Table

Table 6 :
Automatic metric results on IWSLT-17 (Fr→En and De→En), on different architectures (XFMR and MEGA

Table 8 :
Sentence counts across parallel datasets.

Table 9 :
Annotated paragraphs from the BWB test set.

Table 10 :
Examples from the ContraPro and Bawden contrastive evaluation sets.Highlighted pronouns in the current sentence require the preceding context sentence for proper disambiguation.