Diverse Pretrained Context Encodings Improve Document Translation

We propose a new architecture for adapting a sentence-level sequence-to-sequence transformer by incorporating multiple pre-trained document context signals and assess the impact on translation performance of (1) different pretraining approaches for generating these signals, (2) the quantity of parallel data for which document context is available, and (3) conditioning on source, target, or source and target contexts. Experiments on the NIST Chinese-English, and IWSLT and WMT English-German tasks support four general conclusions: that using pre-trained context representations markedly improves sample efficiency, that adequate parallel data resources are crucial for learning to use document context, that jointly conditioning on multiple context representations outperforms any single representation, and that source context is more valuable for translation performance than target side context. Our best multi-context model consistently outperforms the best existing context-aware transformers.


Introduction
Generating an adequate translation for a sentence often requires understanding the context in which the sentence occurs (and in which its translation will occur). Although single-sentence translation models demonstrate remarkable performance (Chen et al., 2018;Vaswani et al., 2017;Bahdanau et al., 2015), extra-sentential information can be necessary to make correct decisions about lexical choice, tense, pronominal usage, and stylistic features, and therefore designing models capable of using this information is a necessary step towards fully automatic high-quality translation. A series of papers have developed architectures that permit the broader translation model to condition on extra-sentential context Miculicich et al., 2018), operating jointly on multiple sentences at once (Junczys-Dowmunt, 2019), or indirectly conditioning on target side document context using Bayes' rule (Yu et al., 2020b).
While noteworthy progress has been made at modeling monolingual documents (Brown et al., 2020), progress on document translation has been less remarkable, and continues to be hampered by the limited quantities of parallel document data relative to the massive quantities of monolingual document data. One recurring strategy for dealing with this data scarcity-and which is the basis for this work-is to adapt a sentence-level sequence-to-sequence model by making additional document context available in a second stage of training (Maruf et al., 2019;Miculicich et al., 2018;Haffari and Maruf, 2018). This two-stage training approach provides an inductive bias that encourages the learner to explain translation decisions preferentially in terms of the current sentence being translated, but these can be modulated at the margins by using document context. However, a weakness of this approach is that the conditional dependence of a translation on its surrounding context given the source sentence is weak, and learning good context representations purely on the basis of scarce parallel document data is challenging.
A recent strategy for making better use of document context in translation is to use pretrained BERT representations of the context, rather than learning them from scratch . Our key architectural innovation in this paper is an architecture for two-staged training that enables jointly conditioning on multiple context types, including both the source and target language context. Practically, we can construct a weak context representation from a variety of different contextual signals, and these are merged with the source sentence encoder's representation at each layer in the transformer. To examine the potential of this architec-ture, we explore two high-level research questions. First, using source language context, we explore the relative impact of different kinds of pretraining objectives on the performance obtained (BERT and PEGASUS), the amount of parallel document training data required, and the size of surrounding context. Second, recognizing that maintaining consistency in translation would seem to benefit from larger contexts in the target language, we compare the impact of source language context, target language context, and context containing both.
Our main findings are (1) that multiple kinds of source language context improves performance of document translation over existing contextual representations, especially those that do not use pretrained context representations; (2) that although fine-tuning using pretrained contextual representations improves performance, large performance is strongly determined by the availability of contextual parallel data; and (3) that while both source and target language context provide benefit, source language context is more valuable, unless the quality of the target language context translations is extremely high.

Model Description
Our architecture is designed to incorporate multiple sources of external embeddings into a pretrained sequence-to-sequence transformer model. We execute this by creating a new attention block for each embedding we wish to incorporate and stack them. We then insert this attention stack as a branching path in each layer of the encoder and decoder. The outputs of the new and original paths are averaged before being passed to the feed forward block at the end of the layer. Details are discussed below ( §2.4), and the architecture is shown in Figure 1.
The model design follows the adapter pattern (Gamma et al., 1995). The interface between the external model and translation model takes the form of an attention block which learns to perform the adaptation. The independence between the models means that different input data can be provided to each, which enables extra information during the translation process. In this work, we leverage this technique to: (1) enhance a sentence-level model with additional source embeddings; (2) convert a sentence-level model to a document-level model by providing contextual embeddings. Like BERTfused , we use pretrained masked language models to generate the external embed-dings.

Pre-Trained Models
We use two kinds of pretrained models: BERT (Devlin et al., 2019) and PEGASUS . Although similar in architecture, we conjecture that these models will capture different signals on account of their different training objectives.
BERT is trained with a masked word objective and a two sentence similarity classification task. During training, it is provided with two sentences that may or may not be adjacent, with some of their words masked or corrupted. BERT predicts the correct words and determining if the two sentences form a contiguous sequence. Intuitively, BERT provides rich word-in-context embeddings. In terms of machine translation, it's reasonable to postulate that BERT would provide superior representations of the source sentence and reasonable near sentence context modulation. On the other hand, we expect it to fail to provide contextual conditioning when the pair of sentences are not adjacent. This shortcoming is where PEGASUS comes in.
PEGASUS is trained with a masked sentence objective. During training, it is given a document that has had random sentences replaced by a mask token. Its task is to decode the masked sentences in the same order they appear in the document. As a result, PEGASUS excels at summarization tasks, which require taking many sentences and compressing them into a representation from which another sentence can be generated. In terms of providing context for document translation, we conjecture that PEGASUS will be able to discover signals across longer ranges that modulate output.

Embedding Notation
To keep track of the type of embeddings being incorporated in a particular configuration, we use the notational convention Model Side (Inputs).
• Model: B for BERT, P for PEGASUS, and D for Document Transformer . • Side: s for the source and t for the target language. • Inputs: c for the current source (or target), i.e., x i , p for the previous source (target), and n for the next one. Note that 3p means the three previous sources (targets), (x i−3 , x i−2 , x i−1 ). • When multiple embeddings are used, we include a ⇒ to indicate the order of attention operations.
We can thus represent the BERT-fused document model proposed by  as B s (p,c) since it passes the previous and current source sentences as input to BERT.

Enhanced Models
The core of this work is to understand the benefits that adding a diverse set of external embeddings has on the quality of document translation. To this effect, we introduce two new models that leverage the output from both BERT and PEGASUS: There are a few ways to integrate the output of external models into a transformer layer. We could stack them vertically after the self-attention block  or we could place them horizontally and average all of their outputs together like MAT (Fan et al., 2020). Our preliminary experiments show that the parallel attention stack, depicted in Figure 1, works best. Therefore, we adopt this architecture in our experiments.

Parallel Attention Stack
If we let A = B s (p,c), B = B s (c,n), and C = P s (3p,c,3n) refer to the output of the external pretrained models computed once per translation example, then the Multi-context encoder layer is defined as The intermediate outputs of the attention stack are S a ⇒ S b ⇒ S . To reproduce BERT-fused, we remove S a and S b from the stack and set S directly to AttnBlock(A, A, E −1 ). We use attention block to refer to the attention, layer normalization, and residual operations, (Fan et al., 2020) is defined as where u ∼ Uniform(0, 1) and 1 is the indicator function.
3 Experiment Setup

Datasets
We evaluate our model on three translation tasks, the NIST Open MT Chinese-English task, 1 the IWSLT'14 English-German translation task, 2 and the WMT'14 English-German news translation task. 3 Table 1 provides a breakdown of the type, quantity, and relevance of the data used in the various dataset treatments. NIST provides the largest amount of in domain contextualized sentence pairs. IWSLT'14 and WMT'14 are almost an order of magnitude smaller. See Appendix A for preprocessing details.
NIST Chinese-English is comprised of LDC distributed news articles and broadcast transcripts. We use the MT06 dataset as validation set and MT03, MT04, MT05, and MT08 as test sets. The validation set contains 1,649 sentences and the test set 5,146 sentences. Chinese sentences are frequently underspecified with respect to grammatical features that are obligatory in English (e.g., number for nouns, tense on verbs, and dropped arguments), making it a common language pair to study for document translation.
IWSLT'14 English-German is a corpus of translated TED and TEDx talks. Following prior work , we use the combination of dev2010, dev2012, tst2010, tst2011, and tst2012 as the test set which contains 6,750 sentences. We randomly selected 10 documents from the training data for validation. We perform a data augmentation experiment with this dataset by additionally including news commentary v15. We denote this treatment as IWSLT+ and consider this to be out of domain data augmentation. A partially shaded box indicates data while full shading is used for operations. A dashed border means the data is constant for a given translation. We use dashed arrows for residual connections and blue arrows to indicate embeddings that originate from outside the transformer model.

WMT'14
English-German is a collection of web data, news commentary, and news articles. We use newstest2013 for validation and newstest2014 as the test set. For the document data, we use the original WMT'14 news commentary v9 dataset. We run two document augmentation experiments on this dataset. The first, denoted as WMT+, replaces news commentary v9 with the newer news commentary v15 dataset. The second augmentation experiment, denoted as WMT++, builds on the first by additionally incorporating the Tilde Rapid 2019 corpus. The Rapid corpus is comprised of European Commission press releases and the language style is quite different from the style used in the News Commentary data. For this reason, we consider Rapid to be out of domain data for this task.

Training
We construct enhanced models with additional attention blocks and restore all previously trained parameters. We randomly initialize the newly added parameters and exclusively update these during training. For a given dataset, we train a model on all the training data it is compatible with. This means that for document-level models, only document data is used, while for sentence-level models both document and sentence data is used. In our work, this distinction only matters for the WMT'14 dataset where there is a large disparity between the two types of data.
Transformer models are trained on sentence pair data to convergence. For NIST and IWSLT'14 we use transformer base while for WMT'14 we use transformer big. We use the following vari-

Dataset
In Domain Out Domain

Evaluation
To reduce the variance of our results and help with reproducibility, we use checkpoint averaging. We select the ten contiguous checkpoints with the highest average validation BLEU. We do this at two critical points: (1) with the transformer models used to bootstrap enhanced models; (2) before calculating the validation and test BLEU scores we report. We use the sacreBLEU script (Post, 2018)

Results
In this section, we present our main results and explore the importance of each component in the multi-context model. Additionally, we investigate the performance impact of document-level parallel data scarcity, the value of source-side versus targetside context, and the importance of target context quality. Table 2 compares our Multi-source and Multicontext models to baselines of related prior work, transformer (Vaswani et al., 2017), document transformer , and the BERT-fused model for machine translation . We see that a multi-embedding model outperforms all the single embedding models in each of the datasets we try. However, the best multiembedding configuration varies by dataset. We find that incorporating target-side context does not improve performance beyond using source-side context alone. We will present our ablation studies in the subsequent sections to further shed light on the causes of this pattern of results. To preserve the value of test set, we report results on the validation set for these experiments.

Source Context vs. Target Context
In some language pairs, the source language is underspecified with respect to the obligatory information that must be given in the target language. For example, in English every inflected verb must have tense and this is generally not overtly marked in Chinese. In these situations, being able to condition on prior translation decisions would be valuable. However, in practice, the target context is only available post translation, meaning there is a risk of cascading errors. In this section, we seek to answer two questions: (1) how does the quality of target context affect document-level translation; (2) whether incorporating high-quality target context into source only models adds additional value.
To answer the first question, we evaluate the target context model P t (3p,3n) using various translations as context. Table 3 shows the BLEU scores achieved by the target context models on the validation set. The lowest quality context comes from using the output of the baseline transformer model to furnish the context (valid BLEU of 48.76); the middle level comes from a model that conditions on three views of source context (valid BLEU of 52.8) and the third is an oracle experiment that uses a human reference translation. We see that the   Table 3: The value of using context on the target side of a translation is dependent on its quality. We test this in the limit by providing oracle context, which uses one of the references as context. We report BLEU scores on the validation set. The numbers in the second column are the BLEU scores of the translations used as the context, indicating the quality of the context. BLEU score improves as the quality of the target context improves; however, the impact is still less than the Multi-context source model-even in the oracle case! Next, we explore whether leveraging both source and target context works better than only using source context. To control for the confounding factor of target context quality, we remove one of the references from the validation dataset and use it only as context. We believe this provides an upper bound on the effect of target context for two reasons: (1) it's reasonable to assume that, at some point, machine translation will be capable of generating human quality translations; (2) even when this occurs, we will not have access to the style of a specific translator ahead of time. For these reasons, we calculate BLEU scores using only the three remaining references. We can see in Table  4 that adding human quality target context to Multicontext only produces a 0.14 BLEU improvement. This challenges the notion that target context can add more value than source context alone.  Table 4: We remove one of the references from the validation dataset and use it to provide target context only. The numbers are lower compared to other tables because the BLEU score is calculated w.r.t three references instead of four. Using human level target context offers little value over using source context alone.

Context Ablation
To assess the importance of the various embeddings incorporated in the Multi-context model, we perform an ablation study by adding one component at a time until we reach its full complexity. Table 5 shows the study results. We can see that much of the improvement comes from the stronger sentencelevel model produced by adding BERT's encoding of the source sentence-a full 2.25 BLEU improvement. The benefit of providing contextual embeddings is more incremental, yet consistent. Adding the previous sentence gives us 0.44 BLEU, adding additional depth provides another .49, and including the next sentence adds .37. Finally, adding PEGASUS' contextual embedding on top of all this results in a boost of .49. Holistically, we can assign 2.45 BLEU to source embedding enrichment and 1.59 to contextual representations.

Data Scarcity
NIST is a high resource document dataset containing over 1.4M contextualized sentence pairs. In 52.10 B s (p,c) ⇒ B s (c,n) ⇒ P s (3p,c,3n) 52.80 Table 5: We perform ablation experiments on the NIST validation dataset to better understand the performance increase of the Multi-context model. We conclude that, in this document rich environment, multiple sources of embedding enrichment and document context contribute to performance. Adding additional parameters also helps but we only see this when going from one to two blocks. Parameter control experiments are shown in light grey. this section, we investigate to what extent the quantities of parallel documents affect the performance of our models. To do so, we retrain enhanced models with subsets of the NIST training dataset. It is important to note that the underlying sentence transformer model was not retrained in these experiments meaning that these experiments simulate adding document context to a strong baseline as done in Lopes et al. (2020). Figure 2 shows the BLEU scores of different models on the NIST validation set with respect to the number of contextualized sentences used for training. We can see that it requires an example pool size over 300K before these models outperform the baseline. We conjecture that sufficient contextualized sentence pairs are crucial for document-level models to achieve good performance, which would explain why these models don't perform well on the IWSLT'14 and WMT'14 datasets.
Further, this pattern of results helps shed light on the inconsistent findings in the literature regarding the effectiveness of document context models. A few works (Kim et al., 2019;Lopes et al., 2020) have found that the benefit provided by many document context models can be explained away by factors other than contextual conditioning. We can now see from Figure 2 that these experiments were done in the low data regime. The randomly initialized context model needs around 600K training examples before it significantly outperform the baseline, while the pretrained contextual models reduce this to about 300K. It is important to note that none of the conextual models we tried outperformed the baseline below this point. This indicates that data quantity is not the only factor that matters but it is a prerequisite for the current class of document context architectures.

Document Data Augmentation
We further validate our hypothesis about the importance of sufficient contextualized data by experimenting with document data augmentation, this time drawing data from different domains. We augment the IWSLT dataset with news commentary v15, an additional 345K document context sentence pairs, and repeat the IWSLT experiments. During training, we sample from the datasets such that each batch contains roughly 50% of the original IWSLT data. To ensure a fair comparison, we first finetune the baseline transformer model on the new data, which improves its performance by 1.61 BLEU. We use this stronger baseline as the foundation for the other models and show the results in Table 6. Although Multi-context edges ahead of Multi-source, the significance lies in the relative impact additional document data has on the two classes of models. The average improvement of the sentence-level models is 1.58 versus the 1.98 experienced by the document models. Huo et al. (2020) observed a similar phenomenon when using synthetic document augmentation. This further emphasizes the importance of using sufficient contextualized data when comparing the impact of various document-level architectures, even when the contextualized data is drawn from a new domain.

Three Stage Training
WMT'14 offers an opportunity to combine the insights gained from the aforementioned experiments. This dataset provides large quantities of sentence pair data and a small amount of document pair data. Not surprisingly, both BERT-fused 8 and Multi-context struggle in this environment. On the other hand, Multi-source benefits from the abundance of sentence pair data.
In order to make the most of the training data, Impact of Data Scarcity

N I S T
Multi-context BERT-fused Multi-source Doc Transformer Baseline Figure 2: Document context models require sufficient contextualized training data in order to be effective. We simulate data scarcity on the NIST dataset by randomly sampling a subset of the data and using it to train the various models. In order to outperform the baseline, pretrained models  Table 6: Model performance before and after document data augmentation. We see that most of the improvement is coming from source embedding enrichment. Data augmentation is required for documentlevel models to additionally learn to leverage contextual information. The document-level models benefit significantly more from additional document data than the sentence-level models.
we add a third stage to our training regime. As before, in stage one, we train the transformer model with the sentence pair data. In stage two, we train the Multi-source model also using the sentence pair data. In stage three, we add an additional P s (3p,3n) attention block to the Multi-source model and train it with document data. We perform two document augmentation experiments. In the first, we replace news commentary v9 with v15. In the second, we train on a mix of news commentary v15 and Tilde Rapid 2019. The optimal mix was 70% and 30% respectably, which we found by tuning on the validation dataset. For each of the augmentation experiments, we created new Multi-source baselines by fine-tuning the original baseline on the new data.
When training these new baselines we only updated the parameters in the B s (c) and P s (c) attention blocks. In contrast, when training the treatment models, we froze these blocks and only updated the parameters in the P s (3p,3n) block. In this way, both the new baselines and treatments started from the same pretrained Multi-source model, were exposed to the same data, and had only the parameters under investigation updated. We see in Table 7 that this method can be used to provide the document-level model with a much stronger sentence-level model to start from. As we saw in the previous data augmentation experiments ( §4.4), document augmentation helps the documentlevel model more than the sentence-level model. It is interesting to note that out of domain document data helps the document-level model yet hurts the sentence-level model. 9

Related Work
This work is closely related to two lines of research: document-level neural machine translation and representation learning via language modeling.
Earlier work in document machine translation exploits the context by taking a concatenated string of adjacent source sentences as the input of neural sequence-to-sequence models (Tiedemann and WMT'14 Table 7: Results from using a three staged training approach. When there is large disparity between the amount of sentence pair data and document data, this method enables training new attention blocks with the maximum amount of available data given their input restrictions. Scherrer, 2017). Follow-up work adds additional context layers to the neural sequence-to-sequence models in order to have a better encoding of the context information Miculicich et al., 2018, inter alia). They vary in terms of whether to incorporate the source-side context (Bawden et al., 2018;Miculicich et al., 2018) or target-side context (Tu et al., 2018), and whether to condition on a few adjacent sentences (Jean et al., 2017;Wang et al., 2017;Tu et al., 2018;Voita et al., 2018;Miculicich et al., 2018) or the full document (Haffari and Maruf, 2018;Maruf et al., 2019). Our work is similar to this line of research since we have also introduced additional attention components to the transformer. However, our model is different from theirs in that the context encoders were pretrained with a masked language model objective.
There has also been work on leveraging monolingual documents to improve document-level machine translation. Junczys-Dowmunt (2019) creates synthetic parallel documents generated by backtranslation (Sennrich et al., 2016;Edunov et al., 2018) and uses the combination of the original and the synthetic parallel documents to train the document translation models. Voita et al. (2019) train a post-editing model from monolingual documents to post-edit sentence-level translations into document-level translations. Yu et al. (2020b,a) uses Bayes' rule to combine a monolingual document language model probability with sentence translation probabilities.
Finally, large-scale representation learning with language modeling has achieved success in im-proving systems in language understanding, leading to state-of-the-art results on a wide range of tasks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2018;McCann et al., 2017;Chronopoulou et al., 2019;Lample and Conneau, 2019;Brown et al., 2020). They have also been used to improve text generation tasks, such as sentence-level machine translation (Song et al., 2019;Edunov et al., 2019; and summarization (Zhang et al., 2019Dong et al., 2019), and repurposing unconditional language generation (Ziegler et al., 2019;de Oliveira and Rodrigo, 2019). Our work is closely related to that from , where pretrained largescale language models are applied to documentlevel machine translation tasks. We advance this line of reasoning by designing an architecture that uses composition to incorporate multiple pretrained models at once. It also enables conditioning on different inputs to the same pretrained model, enabling us to circumvent BERT's two sentence embedding limit.

Conclusion
We have introduced an architecture and training regimen that enables incorporating representations from multiple pretrained masked language models into a transformer model. We show that this technique can be used to create a substantially stronger sentence-level model and, with sufficient document data, further upgraded to a document-level model that conditions on contextual information. Through ablations and other experiments, we establish document augmentation and multi-stage training as effective strategies for training a document-level model when faced with data scarcity. And that source side context is sufficient for these models, with target context adding little additional value.

A.1 Text
We perform text normalization on the datasets before tokenization.
• All languages -Unicode canonicalization (NKFD from), replacement of common multiple encoding errors present in training data, standardization of quotation marks into "directional" variants.
• English -Replace non-American spelling variants with American spellings using the aspell library. 10 Punctuation was split from English words using a purpose-built library.
• Chinese -Convert any traditional Chinese characters into simplified forms and segment into word-like units using the Jieba segmentation tool. 11 • English & German for WMT'14 -Lowercase first word of sentence unless it was in a whitelist of proper nouns and common abbreviations.
• Chinese & English for NIST -Lowercase all words.

A.2 Tokenization
We encode text into sub-word units using the sentencepiece tool (Kudo and Richardson, 2018). When generating our own subword segmentation, we used the algorithm from Kudo (2018) with a minimum character coverage of 0.9995. Other than for BERT, we use TensorFlow Senten-cepieceTokenizer for tokenization given a sentencepiece model.
• BERT (all) -Used vocabulary provided with download and TensorFlow BertTokenizer.
• PEGASUS large & EN small -Used sentencepiece model provided with PEGASUS large download.
• PEGASUS Zh small -Generated subword vocabulary of 34K tokens from the NIST dataset. Using a global batch size of 32 and a beam width of 5, the following are the number of samples per second our models and key baselines managed during inference: • Transformer: 11.94 • BERT-fused: 7.37 • Multi-source: 5.45 • Multi-context: 4.80

C Qualitative Analysis
We manually inspected the translations outputs from the Multi-source model and Multi-context model and have found that the Multi-context model indeed does better in terms maintaining the consistency of lexical usage across sentences. Unlike English, Chinese does not mark nouns for plural vs singular nor verbs for tense. Therefore, this needs to be inferred from context to generate accurate English translations. It is not possible for a sentence-level MT system to capture this information when the relevant context is not in the current sentence. Tables 8, 9, and 10 provide various examples where the sentence-level model cannot know this information and the document-level model is able to correctly condition on it.

Reference:
Mr. Jin said that Wang Xuan started to focus on mentoring young people when he was in his 50s. He constantly stressed that he wanted to pave the way for young people and that he wanted to be their stepping stone.
Multi-source: Mr. Chin says that when he was in his fifties, Wang began to pay attention to cultivating young people. He has always stressed that to pave the way, he must be willing to serve as a ladder for young people.
Multi-context: Mr. Jin said that when he was in his fifties, Wang Xuan began to pay attention to cultivating young people. He always stressed that he wanted to pave the way, to be willing to serve as a ladder, and to give young people a way.

Example 2 Consistency of Proper Noun
Source: 巴政府是决不会让这种企图得逞的。

Reference:
The Pakistani government will never allow such attempt to materialize. Multi-source: The Palestinian government will never let this attempt succeed. Multi-context: The Pakistani government will never let this attempt succeed. Table 9: The pronoun 巴政府 is ambiguous since 巴 could be short for 巴西 (Brazil)，巴勒斯坦 (Palestine)，巴 基斯坦 (Pakistan). The model has to refer to the context to know that 巴 refers to Pakistan in this instance since this is where the entire article takes place.

Reference:
On that day ten years ago, when this teacher was taking a nap at home during noontime break, the telephone rang suddenly.
Multi-source: That was ten years ago. When this teacher was taking advantage of his lunch break, he was sleeping at home. Suddenly, the phone rang.
Multi-context: One day ten years ago, when this teacher was taking advantage of her lunch break, she was sleeping at home. Suddenly, the telephone rang. Table 10: This is a story about a mother. The pronouns she/her have been used across the document. One cannot infer the gender of the teacher from the source sentence alone. Thus, the context model has to refer to the other sentences in order to get this correct.