Breaking the Corpus Bottleneck for Context-Aware Neural Machine Translation with Cross-Task Pre-training

Context-aware neural machine translation (NMT) remains challenging due to the lack of large-scale document-level parallel corpora. To break the corpus bottleneck, in this paper we aim to improve context-aware NMT by taking the advantage of the availability of both large-scale sentence-level parallel dataset and source-side monolingual documents. To this end, we propose two pre-training tasks. One learns to translate a sentence from source language to target language on the sentence-level parallel dataset while the other learns to translate a document from deliberately noised to original on the monolingual documents. Importantly, the two pre-training tasks are jointly and simultaneously learned via the same model, thereafter fine-tuned on scale-limited parallel documents from both sentence-level and document-level perspectives. Experimental results on four translation tasks show that our approach significantly improves translation performance. One nice property of our approach is that the fine-tuned model can be used to translate both sentences and documents.


Introduction
Document-level context-aware neural machine translation (NMT) aims to translate sentences in a document under the guidance of document-level context. Recent years have witnessed great improvement in context-aware NMT with extensive attempts at effectively leveraging document-level context ( (Tiedemann and Scherrer, 2017;Maruf and Haffari, 2018;Maruf et al., 2019), to name a few). However, the performance of contextaware NMT still suffers from the size of parallel document dataset. On the one hand, unlike * Corresponding Author: Junhui Li. 1 If not specified, monolingual documents are all for sourceside through this paper. sentence-level translation models which could be well trained on large-scale sentence-level parallel datasets, the translation models of context-aware NMT may result in insufficient training. On the other hand, with only scale-limited source-side documents, the context encoders may fail to effectively extract useful context from the whole document. 2 On the contrary, large-scale of parallel sentence corpora, and especially monolingual document corpora are much easier to find. In this paper, our goal is to break the corpus bottleneck for context-aware NMT by leveraging both largescale sentence-level parallel dataset and monolingual documents. Specifically, we aim to use the former to boost the performance of translation models while employ the latter to enhance the context encoders' capability of capturing useful context information.
There have been several attempts to boost context-aware NMT performance in the scenarios where the document-level parallel dataset is scale-limited, or even not available. On the one hand, sentence-level parallel dataset is a natural resource to use. For example,  propose a two-stage training strategy for context-aware NMT by pre-training the model on a sentencelevel parallel dataset. On the other hand, Junczys-Dowmunt (2019) leverage large-scale source-side monolingual documents, in which they simply concatenate sentences within a document into a long sequence and explore multi-task training via the BERT-objective (Devlin et al., 2019) on the encoder. Due to that different models are usually required to model sentences and documents, however, it is challenging to effectively take them both in a single model.
In order to effectively and simultaneously model  Figure 1: Illustration of the proposed cross-task pretraining (upper) and fine-tuning with two perspectives (below).
both sentence-level parallel dataset and monolingual documents, in this paper we propose a novel cross-task pre-training approach. As shown in Figure 1, we define two pre-training tasks. One learns to translate a sentence from source language to target language while the other learns to translate a document from deliberately noised to original. Importantly, the two pre-training tasks are jointly learned via the same model synchronously. Then we use document-level parallel dataset to fine-tune the properly pre-trained models. Similarly to the pre-training, we can fine-tune the models from both sentence-level and document-level perspectives. Experimental results on four document-level translation tasks show that our approach significantly improves translation performance, suggesting the effectiveness of our approach in modeling both sentence-level parallel dataset and monolingual documents. One nice property of our approach is that the fine-tuned models can be used to translate both sentences and documents.

Cross-Task Pre-training
In the following, we first describe our pre-training tasks defined upon sentence-level parallel dataset and large-scale monolingual documents (Section 2.1). Then we detail our model which caters such pre-training tasks (Section 2.2). Finally, we present our joint pre-training (Section 2.3).

Pre-training Tasks
We define two pre-training tasks in our pre-training. One is on sentence-level parallel dataset while the other is on monolingual documents.

Sentence-level Translation
Given large-scale sentence-level parallel dataset, our pre-training task is quite straight, i.e., sentence-level translation.
Document-level Restoration Given monolingual documents, our pre-training task is to restore a document from a noised version. To this end, we deliberately corrupt documents by following the two pre-training objectives, which are inspired by both gap sentence objective ) and masked language model objective (Devlin et al., 2019).
• Context-Aware Gap Sentence Restoration (CA-GSR). Given a document S with N sentences, we randomly select M sentences as gap sentences and replace them with a mask token [MASK1] to inform the model. The gap sentence ratio is, therefore M/N . For each selected gap sentence, we use its left and right neighbours as input while the gap sentence serves as output. To mimic documentlevel translation task, in the selection the first and the last sentences are always not selected while any two consequent sentences are not both selected.
• Context-Aware Masked Sentence Restoration (CA-MSR). Given a sentence X, we follow BERT and randomly select 15% tokens in it. The selected tokens are (1) 80% of time replaced by a mask token [MASK2], or (2) 10% of time replaced by a random token, or (3) 10% of time unchanged. For a sentence, we use its masked X as input while the original X serves as output.
Both CA-GSR and CA-MSR are applied simultaneously with the noised document as context. For convenience of presentation, we use a concrete example to illustrate the input and output of our document-level restoration task. As shown in Figure 2, let assume that a document X contains 6 sentences and the third and fifth sentences (i.e., X3 and X5) are selected as gap sentences while the others are not. On the one hand, for a sentence which is not selected as gap sentence, e.g., X1, we use its masked version (e.g., X1) as input while try to predict its original sentence (e.g., X1). On the other hand, for a gap sentence, e.g., X3, we concatenate its left and right neighbouring sentences with separator [MASK1] and try to predict the gap sentence (e.g., X3). As shown in Figure 2, sentences from S1 to S6 constitute document-level input S while sentences from T1 to T6 make up output T . Note that we do not include either gap sentences themselves or their masked version in S, in case the document context contains obvious hints for generating gap sentences.
Overall, the pre-training task of document-level restoration is to predict target output T by giving source input S, which is the same as the task of document-level translation, except that in the restoration S and T are in the same language while in the latter the two are in different languages.

Joint Modeling of Pre-training Tasks
We use the same model to cater the above two pre-training tasks. Since the task of documentlevel restoration is more complicated than the task of sentence-level translation, we first describe the model for document-level restoration (Section 2.2.1). Then we apply the model for sentencelevel translation (Section 2.2.2).

Context-Aware Modeling for Document-Level Restoration
We define some notations before describing our model. Given a document-level source input S = (S1, · · · , SN ) and target output T = (T1, · · · , TN ) with N sentence pairs, we assume each source sentence Si = (si,1, · · · , si,n) consists of n words. We use d m as the size of embedding and hidden state throughout the entire model. Figure 3 shows our context-aware model. It contains two parts, namely a global context encoder and a seq2seq model augmented by context representation. Note that for document-level restoration, we take documents as input units.
Global Context Encoder For the i-th input sentence Si in document S, the global context encoder aims to extract useful global context for every word s i,j in it. As shown in Figure 3(a), the encoder consists of a stack of Ng identical encoder layers. Each encoder layer consists of four major sub-layers: a self-attention sub-layer, a sentence representation sub-layer, a global context attention sub-layer and a feed-forward sub-layer.
In the k-th encoder layer, the self-attention sublayer takes A (k) i ∈ R n×dm as input and computes a new sequence B (k) i with the same length via multihead attention function: where the output B (k) i is in the shape of R n×dm , 3 and q, k, v represent the query and key-value pairs in attention mechanism respectively. For the first encoder layer, A (1) i is the addition of Si's word embedding and its position embedding while for other layers, A (k) i is the output of the proceeding encoder layer.
In the k-th encoder layer, the sentence representation sub-layer takes B (k) i as input and computes a vector to represent the sentence through a linear combination with a vector of weights as: i is a n-sized vector. Then the representation vector of sentence Si is the weighted sum of its hidden states: where C (k) i is a d m -sized vector. We then stack vectors of all sentences in S into C (k) , i.e., . Note that C (k) ∈ R N ×dm is at document-level and represents the global context.
In the k-th encoder layer, the global context attention sub-layer extracts useful global context for si,j in S i . This is also done via multi-head attention function: In the k-th encoder layer, the Feed forward sublayer is applied to each position separately and 3 The actual output of this sub-layer is LayerNorm(B where LayerNorm is the layer normalization function. For simplicity, we do not include the residual addition and layer normalization functions in our sub-layers. Note that the sentence representation sub-layer is the only exception which does not have residual addition and layer normalization.   Figure 3: Illustration of the proposed context-aware model. Note that 1) we share the two sub-layers of selfattention and feed forward between the global context encoder and the sentence encoder; 2) the model uses the same vocabulary for the tasks in pre-training and fine-tuning since we share vocabulary for the source and target languages; 3) we use (b) for sentence-level translation and turn off the gate mechanism.
identically by two linear transformations with a ReLU activation in between.
where W F 1 , W F 2 ∈ R dm×dm , and b F 1 , b F 2 ∈ R dm are model parameters.
We denote G i ∈ R n×dm as the final output of the global context encoder, i.e., . That is to say, G i represents the context representation for sentence S i . Figure 3 (b), the seq2seq model is very similar to the standard Transformer, except that it is now equipped with context representation obtained by the global context encoder. For sentence S i , we denote the sentence encoder output as H i ∈ R n×dm . To leverage its context representation G i , we define a gate to linearly combine the two kinds of representation via:

Context-Aware Model As shown in
where the gating weight is computed by where W G ∈ R 2dm×dm are model parameters.
Then we use H i to replace H i as the input to the decoder. We point out that in the global context encoder and sentence encoder, we share the self-attention sub-layer and the feed forward sublayer. That is to say, compared to the standard Transformer, we introduce new parameters to cater the sentence representation sub-layers, the global context sub-layers, and the gate mechanism to combine the two kinds of representation in Eq. 6.

Adapting Context-Aware Model to Sentence-Level Translation
In the first pre-training task, sentence-level translation is context-agnostic and does not require the global context encoder. Therefore, it only uses the sentence encoder and decoder, as shown in Figure 3 (b). Moreover, we turn off the gate mechanism by setting H i = H i . Since we share the two sub-layers of self-attention and feed forward between the sentence encoder and the global context encoder, updating the model by sentence-level translation will have direct impact on the global context encoder too.

Joint Pre-training Process
As shown in our experimentation, we share the same vocabulary for pre-training tasks. To train the above two pre-training tasks with a single model, we follow the strategy used in Johnson et al. (2017) and add a preceding language tag to each source and target sentence.
Our joint pre-training on two tasks falls into the paradigm of multi-task learning (MTL). In training stage, we take turns to load the training data of these pre-training tasks. For example, we update model parameters on a batch of training instances from the first task, and then update parameters on a batch of training instances of the other, and the process repeats.
3 Fine-tuning on Document-Level Parallel Dataset

Fine-tuning Tasks
Similar to pre-training tasks, we define the following two different fine-tuning tasks from both sentence-level and document-level.
Sentence-level Translation We first extract sentence-level parallel sentence pairs from the document-level parallel dataset for fine-tuning. This fine-tuning task enables the fine-tuned model to translate sentences. In fine-tuning, this task is processed as same as the sentence-level translation task in pre-training.
Document-level Translation Given a parallel document (X , Y) with N sentence pairs (Xi, Yi) | N 1 . This fine-tune task is to translate source document X into target document Y. In fine-tuning, this task takes parallel documents as input units and is processed as same as the document-level restoration task in pre-training.

Fine-tuning Process
The fine-tuning process is quite similar as the pretraining process in Section 2.3. Specifically, we add a preceding language tag to each sentence. Meanwhile in fine-tuning, we alternatively load batches of the two fine-tuning tasks.

Experimentation
To test the effect of our approach in leveraging sentence-level parallel dataset and monolingual documents, we carry out experiments on Chineseto-English (ZH-EN) and English-to-German (EN-DE) translation.

Experimental Settings
Pre-training data settings. The ZH-EN sentence-level parallel dataset contains 2.0M sentence pairs with 54.8M Chinese words and 60.8M English words. 4 We use WMT14 EN-DE 4 It consists of LDC2002E18, LDC2003E07, LDC2003E14, news part of LDC2004T08, LDC2002T01, LDC2004T07, LDC2005T06, LDC2005T10, LDC2009T02, translation dataset as the EN-DE sentence-level parallel dataset which consists of 4.4M sentence pairs. 5 We use Chinese Gigaword (LDC2009T27) and English Gigaword (LDC2012T21) as monolingual document dataset for ZH-EN and En-DE translation, respectively. For efficient training, we split long documents into sub-documents with at most 30 sentences. We have 2.6M (7.3M) subdocuments with 24M (102M) sentences in total for Chinese (English). Upon the monolingual documents, we prepare training instances for the document-level restoration task and set gap sentence ratio to 20%.
All Chinese sentences are segmented by Jieba 6 while all English and German sentences are tokenized by Moses scripts (Koehn et al., 2007). 7 For ZH-EN (EN-DE) translation, we merge the source and target sentences of the parallel dataset and the monolingual document and segment words into sub-words by a BPE model with 30K (25K) operations (Sennrich et al., 2016).
Fine-tuning data settings. For ZH-EN, we have one translation task on news domain. The document-level parallel corpus of training set include 41K documents with 780K sentence pairs. 8 We use the NIST MT 2006 dataset as the development set, and combine the NIST MT 2002, 2003, 2004, 2005, 2008 datasets as test set..
For EN-DE, we test three translation tasks in domains of TED talks, News-Commentary and Europarl.
• TED, which is from IWSLT 2017 MT track (Cettolo et al., 2012). We combine test2016 and test2017 as our test set while the rest as the development set.
• News, which is from News Commentary v11 corpus. 9 We use news-test2015 and news-test2016 as the development set and test set, respectively.  (Miculicich et al., 2018)  • Europarl, which is extracted from the Europarl v7. The training, development and test sets are obtained through randomly splitting the corpus.
All above EN-DE document-level parallel datasets are downloaded from Maruf et al. (2019). 10 Similar to fine-tuning datasets, the pre-processing steps consist of word segmentation, tokenization, long document split. Then we segment the words into subwords using the BPE models trained on pretraining datasets. See Appendix A for more statistics of the fine-tuning datasets.
Model settings. We use OpenNMT (Klein et al., 2017) as the implementation of Transformer and implement our models based on it. 11 For all translation models, the numbers of layers in the context encoder, sentence encoder and decoder (i.e., N g , N e , and N d in Fig 3) are set to 6. The hidden size and the filter size are set to 512 and 2048, respectively. The number of heads in multi-head attention is 8 and the dropout rate is 0.1. In pre-training, we train the models for 500K steps on four V100 GPUs with batch-size 8192. We use Adam (Kingma and Ba, 2015) with β1 = 0.9, β2 = 0.98 for optimization, and learning rate as 1, the warm-up step as 16K. In fine-tuning, we fine-tune the models for 200K steps on a single V100 GPU with batch-size 8192, learning rate 0.3, and warm-up step 4K. In inferring, we set the beam size to 5.

Experimental Results
Main results. Table 1 shows the performance of our approach, where Ours-sent and Ours-doc indicate the performance achieved by our approach when we use sentences or documents as input units, respectively. In the scenario where both sentence-level parallel dataset and monolingual documents are not used, we directly train our models from scratch with the two fine-tuning tasks on the fine-tuning datasets. #2 and #3 in the table show that our model is capable of translating both sentences and documents. Interestingly, when we use sentences as translation units, our models (i.e., #2 Ours-sent) outperform sentence-level Transformer baseline (i.e., #1 who uses sentences as input units in both training and inferring) over all translation tasks with improvement of averaged 1.36 BLEU and 1.72 Meteor. Moreover, when we use documents as translation units, our models (i.e., #3 Ours-doc) achieve further improvement by modeling document-level context. Compared to previous studies, it also shows that our approach surpasses all context-aware baselines on ZH-EN and EN-DE (TED) tasks and achieves the state-ofthe-art on average.
In the scenario where both sentence-level parallel dataset and monolingual documents are used, 12 similar performance trends also hold. For example, #5 Ours-sent significantly exceeds Transformer  Ablation study. We take ZH-EN and EN-DE (News) translations as representatives to study the effect of leveraging sentence-level parallel dataset and monolingual documents. Table 2 compares the performance on the the test sets of ZH-EN and EN-DE (News) translations in different scenarios. From it, we have the following observations. • Using either sentence-level parallel dataset or monolingual documents helps translation for both Transformer baselines and our contextaware models. However, in the presence of sentence-level parallel dataset, the Transformer baselines fail to achieve higher performance with monolingual documents, as we observe performance drops from 46.99 BLEU to 46.30 on Zh-EN, and from 26.89 to 26.80 on EN-DE. In contrary, our models achieve the highest performance by leveraging the two resources. This suggests the effectiveness of our approach in employing the two resources.
• It is not surprising to find out that the improvement is mainly contributed by using sentencelevel parallel dataset, as translation model is more important than context encoder • Finally, our approach consistently outperforms sentence-level Transformer in all scenarios. Encouraging, the performance gap becomes even larger on ZH-EN when more resources are used.   Table 4: Accuracy (%) of discourse phenomena.

Discussion
Next we use ZH-EN translation to analyze more on how our approach affects translation performance. See Appendix B for parameter analysis and statistics of the pre-trained models.

Effect of Joint Fine-tuning
In Section 3 we alternate sentence-level translation and document-level translation in fine-tuning.
We investigate the effect of including sentencelevel translation as a fine-tuning task. Table 3 compares the performance with respect to different fine-tuning strategies and different input units in inferring. When we use documents as input units in inferring, the joint fine-tuning strategy provides no advantage. However, when the input units are sentences, the joint fine-tuning strategy outperforms the one not including sentence-level translation in fine-tuning.

Analysis of Discourse Phenomena
We also want to examine whether the proposed approach actually learns to utilize document context to resolve discourse inconsistencies. Following Voita et al. (2019b) and Zheng et al. (2020), we use the same datasets to train model and contrastive test set for the evaluation of discourse phenomena for English-Russian by Voita et al. (2019b).
There are four test sets in the suite regarding deixis, lexicon consistency, ellipsis (inflection and verb phrase). Each testset contains groups of contrastive examples consisting of a positive translation with correct discourse phenomenon and negative translations with incorrect phenomena. The goal is to figure out if a model is more likely to generate a cor-   rect translation compared to the incorrect variation. We summarize the results in Table 4, which shows that in different scenarios our models are better at resolving discourse consistencies than contextagnostic baselines.

Pronoun Translation
We follow Miculicich et al. (2018) and Tan et al. (2019) to evaluate coreference and anaphora using the reference-based metric: accuracy of pronoun translation (Werlen and Popescu-Belis, 2017). Table 5 lists the performance of pronoun translation. From it we observe that our proposed approach can well improve the performance of pronoun translations.

Effect of Gap Sentence Ratio
A significant hyper-parameter in the pre-training task of document-level restoration is the gap sentence ratio. A low ratio makes the document-level restoration less challenging while choosing gap sentences at a high ratio makes the global context have more overlapped. Table 6 shows that we achieve the best performance when the ratio is set as 20%.

Effect of Pre-training Objectives
As shown in Figure 2, we include two pre-training objectives in document-level restoration, i.e, CA-GSR and CA-MSR. To investigate the effect of CA-GSR, we use CA-MSR as the only objective in this pre-training task. In this way, the S3 and S5 in Figure 2 (a), for example, will be X3 and X5, respectively. Table 7 compares the performance when the pre-training task is of CA-MSR objective or combination of CA-GSR and CA-MSR.It  shows the combining objective achieves better performance than using CA-MSR alone.

Related Work
We describe related studies in the following two perspectives.

Context-Aware NMT
Cache/Memory-based approaches (Tu et al., 2018;Kuang et al., 2018;Maruf and Haffari, 2018;Wang et al., 2017) store word/sentence translation in previous sentences for future sentence translation. Various approaches with an extra context encoders are proposed to model either local context, e.g., previous sentences Wang et al., 2017;Bawden et al., 2018;Voita et al., 2018Voita et al., , 2019bYang et al., 2019;Huo et al., 2020), or entire document (Maruf and Haffari, 2018;Mace and Servan, 2019;Maruf et al., 2019;Tan et al., 2019;Zheng et al., 2020;Kang et al., 2020). Besides, there have been several attempts to improve context-aware NMT with monolingual document data. To make translations more coherent within a document, Voita et al. (2019a) propose DocRepair trained on monolingual target language documents to correct the inconsistencies in sentence-level translation while Yu et al. (2020) train a context-aware language model to rerank sentence-level translations. Finally, Junczys-Dowmunt (2019) use source-side monolingual documents to explore multi-task training via the BERTobjective on the encoder. They simply concatenate sentences within a document into a long sequence, which is different from our approach.

Pre-training for Document-Level NMT
While there are substantial studies on improving sentence-level NMT with pre-training, we limit ourselves here to pre-training for document-level (context-aware) NMT. BART  is a denoising auto-encoder model which learns to reconstruct the original document from a noised version. Inspired by BART, mBART ) is a model trained on a mixed corpus containing monolingual documents of different languages. Both BART and mBART concatenate sentences in one document into a long sequence, and thus fall into a standard sequence-to-sequence (seq2seq) framework. This is very different from our cross-task pre-training, in which we combine both context-agnostic learning and context-aware learning in a single model.

Conclusion
In order to leverage both large-scale sentence-level parallel dataset and source-side monolingual documents for context-aware NMT, in this paper, we have proposed a novel cross-task pre-training approach, which simultaneously learns to translate a sentence from source language to target language while denoising a document from deliberately noised to original. Upon the pre-trained models, we fine-tune them with document-level parallel dataset from both sentence-level and documentlevel perspectives. Experimental results on multiple document-level translation tasks have demonstrate the effectiveness of our approach. Finally, we also provide insights on how context-aware NMT benefits from our approach.    B.2 Statistics on Our Pre-trained models