DOCmT5: Document-Level Pretraining of Multilingual Language Models

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pretrained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pretrained model that can understand and generate long documents. We propose a simple and effective pretraining objective - Document reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pretraining, including (1) The effects of pretraining data quality and (2) The effects of combining mono-lingual and cross-lingual pretraining. We plan to make our model checkpoints publicly available.


Introduction
Multilingual pretrained language models have been useful for a wide variety of NLP tasks. pretraining on large-scale multilingual corpora facilitates transfer across languages and benefits low-resource languages.
Previously, sentence-level or word-level crosslingual objectives have been considered for pretraining large language models (LLM), but not much effort has been put in document-level objectives for pretraining. In this work, we propose a multilingual sequence-to-sequence language model pretrained with cross-lingual structure-aware document-level objectives. DOCmT5 is built on top of mT5 (Xue et al., 2021) and is further trained with parallel documents across multiple language pairs. To encourage the model to gain a deep understanding of the document structure and cross-lingual relationships, we consider a challenging translation scenario as a second-stage pretraining task: the input sentences are shuffled in a random order and random spans are masked. To effectively translate the input document, the model needs to reconstruct the document in the original order, making the model learn sentence relationships, and also recover the masked spans. This objective is effective on documentlevel generation tasks such as machine translation and cross-lingual summarization, outperforming previous best systems.
To enable cross-lingual pretraining at a large scale, we created a synthetic parallel document corpus. To avoid expensive human annotation, we use off-the-shelf neural machine translation (NMT) models to translate the documents in the mC4 corpus (Xue et al., 2021) into English. In our experimental results, this corpus is more effective for pretraining than existing large-scale automatically aligned corpora (e.g., CCAligned (El-Kishky et al., 2020)).
We also conduct extensive ablation studies and provide insights on document-level pretraining. We show that simple document-level pretraining is more useful than sentence-level pretraining for generative tasks. We also show that data quality matters when performing multilingual document pretraining. Finally, we don't observe improvements from combining mono-lingual and cross-lingual objectives when evaluating on two document-level translation tasks.
In summary, this paper makes the following contributions: • We build a state-of-the-art multilingual document-level sequence-to-sequence language model pretrained with a structure-aware cross-lingual objective.  Figure 1: Overview of our proposed Document-Reordering Machine Translation (DrMT) pretraining. For each input document, the sentences are shuffled in random order and then randomly selected spans will be masked. The prediction target of DOCmT5 is to generate the translation of the input document.
• Our proposed model achieves strong results on cross-lingual summarization and documentlevel machine translation for seen and unseen language paris, including SOTA on WMT20 De-En and IWSLT2015 Zh-En tasks.
• We also conduct extensive experiments to study what works and what doesn't work in document-level multilingual pretraining.

Multilingual Pretraining
Multilingual pretrained models provide a set of parameters that can be quickly finetuned for different downstream tasks (Ruder et al., 2021). Some popular models are: mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) which pretrain with masked language modeling objective using only monolingual data, mT5 (Xue et al., 2021) and mBART  which use a sequenceto-sequence language model and pretrain on largescale mono-lingual corpora across many languages. Our proposed model uses mT5 as a backbone and further utilizes pseudo-parallel documents to learn better cross-lingual representations.
To capture cross-lingual information, translation language modeling (Conneau and Lample, 2019) and its variants (VECO (Luo et al., 2021), ERNIE-M (Ouyang et al., 2021)) was proposed to leverage sentence-level parallel data. AMBER (Hu et al., 2021) use two explicit alignment objectives that align representations at the word and sentence level. HICTL  pretrains on parallel sentences with word and sentence-level contrastive losses. mBART50 (Tang et al., 2021), mT6 (Chi et al., 2021) and nmT5 (Kale et al., 2021) focus on second-stage of pretraining using large-scale sentence-level translation data. Our model goes beyond the sentence and focuses on document-level understanding.
While sentence-level pretraining has received a lot of attention, document-level pretraining has been under-studied. Unicoder (Huang et al., 2019) replaces alternating sentences in a document with translations and pretrains with masked language modeling. MARGE  adopts the retriever-generator paradigm and pretrains with an unsupervised translation objective on automatically retrieved documents. M2M100  pretrains sequence-to-sequence language models on automatically mined parallel sentences and documents. Our model considers a challenging supervised translation objective on parallel documents.

Multilingual Parallel Data Sources
OPUS-100 (Aharoni et al., 2019;Zhang et al., 2020a) is collected from a variety of domains and is human labeled but it is at the sentence level. ML50 (Tang et al., 2021) is collected from different machine translation challenges and other publicly available corpora such as OPUS, but most of the data is at the sentence level. CCMatrix (Schwenk et al., 2021b) and Wikimatrix (Schwenk et al., 2021a) use multilingual sentence embedding to automatically mine parallel sentences. Perhaps the most closest to our proposed corpus is CCAligned (El-Kishky et al., 2020), which is also automatically mined but its quality is in question (Kreutzer et al., 2021)  MTmC4 corpus does not require human annotation and instead was produced by NMT models.

Document-level Machine Translation
There are different ways to incorporate document context into translation model. Just to name a few, previous works have explored concatenation-based methods (Tiedemann and Scherrer, 2017;Junczys-Dowmunt, 2019;Sun et al., 2020;Lopes et al., 2020), multi-source context encoder (Zhang et al., 2018;Jean et al., 2017), and hierarchical networks (Zheng et al., 2020;Zhang et al., 2020b;. This line of research focuses on architectural modifications of neural translation models. We focus on how to design a generalized pretraining objective and furthermore, our model can be finetuned for various downstream tasks (e.g. summarization) without task-specific changes.

mC4
For pretraining, we use mC4 (Xue et al., 2021), a large scale corpus extracted from Common Crawl that covers over 100 languages.

MTmC4: Creating Parallel Documents with mC4
To create large-scale parallel documents, we take mC4 as a starting point and use in-house NMT models to translate documents from 25 languages into English. Each sentence in each document is translated independently. For each language, we sample 1 million documents, if there are more than that to start with, in mC4. Detailed data statistics for all the languages can be found in Table 2.

Document Reordering Machine
Translation (DrMT) We start by introducing two related pretraining objectives: • NMT Pretraining: Tang et al. (2021) and Kale et al. (2021) proposed to perform a second-stage of pretraining using sentencelevel MT data. The objective here is to perform sentence-level translation without any other changes to the input.
• Monolingual Document Reordering (Dr) Pretraining: This objective, proposed by mBART , changes the order of the sentences in each document. This is then followed by the original span corruption objective in T5.
The decoder is required to generate the original document in order.
We combine these two objectives and propose DrMT. In DrMT, we introduce two types of noise on the input: (i) sentences in the document are randomly shuffled and (ii) randomly sampled spans are masked. In order to correctly translate the content, the model needs to decipher the corrupted document in order first. This enforces the models to gain deep understanding of the document structure. More formally, suppose we have N language pairs and each language has a set of parallel documents, the whole collection of document pairs are = { 1 , 2 , ..., }. And a pair of ( , ) is an instance in one of the language documents . The overall learning objective is maximizing the likelihood of given a corrupted ( ), that is (1)

DOCmT5
We use mT5 as the backbone model. mT5 is a sequence-to-sequence language model pretrained with the span corruption objective in which random spans in the input are masked and the decoder is required to reconstruct the masked spans (see Raffel et al. (2020) and Xue et al. (2021) for further details). Our system, DOCmT5, incorporates a second-stage pretraining with a structure-aware cross-lingual objective(3.2) on pseudo parallel documents. Detailed comparisons with previous multilingual language models can be found in Table  1. We provide two variants of DOCmT5 with both Base and Large model settings: • DOCmT5-5 This model is pretrained with 5 languages: {De, Ru, Tr, Vi and Es}. For all of the pretraining objective baselines in this paper, we pretrain with this set of languages, unless specified otherwise.
• DOCmT5-25 This model is pretrained with 25 languages. We show the full list of languages and their sizes in Table 2.

Implementation Details
We use mT5-Base 1 and mT5-Large 2 checkpoints at 1M steps as our pretrained models. We perform a second-stage of pretraining for an additional 0.5M steps using batches of 256 examples each of max length 1024. The learning rate is determined by 1 https://console.cloud.google.com/ storage/browser/t5-data/pretrained_ models/mt5/base/ 2 https://console.cloud.google.com/ storage/browser/t5-data/pretrained_ models/mt5/large/ a inverse square root scheduler as defined in T5, with the learning rate set to 1∕ √ where n is the number of training step. We use the same span corruption objective as T5, with 15% of random tokens masked and an average noise span length of 3. For finetuning, we use a constant learning rate of 0.001 and dropout rate of 0.1 for all tasks until convergence. We adopt greedy decoding during inference.

Baselines
• Second-Stage Pretraining on 5 Languages Language models pretrained with huge numbers of languages suffer from curse of multilinguality. In order to make a fair comparison, we create a strong mT5 model by continuing to pretrain on the same 5 languages of mC4 as in DOCmT5-5 with the same number of steps using the original span corruption objective in mT5. Models pretrained with this objective is denoted as cont-5langs.

• Monolingual Document Reordering (Dr)
We briefly mention this objective in Sec-tion3.2. We use the mC4 corpus for this pretraining objective. Models pretrained with this objective is denoted as Dr (Document Reordering).
• Document TLM (DocTLM) In Conneau and Lample (2019), the authors propose the translation language modeling(TLM) objective, which concatenates parallel sentences and applies masked language modeling to learn cross-lingual knowledge.
Here we extend it to the document level by concatenating parallel documents. Instead of masking single tokens, we follow the span corruption objective in T5 and mask consecutive spans. The models are pretrained with this objective on MTmC4.
• Document NMT (DocNMT) We consider a standard document-level machine translation for pretraining. The source document is the input and the target translation is the output. We use MTmC4 for this pretraining objective.

Cross-Lingual Summarization
We evaluate DOCmT5 on cross-lingual summarization as it is challenging for the model to summarize a long document and translate the salient information at the same time. We use Wikilingua, a cross-lingual summarization dataset, in which a document from an arbitrary language must be summarized in English. We adopt the GEM (Gehrmann et al., 2021) version where the data is re-split to avoid train-test overlap between languages. We use a special prefix for cross-lingual summarization: "Summarize X to Y", where X and Y are the source and target language names respectively.

Results on Seen Language Pairs
We show the finetuning results of language pairs that are in the second stage of pretraining in Table 3. We use the same four languages that are in Wikilingua's original release {Es, Ru, Tr, Vi}.
The Dr objective brings substantial improvements over cont-5langs in all four languages, justifying the importance of structure-aware objectives. As for cross-lingual objectives, DocTLM is better than DocNMT in almost all languages except for Russian. DOCmT5-5 substantially outperforms Doc-NMT and DocTLM, showing that our proposed pretraining objective leads to improved cross-lingual learning. The results of DOCmT5-25 are inferior to DOCmT5-5 and this is possibly due to capacity dilution (Arivazhagan et al., 2019). As we increase the capacity, we see that DOCmT5-25-Large outperforms DOCmT5-5-Large. DOCmT5-25-Large is the best overall model outperforming the strong prior system: mBART.

Results on Unseen Language Pairs
We show the finetuning results of language pairs that are not in the second-stage of pretraining stage in Table 4. We use three languages {Fr, Id, Hi} 3 . Once again, we see that the Dr objective brings substantial improvements over cont-5langs. Surprisingly, without directly pretraining on the same language pairs, DOCmT5-5 leads to substantial improvements over strong baselines. This shows that our pretraining objectives are able to generalize to other languages. DOCmT5-25 pretrains on French and Hindi but not Indonesian and hence we observe improvements of average results over DOCmT5-5. The improvements of DOCmT5 are not so substantial and sometimes even hurt performance in high- DOCmT5-25-Large obtains the best results in almost all 3 languages except for French.

Document-Level Machine Translation
We evaluate DOCmT5 on document translation. We split each document into chunks with a max length of 512 tokens. During inference, the decoded chunks are concatenated together to form the final document. We use prefix "Translate X to Y" for translation, where X and Y are the source and target language names respectively.

Seen Language Pair: WMT20 De-En
WMT20 De-En is a document-level machine translation task. We use parallel training data from 3 We choose French to study the transfer ability of the cross-lingual models on high-resource and same-script (latin) languages. Indonesian is for studying high-resource and different-script language. Hindi is for studying low-resource and different-script language.  Table 6: Unseen language pair results on IWSLT 2015 Zh-En. Chinese is in the second-stage pretraining language set of DOCmT5-25 but not in those of DOCmT5-5. DOCmT5-25-Large achieves SOTA. WMT20 without using additional monolingual data. From the results in Table 5 4 , we see that Dr provides large gains. DocNMT outperforms DocTLM. This is probably due to the fact that DocNMT is more close to the document-level translation task.

Pretrained Model d-BLEU
DOCmT5-5 once again outperforms Dr and other strong cross-lingual baselines. DOCmT5-5 is better than DOCmT5-25 again because of capacity dilution as noted in Aharoni et al. (2019). As expected, DOCmT5-5-Large outperforms DOCmT5-5 and to the best of our knowledge, achieves the SOTA. Note that previous systems use one or more of the following techniques: additional monolingual data, back-translation, ensembling or re-ranking tailored to a single translation pair.

Unseen Language Pair: IWSLT 2015
Zh-En We use IWSLT 2015 Zh-En, another documentlevel machine translation task, to examine the multilingual transferability of DOCmT5 when the target transfer language (Chinese in this case) is of a very different script. Chinese is only in the firststage pretraining of mT5 but not in our second-stage pretraining. We use parallel training data from IWSLT15 without using additional monolingual data. Following HAN (Werlen et al., 2018), we use 2010-2013 TED as the test set. The results are in Table 6. DOCmT5-5 outperforms the strong crosslingual and mono-lingual baselines, demonstrating impressive transfer capability . DOCmT5-25 includes Chinese as one of the second-stage pretraining languages therefore obtains better numbers than DOCmT5-5. Unsurprisingly, large models are better than their corresponding base models. To the best of our knowledge, DOCmT5-25-Large achieves the SOTA on this task. We qualitatively analyze the translations of different systems in Appendix A.

Document Translation Without Finetuning
We further show that DOCmT5 is able to perform document translation without finetuning, i.e., evaluate the model right after second-stage pretraining without any finetuning on task-specific data. We show the results in Table 7. While the monolingual pretrained models completely fail to produce meaningful translations, DOCmT5-5 is able to achieve over 20 BLEU points in De-En and 15 in Ru-En. Not surprisingly, DOCmT5-5-Large further improves to over 35 and 29 respectively. DOCmT5-25 includes Pl-En and Ja-En in the second-stage pretraining and therefore obtains competitive results on these two language pairs with either base or large model. Although DOCmT5-5 is not pretrained on Pl-En, the large model gets over 14 BLEU on this task. One hypothesis is that Polish uses the Latin script and shares common subwords with German and Spanish, allowing our model to transfer knowledge across languages. On the other hand, the DOCmT5-5-Base model fails to produce meaningful translations for Pl-En. This shows the importance of size when performing multilingual pretraining. The best model is DocNMT which obtains over 40 BLUE points in both De-En and Ru-En, outperforming DOCmT5-5 and DOCmT5-25. This is reasonable because DOCmT5 shuffles documents in pretraining and this is misaligned with the document translation task inputs. The impressive perfor- mance of both DocNMT and DOCmT5 shows that our MTmC4 corpus is of very high-quality and is likely better than the parallel data provided by the specific tasks in question. Further analysis of the quality of this data will be an interesting avenue for future work.

Are Document-level Models Better Than Sentence-level Models?
To demonstrate the benefits of pretraining with longer context, we pretrain mT5 using translation language modeling (TLM) on five languages: {De, Es, Tr, Vi, Ru} with two different inputs. In DocTLM, we concatenate the parallel documents into a single training sequence. As for SenTLM, we break down the document into individual sentences and find the alignments in the parallel document pair. Then we concatenate the single aligned sentence pair as a training sequence. We finetune these second-stage pretrained models on Wikilingua and WMT20 De-En. The results are shown in Figure 2 and Table 8. We see that document-level models offer small improvements on summarization and very significant improvements on document-level translation, showing that the longer context is indeed useful.

Effect of Data Quality in Second-stage Pretraining
In our experiments, we observe big differences between different parallel corpora. We compare against the CCAligned corpus -a large automatically mined corpus from Common Crawl which is found to be very noisy (Kreutzer et al., 2021). In contrast, MTmC4 is produced by using high-quality translation systems. We pretrain mT5-Base on five languages: {De, Es, Tr, Vi, Ru} with these two corpora using DocNMT and DocTLM. We demonstrate the Wikilingua results in Figure 3 and WMT20 De-En results in Figure 4. Using our curated MTmC4 is consistently better regardless of pretraining objectives or tasks.

Does Combining Mono-Lingual and
Cross-Lingual Pretraining Help? Here we try to see if combining both monolingual and cross-lingual objectives helps. We try two different continual pretraining strategies for combining Dr and DrMT. We use five languages: {De, Ru, Tr, Vi, Es}. with Dr on mC4 for 0.5M steps and then pretrain with DrMT on MTmC4 for 0.5M steps. (ii) Dr + DrMT: We mix these two objectives with a 50-to-50% ratio and pretrain for 0.5M steps. In Table 9, we show that (i) slightly improves over only DrMT in both tasks and (ii) slightly improves on WMT20 De-En but seems to hurt performance on ISWLT15 Zh-En.

How Many Pretraining Steps is Required for DrMT?
To answer this question, we take different pretraining checkpoints of DOCmT5-5 and DOCmT5-25 and finetune with WMT20 De-En. The results are shown in Figure 5. After 50k steps of pretraining with DrMT, both systems outperform the cont-5langs. After 300k steps, both systems roughly converge and perform similarly.

Conclusion
In this paper, we present DOCmT5, a novel document-level multilingual pre-trained model. Our proposed objective, DrMT, is simple and effective and leads to large gains over strong baselines (e.g. mBART and MARGE) on crosslingual summarization and document-level translation. DOCmT5 achieved SOTA on two competitive document-level translation tasks: WMT20 De-En and IWSLT15 Zh-En. We further analyze various factors that contribute to successful document-level pre-training. We plan to release the pre-trained model to facilitate future work on document-level language understanding.

Appendices A Analysis of Document Translation
We take a deeper look at the translations produced by various systems to understand what makes DOCmT5 better. We demonstrate an example in Table 6. We take the best system (DOCmT5-25-Large) and the cont-5langs baseline. We observe that DOCmT5 uses time tenses better than the baseline, producing more coherent sentences (redcolored texts). Additionally, DOCmT5 handles a compositional sentence more elegantly, instead of just using "and" (blue-colored texts). Finally, we observe that cont-5langs often makes minor translation mistakes while our DOCmT5 makes much fewer of them.
And when I was 11 years old, I remember waking up one morning to the sound of a happy voice in the house. My father was listening to the BBC on his little gray radio. He had a smile on his face, which is rare, because most of the news was depressing. "The Taliban are gone!" My father shouted. I didn't know what that meant, but I could see that my father was very, very happy. "You can go to a real school now," he said. And I will never forget that morning. A real school. When I was six years old, the Taliban occupied Afghanistan and made it illegal for girls to go to school. So for the next five years, I was a woman in a man's suit and went to a secret school with my sister, who was not allowed to go out alone. This was the only way we were educated. We had to go in different directions every day so no one would suspect where we were going. ...
And I was 11 years old, and I remember awakefully waking up in the morning and hearing the familiar sound. My father was listening to the BBC news on his little radio. He was smiling, and it was rare, because most of the news was going to frustrate him. "Taliban go." The father went out. I don't know what that meant, but I can see that the father was very, very happy. "You can go to a real school now." He said. I'll never forget that morning. A real school. And I was six years old, and Taliban took Afghanistan and banned girls' schooling. So five years after that, my chick went to a secret school with my sister. And she wasn't allowed to go on a trip. It was the only way that we were educated. We walked on different roads every day so that nobody could suspect where we were. ...
When I was 11, I remember waking up one morning to the sound of joy in my house. My father was listening to BBC News on his small, gray radio. There was a big smile on his face which was unusual then, because the news mostly depressed him. "The Taliban are gone!" my father shouted. I didn't know what it meant, but I could see that my father was very, very happy. "You can go to a real school now," he said. A morning that I will never forget. A real school. You see, I was six when the Taliban took over Afghanistan and made it illegal for girls to go to school. So for the next five years, I dressed as a boy to escort my older sister, who was no longer allowed to be outside alone, to a secret school. It was the only way we both could be educated. Each day, we took a different route so that no one would suspect where we were going. ...

Source Document
DOCmT5-25 translation mT5 translation Target Translation Figure 6: A comparison example of Zh-En document translation. DOCmT5 is able to produce consistent time tenses while mT5 baseline fails. DOCmT5 also produces longer and conherent sentences. Best viewed in color.