TRIP: Accelerating Document-level Multilingual Pre-training via Triangular Document-level Pretraining on Parallel Data Triplets

,


Introduction
Conventional multilingual pre-training achieved promising results on machine translation (Liu et al., 2020) and cross-lingual classification (Xue et al., 2021).These pre-training paradigms usually rely on monolingual corpora in many different languages, with denoising objectives such as sentence permutation and span masking (Liu et al., 2020;Lewis et al., 2020b).Following the calls that the unsupervised scenario is not strictly realistic for cross-lingual learning (Artetxe et al., 2020), multilingual pre-training advanced into a supervised setting through sentence-level bilingual translation pairs (Chi et al., 2021;Reid and Artetxe, 2022) to provide a stronger signal for pre-training.Among these pioneering works, document-level multilingual pre-training with parallel data is currently an understudied topic.This direction is particularly significant for tasks that necessitate contextual comprehension, such as document-level machine translation and cross-lingual summarization.As a workaround, DOCmT5 (Lee et al., 2022) resorts to using synthetic bilingual translation pairs to scale up document-level multilingual pre-training.
In addition to the lack of study for documentlevel multilingual pre-training with parallel data, prior works also overlooked the value of trilingual parallel data for multilingual pre-training.Compared to bilingual parallel data, trilingual parallel data is expected to better capture different linguistic clues and coherence among different languages such as past tense and gendered expressions,2 which can enhance the model pre-training on aspects of document-level cross-lingual understanding and resolve cross-lingual ambiguities.
To this end, we present TRIP, a document-level multilingual pre-training method using trilingual parallel corpora.Because there is no publicly available document-level trilingual corpus, we propose a novel method to construct trilingual document pairs from document-level bilingual corpora.Subsequently, we augment the conventional multilingual pre-training by (i) Grafting two documents presented in two different languages into one mixed document, and (ii) predicting the remaining one language as the reference translation.We conduct experiments on document-level machine translation on TED Talks (Cettolo et al., 2015), News benchmark (News-commentary) and Europarl (Koehn, 2005), and cross-lingual abstractive summarization on Wikilingua (Ladhak et al., 2020;Gehrmann et al., 2021).We found that TRIP clearly improves previous multilingual pre-training paradigms that use monolingual and bilingual objectives (Lee et al., 2022), and achieves strong SOTA results on both tasks.
In summary, we make three key contributions: • TRIP proposes a novel trilingual pre-training objective through Grafting for multilingual pre-training, along with a novel method to construct trilingual data from parallel corpora.
• TRIP yields SOTA scores on both multilingual document-level MT and cross-lingual abstractive summarization.
• We conduct in-depth analyses on documentlevel cross-lingual understanding and compare TRIP to commercial systems.

Triangular Document-level Pre-training
We start by introducing the conventional methodologies previously used by the monolingual and bilingual objectives for multilingual pre-training: • Denoising Pre-training: Sentence permutation (Liu et al., 2020) and span corruption (Xue et al., 2021) are effective denoising pretraining objectives for document-level multilingual pre-training.
• Translation Pre-training: Making the use of sentence-level translation pairs is a bilingual pre-training strategy for multilingual models (Kale et al., 2021;Tang et al., 2021).

Constructing a Trilingual Objective
In comparison, TRIP is the first in the field to introduce a trilingual objective for multilingual pre-training.
The core to making better use of trilingual data is to Grafting3 the documents by splitting the documents written in two different languages but with the same meaning half by half and concatenating each half to form a new document that retains the same meaning written in two different languages.TRIP then applies sentence permutation and span corruption on the Grafted documents.
Conventional monolingual and bilingual pretraining objectives overlooked the value to take such an advantage (Liu et al., 2020;Reid and Artetxe, 2022) of linguistic clues from different languages.In contrast, TRIP fuses authentic trilingual data, in which linguistic clues such as past tense and gendered nouns are usually preserved.
We present in Figure 1 to illustrate how TRIP operates to make use of linguistic clues through trilingual data.Given three documents with the same meaning written in Chinese, Japanese, and English, two of the documents are split and concatenated.The concatenation is randomly permutated at the sentence level, and the remaining unchanged document is used as the translation reference.Here, Chinese is tenseless, and TRIP effectively fuses useful linguistic clues for past tense written in Japanese and English into the Chinese text to resolve crosslingual ambiguities.
Table 1 presents the characteristics that TRIP exhibits compared to existing methods.We report whether the models use trilingual document pairs for pre-training, and we report whether documentlevel tasks such as document-level machine translation or abstractive summarization are reported in their original papers.To our best knowledge, this is the first paper in our field to mine and use trilingual document pairs for multilingual pre-training.This is also the first work that features Grafting.
More formally, we first denote N as the number of training document pairs in trilingual translation triplets of (x 1 , x 2 , x 3 ) in a pre-training corpus D. Given a Seq2Seq generation model (Sutskever Here, Chinese is tenseless, and Japanese and English contain past tense as the linguistic clues that can resolve cross-lingual ambiguities.We present three Grafting cases, representing three different language combinations.For each trilingual pair, two languages serve as the input with the remaining one as the reference translation.We define a novel symbol T that denotes a noise function that combines operations in sequence: splitting by half and concatenation (Grafting), and sentence permutation and span corruption.Z n , J n , and E n for n = {1, 2, 3, 4} denotes four sentences written in three different languages.Zn , Jn , and Ẽn denotes corrupted sentences.et al., 2014), TRIP optimizes the likelihood: where we define T as a novel operation that takes two documents in different languages as the input and takes three operations in sequence: splitting by half, concatenating, and sentence permutation.4 Creating Trilingual Document Pairs As there is no public corpus with trilingual document pairs, TRIP creates MTDD (Microsoft Trilingual Document Dataset), a high-quality trilingual parallel corpus with document translation pairs across 67 languages, 4,422 bilingual directions, and 99,628 trilingual combinations.The corpus is sourced from high-quality news documents scoped from an in-house website5 timestamped from April 2021 to July 2022.The whole procedure is composed of two steps: (i) creating bilingual document pairs and (ii) creating trilingual document pairs based on the bilingual document pairs.To obtain bilingual document pairs, we follow ParaCrawl (Bañón et al., 2020) to translate all the documents we have into English using a lightweighted word-based machine translation model.The resulting translation is used for pairing only and the documents are paired and thresholded with similarity scores such as tf-idf computed on their English translation (Bañón et al., 2020).To improve efficiency, we attempt to pair documents only if they are timestamped within a small window such as one week.The motivation is that the semantic news with the same meaning in different languages are often reported within a small timestamp window in high probabilities.The resulting document pairs are further thresholded and filtered with LASER (Artetxe and Schwenk, 2018), 6 which is a multilingual sentence representation.
Given the bilingual data constructed as above, we follow previous works (Bañón et al., 2020;El-Kishky et al., 2020).These previous works leverage URL addresses for constructing bilingual data.
In contrast, we use URL addresses to construct trilingual data pairs by matching and linking.Figure 2 depicts a detailed illustration.
For space reasons, we present statistics to illustrate the scale of MTDD in Table 10 in Appendix D. We also note that existing MTmC4 (Lee et al., 2022) used by DOCmT5 can be less favourable for our experiments as (i) MTmC4 is composed of synthetic data that could be of lower quality, (ii) MTmC4 is not publicly available at the time of writing, and (iii) MTmC4 can lead to potential data leakage for the test sets on TED Talks.

TRIP Pre-training
Model Configuration We use a Transformer architecture that is composed of 24 Transformer encoder layers and 12 interleaved decoder layers.In addition, it has an embedding size of 1024, and a dropout rate of 0.1.The feed-forward network is configured to have a size of 4096 with 16 attention heads.For parameter initialization, we follow Ma et al. (2021) and Yang et al. (2021) to train a sentence-level MT system.The motivation is that previous studies have shown that the hybrid training of sentence-level and document-level MT can improve the performance of document-level translation (Sun et al., 2022).We call it the Baseline Model in the remaining of this paper.
Data and Pre-processing As described in Section 2, we create a trilingual document-level corpus, MTDD, for TRIP pre-training with the use of trilingual document pairs.We create a list of keywords to automatically clean and remove noisy text such as claims and advertisements.We follow Ma et al. (2021) to use SentencePiece (Kudo and Richardson, 2018) for tokenization, and we use the same SentencePiece model as Yang et al. (2021).Following the previous works, we prefix the inputs with a language tag that indicate the target language of the generation for both pre-training and fine-tuning.
Training Details We use the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.98 for our multilingual pre-training.The learning rate is set as 1e-5 with a warmup step of 4000.We use the label smoothing cross-entropy for our translation loss and we set label smoothing with a ratio Table 2: Results for document-level MT on TED Talks in the direction of (X → En).We report the d-BLEU scores for all the results.†: scores are taken from the official papers for these models.-: the scores are not reported or the language is not supported.The Baseline Model refers to the model described in Section 3.1.The Baseline Model + represents a document-level model continually pre-trained with the bilingual data in MTDD.
For a fair comparison, the trilingual data used by TRIP are constructed from these bilingual data.We perturbed them on sentence permutation and span corruption as the noise functions, with no use of trilingual data.
of 0.1 for model training.All of our pre-trainings are conducted on 16 NVIDIA V100 GPUs.We set the batch size as 512 tokens per GPU.To simulate a larger batch size, we update the model every 128 steps.For the Grafting operation T defined for TRIP, we split the documents 50% by 50%.

TED Talks
Experimental Settings Following DOCmT5, we use the IWSLT15 Campaign for the evaluation of TED Talks.Prior systems have reported scores on only 1 or 2 translation directions (Lee et al., 2022;Sun et al., 2022), and DOCmT5 supports only the translation direction into English (X → En).We report more language directions while DOCmT5 only evaluates on (Zh → En).Following DOCmT5, we split all documents into a maximum of 512 tokens for all train/dev/test sets during training and inference.We use the official parallel training data from IWSLT15 without any additional monolingual data, with the official 2010 dev set and 2010-2013 test set for evaluation (Lee et al., 2022).We compute d-BLEU (Papineni et al., 2002;Liu et al., 2020;Bao et al., 2021), a BLEU score for documents.We use SacreBLEU for evaluation.7 Baseline Systems We report strong baselines evaluated at both sentence and document levels, including SOTA models DOCmT5 † (Lee et al., 2022), M2M-100 (Fan et al., 2022), mBART (Liu et al., 2020), HAN † (Miculicich et al., 2018), MARGE † (Lewis et al., 2020a), and the Baseline Model that we use to initialize the weights for TRIP.†: the scores are taken from existing papers.We also compare to the Baseline Model + , a documentlevel model pre-trained continually on the Baseline Model with the bilingual data used to construct the trilingual data in MTDD.We do not compare to PARADISE (Reid and Artetxe, 2022), a pre-trained model that uses dictionary denoising on monolingual data, as its weights are not publicly available so far.During our trials, we found that monolingual dictionary denoising can degrade documentlevel systems.We think that it could better serve sentence-level tasks such as sentence-level MT and cross-lingual classification as conducted in its original paper.See Appendix C for the number of model parameters.
Results Table 2 presents the evaluation results for TED Talks in the directions of (X → En).TRIP clearly surpasses the baselines.TRIP surpasses the Baseline Model when both are fine-tuned at the document level by an average of 1.87 points in d-BLEU.TRIP surpasses the Baseline Model finetuned at the sentence level by an average of 1.01 points in d-BLEU.We postulate that the Baseline Model fine-tuned at the document level is no better than that of the sentence level due to the reason of the long input problem (Koehn and Knowles, 2017), and also due to the reason that the Baseline Model itself is pre-trained at the sentence level.TRIP beats the prior SOTA system DOCmT5.For space reasons, we present in Appendix A the evaluations in the (X→X) directions, which also show that TRIP effectively improves language pairs that are unseen during pre-training.We also found that (i) the Baseline Model + clearly surpasses the Baseline Model and (ii) TRIP clearly surpasses the Baseline Model + .This observation indicates two points: (i) the bilingual data in MTDD used to construct the trilingual data are of high quality and (ii) the trilingual objective with the Grafting mechanism is superior to the conventional bilingual objectives for multilingual pre-training.

News
Experimental Settings For evaluation on the News benchmark, we follow Sun et al. (2022) to use News Commentary v11 as the training set.For Cs and De, we use newstest2015 as the dev set, and newstest2016/newstest2019 as the test set respectively.For Fr, we use newstest2013 as the dev set and newstest2015 as the test set.For Zh, we use newstest2019 as the dev set and newstest2020 as the test set.We use the same dataset preprocessing and evaluation metric as for the TED Talks.
Baseline Systems As the weights for DOCmT5 are not available at the time of writing, we compare our system to various strong baselines such as M2M-100, mBART, the Baseline Model, and the Baseline Mode + .The scores are obtained by fine-tuning the official checkpoints.
Results Table 3 shows obvious and consistent improvements by up to 3.11 d-BLEU points (from 36.38 to 39.49) with TRIP for (Fr → En) compared to the Baseline Model.

Europarl
Experimental Settings For the Europarl dataset (Koehn, 2005), we follow Sun et al. (2022) to use Europarl-v7, and we experiment with the setting of (X → En) where we test nine languages: Da, De, El, Es, Fr, It, Nl, Pt, and Sv.Like previous works (Bao et al., 2021;Sun et al., 2022), the dataset is randomly partitioned into train/dev/test divisions.Additionally, we split by English document IDs to avoid information leakage.
Baseline Systems As the weights for DOCmT5 are not available at the time of writing, we compare our system to various strong baselines such as M2M-100, mBART, the Baseline Model, and  the Baseline Model + .The scores are obtained by fine-tuning the official checkpoints.
Results By comparing TRIP to strong baselines, we see that the improvements with TRIP are consistent in all directions, and surpass all the strong baselines.This validates TRIP's effectiveness.

Coherence and Consistency Evaluation
BlonDe Evaluation Figure 3 depicts the evaluations on TED Talks with BlonDe scores (Jiang et al., 2022), an evaluation metric designed for document-level MT which considers coherence and consistency issues that require the model to resolve cross-lingual ambiguities.Consistent improvements can be observed in all the directions on TED Talks with TRIP, meaning that TRIP generates more coherent and consistent translations than the baseline does.As discussed in Section 2, we postulate that these improvements attribute to the Grafting mechanism that resolves cross-lingual ambiguities by exploiting useful linguistic clues in trilingual data.This improves translation in coherence and consistency as reflected in the BlonDe scores.We demonstrate case studies for more analysis of coherence and consistency issues.
Case Study Table 5 presents three case studies that demonstrate and compare the outputs between TRIP and the baseline systems.We highlight the correct translation in aqua and the wrong translation in hot pink.In addition to the comparison to the Baseline Models, we also present the outputs from popular commercial translation systems Google Translate, Microsoft Translator, and DeepL Translate.Each case demonstrates that TRIP is the best in terms of three characteristics respectively: (i) tense consistency (Jiang et al., 2022;Sun et al., 2022) across the sentences, (ii) noun-related issues (Jiang et al., 2022) such as singular and plural consistency as well as attaching definite article 'the' to a previously mentioned object 'light', and (iii) conjunction presence that indicates the relationship between sentences and makes the translation natural and fluent (Xiong et al., 2019;Sun et al., 2022).While some translations in the third case are acceptable, missing coordinating conjunction does not precisely capture the relationship between sentences and can make the translation less fluent.TRIP is the best one among the systems at resolving cross-lingual ambiguities.This observation highlights the necessity of translating with document-level contexts for resolving cross-lingual ambiguities.The observations also align with the BlonDe measurements reported above.

Large Language Models
Table 7 compares TRIP to popular ChatGPT (GPT-3.5-TURBO)8 on TED Talks.We use a prompt: "Translate the following text into English:".The results indicate ChatGPT still lags behind supervised system TRIP on document-level MT.This conclusion aligns with the previous study on sentencelevel MT (Zhu et al., 2023), and we postulate that the reason is ChatGPT fails in handling contexts perfectly for document-level MT.

Cross-lingual Abstractive Summarization
Experimental Settings We follow the same setting used by DOCmT5 (Lee et al., 2022) to evaluate cross-lingual abstractive summarization on the benchmark of Wikilingua (Ladhak et al., 2020).The only difference is that they put a special prefix "Summarize X to Y" where X and Y are the source and target language tags for summarization like mT5.We put a target language tag as the prefix.
Baseline Systems We report the scores for DOCmT5 taken from Lee et al. (2022), and we use prior SOTA scores from the official GEM benchmark (Gehrmann et al., 2021) for mT5, ByT5 (Xue et al., 2022).We also employ mBART and the Baseline Models as the baselines.See Appendix C for the number of model parameters.
Results Table 6 demonstrates that TRIP clearly exceeds previous SOTA systems in several directions, including up to 8.9 ROUGE-L points in (Hi → En) compared to DOCmT5.Hence, we conclude that TRIP is an effective multilingual pretraining framework for cross-lingual abstractive summarization.We postulate that the improvement is attributed to the trilingual pre-training objective overlooked by previous works such as DOCmT5.Also, we found that using bilingual data for Baseline Model + seems less beneficial on Wikilingua for cross-lingual abstractive summarization.TRIP Table 7: Comparison of TRIP to ChatGPT on the task of document-level machine translation on TED Talks in the direction of (X → En).The results are snapshotted in May 2023 and can be subject to change.
clearly surpasses the Baseline Model + .This observation indicates that the trilingual objective with the Grafting mechanism is superior to the conventional bilingual objectives for multilingual pre-training.
Case Study Table 8 in Appendix shows three case studies that TRIP outputs better abstractive cross-lingual summarization.For space reasons, we leave more details in Appendix B.

Multilingual Pre-training
Multilingual pre-training has achieved great success.Previous works can be categorized into two streams: monolingual pre-training (Conneau et al., 2020;Liu et al., 2020;Xue et al., 2021) and bilingual pre-training (Huang et al., 2019;Chi et al., 2021;Ouyang et al., 2021;Tang et al., 2021;Chi et al., 2021;Reid and Artetxe, 2022;Lee et al., 2022).Monolingual pre-training uses monolingual corpora in many different languages and perturbs the inputs with sentence permutation (Liu et al., 2020) and span corruption (Xue et al., 2021) and requires the model to reconstruct the original input.Reid and Artetxe (2022) also proposes dictionary denoising on monolingual data.For bilingual pretraining, Tang et al. (2021) uses clean sentencelevel bilingual translation pairs on pre-trained models to improve MT.Chi et al. (2021) extends mT5 with objectives such as translation span corruption.DOCmT5 (Lee et al., 2022) creates synthetic bilingual translation pairs and uses sentence permuta-tion for a document-level multilingual pre-training.

Document-level Cross-lingual Tasks
Document-level MT and cross-lingual abstractive summarization are the two document-level crosslingual tasks that we investigate in this paper.
Document-level MT (Miculicich et al., 2018;Maruf et al., 2019Maruf et al., , 2021;;Lu et al., 2022) is a challenging translation task, possibly due to the long input problem (Pouget-Abadie et al., 2014;Koehn and Knowles, 2017) when directly modelling the long document and the necessity in understanding contexts (Voita et al., 2018(Voita et al., , 2019)).Therefore, many works focus on using sentence-level models with a smaller contextual window to simulate document-level MT (Zheng et al., 2020;Chen et al., 2020).This paper follows the challenging setting (Bao et al., 2021;Lee et al., 2022) that directly optimizes a document-level model with a longer context window that provides a richer source of context, which is also a double-edged sword that could be harder due to the long input problem.
Abstractive summarization is a generation task that requires an understanding of texts (Chopra et al., 2016;Fan et al., 2018).We focus on a crosslingual setting where source and target are written in different languages (Ladhak et al., 2020).

Conclusions
We present a novel sequence-to-sequence multilingual document-level pre-training methodology called TRIP, which is the first in our field to propose a trilingual objective for multilingual pretraining through Grafting.We also propose a novel method to construct high-quality trilingual document pairs.Experimental results indicate that TRIP achieves competitive SOTA scores on both multilingual document-level machine translation and crosslingual abstractive summarization.Future works could improve TRIP to include polygonal parallel translation pairs in multilingual pre-training.We plan to release the model checkpoints and a manually annotated benchmark created using our created document-level corpus MTDD to facilitate future research on multilingual document-level MT.

Limitations
TRIP TRIP leverages high-quality documentlevel trilingual translation pairs for pre-training on multilingual models.It is usually harder to collect high-quality trilingual data than to collect monolingual data written in different languages used by conventional methods.While we can possibly relax the quality bar for the data, additional experiments should be done to verify this view.
MTDD We create MTDD, a corpus that is composed of trilingual document pairs.It could be further extended to include polygonal parallel document pairs to provide a stronger signal for multilingual pre-training.We leave this to future works.
Large Language Models Large language models (LLMs) such as ChatGPT have shown good translation abilities (Lu et al., 2023), while they still lag behind supervised systems (Jiao et al., 2023;Zhu et al., 2023).We conduct a limited comparison to them, as they are much larger in their number of parameters than the systems described in this work.

Ethics Statement
We honour and support the EMNLP Code of Ethics.The datasets used in this work are well-known and widely used, and the dataset pre-processing does not use any external textual resource.We also curate a corpus for pre-training language models.Although we have made our best efforts in reducing potentially offensive and toxic data, the models are subject to generating offensive context.But the issues mentioned above are widely known to exist for these models commonly.Any content generated does not reflect the view of the authors.

Source (Google-translated)
For some, past lives are not a place to go but part of what makes you who you are.Throughout the history of humanity, many cultures have put reincarnation at the center of their beliefs.Although Islam and Christianity do not believe in reincarnation, Hinduists, some Jews, and some Buddhists do.It's best to explore your own path rather than devote yourself entirely to a religion (because religions can be too restrictive).Find your own spiritual truth.

Reference
Become spiritual.
Baseline Model (Document-level) Understand that some people believe in reincarnation.Find your own way.

TRIP
Explore your own spiritual journey.A Unseen (X→X) Language Pairs on MT Figure 4 reports the performance on TED Talks in the direction of (X→X) with our TRIP checkpoint pre-trained in (X→En) directions with our corpus.The row represents the translation source language and the column represents the translation target language.TRIP clearly improves most of these translation directions which are unseen during pre-training.This indicates that fact the TRIP can generalize the cross-lingual understanding ability to unseen language pairs.This aligns with the fact reported in Lee et al. (2022).

B Case Study on Summarization
Table 8 shows that TRIP outputs better summarization in (i) precisely capturing the context in Case 1, (ii) outputting consistent nouns, i.e., 'messages' instead of 'settings' in Case 2 and (iii) producing concise and accurate summarization in Case 3.This highlights that TRIP captures better cross-lingual understanding than the baseline system, which effectively mitigates cross-lingual ambiguities.Table 9 presents the number of model parameters for the pre-trained models used in our experiments.

C Number of Model Parameters
For the scores of ByT5 presented in Table 6, we report the maximum scores for each direction among ByT5-Base, ByT5-Small, and ByT5-Large.This is due to space reasons.See https:// gem-benchmark.com/results for the tailored scores.

D MTDD Corpus Scale
Table 10 presents the top-12 English-centric bilingual data statistics to illustrate the scale of MTDD.The total size of the data is about 40/80 GB respectively for the bilingual and the trilingual data applied with Grafting.

Figure 1 :
Figure 1: Overview of Triangular Document-level Pre-training (TRIP).We select three languages for demonstration.Here, Chinese is tenseless, and Japanese and English contain past tense as the linguistic clues that can resolve cross-lingual ambiguities.We present three Grafting cases, representing three different language combinations.For each trilingual pair, two languages serve as the input with the remaining one as the reference translation.We define a novel symbol T that denotes a noise function that combines operations in sequence: splitting by half and concatenation (Grafting), and sentence permutation and span corruption.Z n , J n , and E n for n = {1, 2, 3, 4} denotes four sentences written in three different languages.Zn , Jn , and Ẽn denotes corrupted sentences.

Figure 2 :
Figure 2: Illustration for the URL matching mechanism to create trilingual document pairs from bilingual data.In this case, we construct trilingual data by successfully matching the URL address for the Chinese document.

Figure 3 :
Figure 3: BlonDe scores on the TED Talks evaluated with TRIP and the Baseline Model (Document-level).

Figure 4 :
Figure 4: Results on TED Talks in (X→X) with our TRIP checkpoint pre-trained in (X→En) directions only.The scores are written in TRIP as the former and the Baseline Model as the latter.Rows represent the source languages and columns represent the target languages.We highlight in aqua when TRIP wins (darker one when printed in B&W) and in hot pink (lighter one when printed in B&W) when the Baseline Model wins.

Table 1 :
Comparisons of various multilingual pretraining methods.We denote the intermediate value as .For example, mT5 uses span corruption solely without sentence permutation, so we put a value of for the column of Denoising Pre-training for mT5.The columns of Denoising Pre-training and Translation Pre-training refer to the pre-training objectives we introduce at the start of Section 2.

Table 3 :
Results for document-level MT on the News benchmark in the direction of (X → En).

Table 4 :
Results for document-level machine translation on Europarl in the direction of (X → En).
(Jiang et al., 2022;Sun et al., et al., 2022;Sun et al.,2022) Source . . . . . .，但是这是一个大致的抽象的讨论，当某些间隙的时候，奥克塔维奥说，"保罗,也许我们可以观看TEDTalk。" TEDTalk用 简单的方式就讲明了，. . . . . .Reference ..., But it was a fairly abstract discussion, and at some point when there was a pause, Octavio said, "Paul, maybe we could watch the TEDTalk."So the TEDTalk laid out in very simple terms, ... Google Translate ..., But it's a roughly abstract discussion when at some point Octavio said, "Paul, maybe we can watch the TEDTalk."The TEDTalk said it in a simple way, ... Microsoft Translator ..., But it's a roughly abstract discussion when, at certain intervals, Octavio said, "Paul, maybe we can watch TEDTalk."TEDTalk explains it in a simple way, ... But it was a sort of abstract discussion, and at some point in the intermission, Octavio said, "Paul, maybe we can watch the TEDTalk."And the TEDTalk made it clear, ... Source . . . . . .，当光在西红柿上走过时，它一直在闪耀。它并没有变暗。为什么？因为西红柿熟了，并且光在西红柿内部反射，. . . . . .Reference ..., as the light washes over the tomato, It continues to glow.It doesn't become dark.Why is that?Because the tomato is actually ripe, and the light is bouncing around inside the tomato, ... Google Translate ..., as the light passed over the tomatoes, It kept shining.It didn't get darker.Why?Because the tomatoes are ripe, and light is reflected inside the tomatoes, ... Microsoft Translator ..., as the light walks over the tomatoes, It keeps shining.It didn't darken.Why?Because the tomatoes are ripe, and light is reflected inside the tomatoes, ... DeepL Translate ..., as the light traveled over the tomatoes, it kept shining.It doesn't dim.Why?Because the tomatoes are ripe and the light is reflecting inside the tomatoes, ... Baseline Model (Sentence-level) ..., as the light goes over the tomato, It's always glowing.It's not darkening.Why?Because the tomato is ripe, and light is reflected inside the tomato, ... Baseline Model (Document-level) ..., as the light passes over the tomato, It keeps flashing.It doesn't get darker.Why?Because the tomatoes are ripe , and the light is is reflected inside the tomato, ... TRIP ..., as the light passes over the tomato, It's flashing all the time.It's not getting darker.Why?Because the tomato is ripe, and the light is reflected inside the tomato, ...
Case 3: Conjunction Presence (Xiong et al., 2019; Sun et al., 2022) Source . . . . . .，我想提醒大家，我已经谈论了很多前人的事情。我还想考虑一下，民主会是什么样子,或者是已经是什么样子的可能性如 果我们可以让更多的母亲参与进来，. . . . . .Reference ..., I want to suggest to you that I've been talking a lot about the fathers.And I want to think about the possibilities of what democracy might look like, or might have looked like, if we had more involved the mothers, ... Google Translate ..., I want to remind everyone that I've talked a lot about my predecessors.I also want to think about what democracy would look like, or is it already What the possibilities look like if we could get more mothers involved, ... Microsoft Translator ..., I want to remind you that I have talked a lot about my predecessors.I would also like to consider what democracy would look like, or already be What kind of possibilities if we can involve more mothers , ... DeepL Translate ..., I want to remind you that I've talked about a lot of things that have come before.I also want to consider the possibility of what democracy would look like, or what it already looks like if we could get more mothers involved in, ... Baseline Model (Sentence-level) ..., I want to remind you that I've talked about a lot of my predecessors.I also want to think about what democracy might look like, or what democracy might look like if we could get more mothers involved, ... Baseline Model (Document-level) ..., I'd like to remind you that I've talked about a lot of things before.I'd also like to think about the possibilities of what democracy might look like, or what it might be like, if we could get more mothers to participate, ... TRIP ..., I want to remind you that I've talked a lot about the past.And I want to think about the possibilities of what democracy might look like, or already looks like, if we can get more mothers involved, ...

Table 5 :
Cases from TED Talks demonstrate that TRIP captures better tense consistency, noun-related issues, and conjunction presence.We highlight the correct translation in aqua (the darker one when printed in B&W), and the mistakes in hot pink (the lighter one when printed in B&W).Google Translate: https://translate.google.com/,Microsoft Translator: https://www.bing.com/translator,DeepL Translate: https://www.deepl.com/translator.Timestamped on 15th June 2023, can be subject to change.
ReferenceOpen your iPhone's messages.Tap Edit.Select each conversation you wish to delete.Tap Delete.Baseline Model (Document-level) Open your iPhone's Settings .Tap Messages.Tap Delete Messages.TRIPOpen Messages.Tap the Messages tab.Tap Delete.Tap Delete to confirm.

Table 8 :
Three case studies from Wikilingua (Tr → En) demonstrate that TRIP outputs better summarization.

Table 9 :
Comparison in the number of parameters for the pre-trained models used in our experiments.*:these models all use the model architecture of mT5-Large, and we report the number of model parameters taken from the original paper of mT5 reported byXue et al. (2021).

Table 10 :
A language list in ISO code for the top 12 language directions for the bilingual high-quality pretraining data to illustrate the scale of size.