Target-Side Augmentation for Document-Level Machine Translation

Document-level machine translation faces the challenge of data sparsity due to its long input length and a small amount of training data, increasing the risk of learning spurious patterns. To address this challenge, we propose a target-side augmentation method, introducing a data augmentation (DA) model to generate many potential translations for each source document. Learning on these wider range translations, an MT model can learn a smoothed distribution, thereby reducing the risk of data sparsity. We demonstrate that the DA model, which estimates the posterior distribution, largely improves the MT performance, outperforming the previous best system by 2.30 s-BLEU on News and achieving new state-of-the-art on News and Europarl benchmarks.


Introduction
Document-level machine translation (Gong et al., 2011;Hardmeier et al., 2013;Werlen et al., 2018;Maruf et al., 2019;Bao et al., 2021;Feng et al., 2022) has received increasing research attention.It addresses the limitations of sentence-level MT by considering cross-sentence co-references and discourse information, and therefore can be more useful in the practical setting.Document-level MT presents several unique technical challenges, including significantly longer inputs (Bao et al., 2021) and relatively smaller training data compared to sentence-level MT (Junczys-Dowmunt, 2019;Liu et al., 2020;Sun et al., 2022).The combination of these challenges leads to increased data sparsity (Gao et al., 2014;Koehn and Knowles, 2017;Liu et al., 2020), which raises the risk of learning spurious patterns in the training data (Belkin et al., 2019;Savoldi et al., 2021) and hinders generalization (Li et al., 2021;Dankers et al., 2022).
To address these issues, we propose a targetside data augmentation method that aims to reduce sparsity by automatically smoothing the training distribution.The main idea is to train the document MT model with many plausible potential translations, rather than forcing it to fit a single human translation for each source document.This allows the model to learn more robust and generalizable patterns, rather than being overly reliant on features of particular training samples.Specifically, we introduce a data augmentation (DA) model to generate possible translations to guide MT model training.As shown in Figure 1, the DA model is trained to understand the relationship between the source and possible translations based on one observed translation (Step 1), and then used to sample a set of potentially plausible translations (Step 2).These translations are fed to the MT model for training, smoothing the distribution of target translations (Step 3).
We use standard document-level MT models including Transformer (Vaswani et al., 2017) and G-Transformer (Bao et al., 2021) for both our DA and MT models.For the DA model, in order to effectively capture a posterior target distribution given a reference target, we concatenate each source sentence with a latent token sequence as the new input, where the latent tokens are sampled from the observed translation.A challenge to the DA model is that having the reference translation in the input can potentially decrease diversity.To address this issue, we introduce the intermediate latent variable on the encoder side by using rules to generate ngram samples, so that posterior sampling (Wang and Park, 2020) can be leveraged to yield diverse translations.
Step 1. DA model training: Step 2. Target-side data augmentation: Step 3. MT model training: One reference: Figure 1: Illustration of target-side data augmentation (DA) using a very simple example.A DA model is trained to estimate the distribution of possible translations y given a source x i and an observed target y i , and the MT model is trained on the sampled translations ŷj from the DA model for each source x i .Effectively training the DA model with the target y i , which is also a conditional input, can be challenging, but it is achievable after introducing an intermediate latent variable between the translation y and the condition y i .
of-the-art results on News and Europarl.Further analysis shows that high diversity among generated translations and their low deviation from the gold translation are the keys to improved performance.To our knowledge, we are the first to do target-side augmentation to enrich output variety for document-level machine translation.

Related Work
Data augmentation (DA) increases training data by synthesizing new data (Van Dyk and Meng, 2001;Shorten and Khoshgoftaar, 2019;Shorten et al., 2021;Li et al., 2022).In neural machine translation (NMT), the most commonly used data augmentation techniques are source-side augmentations, including easy data augmentation (EDA) (Wei and Zou, 2019), subword regularization (Kudo, 2018), and back-translation (Sennrich et al., 2016a), which generates pseudo sources for monolingual targets enabling the usage of widely available monolingual data.These methods generate more source-target pairs with different silver source sentences for the same gold-target translation.On the contrary, target-side augmentation is more challenging, as approaches like EDA are not effective for the target side because they corrupt the target sequence, degrading the autoregressive modeling of the target language.
Previous approaches on target-side data augmen-tation in NMT fall into three categories.The first is based on self-training (Bogoychev and Sennrich, 2019;He et al., 2019;Zoph et al., 2020), which generates pseudo translations for monolingual source text using a trained model.The second category uses either a pre-trained language model (Fadaee et al., 2017;Wu et al., 2019) or a pre-trained generative model (Raffel et al., 2020;Khayrallah et al., 2020) to generate synonyms for words or paraphrases of the target text.The third category relies on reinforcement learning (Norouzi et al., 2016;Wang et al., 2018), introducing a reward function to evaluate the quality of translation candidates and to regularize the likelihood objective.In order to explore possible candidates, a sampling from the model distribution or random noise is used.Unlike these approaches, our method is a target-side data augmentation technique that is trained using supervised learning and does not rely on external data or large-scale pretraining.More importantly, we generate document-level instead of word, phrase, or sentence-level alternatives.
Previous target-side input augmentation (Xie et al., 2022) appears to be similar to our targetside augmentation.However, besides the literal similarity, they are quite different.Consider the token prediction P (y i |x, y <i ).The target-side input augmentation augments the condition y <i to increase the model's robustness to the conditions, which is more like source-side augmentation on condition x.In comparison, target-side augmentation augments the target y i , providing the model with completely new training targets.
Paraphrase models.Our approach generates various translations for each source text, each of which can be viewed as a paraphrase of the target.Unlike previous methods that leverage paraphrase models for improving MT (Madnani et al., 2007;Hu et al., 2019;Khayrallah et al., 2020), our DA model exploits parallel corpus and does not depend on external paraphrase data, similar to Thompson and Post (2020).Instead, it takes into account the source text when modeling the target distribution.More importantly, while most paraphrase models operate at the sentence level, our DA model can generate translations at the document level.
Conditional auto-encoder.The DA model can also be seen as a conditional denoising autoencoder (c-DAE), where the latent variable is a noised version of the ground-truth target, and the model is trained to reconstruct the ground-truth target from a noisy latent sequence.c-DAE is similar to the conditional variational autoencoder (c-VAE) (Zhang et al., 2016;Pagnoni et al., 2018), which learns a latent variable and generates diverse translations by sampling from it.However, there are two key differences between c-VAE and our DA model.First, c-VAE learns both the prior and posterior distributions of the latent variable, while the DA model directly uses predefined rules to generate the latent variable.Second, c-VAE models the prior distribution of the target, while the DA model estimates the posterior distribution.
Sequence-level knowledge distillation.Our DA-MT process is also remotely similar in form to sequence-level knowledge distillation (SKD) (Ba and Caruana, 2014;Hinton et al.;Gou et al., 2021;Kim and Rush, 2016;Gordon and Duh, 2019;Lin et al., 2020), which learns the data distribution using a large teacher and distills the knowledge into a small student by training the student using sequences generated by the teacher.However, our method differs from SKD in three aspects.First, SKD aims to compress knowledge from a large teacher to a small student, while we use the same or smaller size model as the DA model, where the knowledge source is the training data rather than the big teacher.Second, the teacher in SKD estimates the prior distribution of the target given source, while our DA model estimates the posterior distribution of the target given source and an observed target.Third, SKD generates one sequence for each source, while we generate multiple diverse translations with controlled latent variables.

Target-Side Augmentation
The overall framework is shown in Figure 1.Formally, denote a set of training data as D = {(x i , y i )} N i=1 , where (x i , y i ) is the i-th sourcetarget pair and N is the number of pairs.We train a data augmentation (DA) model (Section 3.1) to generate samples with new target translations (Section 3.2), which are used to train an MT model (Section 3.3).

The Data Augmentation Model
We learn the posterior distribution P da (y|x i , y i ) from parallel corpus by introducing latent variables where z is the latent variable to control the translation output and Z i denotes the possible space of z, φ denotes the parameters of the DA model, and α denotes the hyper-parameters for determining the distribution of z given y i .
The space Z i of possible z is exponentially large compared to the number of tokens of the target, making it intractable to sum over Z i in Eq. 1.We thus consider a Monte Carlo approximation, sample a group of instances from p α (z|y i ), and calculate the sample mean where Ẑi denotes the sampled instances.
There are many possible choices for the latent variable, such as a continuous vector or a categorical discrete variable, which also could be either learned by the model or predefined by rules.Here, we simply represent the latent variable as a sequence of tokens and use predefined rules to generate the sequence, so that the latent variable can be easily incorporated into the input of a seq2seq model without the need for additional parameters.
Specifically, we set the value of the latent variable z to be a group of sampled n-grams from the observed translation y i and concatenate x i and z into a sequence of tokens.We assume that the generated translations y can be consistent with the  observed translation y i on these n-grams.To this end, we define α as the ratio of tokens in y i that is observable through z, naming observed ratio.For a target with |y i | tokens, we uniformly sample n-grams from y i to cover α × |y i | tokens that each n-gram has a random length among {1, 2, 3}.
For example, given that α = 0.1 and a target y i with 20 tokens, we can sample one 2-gram or two uni-grams from the target to reach 2 (0.1 × 20) tokens.
Training.Given a sample (x i , y i ), the training loss is rewritten as where the upper bound of the loss is provided by Jensen inequality.The upper bound sums log probabilities, which can be seen as sums of the standard negative log-likelihood (NLL) loss of each (x i , z, y i ).As a result, when we optimize this upper bound as an alternative to optimizing L da , the DA model is trained using standard NLL loss but with | Ẑi | times more training instances.
Discussion.As shown in Figure 1, given a sample (x i , y i ), we adopt a new estimation method using the posterior distribution P da (y|x i , y i ) for our DA model.The basic intuition is that by conditioning on both the source x i and the observed translation y i , the DA model can estimate the data distribution P data (y|x i ) more accurately than an MT model.Logically, an MT model learns a prior distribution P mt (y|x i ), which estimates the data distribution P data (y|x i ) for modeling translation probabilities.This prior distribution works well when the corpus is large.However, when the corpus is sparse in comparison to the data space, the learned distribution overfits the sparsely distributed samples, resulting in poor generalization to unseen targets.

The Data Augmentation Process
The detailed data augmentation process is shown in Figure 2 and the corresponding algorithm is shown in Algorithm 1. Below we use one training example to illustrate.
DA model training.We represent the latent variable z as a sequence of tokens and concatenate z to the source, so a general seq2seq model can be used to model the posterior distribution.Compared to general MT models, the only difference is the structure of the input.
Specifically, as the step B shown in the figure, for a given sample (x i , y i ) from the parallel data, we sample a number of n-grams from y i and extend the input to (x i , z), where the number is determined according to the length of y i .Take the target sentence "most free societies accept such limits as reasonable , but the law has recently become more restrictive ."as an example.We sample "societies" and "has recently" from the target and concatenate them to the end of the source sentence to form the first input sequence.We then sample "the law" and "as reasonable" to form the second input sequence.These new input sequences pair with the original target sequence to form new parallel data.By generating different input sequences, we augment the data multiple times.
Algorithm 1 Target-side data augmentation.
▷ Add the gold pair 6: for j ← 1 to M do 7: α ∼ Beta(a, b) ▷ Sample an observed ratio 8: zj ∼ Pα(z|yi) ▷ Sample a latent value 9: ŷj ∼ Pφ(y|xi, zj) ▷ Sample a translation 10: Target-side data augmentation.Using the data "C.Extended Input" separated from the extended data in step B, we generate new translations by running a beam search with the trained DA model, where for each extended input sequence, we obtain a new translation.Here, we reuse the sampled z from step B. However, we can also sample new z for inference, which does not show an obvious difference in the MT performance.By pairing the new translations with the original source sequence, we obtain "E.Augmented Data".The details are described in Algorithm 1, which inputs the original parallel data and outputs the augmented data.

The MT Model
We use Transformer (Vaswani et al., 2017) and G-Transformer (Bao et al., 2021) as the baseline MT models.The Transformer baseline models the sentence-level translation and translates a document sentence-by-sentence, while the G-Transformer models the whole document translation and directly translates a source document into the corresponding target document.G-transformer improves the naïve self-attention in Transformer with group-attention (Appendix A) for long document modeling, which is a recent state-of-the-art document MT model.
Baseline Training.The baseline methods are trained on the original training dataset D by the standard NLL loss (4) Augmentation Training.For our target-side augmentation method, we force the MT model to match the posterior distribution estimated by the

DA model
(5) where Y i is the possible translations of x i .
We approximate the expectation over Y i using a Monte Carlo method.Specifically, for each sample (x i , y i ), we first sample z j from P α (z|y i ) and then run beam search with the DA model by taking x i and z j as its input, obtaining a feasible translation.
Repeating the process M times, we obtain a set of possible translations as the step D in Figure 2 and Algorithm 1 in Section 3.2 illustrate.
Subsequently, the loss function for the MT model is rewritten as follows, which approximates the expectation using the average NLL loss of the sampled translations where θ denotes the parameters of the MT model.The number | Ŷi | could be different for each sample, but for simplicity, we choose a fixed number M in our experiments.

Experiments
Datasets.We experiment on three benchmark datasets -TED, News, and Europarl (Maruf et al., 2019), representing different domains and data scales for English-German (En-De) translation.
The detailed statistics are displayed in Baselines.We apply target-side augmentation to two baselines, including sentence-level Transformer (Vaswani et al., 2017) and document-level G-transformer (Bao et al., 2021).We further combine back-translation and target-side augmentation, and apply it to the two baselines.
Training Settings.For both Transformer and G-Transformer, we generate M new translations (9 for TED and News, and 3 for Europarl) for each sentence and augment the data to its M + 1 times.For back-translation baselines, where the training data have already been doubled, we further augment the data 4 times for TED and News, and 1 for Europarl, so that the total times are still 10 for TED and News, and 4 for Europarl.
We obtain the translations by sampling latent z with an observed ratio from a Beta distribution Beta(2, 3) and running a beam search with a beam size of 5. We run each main experiment three times and report the median.More details are described in Appendix B.3.

Main Results
As shown in Table 2, target-side augmentation significantly improves all the baselines.Particularly, it improves G-Transformer (fnt.) by 1.75 s-BLEU on average over the three benchmarks, where the improvement on News reaches 2.94 s-BLEU.With the augmented data generated by the DA model, the gap between G-Transformer (rnd.) and G-Transformer (fnt.)narrows from 1.26 s-BLEU on average to 0.18, suggesting that fine-tuning on sentence MT model might not be necessary when augmented data is used.For the Transformer baseline, target-side augmentation enhances the performance by 1.33 s-BLEU on average.These results demonstrate that target-side augmentation can significantly improve the baseline models, especially on small datasets.
Comparing with previous work, G-Transformer (fnt.)+Target-sideaugmentation outperforms the best systems SMDT, which references retrieved similar translations, with a margin of 1.40 s-BLEU on average.It outperforms previous competitive RecurrentMem, which gives the best score on TED, with a margin of 1.58 s-BLEU on average.Compared with MultiResolution, which is also a data augmentation approach that increases the training data by splitting the documents into different resolutions (e.g., 1, 2, 4, 8 sentences per training instance), target-side augmentation obtains higher performance with a margin of 1.72 s-BLEU on average.With target-side augmentation, G-Transformer (fnt.)achieves the best-reported s-BLEU on all three datasets.
Compared to the pre-training setting, targetside augmentation with G-Transformer (fnt.)outperforms Flat-Transformer+BERT and G-Transformer+BERT, which are fine-tuned on pretrained BERT, with margins of 1.46 and 0.70 s-BLEU, respectively, on an average of the three benchmarks, where the margins on News reaches 3.54 and 1.92, respectively.The score on bigger dataset Europarl even excels strong large pretraining G-Transformer+mBART, suggesting the effectiveness of target-side augmentation for both small and large datasets.
Back-translation does not enhance the performance on TED and Europarl by an adequate margin, but enhances the performance on News significantly, compared to the Transformer and G-Transformer baselines.Upon the enhanced baselines, target-side augmentation further improves the performance on News to a new level, reaching the highest s/d-BLEU scores of 28.69 and 30.41, respectively.The results demonstrate that target-side augmentation complements the back-translation technique, where a combination may be the best choice in practice.

Posterior vs Prior Distribution
We first compare the MT performance of using a posterior distribution P (y|x i , y i ) in the DA model (Eq. 5 in Section 3.3) against using the prior distribution P (y|x i ).As shown in Table 3, when using a prior-based augmentation, the performance improves by 0.64 s-BLEU on average compared to using the original data.After replacing the DA model with the posterior distribution, the performance improves by 1.75 s-BLEU on average, which is larger than the improvements obtained by the prior distribution.The results suggest that using a DA model (even with a simple prior distribution) to augment the target sequence is effective, and the posterior distribution further gives a significant boost.
Generated Translations.We evaluate the distribution of generated translations, as shown in Table 4. Using prior distribution, we obtain translations with higher Diversity than posterior distribution.(gen+gold) -trained on both generated and gold translations.(gen only) -trained on generated translations.
However, higher Diversity does not necessarily lead to better performance if the generated translations are not consistent with the target distribution.As the Deviation column shows, the translations sampled from the posterior distribution have a much smaller Deviation than that from the prior distribution, which confirms that the DA model estimating posterior distribution can generate translations more similar to the gold target.
Accuracy of Estimated Distribution.As more direct evidence to support the DA model with a posterior distribution, we evaluate the perplexity (PPL) of the model on a multiple-reference dataset, where a better model is expected to give a lower PPL on the references (Appendix C.1).As shown in the column PPL in Table 4, we obtain an average PPL (per token) of 7.00 for the posterior and 8.68 for the prior distribution, with the former being 19.4% lower than the latter, confirming our hypothesis that the posterior distribution can estimate the data distribution P data (y|x i ) more accurately.

Sampling of Latent z
Scale.The sampling scale | Ŷ| in Eq. 7 is an important influence factor on the model performance.Theoretically, the larger the scale is, the more accurate the approximation will be.Figure 3 evaluates the performance on different scales of generated translations.The overall trends confirm the theoretical expectation that the performance improves when the scale increases.At the same time, the contribution of the gold translation drops when the scale increases, suggesting that with more generated translations, the gold translation provides  less additional information.In addition, the performance of scale ×1 and ×9 have a gap of 0.75 s-BLEU, suggesting that the MT model requires sufficient samples from the DA model to match its distribution.In practice, we need to balance the performance gain and the training costs to decide on a suitable sampling scale.
Observed Ratio.Using the observed ratio (α in Eq. 1), we can control the amount of information provided by the latent variable z.Such a ratio influences the quality of generated translations.As Figure 4a shows, a higher observed ratio produces translations with a lower Deviation from the gold reference, which shows a monotonic descent curve.In comparison, the diversity of the generated translations shows a convex curve, which has low values when the observed ratio is small or big but high values in the middle.The diversity of the generated translations represents the degree of smoothness of the augmented dataset, which has a direct influence on the model performance.
As Figure 4b shows, the MT model obtains the best performance around the ratio of 0.4, where it has a balanced quality of Deviation and Diversity.When the ratio further increases, the performance goes down.Comparing the MT models trained with/without the gold translation, we see that the performance gap between the two settings is closing when the observed ratio is bigger than 0.6, where the generated translations have low Deviation from the gold translations.
The Diversity can be further enhanced by mixing the generated translations from different observed ratios.Therefore, instead of using a fixed ratio, we sample the ratio from a predefined Beta distribution.As Figure 4c shows, we compare the performance on different Beta distributions.The performance on TED peaks at Beta(1, 1) but does not show a significant difference compared to the other two, while the performance on News peaks at Beta(2, 3), which has a unimodal distribution with an extremum between the ratio 0.3 and 0.4 and has a similar shape as the curve of Diversity in Figure 4a.Compared to Beta(2, 2), which is also a unimodal distribution but with an extremum at the ratio 0.5, the performance with Beta(2, 3) is higher by 0.66 s-BLEU.Granularity of N-grams.The granularity of n-grams determines how much order information between tokens is observable through the latent z (in comparison, the observed ratio determines how many tokens are observed).We evaluate different ranges of n-grams, where we sample n-grams according to a number uniformly sampled from the range.As Figure 5 shows, the performance peaks at [1, 2] for TED and [1, 3] for News.However, the differences are relatively small, showing that the performance is not sensitive to the token order of the original reference.A possible reason may be that the DA model can reconstruct the order according to the semantic information provided by the source sentence.

Different Augmentation Methods
Source-side and Both-side Augmentation.We compare target-side augmentation with the sourceside and both-side augmentations, by applying the DA model to the source and both sides.As Table 5 shows, the source-side augmentation improves the baseline by 1.12 s-BLEU on average of TED and News but is still significantly lower than the target-side augmentation, which improves the baseline by 2.17 s-BLEU on average.Combining the Table 6: Target-side augmentation vs paraphraser on sentence-level MT, evaluated on IWSLT14 German-English (De-En).♢ -nucleus sampling with p = 0.95.generated data from both the source-side and targetside augmentations, we obtain an improvement of 2.42 s-BLEU on average, whereas the source-side augmented data further enhance the target-side augmentation by 0.25 s-BLEU on average.These results suggest that the DA model is effective for source-side augmentation but more significantly for target-side augmentation.
Paraphrasing.Target-side augmentation augments the parallel data with new translations, which can be seen as paraphrases of the original gold translation.Such paraphrasing can also be achieved by external paraphrasers.We compare target-side augmentation with a pre-trained T5 paraphraser on a sentence-level MT task, using the settings described in Appendix C.3.
As shown in Table 6, the T5 paraphraser performs lower than the Transformer baseline on both the dev and test sets, while target-side augmentation outperforms the baseline by 1.57 and 1.55 on dev and test, respectively.The results demonstrate that a DA model is effective for sentence MT but a paraphraser may not, which can be because of missing translation information.
In particular, the generated paraphrases from the T5 paraphraser have a Diversity of 40.24, which is close to the Diversity of 37.30 from the DA model.However, when we compare the translations by calculating the perplexity (PPL) on the baseline Transformer, we get a PPL of 3.40 for the T5 paraphraser but 1.89 for the DA model.The results suggest that compared to an external paraphraser, the DA model generates translations more consistent with the distribution of the gold targets.

Further Analysis
Size of The DA model.The condition on an observed translation simplifies the DA model for predicting the target.As a result, the generated translations are less sensitive to the capacity of the DA model.Results with different sizes of DA models confirm the hypothesis and suggest that the MT performance improves even with much smaller DA models.The details are in Appendix C.2.
Case Study.We list several word, phrase, and sentence cases of German-English translations, and two documents of English-German translations, demonstrating the diversity of the generated translations by the DA model.The details are shown in Appendix C.4.

Conclusion
We investigated a target-side data augmentation method, which introduces a DA model to generate many possible translations and trains an MT model on these smoothed targets.Experiments show our target-side augmentation method reduces the effect of data sparsity issues, achieving strong improvement upon the baselines and new state-ofthe-art results on News and Europarl.Analysis suggests that a balance between high Diversity and low Deviation is the key to the improvements.To our knowledge, we are the first to do target-side augmentation in the context of document-level MT.

Limitations
Long documents, intuitively, have more possible translations than short documents, so a dynamic number of generated translations may be a better choice when augmenting the data, which balances the training cost and the performance gain.Another potential solution is to sample a few translations and force the MT model to match the dynamic distribution of the DA model using these translations as decoder input, similar to Khayrallah et al. (2020).Such dynamic sampling and matching could potentially be used to increase training efficiency.We do not investigate the solution in this paper and leave the exploration of this topic to future work.
Target-side augmentation can potentially be applied to other seq2seq tasks, where the data sparsity is a problem.Due to the limitation of space in a conference submission, we will leave investigations on other tasks for future work.

A G-Transformer
G-Transformer (Bao et al., 2021) has an encoderdecoder architecture, involving two types of multihead attention.One is for global document, naming global attention, while another is for local sentence, naming group attention.
Global Attention.The global attention is simply a normal multi-head attention, which attends to the whole document.
where matrix inputs Q, K, V are query, key, and value for calculating the attention.Group Attention.The group attention differentiates the sentences in a document by assigning a group tag (Bao andZhang, 2021, 2023;Bao et al., 2023) to each sentence.The group tag is a number used to identify a specific sentence, which is allocated in the order of sentences, where the group tag for the first sentence is 1, second sentence 2, and so on.
The group tag sequences are used to calculate an attention mask to avoid cross-sentential attention args = (Q, K, V, GQ, GK ), where G Q and G K are group-tag sequences for query and key.The function M (G Q , G K ) calculates the attention mask that for a group tag in G Q and a group tag in G K , it returns a big negative number if the two tags are different, otherwise it returns 0.
Combined Attention The two multi-head attentions are combined using a gate-sum module where W and b are trainable parameters, and ⊙ denotes element-wise multiplication.G-Transformer uses group attention on low layers and combined attention on top 2 layers.

B.1 Datasets
The three benchmark datasets are as follows.TED is a corpus from IWSLT2017, which contains the transcriptions of TED talks that each talk corresponds to a document.The sentences in source and target documents are aligned for translation.We use tst2016-2017 for testing and the rest for development.
News is a corpus mainly from News Commentary v11, where the sentences are also aligned between the source and target documents.We use newstest2016 for testing and newstest2015 for development.In addition, we use newstest2021 from WMT21 (Farhad et al., 2021), which has three references for each source, to evaluate the quality of the estimation of data distribution.
Europarl is a corpus extracted from Europarl v7, where the train, development, and test sets are randomly split.
We pre-process the data by tokenizing and truecasing the sentences using MOSES tools (Koehn et al., 2007), followed with a BPE (Sennrich et al., 2016b) of 30000 merging operations.

B.2 Metrics
The sentence-level BLEU score (s-BLEU) and document-level BLEU score (d-BLEU) are described as follows.
s-BLEU is calculated over sentence pairs between the source and target document, which is basically the same with the BLEU scores (Papineni et al., 2002) for sentence NMT models.
d-BLEU is calculated over document pairs, taking each document as a whole word sequence and computing the BLEU scores between the source and target sequences.
For analysis, we measure the Deviation and Diversity of generated translations.
Deviation is simply defined as the distance to perfect s-BLEU score Deviation(ŷ, y) = 100 − s-BLEU(ŷ, y), (11) accept such limits as reasonable 1) consider these restrictions useful 2) regard such restrictions as reasonable 3) take these constraints as certain passiv bewegte ohren sobald der kopf etwas tut .
ears that move passively when the head goes .1) ears moving passively when the head does something .2) passively moving ears once the head goes .
an object constructed out of wood and cloth with movement built into it to persuade you to believe that it has life 1) an object made out of wood and cloth , with movement built in to persuade you to believe that 's alive .
2) an object built out of wood and cloth with movement to perpetuate you to believe it 's alive .
3) a wooden and cloth object with movement built in to make you believe that it 's alive .Sentence sie lebt nur dann wenn man sie dazu bringt .it only lives because you make it .1) it only lives when you get it to do .2) it lives only as you make it .
3) it only lives because you get them to do it .in jedem moment auf der bühne rackert sich die puppe ab .so every moment it 's on the stage , it 's making the struggle .1) at every moment on the stage , it 's making the struggle of puppet .
2) every moment on the stage it reckers down the puppet .3) so every moment it 's on the stage , the puppet is racking off .er demonstriert anhand einer schockierenden geschichte von der toxinbelastung auf einem japanischen fischmarkt , wie gifte den weg vom anfang der ozeanischen nahrungskette bis in unseren körper finden .he shows how toxins at the bottom of the ocean food chain find their way into our bodies , with a shocking story of toxic contamination from a japanese fish market .1) he demos through a shocking story of toxic burden on a japanese fish market , how poisoning their way from the beginning of the ocean food chain into our bodies .2) he demos through a shocking story of toxin impact on a japanese fish market , how poised the way from the ocean food chain to our bodies .3) he demos through a shocking story of toxin contamination at a japanese fish market , with how toxins find the way from the beginning of the ocean food chain to our bodies .
Table 8: Translations generated by the DA model on IWSLT14 German-English.generate 6 translations for each source sentence without using the document context.It is worth noting that different from the previous paraphrasing augmentation method (Khayrallah et al., 2020), where the MT model learns the paraphraser's distribution directly, we use sampled text output to train the MT models.

C.4 Case Study
Our case study demonstrates that the DA model generates diverse translations at word, phrase, and sentence levels.Several cases for German-English translation are listed in Table 8.
We further list two document-level translations, through which we can have a direct sense of how target-side augmentation improves MT performance, as Table 9 shows.

Figure 2 :
Figure2: The detailed data augmentation process, where the parallel data is augmented multiple times.

Figure 3 :
Figure 3: Impact of the sampling scale for z, trained on G-Transformer (fnt.) and evaluated in s-BLEU on News.(gen+gold) -trained on both generated and gold translations.(gen only) -trained on generated translations.

Figure 4 :
Figure 4: Impact of the observed ratio for z, trained on G-Transformer (fnt.) and evaluated in s-BLEU.Beta(a,b) -The function curves are shown in Appendix B.3.

Figure 5 :
Figure 5: Impact of the granularity of n-grams, trained on G-Transformer (fnt.) and evaluated in s-BLEU.

Figure 6 :
Figure 6: The probability density function of Beta(a, b) distributions.
~ (),   ~ (|  ) most free societies accept such limits as reasonable , but the law has recently become more restrictive .

Table 1
Liu et al. (2020)d descriptions are in Appendix B.1.Metrics.We followLiu et al. (2020)to use sentence-level BLEU score (s-BLEU) and document-level BLEU score (d-BLEU) as the major metrics for the performance.We further define two metrics, including Deviation and Diversity, to measure the quality of generated translations from

Table 2 :
Main results evaluated on English-German document-level translation, where "*" indicates a significant improvement upon the baseline with p < 0.01.(rnd.)-parameters are randomly initialized.(fnt.)-parameters are initialized using a trained sentence model.♢ -we adjust the hyper-parameters for augmented datasets.♡ -we augment the training data by back-translating each target to a new source instead of introducing additional monolingual targets.
the DA model for analysis.The detailed description and definition are in Appendix B.2.

Table 3 :
MT performance with prior/posterior-based DA models, evaluated in s-BLEU.

Table 4 :
Quality of generated translations and accuracy of the estimated distributions from the DA model, evaluated on News.