G-Transformer for Document-Level Machine Translation

Document-level MT models are still far from satisfactory. Existing work extend translation unit from single sentence to multiple sentences. However, study shows that when we further enlarge the translation unit to a whole document, supervised training of Transformer can fail. In this paper, we find such failure is not caused by overfitting, but by sticking around local minima during training. Our analysis shows that the increased complexity of target-to-source attention is a reason for the failure. As a solution, we propose G-Transformer, introducing locality assumption as an inductive bias into Transformer, reducing the hypothesis space of the attention from target to source. Experiments show that G-Transformer converges faster and more stably than Transformer, achieving new state-of-the-art BLEU scores for both nonpretraining and pre-training settings on three benchmark datasets.


Introduction
Document-level machine translation (MT) has received increasing research attention (Gong et al., 2011;Hardmeier et al., 2013;Garcia et al., 2015;Miculicich et al., 2018a;Maruf et al., 2019;. It is a more practically useful task compared to sentence-level MT because typical inputs in MT applications are text documents rather than individual sentences. A salient difference between document-level MT and sentence-level MT is that for the former, much larger inter-sentential context should be considered when translating each sentence, which include discourse structures such as anaphora, lexical cohesion, etc. Studies show that human translators consider such contexts when conducting document translation (Hardmeier, 2014;Läubli et al., 2018). Despite that neural models achieve competitive performances on sentence- * * Corresponding author. level MT, the performance of document-level MT is still far from satisfactory. Existing methods can be mainly classified into two categories. The first category translates a document sentence by sentence using a sequence-tosequence neural model (Zhang et al., 2018;Miculicich et al., 2018b;Maruf et al., 2019;Zheng et al., 2020). Document-level context is integrated into sentence-translation by introducing additional context encoder. The structure of such a model is shown in Figure 1(a). These methods suffer from two limitations. First, the context needs to be encoded separately for translating each sentence, which adds to the runtime complexity. Second, more importantly, information exchange cannot be made between the current sentence and its document context in the same encoding module.
The second category extends the translation unit from a single sentence to multiple sentences (Tiedemann and Scherrer, 2017;Agrawal et al., 2018; and the whole document (Junczys-Dowmunt, 2019;. Recently, it has been shown that when the translation unit increases from one sentence to four sentences, the performance improves Scherrer et al., 2019). However, when the whole document is encoded as a single unit for sequence to sequence translation, direct supervised training has been shown to fail . As a solution, either large-scale pre-training  or data augmentation (Junczys-Dowmunt, 2019) has been used as a solution, leading to improved performance. These methods are shown in Figure 1(b). One limitation of such methods is that they require much more training time due to the necessity of data augmentation.
Intuitively, encoding the whole input document as a single unit allows the best integration of context information when translating the current sentence. However, little work has been done investigating the underlying reason why it is difficult to train such a document-level NMT model. One remote clue is that as the input sequence grows larger, the input becomes more sparse (Pouget-Abadie et al., 2014;Koehn and Knowles, 2017). To gain more understanding, we make dedicated experiments on the influence of input length, data scale and model size for Transformer (Section 3), finding that a Transformer model can fail to converge when training with long sequences, small datasets, or big model size. We further find that for the failed cases, the model gets stuck at local minima during training. In such situation, the attention weights from the decoder to the encoder are flat, with large entropy values. This can be because that larger input sequences increase the challenge for focusing on a local span to translate when generating each target word. In other words, the hypothesis space for target-to-source attention is increased.
Given the above observations, we investigate a novel extension of Transformer, by restricting selfattention and target-to-source attention to a local context using a guidance mechanism. As shown in Figure 1(c), while we still encode the input document as a single unit, group tags 1 2 3 are assigned to sentences to differentiate their positions. Target-to-source attention is guided by matching the tag of target sentence to the tags of source sentences when translating each sentence, so that the hypothesis space of attention is reduced. Intuitively, the group tags serve as a constraint on attention, which is useful for differentiating the current sentence and its context sentences. Our model, named G-Transformer, can be thus viewed as a combination of the method in Figure 1(a) and Figure 1(b), which fully separate and fully integrates a sentence being translated with its document level context, respectively. We evaluate our model on three commonly used document-level MT datasets for English-German translation, covering domains of TED talks, News, and Europarl from small to large. Experiments show that G-Transformer converges faster and more stably than Transformer on different settings, obtaining the state-of-the-art results under both non-pretraining and pre-training settings. To our knowledge, we are the first to realize a truly document-by-document translation model. We release our code and model at https://github.com/baoguangsheng/g-transformer.

Experimental Settings
We evaluate Transformer and G-Transformer on the widely adopted benchmark datasets (Maruf et al., 2019), including three domains for English-German (En-De) translation.
TED. The corpus is transcriptions of TED talks from IWSLT 2017. Each talk is used as a document, aligned at the sentence level. tst2016-2017 is used for testing, and the rest for development.
News. This corpus uses News Commentary v11 for training, which is document-delimited and sentence-aligned. newstest2015 is used for development, and newstest2016 for testing.
Europarl. The corpus is extracted from Europarl v7, where sentences are segmented and aligned using additional information. The train, dev and test sets are randomly split from the corpus.
The detailed statistics of these corpora are shown in Table 1. We pre-process the documents by splitting them into instances with up-to 512 tokens, taking a sentence as one instance if its length exceeds 512 tokens. We tokenize and truecase the sentences with MOSES (Koehn et al., 2007) tools, applying BPE (Sennrich et al., 2016) with 30000 merging operations.
We consider three standard model configurations.
Large Model. We use the same settings of BART large model , which involves 12 layers, 16 heads, 1024 dimension outputs, and 4096 dimension hidden vectors.
We use s-BLEU and d-BLEU ) as the metrics. The detailed descriptions are in Appendix A.

Transformer and Long Inputs
We empirically study Transformer (see Appendix B) on the datasets. We run each experiment five times using different random seeds, reporting the average score for comparison.

Failure Reproduction
Input Length. We use the Base model and fixed dataset for this comparison. We split both the training and testing documents from Europarl dataset into instances with input length of 64, 128, 256, 512, and 1024 tokens, respectively. For fair comparison, we remove the training documents with a length of less than 768 tokens, which may favour small input length. The results are shown in Figure 2a. When the input length increases from 256 tokens to 512 tokens, the BLEU score drops dramatically from 30.5 to 2.3, indicating failed training with 512 and 1024 tokens. It demonstrates the difficulty when dealing with long inputs of Trans- former.
Data Scale. We use the Base model and a fixed input length of 512 tokens. For each setting, we randomly sample a training dataset of the expected size from the full dataset of Europarl. The results are shown in Figure 2b. The performance increases sharply when the data scale increases from 20K to 40K. When data scale is equal or less than 20K, the BLEU scores are under 3, which is unreasonably low, indicating that with a fixed model size and input length, the smaller dataset can also cause the failure of the training process. For data scale more than 40K, the BLEU scores show a wide dynamic range, suggesting that the training process is unstable.
Model Size. We test Transformer with different model sizes, using the full dataset of Europarl and a fixed input length of 512 tokens. Transformer-Base can be trained successfully, giving a reasonable BLEU score. However, the training of the Big and Large models failed, resulting in very low BLEU scores under 3. It demonstrates that the increased model size can also cause the failure with a fixed input length and data scale.
The results confirm the intuition that the performance will drop with longer inputs, smaller datasets, or bigger models. However, the BLEU scores show a strong discontinuity with the change of input length, data scale, or model size, falling into two discrete clusters. One is successfully trained cases with d-BLEU scores above 10, and the other is failed cases with d-BLEU scores under 3.  : For the successful model, the attention distribution shrinks to narrow range (low entropy) and then expands to wider range (high entropy).

Failure Analysis
Training Convergence. Looking into the failed models, we find that they have a similar pattern on loss curves. As an example of the model trained on 20K instances shown in Figure 3a, although the training loss continually decreases during training process, the validation loss sticks at the level of 7, reaching a minimum value at around 9K training steps. In comparison, the successfully trained models share another pattern. Taking the model trained on 40K instances as an example, the loss curves demonstrate two stages, which is shown in Figure  3b. In the first stage, the validation loss similar to the failed cases has a converging trend to the level of 7. In the second stage, after 13K training steps, the validation loss falls suddenly, indicating that the model may escape successfully from local minima. From the two stages of the learning curve, we conclude that the real problem, contradicting our first intuition, is not about overfitting, but about local minima. Attention Distribution. We further look into the attention distribution of the failed models, observing that the attentions from target to source are widely spread over all tokens. As Figure 4a shows, the distribution entropy is high for about 8.14 bits on validation. In contrast, as shown in Figure 4b, the successfully trained model has a much lower attention entropy of about 6.0 bits on validation. Furthermore, we can see that before 13K training  steps, the entropy sticks at a plateau, confirming with the observation of the local minima in Figure  3b. It indicates that the early stage of the training process for Transformer is difficult. Figure 5 shows the self-attention distributions of the successfully trained models. The attention entropy of both the encoder and the decoder drops fast at the beginning, leading to a shrinkage of the attention range. But then the attention entropy gradually increases, indicating an expansion of the attention range. Such back-and-forth oscillation of the attention range may also result in unstable training and slow down the training process.

Conclusion
The above experiments show that training failure on Transformer can be caused by local minima. Additionally, the oscillation of attention range may make it worse. During training process, the attention module needs to identify relevant tokens from whole sequence to attend to. Assuming that the sequence length is N , the complexity of the attention distribution increases when N grows from sentence-level to document-level.
We propose to use locality properties (Rizzi, 2013;Hardmeier, 2014;Jawahar et al., 2019) of both the language itself and the translation task as a constraint in Transformer, regulating the hypothesis space of the self-attention and target-to-source attention, using a simple group tag method. source group tags can be assigned deterministically, target tags are assigned dynamically according to whether a generated sentence is complete. Starting from 1, target words copy group tags from its predecessor unless the previous token is </s>, in which case the tag increases by 1. The tags serve as a locality constraint, encouraging target-to-source attention to concentrate on the current source sentence being translated.
Formally, for a source document X and a target document Y , the probability model of Transformer can be written aŝ and G-Transformer extends it by havinĝ where G X and G Y denotes the two sequences of group tags where sent k represents the k-th sentence of X or Y . For the example shown in Figure 6, Group tags influence the auto-regressive translation process by interfering with the attention mechanism, which we show in the next section. In G-Transformer, we use the group-tag sequence G X and G Y for representing the alignment between X and Y , and for generating the localized contextual representation of X and Y .

Group Attention
An attention module can be seen as a function mapping a query and a set of key-value pairs to an output (Vaswani et al., 2017). The query, key, value, and output are all vectors. The output is computed by summing the values with corresponding attention weights, which are calculated by matching the query and the keys. Formally, given a set of queries, keys, and values, we pack them into matrix Q, K, and V , respectively. We compute the matrix outputs where d k is the dimensions of the key vector. Attention allows a model to focus on different positions. Further, multi-head attention (MHA) allows a model to gather information from different representation subspaces where the projections of W O , W Q i , W K i , and W V i are parameter matrices. We update Eq 4 using group-tags, naming it group attention (GroupAttn). In addition to inputs Q, K, and V , two sequences of group-tag inputs are involved, where G Q corresponds to Q and G K corresponds to K. We have args = (Q, K, V, GQ, GK ), where function M (·) works as an attention mask, excluding all tokens outside the sentence. Specifically, M (·) gives a big negative number γ to make softmax close to 0 for the tokens with a different group tag compared to current token where I K and I Q are constant vectors with value 1 on all dimensions, that I K has dimensions equal to the length of G K and I Q has dimensions equal to the length of G Q . The constant value γ can typically be −1e8. Similar to Eq 5, we use group multi-head attention args = (Q, K, V, GQ, GK ), and the projections of Encoder. For each layer a group multi-head attention module is used for self-attention, assigning the same group-tag sequence for the key and the value that G Q = G K = G X .
Decoder. We use one group multi-head attention module for self-attention and another group multihead attention module for cross-attention. Similar to the encoder, we assign the same group-tag sequence to the key and value of the self-attention, that G Q = G K = G Y , but use different group-tag sequences for cross-attention that G Q = G Y and G K = G X .  Complexity. Consider a document with M sentences and N tokens, where each sentence contains N/M tokens on average. The complexities of both the self-attention and cross-attention in Transformer are O(N 2 ). In contrast, the complexity of group attention in G-Transformer is O(N 2 /M ) given the fact that the attention is restricted to a local sentence. Theoretically, since the average length N/M of sentences tends to be constant, the time and memory complexities of group attention are approximately O(N ), making training and inference on very long inputs feasible.

Combined Attention
We use only group attention on lower layers for local sentence representation, and combined attention on top layers for integrating local and global context information. We use the standard multihead attention in Eq 5 for global context, naming it global multi-head attention (GlobalMHA). Group multi-head attention in Eq 8 and global multi-head attention are combined using a gate-sum module (Zhang et al., 2016;Tu et al., 2017) where W and b are linear projection parameters, and denotes element-wise multiplication.
Previous study (Jawahar et al., 2019) shows that the lower layers of Transformer catch more local syntactic relations, while the higher layers represent longer distance relations. Based on these findings, we use combined attention only on the top layers for integrating local and global context. By this design, on lower layers, the sentences are isolated from each other, while on top layers, the crosssentence interactions are enabled. Our experiments show that the top 2 layers with global attention are sufficient for document-level NMT, and more layers neither help nor harm the performance.

Inference
During decoding, we generate group-tag sequence G Y according to the predicted token, starting with 1 at the first <s> and increasing 1 after each </s>. We use beam search and apply the maximum length constraint on each sentence. We generate the whole document from start to end in one beam search process, using a default beam size of 5.

G-Transformer Results
We compare G-Transformer with Transformer baselines and previous document-level NMT models on both non-pretraining and pre-training settings. The detailed descriptions about these training settings are in Appendix C.1. We make statistical significance test according to Collins et al. (2005).

Results on Non-pretraining Settings
As shown in Table 2, the sentence-level Transformer outperforms previous document-level models on News and Europarl. Compared to this strong baseline, our randomly initialized model of G-Transformer improves the s-BLEU by 0.81 point on the large dataset Europarl. The results on the small datasets TED and News are worse, indicating overfitting with long inputs. When G-Transformer is trained by fine-tuning the sentence-level Transformer, the performance improves on the three datasets by 0.3, 0.33, and 1.02 s-BLEU points, respectively.
Different from the baseline of document-level Transformer, G-Transformer can be successfully trained on small TED and News. On Europarl, G-Transformer outperforms Transformer by 0.77 d-BLEU point, and G-Transformer fine-tuned on sentence-level Transformer enlarges the gap to 0.98 d-BLEU point.
G-Transformer outperforms previous documentlevel MT models on News and Europarl with a significant margin. Compared to the best recent model Hyrbid-Context, G-Transformer improves the s-BLEU on Europarl by 1.99. These results suggest that in contrast to previous short-context models, sequence-to-sequence model taking the whole document as input is a promising direction.

Results on Pre-training Settings
There is relatively little existing work about document-level MT using pre-training. Although Flat-Transformer+BERT gives a state-of-the-art scores on TED and Europarl, the score on News is worse than previous non-pretraining model HAN (Miculicich et al., 2018b). G-Transformer+BERT improves the scores by margin of 0.20, 1.62, and 0.47 s-BLEU points on TED, News, and Europarl, respectively. It shows that with a better contextual representation, we can further improve documentlevel MT on pretraining settings.
We further build much stronger Transformer baselines by fine-tuning on mBART25 . Taking advantage of sequence-to-sequence pre-training, the sentence-level Transformer gives much better s-BLEUs of 27.78, 29.90, and 31.87, respectively. G-Transformer fine-tuned on mBART25 improves the performance by 0.28, 0.44, and 0.87 s-BLEU, respectively. Compared to the document-level Transformer baseline, G-Transformer gives 1.74, 1.22, and 0.31 higher d-BLEU points, respectively. It demonstrates that even with well-trained sequence-to-sequence model, the locality bias can still enhance the performance.

Convergence
We evaluate G-Transformer ad Transformer on various input length, data scale, and model size to better understand that to what extent it has solved the convergence problem of Transformer. Input Length. The results are shown in Figure  7a. Unlike Transformer, which fails to train on long input, G-Transformer shows stable scores for inputs containing 512 and 1024 tokens, suggesting that with the help of locality bias, a long input does not impact the performance obviously.
Data Scale. As shown in Figure 7b, overall G-Transformer has a smooth curve of performance on the data scale from 1.25K to 160K. The variances of the scores are much lower than Transformer, indicating stable training of G-Transformer. Additionally, G-Transformer outperforms Transformer by a large margin on all the settings.
Model Size. Unlike Transformer, which fails to train on Big and Large model settings, G-Transformer shows stable scores on different model sizes. As shown in Appendix C.2, although performance on small datasets TED and News drops largely for Big and Large model, the performance on large dataset Europarl only decreases by 0.10 d-BLEU points for the Big model and 0.66 for the Large model.
Loss. Looking into the training process of the above experiments, we see that both the training and validation losses of G-Transformer converge much faster than Transformer, using almost half time to reach the same level of loss. Furthermore, the validation loss of G-Transformer converges to much lower values. These observations demonstrate that G-Transformer converges faster and better.
Attention Distribution. Benefiting from the separate group attention and global attention, G-Transformer avoids the oscillation of attention  Table 3: Impact of source-side and target-side context reporting in s-BLEU. Here, fnt. denotes the model finetuned on sentence-level Transformer.
range, which happens to Transformer. As shown in Figure 8a, Transformer sticks at the plateau area for about 13K training steps, but G-Transformer shows a quick and monotonic convergence, reaching the stable level using about 1/4 of the time that Transformer takes. Through Figure 8b, we can find that G-Transformer also has a smooth and stable curve for the convergence of self-attention distribution. These observations imply that the potential conflict of local sentence and document context can be mitigated by G-Transformer.

Discussion of G-Transformer
Document Context. We study the contribution of the source-side and target-side context by removing the cross-sentential attention in Eq 10 from the encoder and the decoder gradually. The results are shown in Table 3. We take the G-Transformer fine-tuned on the sentence-level Transformer as our starting point. When we disable the targetside context, the performance decreases by 0.14 s-BLEU point on average, which indicates that the target-side context does impact translation performance significantly. When we further remove the source-side context, the performance decrease by 0.49, 0.83, and 0.77 s-BLEU point on TED, News, and Europarl, respectively, which indicates that the source-side context is relatively more important for document-level MT.
To further understand the impact of the sourceside context, we conduct an experiment on automatic evaluation on discourse phenomena which rely on source context. We use the human labeled evaluation set (Voita et al., 2019b) on English-   Russion (En-Ru) for deixis and ellipsis. We follow the Transformer concat baseline (Voita et al., 2019b) and use both 6M sentence pairs and 1.5M document pairs from OpenSubtitles2018 (Lison et al., 2018) to train our model. The results are shown in Table 4. G-Transformer outperforms Transformer baseline concat (Voita et al., 2019b) with a large margin on three discourse features, indicating a better leverage of the source-side context. When compared to previous model LSTM-T, G-Transformer achieves a better ellipsis on both infl. and VP. However, the score on deixis is still lower, which indicates a potential direction that we can investigate in further study. Word-dropout. As shown in Table 5, worddropout (Appendix C.1) contributes about 0.37 d-BLEU on average. Its contribution to TED and News is obvious in 0.35 and 0.58 d-BLEU, respectively. However, for large dataset Europarl, the contribution drops to 0.17, suggesting that with sufficient data, word-dropout may not be necessary.
Locality Bias. In G-Transformer, we introduce locality bias to the language modeling of source and target, and locality bias to the translation between source and target. We try to understand these biases by removing them from G-Transformer. When all the biases removed, the model downgrades to a document-level Transformer. The results are shown in Table 5. Relatively speaking, the contribution of language locality bias is about 1.78 d-BLEU on average. While the translation locality bias contributes for about 14.68 d-BLEU on average, showing critical impact on the model convergence on small datasets. These results suggest that the locality bias may be the key to train whole-document MT models, especially when the data is insufficient.
Combined Attention. In G-Transformer, we enable only the top K layers with combined attention. On Europarl7, G-Transformer gives 33.75, 33.87, and 33.84 d-BLEU with top 1, 2, and 3 layers with combined attention, respectively, showing that K = 2 is sufficient. Furthermore, we study the effect of group and global attention separately. As shown in Table 6, when we replace the combined attention on top 2 layers with group attention, the performance drops by 0.22, 0.09, and 0.75 d-BLEU on TED, News, and Europarl, respectively. When we replace the combined attention with global attention, the performance decrease is enlarged to 0.84, 0.69, and 1.00 d-BLEU, respectively. These results demonstrate the necessity of combined attention for integrating local and global context information.

Related Work
The unit of translation has evolved from word (Brown et al., 1993;Vogel et al., 1996) to phrase (Koehn et al., 2003;Chiang, 2005Chiang, , 2007 and further to sentence (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014; in the MT literature. The trend shows that larger units of translation, when represented properly, can lead to improved translation quality. A line of document-level MT extends translation unit to multiple sentences (Tiedemann and Scherrer, 2017;Agrawal et al., 2018;Ma et al., 2020). However, these approaches are limited within a short context of maximum four sentences. Recent studies extend the translation unit to whole document (Junczys-Dowmunt, 2019; , using large augmented dataset or pretrained models.  shows that Transformer trained directly on documentlevel dataset can fail, resulting in unreasonably low BLEU scores. Following these studies, we also model translation on the whole document. We solve the training challenge using a novel locality bias with group tags. Another line of work make document-level machine translation sentence by sentence, using additional components to represent the context (Maruf and Haffari, 2018;Zheng et al., 2020;Zhang et al., 2018;Miculicich et al., 2018b;Maruf et al., 2019;Yang et al., 2019). Different from these approaches, G-Transformer uses a generic design for both source and context, translating whole document in one beam search instead of sentence-by-sentence. Some methods use a two-pass strategy, generating sentence translation first, integrating context information through a post-editing model (Voita et al., 2019a;Yu et al., 2020). In contrast, G-Transformer uses a single model, which reduces the complexity for both training and inference.
The locality bias we introduce to G-Transformer is different from the ones in Longformer (Beltagy et al., 2020) and Reformer (Kitaev et al., 2020) in the sense that we discuss locality in the context of representing the alignment between source sentences and target sentences in document-level MT. Specifically, Longformer introduces locality only to self-attention, while G-Transformer also introduces locality to cross-attention, which is shown to be the key for the success of G-Transformer. Reformer, basically same as Transformer, searches for attention targets in the whole sequence, while G-Transformer mainly restricts the attention inside a local sentence. In addition, the motivations are different. While Longformer and Reformer focus on the time and memory complexities, we focus on attention patterns in cases where a translation model fails to converge during training.

Conclusion
We investigated the main reasons for Transformer training failure in document-level MT, finding that target-to-source attention is a key factor. According to the observation, we designed a simple extension of the standard Transformer architecture, using group tags for attention guiding. Experiments show that the resulting G-Transformer converges fast and stably on small and large data, giving the state-of-the-art results compared to existing models under both pre-training and random initialization settings.  (Bowman et al., 2016) with a probability of 0.3 on both the source and the target inputs.
Fine-tuned on Sentence-Level Transformer. We use the parameters of an existing sentencelevel Transformer to initialize G-Transformer. We copy the parameters of the multi-head attention in Transformer to the group multi-head attention in G-Transformer, leaving the global multi-head attention and the gates randomly initialized. For the global multi-head attention and the gates, we use a learning rate of 5e − 4, while for other components, we use a smaller learning rate of 1e − 4. All the parameters are jointly trained using Adam optimizer with 4000 warmup steps. We apply a word-dropout with a probability of 0.1 on both the source and the target inputs.
Fine-tuned on mBART25. Similar as the finetuning on sentence-level Transformer, we also copy parameters from mBART25  to G-Transformer, leaving the global multi-head attention and the gates randomly initialized. We following the settings  to train the model, using Adam optimizer with a learning rate of 3e − 5 and 2500 warmup steps. Here, we do not apply word-dropout, which empirically shows a damage to the performance.

C.2 Results on Model Size
As shown in Table 7, G-Transformer has a relatively stable performance on different model size. When increasing the model size from Base to Big, the performance drops for about 0.24, 1.33, and 0.14 s-BLEU points, respectively. Further to Large model, the performance drops further for about 17.06, 8.54, and 0.53 s-BLEU points, respectively. Although the performance drop on small dataset is large since overfitting on larger model, the drop on large dataset Europarl is relatively small, indicating a stable training on different model size.