Rethinking Document-level Neural Machine Translation

This paper does not aim at introducing a novel model for document-level neural machine translation. Instead, we head back to the original Transformer model and hope to answer the following question: Is the capacity of current models strong enough for document-level translation? Interestingly, we observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words. We evaluate this model and several recent approaches on nine document-level datasets and two sentence-level datasets across six languages. Experiments show that document-level Transformer models outperforms sentence-level ones and many previous methods in a comprehensive set of metrics, including BLEU, four lexical indices, three newly proposed assistant linguistic indicators, and human evaluation.


Introduction
Neural machine translation (Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017) has achieved great progress and reached near human-level performance. However, most current sequence-tosequence NMT models translate sentences individually. In such cases, discourse phenomena, such as pronominal anaphora, lexical consistency, and document coherence that depend on long-range context going further than a few previous sentences, are neglected (Bawden et al., 2017). As a result, Läubli et al. (2018) find human raters still show a markedly stronger preference for human translations when evaluating at the level of documents.
These studies come up with different structures in order to include discourse information, namely, introducing adjacent sentences into the encoder or decoder as document contexts. Experimental results show effective improvements on universal translation metrics like BLEU (Papineni et al., 2002) and document-level linguistic indices (Tiedemann and Scherrer, 2017;Bawden et al., 2017;Werlen and Popescu-Belis, 2017;Müller et al., 2018;Voita et al., , 2019. Unlike previous work, this paper does not aim at introducing a novel model. Instead, we hope to answer the following question: Is the basic sequenceto-sequence model strong enough to directly handle document-level translation? To this end, we head back to the original Transformer (Vaswani et al., 2017) and conduct literal document-to-document (Doc2Doc) training.
Though many studies report negative results of naive Doc2Doc translation , we successfully activate it with Multi-resolutional Training, which involves multiple levels of sequences. It turns out that end-to-end document translation is not only feasible but also stronger than sentence-level models and previous studies. Furthermore, if assisted by extra sentencelevel corpus, which can be much more easily obtained, the model can significantly improve the performance and achieve state-of-the-art results. It is worth noting that our method does not change the model architecture and need no extra parameters.
Our experiments are conducted on nine document-level datasets, including TED (ZH-EN, EN-DE), News (EN-DE, ES-EN, FR-EN, RU-EN), Europarl (EN-DE), Subtitles (EN-RU), and a newly constructed News dataset (ZH-EN). Additionally, two sentence-level datasets are adopted in further experiments, including Wikipedia (EN-DE) and WMT (ZH-EN). Experiment results show that our strategy outperforms previous methods in a comprehensive set of metrics, including BLEU, four lexical indices, three newly proposed assistant linguistic indicators, and human evaluation. In addition to serving as improvement evidence, our newly proposed document-level datasets and metrics can also be a boosting contribution to the community.

Re-examining Recent DNMT Studies
For DNMT, though many improvements have been reported, a couple of studies have proposed challenges against these results (Kim et al., 2019;Jwalapuram et al., 2020;. And we also find some of previous gains should be attributed to overfitting to some extent. The most used datasets of previous work are News Commentary and TED Talks, which contain only 200 thousand sentences. The small scale of the datasets gives rise to the frequent occurrence of overfitting, let alone that the distribution of the test set is highly similar to the training set. And some work even conduct an unfair comparison with dropout=0.1 for sentence-level models and dropout=0.2 for document-level models (Maruf et al., 2019;Zheng et al., 2020). As a result, regularization and overfitting on small datasets make the improvements not solid enough.
To verify our assumption, we perform different training by switching hyperparameters on sentencelevel experiments. We follow the datasets provided by Maruf et al. (2019) and Zheng et al. (2020), including TED (ZH-EN/EN-DE), News (EN-DE), and Europarl (EN-DE), as well as all the model architecture settings they adopt, including a fourlayer Transformer base version.
As is shown in of-the-art results are yet to be obtained, the gap between sentence and document models has been largely narrowed up. As for Europarl, a much higher baseline has been easily achieved, which also makes other improvements not solid enough.
Our results show that preceding experiments lack the comparison with a strong baseline. An important proportion of the improvements may come from the regularization of the models since they bring in extra parameters for context encoders or hierarchical attention weights. However, the regularization can be also achieved in sentence-level models and is not targeted at improving document coherence. Essentially, the small scale of related datasets and identically distributed test sets make the improvements questionable. Kim et al. (2019) draw the same conclusion that well-regularized or pre-trained sentence-level models can beat document-level models in the same settings. They find that most improvements are not from coreference or lexical choice but "not interpretable". Similarly, Jwalapuram et al. (2020) adopt a wide evaluation and find that the existing context-aware models do not improve discourserelated translations consistently across languages and phenomena. Also,  find that the extra context encoders act more like a noise generator and the BLEU improvements mainly come from the robust training instead of the leverage of contextual information. All these three studies appeal for stronger baselines for a fair comparison.
We suggest that the current research tendency in DNMT should be reviewed since it is hard to tell whether the improvements are targeted at document coherence or just normal regularization, let alone complicated modules are introduced. Therefore, as a simpler alternative, we head back to the original but concise style, using end-to-end training framework to cope with document translation.
In this section, we attempt to analyze the different training patterns for DNMT. Firstly, let us formulate the problem.
Let D x = {x (1) , x (2) , · · · , x (M ) } be a source-language document containing M source sentences. The goal of the document-level NMT is to translate the document D x in language x to a document D y in language y. D y = {y (1) , y (2) , · · · , y (N ) }. We use L (i) y to denote the sentence length of y (i) . Previous work translate a document sentenceby-sentence, regarding DNMT as a step-by-step sentence generating problem (Doc2Sent) as: S (i) is the context in the source side, depending on the model architecture and is comprised of only two or three sentences in many work. Most current work focus on S (i) , by utilizing hierarchical attention or extra encoders. And T (i) is the context in the target side, which is involved by only a couple of work. They usually make use of a topic model or word cache to form T (i) .
Different from Doc2Sent, we propose to resolve document translation with the end-to-end, namely document-to-document (Doc2Doc) pattern as: where D x is the complete context in the source side, and y <i is the complete historical context in the target side.

Why We Dive into Doc2Doc?
Full Source Context: First, many Doc2sent studies show that more sentences beyond can harm the results (Miculicich et al., 2018;Tu et al., 2018). Therefore, many Doc2Sent work are more of "a couple of sentences to sentence" since they only involve two or three preceding sentences as context. However, broader contexts provide more information, which shall theoretically lead to more improvements. Thus, We attempt to re-visit involving the full context and choose Doc2Doc, as it is required to take account of all the source-side context. Full Target Context: Second, many Doc2sent work abandon the target-side historical context, and some even claim that it is harmful to translation quality (Wang et al., 2017;Tu et al., 2018). However, once the cross-sentence language model is discarded, some problems, such as tense mismatch (especially when the source language is tenseless like Chinese), may occur. Therefore, we attempt to re-visit involving the full context and choose Doc2Doc, as it treats the whole document as a sequence and can naturally take advantage of all the target-side historical context.
Relaxed Training: Third, Doc2Sent restricts the training scene. The previous work focus on adjusting the model structure to feed preceding source sentences, so the training data has to be in the form of consecutive sentences so as to meet the model entrance. As a result, it is hard to use large numbers of piecemeal parallel sentences. Such a rigid form of training data also greatly limits the model potential because the scale of parallel sentences can be tens of times of parallel documents. On the contrary, Doc2Doc can naturally absorb all kinds of sequences, including sentences and documents.
Simplicity: Last, Doc2Sent inevitably introduces extra model modules with extra parameters in order to capture contextual information. It complicates the model architecture, making it hard to renovate or generalize. On the contrary, Doc2Doc does not change the model structure and brings in no additional parameters.

Multi-resolutional Doc2Doc NMT
Although Doc2Doc seems more concise and promising in multiple terms, it is not widely recognized. ;  conduct experiments by directly feeding the whole documents into the model. We refer to it as Singleresolutional Training (denoted as SR Doc2Doc). Their experiments report extremely negative results unless pre-trained in advance. The model either has a large drop in performance or does not work at all. As pointed out by Koehn and Knowles (2017), one of the six challenges in neural machine translation is the dramatic drop of quality as the length of the sentences increases.
However, we find that Doc2Doc can be activated on any datasets and obtain better results than Doc2Sent models as long as we employ Multi-resolutional Training, mixing documents  with shorter segments like sentences or paragraphs (denoted as MR Doc2Doc). Specifically, we split each document averagely into k parts for multiple times and collect all the sequences together, k ∈ {1, 2, 4, 8, ...}. For example, a document containing eight sentences will be split into two four-sentences segments, four two-sentences segments, and eight single-sentence segments. Finally, fifteen sequences are all gathered and fed into sequence-to-sequence training (15 = 1 + 2 + 4 + 8).
In this way, the model can acquire the ability to translate long documents since it is assisted by easier and shorter segments. As a result, multiresolutional Doc2Doc is able to translate all forms of sequences, including extremely long ones such as a document with more than 2000 tokens, as well as shorter ones like sentences. In the following sections, we conduct the same experiments as the aforementioned studies by translating the whole document directly and atomically.  Lastly, we propose a new document-level dataset in this paper, whose source, scales, and benchmark will be illustrated in the subsequent sections.
For sentences without any ending symbol inside documents, periods are manually added. For our Doc2Doc experiments, the development and test sets are documents merged by sentences. We list all the detailed information of used datasets in Table 2, including languages, scales, and downloading URLs for reproducibility.

Models
For the model setting, we follow the base version of Transformers (Vaswani et al., 2017), including 6 layers for both encoders and decoders, 512 dimensions for model, 2048 dimensions for ffn layers, 8 heads for attention. For all experiments, we use subword (Sennrich et al., 2016) with 32K merge operations on both sides and cut out tokens appearing less than five times. The models are trained with a batch size of 32000 tokens on 8 Tesla V100 GPUs. Parameters are optimized by using Adam optimizer (Kingma and Ba, 2015), with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . The learning rate is scheduled according to the method proposed in Vaswani et al. (2017), with warmup_steps = 4000. Label smoothing (Szegedy et al., 2016) Table 3: Experiment results of document translation. "-" means not provided. Except baseline cited from previous papers, we also re-implement our strong baseline with the best hyper-parameters (dropout, as is in section 2) on the development sets. "++" indicates using additional sentence corpus. From the upper part, though SR Doc2Doc yields disappointing translation and even fails on TED, MR Doc2Doc achieves much better results, proving the feasibility of Doc2Doc. From the lower part, extra sentence-level corpus can activate SR Doc2Doc and boost MR Doc2Doc, yielding the best results. also adopted. We set dropout=0.3 for small datasets like TED and News, and dropout=0.1 for larger datasets like Europarl, unless stated elsewise.

Evaluation
For inference, we generate the hypothesis with a beam size of 5. Following previous related work, we adopt tokenized case-insensitive BLEU (Papineni et al., 2002). Specifically, we follow the methods in , which calculate sentencelevel BLEU (denoted as s-BLEU) and documentlevel BLEU (denoted as d-BLEU), respectively. For d-BLEU, the computing object is either the concatenation of generated sentences or the directly generated documents. Since our documents are generated atomically and hard to split into sentences, we only report d-BLEU for Doc2Doc. Doc2Doc matters. We also compare MR Doc2Doc to a intuitive baseline: MR Doc2Sent. The latter one is trained in a typical Doc2Sent way: the source is the whole past context, the target is the current sentence. From the experimental results, we can see Doc2Doc outperforms it due to much broader contexts. Language model can effectively improve translation performance (Sun et al., 2021).
To show the universality of MR Doc2Doc, we also conduct the experiments on other language pairs: Spanish, French, Russian to English. As is shown in  It is worth noting that all our results are obtained without any adjustment of model architecture or any extra parameters.

Additional Sentence Corpus Helps
Furthermore, introducing extra sentence-level corpus is also an effective technique. This can be regarded as another form of multi-resolutional training, as it supplements more sentence-level information. This strategy makes an impact in two ways: activating SR Doc2Doc and boosting MR Doc2Doc.
We merge the datasets mentioned above and Wikipedia (EN-DE), WMT (ZH-EN), two out-ofdomain sentence-level datasets to do experiments. 2 As is shown in the lower part of Table 3, on the one hand, SR Doc2Doc models are activated and can reach comparable levels with Sent2Sent models as long as assisted with additional sentences. On the other hand, MR Doc2Doc obtains the best results on all datasets and further widens the gap with the sentence corpus's boost. Even out-of-domain sentences can leverage the learning ability of document translation. It again proves the importance of multi-resolutional assistance.
In addition, as analyzed in the previous section, Doc2Sent models are not compatible with sentence-level corpus since the model entrance is specially designed for consecutive sentences. However, Doc2Doc models can naturally draw on the merits of any parallel pairs, including piecemeal sentences. Considering the amount of parallel sentence-level data is much larger than the document-level one, MR Doc2Doc has a powerful application potential compared with Doc2Sent.

Improved Discourse Coherence
Except for BLEU, whether Doc2Doc truly learns to utilize the context to resolve discourse inconsistencies has to be verified. We use the contrastive test sets proposed by Voita et al. (2019), which include deixis, lexicon consistency, ellipsis (inflection), and ellipsis (verb phrase) on English-Russian. Each instance contains a positive translation and a few negative ones, whose difference is only one specific word. With force decoding, if the score of the positive one is the highest, then this instance is counted as correct.
As is shown in Table 5, MR Doc2Doc achieves significant improvements and obtain the best results, which proves MR Doc2Doc indeed well captures the context information and maintain the cross-sentence coherence.

Strong Context Sensibility
Li et al. (2020) find the performance of previous context-aware systems does not decrease with intentional incorrect context and suspect the context usage of context encoders. To verify whether Doc2Doc truly takes advantage of the contextual information in the document, we also conduct the inference with the wrong context deliberately. If the model neglects discourse dependency, then there should be no difference in the performance. Specifically, we firstly shuffle the sentence order inside each document randomly, marking it as Local Shuffle. Furthermore, we randomly swap sentences among all the documents to make the context more disordered, marking it as Global Shuffle. As is shown in Table 6, the misleading context results in a significant drop for the Doc2Doc model in BLEU. Besides, Global Shuffle brings more harm than Local Shuffle, showing that more chaotic contexts lead to more harm. After all, Local Shuffle still reserves some general information, like topic or tense. These experiments prove the usage of the context.

Compatible with Sentences
The performance with sequence length is also analyzed in this study. Taking Europarl as an example, we randomly split documents into shorter paragraphs in different lengths and evaluate them with our models, as is shown in Figure 1. Obviously, the model trained only on sentence-level corpus has a severe drop when translating long sequences, while the model trained only on document-level corpus shows the opposite result, which reveals the importance of data distribution. However, the model trained with our multi-resolutional strategy can sufficiently cope with all situations, breaking the limitation of sequence length in translation. By conducting MR Doc2Doc, we obtain an all-in-one model that is capable of translating sequences of any length, avoiding deploying two systems for sentences and documents, respectively.

Further Evidence with Newly Proposed Datasets and Metrics
To further verify our conclusions and push the development of this field, we also contribute a new dataset along with new metrics. Specifically, we propose a package of a large and diverse parallel document corpus, three deliberately designed metrics, and correspondingly constructed test sets 3 . On the one hand, they make our conclusions more solid. On the other hand, they may benefit future researches to expand the comparison scenes.

Parallel Document Corpus
We crawl bilingual news corpus from two websites 4 5 with both English and Chinese content provided.
The detailed cleaning procedure is in Appendix B.
Finally, 1.39 million parallel sentences within almost 60 thousand parallel documents are collected. The corpus contains large-scale data with internal dependency in different lengths and diverse domains, including politics, finance, health, culture, etc. We name it PDC (Parallel Document Corpus).

Metrics
To inspect the coherence improvement, we sum up three common linguistic features in document corpus that the Sent2Sent model can not handle: Tense Consistency (TC): If the source language is tenseless (e.g. Chinese), it is hard for Sent2Sent models to maintain the consistency of tense.
Conjunction Presence (CP): Traditional models ignore cross-sentence dependencies, and the sentence-level translation may cause the missing of conjunctions like "And" .
Pronoun Translation (PT): In pro-drop languages such as Chinese and Japanese, pronouns are frequently omitted. When translating from a prodrop language into a non-pro-drop language (e.g., Chinese-to-English), invisible dropped pronouns may be missing (Wang et al., 2016b(Wang et al., ,a, 2018a.
Afterward, we collect documents that contain abundant verbs in the past tense, conjunctions, and pronouns, as test sets. These words, as well as their positions, are labeled. Some cases are in Appendix C.
For each word-position pair < w, p >, we check whether w appears in the generated documents within a rough span. And we calculate the appearance percentage as the evaluation score, Specifically: n indicates the number of sequences in the test set, W i indicates the labeled word set of sequence i , w indicates labeled words, y i indicates output i , p ij indicates the labeled position of w ij in the reference i , α i indicates the length ratio of translation and reference, d indicates the span radius. We set d = 20 in this paper, and calculate the geometric mean as the overall score denoted as TCP.

Test Sets
Along with the filtration of the aforementioned coherence indices, the test sets are built based on websites that are totally different from the training corpus to avoid overfitting. Meanwhile, to alleviate the bias of human translation, the English documents are selected as the reference and manually translated to the Chinese documents as the source. Finally, a total of nearly five thousand sentences within 148 documents is obtained.

Benchmark
Basic experiments with Sent2Sent and Doc2Doc are conducted based on our new datasets, along with full WMT ZH-EN corpus, a sentence-level dataset containing around 20 million pairs. 6 We use WMT newstest2019 as the development set and evaluate the models with our new test sets as well as metrics. The results are shown in Table 7  BLEU: In terms of BLEU, MR Doc2Doc outperforms Sent2Sent, illustrating the positive effect of long-range context. Moreover, with extra sentencelevel corpus, Doc2Doc shows significant improvements again.
Fine-grained Metrics: Our metrics show much clearer improvements. Considering the usage of contextual information, tense consistency is better guaranteed with Doc2Doc. Meanwhile, Doc2Doc is much more capable of translating the invisible pronouns by capturing original referent beyond the current sentence. Finally, the conjunction presence shows the same tendency.
Human Evaluation: Human evaluation is also conducted to illustrate the reliability of our metrics. One-fifth of translated documents are sampled and scored by linguistics experts from 1 to 5 according to not only translation quality but also translation consistency (Sun et al., 2020). As is shown in Table 7, human evaluation shows a strong correlation with TCP. More specifically, the Pearson Correla-tion Coefficient (PCCs) between human scores and TCP is higher than that of BLEU (97.9 vs. 94.1). Table 8 shows an example of document translation. Sent2Sent model neglects the cross-sentence context and mistakenly translate the ambiguous word, which leads to a confusing reading experience. However, the Doc2Doc model can grasp a full picture of the historical context and make accurate decisions.

Case Study
与大多数欧洲人一样, 德国总理对美国总统的"美国优 先"民族主义难以掩饰不屑。 Source ... 但她已进入第四个、也必定是最后一个总理任期。 Like most Europeans , the German chancellor has struggled to hide his disdain for the US president's "America First" nationalism. Sent2Sent ... But she has entered a fourth and surely last term as prime minister.
Like most Europeans, the German chancellor's disdain for the US president's "America First" nationalism is hard to hide. Doc2Doc ... But she has entered her fourth and certainly final term as chancellor. Table 8: Coherence problem in document translation. Without discourse contexts, the Chinese word "总理" is usually translated to "prime minister", while in the context of "German", it should be translated into "chancellor".
Also, we manually switch the context information in the source side to test the model sensibility, as is shown in Table 9. It turns out that Doc2Doc is able to adapt to different contexts.

Limitation
Though multi-resolutional Doc2Doc achieves direct document translation and obtains better results, there still exists a big challenge: efficiency. The computation cost of self-attention in Transformer rises with the square of the sequence length. As we feed the entire document into the model, the memory usage will be a bottleneck for larger model deployment. And the inference speed may be affected if no parallel operation is conducted. Recently, many studies focus on the efficiency enhancement on long-range sequence processing (Correia et al., 2019;Kitaev et al., 2020;Wu et al., 2020;Beltagy et al., 2019;Rae et al., 2020). We leave reducing the computation cost to the future work.

Related Work
Document-level neural machine translation is an important task and has been abundantly studied with multiple datasets as well as methods.
The mainstream research in this field is the model architecture improvement. Specifically, several recent attempts extend the Sent2Sent approach to the Doc2Sent-like one. Wang et al. (2017) Tu et al. (2018) propose to augment NMT models with a cache-like memory network, which generates the translation depending on the decoder history retrieved from the memory.
Besides, some works intend to resolve this problem in other ways. Jean and Cho (2019) propose a regularization term for encouraging to focus more on the additional context using a multi-level pairwise ranking loss.   There are also some works sharing similar ideas with us. Tiedemann and Scherrer (2017); Bawden et al. (2017) explore concatenating two consecutive sentences and generate two sentences directly. Obviously, we leverage greatly longer information and capture the full context. Junczys-Dowmunt (2019) cut documents into long segments and feed them into training like BERT (Devlin et al., 2019). There are at least three main differences. Firstly, they need to add specific boundary tokens between sentences while we directly translate the original documents without any additional processing. Secondly, we propose a novel multi-resolutional training paradigm that shows consistent improvements compared with regular training. Thirdly, for extremely long documents, they restrict the segment length to 1000 tokens or make a truncation while we preserve entire documents and achieve literal document-to-document training and inference.
Finally, our work is also related to a series of studies in long sequence generation like GPT (Radford, 2018), GPT-2 , and Transformer-XL (Dai et al., 2019). We all suggest that the deep neural generation models have the potential to well process long-range sequences.

Conclusion
In this paper, we try to answer the question of whether Document-to-document translation works. It seems naive Doc2Doc can fail in multiple scenes. However, with the multi-resolutional training proposed in this paper, it can be successfully activated. Different from traditional methods of modifying the model architectures, our approach introduces no extra parameters. A comprehensive set of experiments on various metrics show the advantage of MR Doc2Doc. In addition, we contribute a new document-level dataset as well as three new metrics to the community.

A Oversampling Illustration
When combining document-level datasets with sentence-level datasets (especially out-of-domain corpus), we employ oversampling for non-MR settings. This can keep them the same data ratio with the MR setting and is helpful for their performance. Since the data size of MR is around 6 times of non-MR (≈ log 2 64), as shown in Table 10, we mainly oversample for 6 times. The contrastive experiments are in Table 11. We attribute the improvements to the reduction of the proportion of out-of-domain data.

C Cases of Our Test Sets
Apart from the statistic number in the main paper, we also provide some cases in our test sets to illustrate the value of our test sets and metrics, as shown in Table 12