On Search Strategies for Document-Level Neural Machine Translation

Compared to sentence-level systems, document-level neural machine translation (NMT) models produce a more consistent output across a document and are able to better resolve ambiguities within the input. There are many works on document-level NMT, mostly focusing on modifying the model architecture or training strategy to better accommodate the additional context-input. On the other hand, in most works, the question on how to perform search with the trained model is scarcely discussed, sometimes not mentioned at all. In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding. We start with the most popular document-level NMT approach and compare different decoding schemes, some from the literature and others proposed by us. In the comparison, we are using both, standard automatic metrics, as well as specific linguistic phenomena on three standard document-level translation benchmarks. We find that most commonly used decoding strategies perform similar to each other and that higher quality context information has the potential to further improve the translation.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2014;Vaswani et al., 2017) is widely adopted and produces excellent translations for many domains and language pairs.However, when these automatic translations are evaluated on the document level, they reveal shortcomings when it comes to consistency in style, entity-translation or correct inference of the gender, among other things (Läubli et al., 2018;Müller et al., 2018;Thai et al., 2022).Document-level NMT aims to resolve these shortcomings by taking the context of a sentence into account during translation.There exist many works on the topic of document-level NMT, proposing various changes to the standard transformer (Vaswani et al., 2017) architecture and training criteria to improve context incorporation and consequently translation quality.However, while the modeling and training aspects are covered in great detail in these works, the exact decoding strategy is often not very clearly described and sometimes not mentioned at all.
In this work, we head out to answer the question, which decoding strategy is most beneficial for document-level NMT systems.We compare all commonly used strategies, as well as some additional ones, on three standard document-level translation benchmarks.We find that most of the analyzed decoding strategies perform similar to each other.Also, higher quality context information can lead to better translations in certain scenarios.

Related Work
The earliest approaches to document-level NMT simply concatenate consecutive sentences without any further changes to the architecture compared to the sentence-level systems (Tiedemann and Scherrer, 2017;Agrawal et al., 2018).Later, some changes were made to the vanilla transformer architecture, like segment embeddings (Ma et al., 2020) or attention masking (Zhang et al., 2020;Petrick et al., 2022) and a move was made towards translating longer segments (Junczys-Dowmunt, 2019;Liu et al., 2020;Zheng et al., 2021;Bao et al., 2021;Sun et al., 2022).Other works employ a separate encoder to include the additional context on the source side (Jean et al., 2017;Bawden et al., 2018;Zhang et al., 2018;Voita et al., 2018) or make use of the context in a post-editing fashion (Voita et al., 2019;Xiong et al., 2019).Further approaches include the usage of a cache (Wang et al., 2017;Maruf and Haffari, 2018;Tu et al., 2018) or hierarchical attention networks (Miculicich et al., 2018;Maruf et al., 2019;Wong et al., 2020).Recently, several works have concluded that the simple concatenation approach used with the vanilla transformer architecture performs as good -if not better -than more complicated approaches that modify the model structure (Sun et al., 2022;Majumde et al., 2022).Since we also observed this in our internal comparisons, we decided to focus on this simple approach for our analysis in this work.
Several works made the argument that the improvements seen in automatic metric scores for document-level NMT systems are from regularization effects rather than from utilizing the additional context information (Kim et al., 2019;Li et al., 2020;Nguyen et al., 2021).In order to better asses the improvements gained by documentlevel NMT, several targeted test suites have been released (Müller et al., 2018;Bawden et al., 2018;Voita et al., 2019;Jwalapuram et al., 2019).However, all of these are based on just scoring contrastive examples without actually translating anything.Recently, Jiang et al. (2022) and Currey et al. (2022) have released frameworks that allow to score MT systems on their ability to generate contextually correct translations.1

Search Strategies
Training a document-level NMT system that takes the last k sentences as context is straightforward using the standard concatenation strategy (Tiedemann and Scherrer, 2017).Given some document level training data (F n , E n ), n = 1, ..., N , where (F n , E n ) denotes the n-th source-target sentence pair, during training we optimize the parameters Θ of the model towards Here, E n n−k denotes the concatenation of the sentences E n−k , ..., E n .
During search, given a document F M 1 , we want to find the best translation ÊM 1 according to the model.Of course, exact search can not be performed and different works have used different methods to generate a translation: full segment (Liu et al., 2020;Bao et al., 2021;Sun et al., 2022): we split the document into non-overlapping parts and translate each part separately using which is approximated using standard beam search on the token level.
last sentence (Bawden et al., 2018;Agrawal et al., 2018;Zhang et al., 2020;Petrick et al., 2022;Majumde et al., 2022): we split the document into overlapping parts ..., F i i−k , F i+1 i−k+1 , ... and translate each part separately using Equation 1. From each translated part we choose only the last sentence to get one translation for every sentence in the document.
first sentence (Zhang et al., 2020): similar to last sentence, but from each translated part we choose only the first sentence to get one translation for every sentence in the document.

doc-trans (beam)
: similar to doc-trans, but instead of keeping just the best context Êi−1 1 , we keep the top-h candidates and prune them after each step i, analogous to beam search on the token level.h = 12 for all our experiments, the same as our token-level beam-size.
cheating : this is just used as a tool for analysis.
The translation of each sentence F i is created using the true target reference EM 1 as context Êi = argmax no context : this is just used as a tool for analysis.The translation of each sentence F i is created using no context information at all Êi = argmax Table 1: Computational cost of decoding (=number of forward passes through the decoder) for each of the search strategies described above.h denotes the sentence-level beam size.
The different search strategies also have a different computational cost associated with them.The biggest factor regarding the decoding cost is the number of forward passes through the model, specifically the decoder, that we have to do.We list the computational costs for the different decoding approaches in Table 1 under the assumption that the document consists of N sentences with average sentence length L and the model uses k − 1 sentences as context.Please note that the decoding time might follow a different dependence than the cost in the above table, since it heavily depends on the available hardware.For example, doc trans and doc trans (beam) might have the same decoding time, if we have enough computational resources available, since the additional computations in doc trans (beam) can all be done in parallel.

Experiments
We perform experiments on three document-level translation benchmarks, called NEWS (En→De), TED (En→It) and OS (En→De).For the details regarding data conditions and preparation, as well as model training, we refer to Appendix A. For the context-aware systems, we concatenate 3 adjacent sentences (i.e.k = 3) using a special token <sep>.For the two En→De tasks, we also evaluate the systems on the ContraPro test set (Müller et al., 2018).Instead of scoring and ranking the contrastive examples in ContraPro, as the authors have originally envisioned, we translate the source side to calculate BLEU and TER as well as to score the pronoun translations according to Section 4.1.We can not evaluate the full segment search strategy on Con-traPro, because the sentences are not adjacent since they come from different documents.

Evaluating Pronoun Translation
As further analysis, we measure how well ambiguous pronouns are handled when translating from English to German.Regarding gender, the English third-person pronoun 'it' (and its other forms), can be translated to the German words 'er', 'sie' or 'es', depending on which noun it refers to.On the other hand, ambiguities in the formality come from second-person pronouns.or informal pronoun appears in the reference.5

Perplexities
First, we compare the perplexities of the hypotheses from the different search strategies, which are listed in Table 2.The first thing to note is, that the reference has a much higher perplexity than all hypotheses, which is commonly seen for NMT systems.All document-level search strategies result in different hypotheses, which however have a similar perplexity score.Surprisingly, the cheating setting generates the worst translation perplexity-wise, even worse than using no context.This might be related to the observation, that the reference has a worse perplexity than any hypothesis, which is rather a modelling error than a search error.

Automatic Metrics
Next, we evaluate the hypotheses based on the common automatic metrics BLEU and TER.The results are shown in Table 3.The hypotheses created with no context seem to have the same quality as the sentence-level baseline.Surprisingly, the true reference as context does not improve performance on the NEWS and TED test sets.This indicates that the improvements seen on these test sets for the document-level system might not be related to better context incorporation.On the contrary, the OS system creates the best hypothesis with the true reference as context.All the actual decoding strategies give similar performance in terms of BLEU and TER with 2-pass decoding being a little bit behind.
A special case is the first sentence strategy, which performs quite well on the standard test sets but poorly on ContraPro.This is, because ContraPro is designed in a way that the left side context is more important for translation than the right side.Finally, we analyze the quality of the pronoun translation as discussed in Section 4.1.In principle, we could calculate the F1 score for both, gender and formality, on all En→De test sets.However, we discard the cases where one or more classes have less than 100 examples.This leaves us with the three test sets depicted in Table 4.As a sanity check, we also report the ContraPro accuracies calculated from scoring the contrastive references as described in (Müller et al., 2018).They are 48.2/45.8 for sentence-level and 68.2/82.2 for document-level for NEWS/OS respectively.That means, with just scoring, we overestimate the capabilities of the system, but the trend is still consistent. 6Using the true reference leads to the best results in all cases.no context and first sentence leaves us with sentencelevel performance on the gender tasks, while all other decoding strategies perform similarly.For the formality, none of the methods can significantly outperform the sentence level system, although the cheating experiment shows that the system could do better if a better context information is provided.This might be, because segments of 3 sentences are too short to reliably detect if a setting is formal or informal, without access to the true reference.

Conclusion
In this work, we analyze decoding strategies for document-level NMT systems.Using the most popular document-level translation approach, we compare different search strategies found in the literature against methods developed by us.We find that most of the commonly used decoding strategies result in similar performance, both in terms of common automatic metrics, as well as on specific pronoun evaluation tasks.Therefore, we conclude that it is important to include the context information during decoding, but the exact way in which to do this is not as important.Also, we find that the document-level systems could actually profit from higher quality context information, in situations where this context is most relevant for translation.ture and training criterion.Other approaches exist, which might exhibit a different behavior in decoding.Two out of the three document-level translation tasks we use in this work are low resource with less than 500k sentence-pairs as training data.We chose these tasks due to computational limitations and to be better comparable to other works, but higher resource scenarios are more realistic for actual applications.We limit the analysis of pronoun translation to the English-German language pair.Also, there are other aspects of documentlevel NMT, like consistent translation of entities, which we did not consider in our analysis.

A Appendix
For the NEWS En→De task, the parallel training data (around 300k sentence pairs, newsdomain) comes from the NewsCommentaryV14 corpus 7 .As validation/test set we use the WMT newstest2015/newstest2018 test sets from the WMT news translation tasks (Farhad et al., 2021).For the TED En→It task, the parallel training data (around 200k sentence pairs, scientific-talksdomain) comes from the IWSLT17 Multilingual Task (Cettolo et al., 2017).As validation set we use the concatenation of IWSLT17.TED.dev2010 and IWSLT17.TED.tst2010 and as test set we use IWSLT17.TED.tst2017.mltlng.For the OS En→De task, the parallel training data (around 22.5M sentence pairs, subtitle-domain) comes from the OpenSubtitlesV2018 corpus (Lison et al., 2018).We use the same train/validation/test splits as Huo et al. (2020) and additionally remove all segments that are used in the ContraPro test suite (Müller et al., 2018) from the training data.The data statistics for all tasks can be found in Table 5  Since in the original release of ContraPro only left side context is provided, we extract the right side context ourselves from OpenSubtitlesV2018 based on the meta-information of the segments.
We tokenize the data using byte-pair-encoding (Sennrich et al., 2016;Kudo, 2018) with 15k joint merge operations (32k for OS En→De).The models are implemented using the fairseq toolkit (Ott et al., 2019) following the transformer base architecture (Vaswani et al., 2017) with dropout 0.3 and label-smoothing 0.2 for NEWS En→De and TED En→It and dropout 0.1 and label-smoothing 0.1 for OS En→De.This resulted in models with ca.51M parameters for NEWS and TED and ca.60M parameters for OS for both the sentencelevel and the document-level systems.All systems are trained until the validation perplexity does no longer improve and the best checkpoint is selected using validation perplexity as well.Training took around 24h for NEWS and TED and around 96h for OS on a single NVIDIA GeForce RTX 2080 Ti graphics card.Due to computational limitations, we report results only for a single run.For the generation of segments (see Section 3), we use beam-search on the token level with beam-size 12 and length normalization.To calculate BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) we use SacreBLEU (Post, 2018).

Table 2 :
Perplexity values on the test set for different search strategies.

Table 5 :
Data statistics for the different document-level translation tasks.