Capturing document context inside sentence-level neural machine translation models with self-training

Neural machine translation (NMT) has arguably achieved human level parity when trained and evaluated at the sentence-level. Document-level neural machine translation has received less attention and lags behind its sentence-level counterpart. The majority of the proposed document-level approaches investigate ways of conditioning the model on several source or target sentences to capture document context. These approaches require training a specialized NMT model from scratch on parallel document-level corpora. We propose an approach that doesn’t require training a specialized model on parallel document-level corpora and is applied to a trained sentence-level NMT model at decoding time. We process the document from left to right multiple times and self-train the sentence-level model on pairs of source sentences and generated translations. Our approach reinforces the choices made by the model, thus making it more likely that the same choices will be made in other sentences in the document. We evaluate our approach on three document-level datasets: NIST Chinese-English, WMT19 Chinese-English and OpenSubtitles English-Russian. We demonstrate that our approach has higher BLEU score and higher human preference than the baseline. Qualitative analysis of our approach shows that choices made by model are consistent across the document.


Introduction
Neural machine translation (NMT) (Sutskever et al., 2014;Kalchbrenner and Blunsom, 2013;Bahdanau et al., 2014) has achieved great success, arguably reaching the levels of human parity (Hassan et al., 2018) on Chinese to English news translation that led to its popularity and adoption in academia and industry.These models are predominantly trained and evaluated on sentence-level par-allel corpora.Document-level machine translation that requires capturing the context to accurately translate sentences has been recently gaining more popularity and was selected as one of the main tasks in the premier machine translation conference WMT19 (Barrault et al., 2019).
A straightforward solution to translate documents by translating sentences in isolation leads to inconsistent but syntactically valid text.The inconsistency is the result of the model not being able to resolve ambiguity with consistent choices across the document.For example, the recent NMT system that achieved human parity (Hassan et al., 2018) inconsistently used three different names "Twitter Move Car", "WeChat mobile", "WeChat move" when referring to the same entity (Sennrich, 2018).
We propose a way of incorporating context into a trained sentence-level neural machine translation model at decoding time.We process each document monotonically from left to right one sentence at a time and self-train the sentence-level NMT model on its own generated translation.This procedure reinforces choices made by the model and hence increases the chance of making the same choices in the remaining sentences in the document.Our approach does not require training a separate Algorithm 1: Document-level NMT with selftraining at decoding time Input: Document D = (X 1 , ..., X n ), pretrained sentence-level NMT model f (θ), learning rate α and decay prior λ Output: context-conditional model on parallel documentlevel data and allows us to capture context in documents using a trained sentence-level model.
We make the key contribution in the paper by introducing the first document-level neural machine translation approach that does not require training a context-conditional model on document data.We show how to adapt a trained sentence-level neural machine translation model to capture context in the document during decoding.We evaluate and demonstrate improvements of our proposed approach measured by BLEU score and preferences of human annotators on several document-level machine translation tasks including NIST Chinese-English, WMT19 Chinese-English and OpenSubtitles English-Russian datasets.We qualitatively analyze the decoded sentences produced using our approach and show that they indeed capture the context.

Proposed Approach
We translate a document D consisting of n source sentences X 1 , X 2 , ..., X n into the target language, given a well-trained sentence-level neural machine translation model f θ .The sentencelevel model parametrizes a conditional distribution p(Y |X) = T i=1 p(y t |Y <t , X) of each target word y t given the preceding words Y <t and the source sentence X. Decoding is done by approximately finding arg max Y p(Y |X) using greedy decoding or beam-search.f is typically a recurrent neural network with attention (Bahdanau et al., 2014) or a Transformer model (Vaswani et al., 2017) with parameters θ.

Self-training
We start by translating a first source sentence X 1 in the document D into the target sentence Y 1 .We then self-train the model on the sentence pair (X 1 , Y 1 ), which maximizes the log probabilities of each word in the generated sentence Y 1 given source sentence X 1 .The self-training procedure runs gradient descent steps for a fixed number of steps with a weight decay.Weight decay keeps the updated values of weights closer to original values.We repeat the same update process for the remaining sentences in the document.The detailed implementation of self-training procedure during decoding is shown in Algorithm 1.

Multi-pass self-training
Since the document is processed in the left-to-right, monotonic order, our self-training procedure does not incorporate the choices of the model yet to be made on unprocessed sentences.In order to leverage global information from the full document and to further reinforce the choices made by the model across all generated sentences, we propose multipass document decoding with self-training.Specifically, we process the document multiple times monotonically from left to right while continuing self-training of the model.

Oracle self-training to upper bound performance
Since generated sentences are likely to contain some errors, our self-training procedure can reinforce those errors and thus potentially hurt the performance of the model on unprocessed sentences in the document.In order to isolate the effect of imperfect translations and estimate the upper bound of performance, we evaluate our self-training procedure with ground-truth translations as targets, which we call oracle self-training.Running oracle self-training makes it similar to the dynamic evaluation approach introduced in language modeling (Mikolov, 2012;Graves, 2013;Krause et al., 2018), where input text to the language model is the target used to train the neural language model during evaluation.We do not use the oracle in multi-pass self-training since this would make it equivalent to memorizing the correct translation for each sentence in the document and regenerating it again.
Although there have been some attempts at tackling document-level neural machine translation (for example see proceedings of discourse in machine translation workshop (Popescu-Belis et al., 2019)), it has largely received less attention compared to sentence-level neural machine translation.
Prior document-level NMT approaches (Jean et al., 2017;Wang et al., 2017;Kuang et al., 2017;Tiedemann and Scherrer, 2017;Maruf and Haffari, 2018;Agrawal et al., 2018;Zhang et al., 2018;Miculicich et al., 2018) proposed different ways of conditioning NMT models on several source sentences in the document.Perhaps closest of those document NMT approaches to our work is the approach by (Kuang et al., 2017), where they train a NMT model with a separate non-parametric cache (Kuhn and Mori, 1990) that incorporates topic information about the document.Recent approaches (Jean et al., 2019;Junczys-Dowmunt, 2019;Voita et al., 2019a) use only partially available parallel document data or monolingual document data.These approaches proposed to fill in missing context in the documents with random or generated sentences.Another line of document-level NMT work (Xiong et al., 2018;Voita et al., 2019b) proposed a two-pass document decoding model inspired by the deliberation network (Xia et al., 2017) in order to incorporate target side document context.Recently, Yu et al. (2019) proposed a novel beam search method that incorporates document context inside noisy channel model (Shannon, 1948;Yu et al., 2017;Yee et al., 2019).Similar to our work, their approach doesn't require training context-conditional models on parallel document corpora, but relies on separate target-to-source NMT model and unconditional language model to re-rank hypotheses of the sourceto-target NMT model.Closest to our work is the dynamic evaluation approach proposed by (Mikolov, 2012) and further extended by (Graves, 2013;Krause et al., 2018), where a neural language model is trained at evaluation time.However unlike language modeling where inputs are ground-truth targets used both during training and evaluation, in machine translation ground-truth translation are not available at decoding time in practical settings.The general idea of storing memories in the weights of the neural network rather than storing memories as copies of neural network activations, that is behind our approach and dynamic evaluation, goes back to 1970s and 1980s work on associative memory models (Willshaw et al., 1969;Kohonen, 1972;Anderson and Hinton, 1981;Hopfield, 1982) and to more recent work on fast weights (Ba et al., 2016).
Our work belongs to the broad category of selftraining or pseudo-labelling approaches (Scudder, 1965;Lee, 2013) proposed to annotate the unlabeled data to train supervised classifiers.Selftraining has been successfully applied to NLP tasks such as word-sense disambiguation (Yarowsky, 1995) and parsing (McClosky et al., 2006;Reichart and Rappoport, 2007;Huang and Harper, 2009).Self-training has also been used to label monolingual data to improve the performance of sentencelevel statistical and neural machine translation models (Ueffing, 2006;Zhang and Zong, 2016).Recently, He et al. (2019) proposed noisy version of self-training and showed improvement over classical self-training on machine translation and text summarization tasks.Backtranslation (Sennrich et al., 2016a) is another popular pseudo-labelling technique that utilizes target-side monolingual data to improve performance of NMT models.

Datasets
We use the NIST Chinese-English (Zh-En), the WMT19 Chinese-English (Zh-En) and the Open-Subtitles English-Russian (En-Ru) datasets in our experiments.
The NIST training set consists of 1.5M sentence pairs from LDC-distributed news.We use MT06 set as validation set.We use MT03, MT04, MT05 and MT08 sets as held out test sets.The MT06 validation set consists of 1649 sentences with 21 sentences per document.MT03, MT04, MT05 and MT08 consist of 919, 1788, 1082 and 1357 sentences with 9, 9, 11 and 13 sentences on average per document respectively.We follow previous work (Zhang et al., 2018) when preprocessing NIST dataset.We preprocess the NIST dataset with punctuation normalization, tokenization, and lowercasing.Sentences are encoded using byte-pair encoding (Sennrich et al., 2016b) with source and target vocabularies of roughly 32K tokens.We use the case-insensitive multi-bleu.perlscript with 4 reference files to evaluate the model.
The WMT19 dataset includes the UN corpus, CWMT, and news commentary.We filter the training data by removing duplicate sentences and sentences longer than 250 words.consits of 18M sentence pairs.We use news-dev2017 as a validation set and use newstest2017, newstest2018 and newstest2019 as held out test sets.newsdev2017, newstest2017, newstest2018 and newstest2019 consist of total of 2002, 2001, 3981 and 2000 sentences with average of 14, 12, 15 and 12 sentences per document respectively.We similarly follow previous work (Xia et al., 2019) when preprocessing the dataset.Chinese sentences are preprocessed by segmenting and normalizing punctuation.English sentences are preprocessed by tokenizing and true casing.We learn a byte-pair encoding (Sennrich et al., 2016b) with source and target vocabularies of roughly 32K tokens.We use sacreBLEU (Post, 2018) for evaluation.
The OpenSubtitles English-Russian dataset, consisting of movie and TV subtitles, was prepared by (Voita et al., 2019b). 1 The training dataset consists of 6M parallel sentence pairs.We use the context aware sets provided by the authors consisting of 10000 documents both in validation and test sets.Due to the way the dataset is processed, each document only contains 4 sentences.The dataset is preprocessed by tokenizing and lower casing.We use byte-pair encoding (Sennrich et al., 2016b) to 1 https://github.com/lena-voita/good-translation-wrong-in-context prepare source and target vocabularies of roughly 32K tokens.We use multi-bleu.perlscript for evaluation.

Hyperparameters
We train a Transformer (Vaswani et al., 2017) on all datasets.Following previous (Zhang et al., 2018;Voita et al., 2019b;Xia et al., 2019) work we use the Transformer base configuration (transformer_base) on the NIST Zh-En and the OpenSubtitles En-Ru datasets and use the Transformer big configuration (transformer_big) on the WMT19 Zh-En dataset.Transformer base consists of 6 layers, 512 hidden units and 8 attention heads.Transformer big consists of 6 layers, 1024 hidden units and 16 attention heads.We use a dropout rate (Srivastava et al., 2014) of 0.1 and label smoothing to regularize our models.We train our models with the Adam optimizer (Kingma and Ba, 2014) using the same warm-up learning rate schedule as in (Vaswani et al., 2017).During decoding we use beam search with beam size 4 and length penalty 0.6.We additionally train backtranslated models (Sennrich et al., 2016a) (Voita et al., 2019a) for the OpenSubtitles En-Ru dataset.When training backtranslated models, we oversample the original parallel data to make the ratio of synthetic data to original data equal to 1 (Edunov et al., 2018).We tune the number of update steps, learning rate, decay rate, and number of passes over the document of our selftraining approach with a random search on a validation set.We use the range of (5 × 10 −5 , 5 × 10 −1 ) for learning rate, range of (0.001, 0.999) for decay rate, number of update steps (2, 4, 8) and number of passes over the document (2, 4) for random search.We found that best performing models required a small number of update steps (either 2 or 4) with a relatively large learning rate (∼ 0.005 − 0.01) and small decay rate (∼ 0.2 − 0.5).We use the Ten-sor2Tensor library (Vaswani et al., 2018) to train baseline models and to implement our method.

Results
We present translation quality results measured by BLEU on NIST dataset on  (2017) and Kuang et al. (2017) and is comparable to the document-level model proposed by Zhang et al. (2018).Backtranslation further improves the results of our sentence-level model leading to higher BLEU score compared to the Document Transformer (Zhang et al., 2018).
In Table 2, we show a detailed study of effects of multi-pass self-training and oracle self-training on BLEU scores on NIST evaluation sets.First, multiple decoding passes over the document give an additional average improvement of 0.25−0.45BLEU points compared to the single decoding pass over the document.Using oracle self-training procedure gives an average of 0.86 and 1.63 BLEU improvement over our non-backtranslated and backtranslated sentence-level baseline models respectively.Compared to using generated translations by the model, oracle self-training gives an improvement of 0.3 and 0.7 BLEU points for non-backtranslated and backtranslated models respectively.
The results on the WMT19 evaluation sets are presented on Table 3.Compared to the NIST dataset our self-training procedure shows an improvement of 0.1 BLEU over a sentence-level baseline model.Oracle self-training outperforms sentence-level baselines by a significant margin of 2.5 BLEU.We hypothesize that such a large gap between performance of oracle and non-oracle selftraining is due to the more challenging nature of the WMT dataset which is reflected in the worse performance of sentence-level baseline on WMT compared to NIST.We investigate this claim by measuring the relationship between BLEU achieved by self-training and the relative quality of the sentencelevel model on the NIST dataset.Figure 1 shows that the BLEU difference between self-training and sentence-level models monotonically increases as the quality of the sentence-level model gets better on the NIST dataset.This implies that we can    Table 5: Human evaluation results on the NIST Zh-En and the OpenSubtitles En-Ru datasets."Total" denotes total number of annotations collected from humans."Self-train" denotes number of times evaluators preferred documents by the self-training approach."Baseline" denotes number of times evaluators preferred documents by sentence-level baseline.
expect a larger improvement from applying selftraining as we improve the sentence-level model on the WMT dataset.Preliminary experiments on training back-translated models didn't improve results on the WMT dataset.We leave further investigation of ways to improve the sentence-level model on the WMT dataset for future work.
The results on OpenSubtitles evaluation sets are in Table 4.Our self-training and oracle self-training approaches give the performance improvement of 0.1 and 0.3 BLEU respectively.We hypothesize that the small improvement of self-training is due to relatively small number of sentences in the documents in the OpenSubtitles dataset.We validate this claim by varying the number of sentences in the document used for self-training on NIST dataset.Figure 2 shows that the self-training approach achieves higher BLEU improvement as we increase the number of sentences in documents used for self-training.

Human Evaluation
We conduct a human evaluation study on the NIST Zh-En and the OpenSubtitles En-Ru datasets.For both datasets we sample 50 documents from the test set where translated documents generated by the self-training approach are not exact copies of the translated documents generated by the sentencelevel baseline model.For the NIST Zh-En dataset we present reference documents, translated documents generated by the sentence-level baseline, and translated documents generated by self-training approach to 4 native English speakers.For the Open-Subtitles En-Ru dataset we follow a similar setup, where we present reference documents, translated documents generated by sentence-level baseline, and translated documents generated by self-training approach to 4 native Russian speakers.All translated documents are presented in random order with no indication of which approach was used to generate them.We highlight the differences between translated documents when presenting them to human evaluators.The human evaluators are asked to pick one of two translations as their preferred option for each document.We ask the human evaluators to consider fluency, idiomaticity and correctness of the translation relative to the reference when entering their preferred choices.
We collect a total of 200 annotations for 50 documents from all 4 human evaluators and show results in Table 5.For both datasets, human evaluators prefer translated documents generated by the self-training approach to translated documents generated by the sentence-level model.For NIST Zh-En, 122 out of 200 annotations indicate a preference towards translations generated by self-training approach.For OpenSubtitles En-Ru, 118 out of 200 annotations similarly show a preference towards translations generated by our self-training approach.This is a statistically significant preference p < 0.05 according to two-sided Binomial

Ref
we are actively seeking a local partner to set up a joint fund company , " duchateau said .duchateau said that the chinese market still has ample potentials .Baseline we are actively looking for a local partner to establish a joint venture fund company , " doyle said .
du said that there is still a lot of room for the chinese market .

Ours
we are actively looking for a local partner to establish a joint venture fund company , " doyle said .doyle said that there is still great room for the chinese market .
Ref in may this year , 13 pilots with china eastern airlines wuhan company in succession handed in their resignations , which were rejected by the company .soon afterwards , the pilots applied one after another at the beginning of june to the labor dispute arbitration commission of hubei province for labor arbitration , requesting for a ruling that their labor relationship with china eastern airlines wuhan company be terminated .Baseline in may this year , 13 pilots of china eastern 's wuhan company submitted their resignations one after another , but the company refused .the pilot then applied for labor arbitration with the hubei province labor dispute arbitration committee in early june , requesting the ruling to terminate the labor relationship with the wuhan company of china eastern airlines .Ours in may this year , 13 pilots of china eastern 's wuhan company submitted their resignations one after another , but the company refused .subsequently , in early june , the pilots successively applied for labor arbitration with the hubei province labor dispute arbitration committee , requesting that the labor relationship with china eastern airlines be terminated .
Table 6: Four reference documents together with translations generated by the baseline sentence-level model and by our self-training approach.First two documents are taken from the OpenSubtitles English-Russian and second two documents are taken from the NIST Chinese-English dataset.
test.When aggregated for each document by majority vote, for NIST Zh-En, translations generated by the self-training approach are considered better in 25 documents, worse in 12 documents, and the same in 13 documents.For OpenSubtitles En-Ru, translations generated by self-training approach are considered better in 23 documents, worse in 15 documents, and the same in 12 documents.The agreement between annotators for NIST Zh-En and OpenSubtitles En-Ru is κ = 0.293 and κ = 0.320 according to Fleiss' kappa (Fleiss, 1971).For both datasets, the inter-annotator agreement rate is considered fair.

Qualitative Results
In Table 6, we show four reference document pairs together with translated documents generated by the baseline sentence-level model and by our self-training approach.We emphasize the underlined words in all documents.
In the first two examples we emphasize the gender of the person marked on verbs and adjectives in translated Russian sentences.In the first example, the baseline sentence-level model inconsistenly produces different gender markings on the underlined verb сказал (masculine told) and underlined adjective сильной (feminine strong).The selftraining approach correctly generates a translation with consistent male gender markings on both the underlined verb сказал and the underlined adjective сильным.Similarly, in the second example, the baseline model inconsistenly produces different gender markings on the underlined verbs приглашена (feminine invited) and поругался (masculine fought).Self-training consistently generates female gender markings on both the underlined verbs приглашена (feminine invited) and поссо-рилась (feminine fought).
In the third example, we emphasize the underlined named entity in reference and generated translations.The baseline sentence-level model inconsistently generates the names "doyle" and "du" when referring to the same entity across two sentences in the same document.The self-training approach consistently uses the name "doyle" across two sentences when referring to the same entity.In the fourth example, we emphasize the plurality of the underlined words.The baseline model inconsistenly generates both singular and plural forms when referring to same noun in consecutive sentences.Self-training generates the noun "pilots" in correct plural form in both sentences.

Conclusion
In this paper, we propose a way of incorporating the document context inside a trained sentencelevel neural machine translation model using selftraining.We process documents from left to right multiple times and self-train the sentence-level NMT model on the pair of source sentence and generated target sentence.This reinforces the choices made by the NMT model thus making it more likely that the choices will be repeated in the rest of the document.
We demonstrate the feasibility of our approach on three machine translation datasets: NIST Zh-En, WMT'19 Zh-En and OpenSubtitles En-Ru.We show that self-training improves sentence-level baselines by up to 0.93 BLEU.We also conduct a human evaluation study and show a strong preference of the annotators to the translated documents generated by our self-training approach.Our analysis demonstrates that self-training achieves higher improvement on longer documents and using better sentence-level models.
In this work, we only use self-training on sourceto-target NMT models in order to capture the target side document context.One extension could investigate the application of self-training on both target-to-source and source-to-target sentence-level models to incorporate both source and target document context into generated translations.Overall, we hope that our work would motivate novel approaches of making trained sentence-level models better suited for document translation at decoding time.

Figure 1 :
Figure 1: Relationship between relative performance of the sentence-level model and BLEU difference of self-training on the NIST dataset.

Figure 2 :
Figure 2: Relationship between number of sentences and BLEU improvement of self-training on the NIST dataset.

Table 1 :
(Wang et al., 2017;Kuang et al., 2017;Zhang et al., 2018)first four rows show the performance of the previous document-level NMT models from(Wang et al., 2017;Kuang et al., 2017;Zhang et al., 2018).The last four rows show performance of our baseline sentence-level Transformer models with and without self-training.BT: backtranslation.

Table 2 :
Ablation study on NIST evaluation sets measuring the effect on multiple passes of decoding and the oracle on self-training procedure.BT: backtranslation.ST: self-training.

Table 3 :
(Xia et al., 2019)and the OpenSubtitles En-Ru datasets.We use the publicly available English gigaword dataset (Graff Results on WMT'19 Chinese-English evaluation sets.The first row shows the performance of the Transformer Big model by(Xia et al., 2019).All models were trained without additional monolingual data and without pretraining.ST: self-training.

Table 4 :
Results on OpenSubtitles English-Russian evaluation sets.ST: self-training.
et al., 2003)to create synthetic parallel data for the NIST Zh-En dataset and use synthetic parallel data provided by