Context-aware Neural Machine Translation with Mini-batch Embedding

It is crucial to provide an inter-sentence context in Neural Machine Translation (NMT) models for higher-quality translation. With the aim of using a simple approach to incorporate inter-sentence information, we propose mini-batch embedding (MBE) as a way to represent the features of sentences in a mini-batch. We construct a mini-batch by choosing sentences from the same document, and thus the MBE is expected to have contextual information across sentences. Here, we incorporate MBE in an NMT model, and our experiments show that the proposed method consistently outperforms the translation capabilities of strong baselines and improves writing style or terminology to fit the document’s context.

Most of the previous studies have considered only a few previous context sentences. Several methods, such as the cache-based network, consider long-range context but heavily modify the standard NMT models and require additional training/decoding steps. Our goal is to make a simple but effective context-aware NMT model, which does not require heavy modification to standard NMT models and can handle a wider inter-sentence context. To this end, we propose a method to create an embedding that represents the contextual information of a document. To create this embedding, we focused on the mini-batch, which is commonly used in NMT training and decoding for efficient GPU computation. We modified the mini-batch creation algorithm to choose sentences from a single document and created an embedding that represents the features of the mini-batch. We call this embedding mini-batch embedding (MBE) and incorporate it in the NMT model to exploit contextual information across the sentences in the mini-batch.
Our main contributions can be summarized as follows: (i) We introduce mini-batch embedding to represent the features of sentences in a minibatch. (ii) We incorporate mini-batch embedding in NMT to achieve simple context-aware translation and find that our approach improves translation performance by up to 1.9 BLEU points.

Neural Machine Translation
The current NMT model f (·) generates a sequence of target sentence tokens y = (y 1 , . . . , y t ) given a sequence of source sentence tokens x = (x 1 , . . . , x s ): y = f (x; ✓), where ✓ is a set of model parameters and s and t are the numbers of source and target sentence tokens. The model parameters are trained by minimizing the loss function: where D is a set of bilingual sentence pairs. Since the model only uses a single sentence as its input, it does not consider the inter-sentence context.

Context-aware NMT with Mini-batch Embedding
To exploit the inter-sentence context in NMT with a simple modification, we propose mini-batch embedding (MBE) to represent the features of sentences in the mini-batch. Figure 1 shows an overview of how we create mini-batch embedding and incorporate it in the NMT model.

Mini-batch Embedding
is a pair of source/target sentences. Normally, we randomly select them from all of the training data to create a mini-batch B. However, for our method, we choose sentence pairs from the same document to create a mini-batch. Let g enc (·) be a single Transformer encoder layer. We first compute sentence-wise contextualized embeddings E i = (e i,1 , . . . , e i,s i ) as E i = g enc (x i ; ), where s i is the number of tokens in x i and are the model parameters. MBE z 2 R e is computed: where e is a hidden dimension of the NMT model. We use mean pooling 2 to make both sentence embeddings v i and MBE z. By adopting this procedure, we expect MBE z to have inter-sentence context features, which is desirable for a contextaware NMT. Note that we ignore the order of sentences in a document. This is a beneficial trait because this method is also applicable to corpora with document boundaries but without in-document sentence order, such as ParaCrawl (Esplà et al., 2019).

Learning NMT with Mini-batch Embedding
To use inter-sentence information, we modify the NMT model by adding MBE to the input: We concatenated MBE to the input word embeddings, and the model uses MBE as the first input token (Fig. 1). Now the encoder/decoder takes s+1 and t + 1 embeddings. The Transformer encoder layer for MBE was jointly trained with the NMT model by modifying the loss function in Eq. (1): where D 0 is a set of mini-batches created from D.

Mini-batch Embedding Gate
The MBE may degrade the translation performance when the NMT model does not need any context information to translate the mini-batch or the MBE fails to contain important information for translation. To deal with such cases, we aim to make the model estimate how important MBE is for each mini-batch. Thus we added a mini-batch embedding gate to determine MBE's importance.
In this setting, we prepared two types of minibatches for training: (i) sentences from the same document and (ii) sentences from different documents. Then we trained a binary classifier that predicts whether the sentences in the mini-batch are selected from the same document: where W 2 R 2⇥e is a parameter matrix and d is a binary value that takes 1 if the sentences in the mini-batch are selected from the same document.
To train the classifier, we minimize the loss function: where is a set of parameters for the classifier. For training, we mix the two types of mini-batches at the same ratio.
Concretely, we jointly minimize the NMT and the classifier loss functions: where is a hyperparameter used to control the weight of the classifier loss. We use the value predicted by the classifier as a gate. Our new weighted MBE isz = ↵z, where ↵ = P (d = 1|z), and we change z in Eqs.
(3) toz.  Figure 1: Overview of context-aware NMT with mini-batch embedding. x i is a sequence of source tokens, B is a mini-batch that has n sentences, E i is sentence-wise contextualized embeddings computed by a Transformer encoder, v i is a sentence vector, and z is mini-batch embedding. We pad short sentences with a special <pad> token to adjust their length to the longest sentence in the mini-batch.

Compared Models
We used four settings as baselines: Baseline 6 Enc-Layers is the original Transformer NMT model with six encoder-decoder layers. Baseline 7 Enc-Layers resembles Baseline 6 Enc-Layers, but the number of encoder layers was changed to seven. Since our MBE model requires an additional Transformer encoder layer, this model has a comparable number of parameters as the following MBE models. 2-to-1 is the context-aware translation model proposed by Tiedemann and Scherrer (2017) that translates a pair of previous and current source sentences into a target sentence. Two source sentences are concatenated with a special sentence boundary token. This method is known as a strong baseline for context-aware NMT (Bawden et al., 2018;Voita et al., 2018). Other settings are identical to those of Baseline 6 Enc-Layers 3 . DocRepair is another recent context-aware translation model, which uses two-step decoding (Voita et al., 2019). The first step generates 1-best translation with a sentence-level NMT model given a single sentence. The second step generates document-level translation given 1best translations of four consecutive sentences concatenated with a special token.
We compared our proposed methods with the following settings: MBE Enc resembles Baseline 6 Enc-Layers but uses MBE in the encoder.

Experimental Settings
Datasets/Evaluation We trained Japanese-English NMT models. As training data, we used the JParaCrawl corpus (Morishita et al., 2020). JParaCrawl was created by crawling the web and aligning parallel sentences, and each sentence-pair has a URL from which the sentences were taken.
In this experiment, we regarded the sentences from the same URL as a document. We used several test sets with document boundaries: (i) scientific paper excerpts (AS-PEC (Nakazawa et al., 2016)), (ii) news (news-dev2020 from WMT20 news translation shared task 4 ), and (iii) TED talks (tst2012 from IWSLT translation shared task (Cettolo et al., 2012)). As a dev set to tune the NMT model, we used the AS-PEC dev split. See Section A.1 in the Appendix for corpus statistics and detailed preprocessing steps.
To evaluate the translation performance, we used sacreBLEU 5 (Post, 2018) and report the BLEU scores (Papineni et al., 2002).

Experimental Results and Analysis
Translation Performance Table 1 summarizes the model performance on several test sets. See Table 3 in the Appendix for the dev set performance.
The results show that the scores of the proposed methods surpass the baseline as well as the stronger baselines that used seven encoder layers or the existing context-aware models. Figure 2 shows an example translation of a sentence from the scientific paper excerpts (ASPEC test set). In this example, the word "mentions" is translated in two ways. The baseline system translated the word as " W fD~Y", which is a colloquial expression. In contrast, the proposed method translated it as " yã", which suits usage in scientific papers. This shows that MBE could change the writing style to one that is more appropriate for scientific papers compared to the baseline. Figure 3 shows another example, which is from TED talks (tst2012). This example shows how our model could change the translation of the word "you". Our method translated this word as " ", which is friendlier than the baseline output "Bj _". In this document, "he" is a friendly old man, and thus the MBE output is more appropriate for this context. 6 We sorted the sentences in a document by their length when splitting the document into several mini-batches to maintain the training efficiency. Since the method focuses more on writing style and wording, we do not keep the original order of the sentences. 7 In this case, the mini-batch size could be smaller than 3,000 tokens, since we did not want to mix up the sentences from different documents.

Translation Examples
These examples show that our method improved the writing style to fit the context and chose the appropriate word for the context. This indicates that MBE helped the NMT model by providing context information across the mini-batch.
Effect of Decoding Batch-size In the previous section, we discussed the translation performance given a document, which means that the sentences in the entire document are in a mini-batch. However, in practice, we sometimes have to translate a part of the document. To check the robustness of the model in such situations, we decoded the test set by limiting the number of sentences in a mini-batch. Figure 4 shows the experimental results. The baseline model scores are identical to those in Table 1, since the model is immune to mini-batch size. Our MBE models achieve better performance when given a larger context. It reach comparable or better scores than the baseline model when given a single sentence or a smaller context. However, the model without using MBE gate (MBE Enc w/o Gate) showed a drastic drop in performance when translating a single sentence. This shows that the gate properly works to weigh the importance of MBE and improve performance.

Related Work
Context-aware NMT Tiedemann and Scherrer (2017) proposed a 2-to-1 (or 2-to-2) method that concatenates two source sentences and generates one (or two) target sentences. This is a simple model, but it only considers a previous sentence, while our method can make use of larger contexts. Junczys-Dowmunt (2019) extended the 2to-2 method to document-to-document by concatenating all sentences in a document. Although they showed that the method is effective, it requires heavy computational cost since the NMT model Source The paper ::::::: mentions the reliability assurance test and application technologies.
Reference ·<'›<f◆hi(ÄSí :::::: y_ Baseline 6 Enc-Layers Sn÷ágo ·<'›<∆π»h¢◊Í±¸∑ÁÛÄSkdDf :::::::::::: WfD~Y⇥ MBE Enc/Dec ,÷ágo,·<'›<f◆h‹(ÄSkdDf ::::: yã.   2019) proposed a method called DocRepair, one of the most recent context-aware NMT methods, that employs two decoding steps. It first translates a sentence by sentence-level NMT, and then the concatenated output is fed to a document-level model that outputs document-level translation. Although this is a promising method, it requires training of three sequence-to-sequence models to translate a single direction and needs two decoding steps, which slows down the translation. Our method has an advantage in that it only trains a single model and uses single-step decoding, which requires only a small computational cost.

NMT with Tags
We used an MBE as the first input of the encoder/decoder. Our approach is sim-ilar to the work that uses special tags to control or provide additional information to NMT (Johnson et al., 2016;Takeno et al., 2017;Caswell et al., 2019). Johnson et al. (2016) added tags to a source sentence for indicating the target language in multilingual NMT models. Takeno et al. (2017) proposed a method that controls the target length or the domain by adding a tag to the decoder inputs. Caswell et al. (2019) used a tag to indicate the synthetic corpus (Sennrich et al., 2016). Our work, which automatically generates a tag (MBE) with the sentence in a mini-batch and uses a gate to control the importance of MBE, is different from the previous studies.

Conclusion
We proposed mini-batch embedding (MBE), which is a simple but effective method to represent contextual information across documents. We incorporated MBE in the NMT model, which enabled it to outperform competitive baselines. We found that our NMT model could choose the appropriate word and writing style to match the document context. An analysis showed that our model's performance improves with a large context, but it still achieves comparable or even better performance than that of the baseline when translating a single sentence. Our future work includes applying MBE to other applications and improving the method to generate embeddings from a mini-batch.

A Detailed Experimental Settings
In this section, we describe more detailed experimental settings.

A.1 Data/Evaluation
The number of sentences and documents contained in the train/dev/test sets are shown in Table 2. We tokenized the sentences into subwords with sentencepiece (Kudo, 2018; Kudo and Richardson, 2018) and set the vocabulary size to 32k for each language. For the training set, we removed sentences whose length exceeded 250 subword tokens. For the DocRepair method, we used the JParaCrawl corpus as data for both monolingual and bilingual document-level application.

A.2 Model Configurations
We used the Transformer model as an NMT model (Vaswani et al., 2017). Our hyperparameters are based on "big" settings defined by Vaswani et al. (2017) and have six encoder/decoder layers, 16 attention heads, and 1,024 dimensions for all of the hidden states except the feed-forward network hidden states that have 4,096 dimensions. We used a dropout with a probability of 0.3 (Srivastava et al., 2014). As an optimizer, we used Adam with ↵ = 0.001, 1 = 0.9, and 2 = 0.98 (Kingma and Ba, 2015). A root-square decay learning rate schedule was used with a linear warm-up of 4,000 steps (Vaswani et al., 2017). We clipped the gradients to avoid exceeding their norm of 1.0. For the MBE experiments, we set in Eq. (7) to 1.0 and set the per-GPU-batch-size to 3,000 tokens. Since largebatch training can reduce training time (Ott et al., 2018), we accumulated about 280k tokens for an update. Based on dev set perplexity, we trained the model for 24,000 iterations. We saved the model every 200 iterations and averaged the last eight model parameters for decoding. We normalized the candidate translation scores by dividing their length and carried out a beam search with a size of six. Our implementation is based on fairseq (Ott et al., 2019). We used mixed-precision training (Micikevicius et al., 2018) to reduce memory consumption and training time. All experiments were run on eight NVIDIA Tesla V100 GPUs with 32-GB memory. Since we did not conduct a hyperparameter search, almost all of the settings were borrowed from (Morishita et al., 2020). DocRepair requires the training of three sequence-to-sequence models: (1) an NMT model that translates language X to Y; (2) an NMT model that translates in reverse direction to make round-trip translation; and (3) a sequence-tosequence model that converts 1-best translations to document-level translation. We used "Baseline 7 Enc-Layers" models for both (1) and (2), and newly trained the Transformer model for (3). Table 3 shows the number of parameters for each model, training speed, and BLEU scores on the dev set. The scores show the same tendency as the test set (Table 1).

B Additional Experimental Results
The DocRepair method requires two translation models (English-to-Japanese and Single-to-Document), and thus the number of model parameters is larger than that for the other models. Although it also requires a Japanese-to-English translation model for creating round-trip translation data for training, these model parameters are not included in the table.
Since our MBE implementation was still in the experimental phase, the training speed was slower than that of the baselines, which were fully optimized by fairseq developers. We can further improve our implementation for faster computation, but we leave this for future work.