Encouraging Lexical Translation Consistency for Document-Level Neural Machine Translation

Recently a number of approaches have been proposed to improve translation performance for document-level neural machine translation (NMT). However, few are focusing on the subject of lexical translation consistency. In this paper we apply “one translation per discourse” in NMT, and aim to encourage lexical translation consistency for document-level NMT. This is done by first obtaining a word link for each source word in a document, which tells the positions where the source word appears. Then we encourage the translation of those words within a link to be consistent in two ways. On the one hand, when encoding sentences within a document we properly share context information of those words. On the other hand, we propose an auxiliary loss function to better constrain that their translation should be consistent. Experimental results on Chinese↔English and English→French translation tasks show that our approach not only achieves state-of-the-art performance in BLEU scores, but also greatly improves lexical consistency in translation.


Introduction
Unlike sentence-level neural machine translation (NMT), document-level NMT needs to not only model intra-sentence dependencies, but also consider a wide variety of inter-sentence discourse phenomena, such as coreference, lexical cohesion, semantic coherence, discourse relations. Motivated by the success of "one translation per discourse" in statistical machine translation (SMT) (Merkel, 1996;Carpuat, 2009;Türe et al., 2012;Guillou, 2013;Al Khotaba and Al Tarawneh, 2015), in this paper our goal is to encourage lexical translation consistency for document-level NMT. Figure 1 shows an example of an input document and its output translated by a state-of-the-art sentence-level NMT system. The technical term * Corresponding author: Junhui Li. 房地产业/fang_di_chan_ye, occurring four times within a document, surprisingly obtains different translations while in its reference (human translation) it is translated consistently. Such inconsistent translations, however, tend to confuse readers in some cases.
Recent years have witnessed an increasing interest in document-level NMT, but most previous studies explore various context-aware models for better incorporating document-level context to improve translation performance without handling a specific discourse phenomenon ( Maruf and Haffari 2018;Miculicich et al. 2018;Maruf et al. 2019, to name a few). As a way to encourage lexical translation consistency, Kuang et al. (2017) and Tu et al. (2018) cache recently translated words and/or their translations for translating future sentences. However, cache-based approaches may potentially guide the translation of future sentences in a wrong way since the cached translation could be incorrect. Rather than explicitly presenting lexical translations used in previous sentences as in cache-based approaches, in this paper we aim at improving lexical translation consistency in a softer way: we encourage translations of the same word in different positions of a document to be consistent. Specifically, we first obtain a word link for each source word in a document if it has, which tells the positions the source word appears at. To encourage translation consistency for words within a link, we exchange their context information when encoding sentences in a document. Moreover, we properly propose an auxiliary loss function to better constrain that the translations of these words should be consistent.
Overall, we make the following contributions.
• We propose a metric to properly measure lexical translation consistency, and provide a detailed study on lexical translation consistency in both Chinese↔English translation.
• We propose a novel approach to improve lexical translation consistency for document-level NMT. One nice property of our approach is that our models could synchronously translate sentences in a document, rather than translating them one by one as in cached-based approaches.
• Experimental results show that our approach outperforms various context-aware NMT models in BLEU. More importantly, our approach greatly improves lexical translation consistency.

Motivation
Given a parallel document pair (S, T ), a sourceside word w (stemmed to eliminate morphological differences if necessary) is one of words of our interest if it is a non-stop word and occurs two or more times in S. For w, we conjecture that the translations (stemmed too if necessary) of w in T tend to be same. As shown in Figure 1, source word 房地产业/fang_di_chan_ye is consistently translated into (the) real estate sector in reference translation.
Lexical Translation Consistency Metric. To properly evaluate lexical translation consistency, we propose lexical translation consistency ratio (LTCR), which is based on word-alignment. Let us assume that source word w appears k times in S. Based on word alignment between S and T , we obtain its k translations, 1 i.e., (t1, · · · , t k ), where ti 1 To obtain translation, we filter out determiners. may consist of zero, one or more words. Then we define the metric for word w as: where the denominator C 2 k denotes the size of the combination of translation set (t1, · · · , t k ), and function 1 (t i = t j ) returns 1 if t i is same as t j , otherwise 0. The metric illustrates how frequent translation pairs of w is same within a document. The higher the metric value is, the more likely w is translated consistently. Taking source word 房地 产业/fang_di_chan_ye in Figure 1 as an example, its LTCR is 100% for reference translation and 0% for sentence-level NMT.
In above we calculate LTCR for a single word in a document. Likewise, we could apply the metric to all source words that are of our interest in a parallel document pair, or a document-level parallel dataset by summing up all these words' corresponding numerators and denominators, respectively.
Statistics on Reference Translation and Automatic Translation. To better understand lexical consistency in translation, we take a concrete Chinese-English (ZH-EN) manually word-aligned document-level parallel corpus (LDC2015T06) as representative to study how consistent the lexical translation is in ZH→EN and EN→ZH translation. The corpus consists of 268 documents with 6741 sentences in total from domains including broadcast, newswire, and web data. Moreover, for sentence-level NMT translation we perform word alignment to obtain word-level translation. 2 Table 1 compares the lexical translation consistency in ZH→EN and EN→ZH translation of LDC2015T06. From it, we observe that although translation diversity is usually encouraged, LTCR still reaches 74.24% and 63.11% in ZH→EN and EN→ZH reference translation, respectively. This confirms our conjecture that the translations of same source words tend to be consistent. We also note that the consistency is different among different types of words. For example, the consistency for nouns is much higher than those of other word types in both translation directions. Unfortunately, the consistency in automatic translation is much lower than that in reference translation, indicating there exists much room to improve lexical consistency in document-level machine translation. Finally, it also shows that the percentages of words of

Encouraging Lexical Translation Consistency via Word Links
As our goal is to encourage lexical consistency in document-level translation, we first obtain word links, each of which tells the positions that a word appears in a document (Section 3.1). To encourage translation consistency among words in the same link, on the one hand we exchange their information when encoding sentences within a document (Section 3.2). On the other hand, we properly propose an auxiliary loss function to better constrain the translations of these words being consistent (Section 3.3).

Obtaining Word Links
We define some notations before describing our approach. Given a document-level parallel pair with N sentence pairs, we assume that each source sentence Si = (si,j)| n j=1 consists of n words. Given document S, we use V to denote the collection of words of our interest in S, which are non-stop words and appear two or more times.
For word si,j if it exists in V, we maintain a link list Li,j = (a i,j,k , b i,j,k , m i,j,k )| K k=1 with K triples, which tells the other K positions where si,j appears. 3 Specifically, in a triple (a, b, m), a and b indicate the sentence index and word index of a position respectively while m ∈ {0, 1} is a padding mask and indicates (a, b) is a real position pair or a fake one.
Specially, for cases where si,j appears more than K times in S, we choose the top K closest ones to construct its word link. 4 3 We do not include si,j itself in Li,j. 4 According to our preliminary experimentation, the effect of different ways of choosing K positions is negligible.

Encoding Documents with Word Links
Now each word of our interest in a document is equipped with a word link. In encoding, we take documents as input units by synchronously encoding sentences within a document. Figure 2 shows our encoder layer which encodes documents with word links.

Sentence Position Embedding
Since words in a link list may appear in different sentences, a Transformer encoder can not distinguish the sentence positions of the linked words and the current word. Therefore, we introduce sentence position embedding to distinguish the positions of these words.
Formally, given the i-th sentence Si in S, we project each word si,j into a word embedding ei,j ∈ R d , a (intra-sentence) position embedding pej ∈ R d , and a sentence position embedding spei ∈ R d , where d is the size of embedding and hidden state throughout the entire model. Then, we perform an addition operation to unify them into a single input, i.e., ei,j + pej + spei. Note that both the word embeddings and the sentence position embeddings are trainable parameters while the (intra-sentence) position embeddings are sinusoidal (Vaswani et al., 2017).

Encoder
As shown in Figure 2, the encoder consists of M identical encoder layer, which consists of three sublayers, i.e., a self-attention sub-layer, a word-linkattention sub-layer, and a feed-forward sub-layer. Next we use sentence Si = (si,j)| n j=1 to illustrate the encoding process.
Self-Attention Sub-Layer. In the m-th encoder layer, it takes A (m) i ∈ R n×dm as input and computes a new sequence B (m) i with the same length via multi-head attention function: where LayerNorm is the layer normalization function (Ba et al., 2016), and the output B (m) i is of shape R n×d . For the first encoder layer, A (1) i is the input of the encoder while for other layers, A (m) i is the output of the (m − 1)-th encoder layer.
Word-Link-Attention Sub-Layer. Since we en- from the selfattention sub-layer of the m-th layer. Let us assume that word si,j in sentence Si is of our interest and has a word link list Li,j. Then we use the list to index the states of its K linked words from B (m) i,j ∈ R K×d to denote the indexed states. Consequently, this sub-layer uses another multihead attention function to exchange information among linked words: Specifically, if si,j is out of our interest and does not have a word link list, we set D (m) In the m-th encoder layer, this sub-layer is applied to each position separately and identically by two linear transformations with a ReLU activation in between.
The output of the final layer, i.e., E (M ) i will be used as the output of the encoder.

Consistency Constraint Loss
After encoding sentences within a document, we properly extract useful information from documentlevel context via deliberately obtained word links. We expect the extracted information from document-level context can enhance the translations of the same words being more consistent, i.e., the states of the same words within a document being closer. Let us assume that word sx,y, i.e., the y-th word in the x-th sentence is in the word-link list of word si,j. We use E (M ) i,j and E (M ) x,y to denote their hidden states of our encoder with word-link attention sub-layer. Meanwhile we use x,y to denote their hidden states of a vanilla Transformer encoder, i.e., the encoder without the wordlink attention sub-layer. Since our encoder has exchanged context information between si,j and sx,y while the vanilla encoder has not, we expect that the two states E (M ) i,j and E (M ) x,y are closer than x,y . 5 According to Section 3.2, our encoder returns to denote the outputs of its corresponding vanilla encoder. 6 To encourage that our encoder would generate closer hidden states for a pair of linked words than the vanilla encoder, we follow previous work on visual semantic embedding (Kiros et al., 2014) and define a consistency constraint loss. In practice, similar to Chen et al. (2020), we introduce a small neural network projection head that maps representations, i.e.
to a space where a consistency constraint loss is applied during training. We use MLP with one hidden layer to obtain Z and Z (i.e. x,y are not directly used to train the model, there are in the semantic space as E x,y . See Appendix E for performance comparison by using E and E. σ is a ReLU non-linearity, and W (1) , W (2) ∈ R d×d are model parameters. As shown in Appendix C, we find it beneficial to define the consistency constraint loss on Z, Z 's rather than E, E 's.
After that, the consistency constraint loss is defined as follow: where θ are the parameters in our model, D is a distance function , i.e., cosine distance between two vectors, and γ is a margin, a i,j,k and b i,j,k denote the sentence and word indexes of word si,j's k-th linked word, respectively. 7 Finally, the joint objective function of our model J (θ) is define as: where α determines the contribution of consistency constraint loss, and J N M T (θ) is the cross entropy loss function, i.e.,

Experimentation
To verify the effectiveness of our proposed approach, we carry out experiments on ZH↔EN translation tasks of two different domains: news and TED talks. As inspired by the conclusion in Guillou (2013) that lexical consistency is encouraged in English-French human translation, we also validate our approach on EN→FR translation.

Experimental Setup
Datasets. For ZH↔EN (News), the training data is composed from LDC. We use the NIST2006 dataset as the development set and combine NIST2002, 2003NIST2002, , 2004NIST2002, , 2005  For ZH↔EN (TED), the dataset is from the IWSLT 2014 and 2015 (Cettolo et al., 2012(Cettolo et al., , 2015 evaluation. We use dev2010 as the development set and combine tst2010-2013 as the test set. For both ZH↔EN translations, every source sentence has one translation reference. For EN→FR, we use IWSLT 2015 (Cettolo et al., 2015) evaluation as training data. For development and testing, we use dev2010 as the development set and combine tst2010-2013 as test set and every source sentence has one translation reference.
See Appendix A for more statistics and preprocessing of the experimental datasets.
Training Strategy. To compute the consistency constraint loss JCC(θ), sentences are required to be encoded twice, i.e., one for encoding with the wordlink attention sub-layer and the other for encoding without it. Therefore, including this loss function from the beginning may break the balance between optimizing the encoder and the decoder, and make it hard for the training to properly converge. To alleviate this problem, we divide the whole training process into two stages. In the first stage, we train the models to convergence with the cross entropy loss JNMT(θ) only while in the second stage, we combine the consistency constraint loss JCC(θ) and train the models with the joint loss. Actually, the second training stage acts like a fine-tuning, in which we use a smaller learning rate and fewer training steps.
Model Setting. We use OpenNMT (Klein et al., 2017) as the implementation of the Transformer and extend it. For the number of linked words with the current word, we set K = 6. The margin size γ in the consistency constraint loss is set to 0.2 while the weight α in joint objective function is set to 0.01. Other model settings are in Appendix B.
Evaluation. For all translation tasks, we report case-insensitive BLEU score as calculated by the multi-bleu.perl script.

Experimental Result
Besides sentence-level Transformer, we also compare our approach to three previous Transformerbased context-aware NMT models: HAN (Miculicich et al., 2018)    MCN (Zheng et al., 2020). 10 For fair comparison, we run their source code with our model settings. Note that the above context-aware NMT models aim to improve the translation accuracy (i.e., BLEU) without focusing on resolving a particular discourse phenomenon.
Chinese-English Translation. Table 2 lists the performance of ZH→EN translation on both News and TED talk domains. From the table, we have the following observations.
• Exchanging information via words within word links (i.e., + word-link) achieves significant improvement in BLEU over (sentence-level) Transformer, suggesting that extracting information from document-level context via our deliberately designed word links is effective. Upon the setting of + word-link, constraining the translations of 10 MCN: https://github.com/Blickwinkel1107/making-themost-of-context-nmt words within a link (i.e., +CC-loss) to be consistent with our proposed loss function achieves further significant improvement in BLEU. Comparing to Transformer, our approach gains +2.23 and +2.05 BLEU on the two domains, respectively.
• In terms of LTCR, both +word-link and +CCloss greatly improve lexical translation consistency. For example, with +word-link +CC-loss our approach achieves +7.74% and +10.28% LTCR on the two domains, respectively.
• Though the three previous context-aware NMT models significantly outperform Transformer in terms of BLEU, their performance of LTCR is very close to that of Transformer, suggesting that these models have very limited effect in encouraging lexical translation consistency. Compared to these models, our approach achieves better performance in BLEU while more importantly, it greatly improves the performance in LTCR.
• With the word-link attention sub-layer, our approach introduces additional 10.87% parameters and have similar number of parameters as the previous context-aware NMT models.
English-French Translation.

Discussion
Next, we take ZH→EN translation on news domain as a representative to discuss how our proposed approach improves translation performance. See Appendix for more discussion.

Effect of Hyper-parameter K
Among the words of our interests, the valid lengths of their word links differ greatly. As shown in Table 5, about 79.68% of our interested words have a word link whose valid length is 6 or less.  A significant hyper-parameter in our proposed model is K, i.e., the number of words in every word link (Section 3.1). A low value makes the information exchanging among sentences within a document not sufficient while a high value increases the cost of computation. We compare the performance and training consumed time for five different K values. Note that our model is equivalent to sentence-level Transformer when K is 0.  Table 6: Performance comparison when linked list contains the positions of words with same stem, or random positions. Figure 3 shows the performance over different values of K. It shows that when K increases from 0 to 6, we observe consistent improvement on both BLEU and LTCR. The performance tends to be stable at K = 6 since no further improvement is achieved by increasing K to 8. Meanwhile, increasing K slightly slows down the training speed. Compared to Transformer (i.e., K = 0, 12700 toks/sec), our approach with K = 6 (7800 toks/sec) spends 39% more training time, consumed by the wordlink attention sub-layers and the computation of consistency constraint loss.

Effect of Random Linked Word Positions
As shown in Section 3.1, the word link of word si,j contains the other positions where si,j appears at. To validate that the improvement achieved indeed comes from exchanging information among words with same stem, we perform a contrastive experiment by replacing the positions in word links with random positions. Note that in this way it does not make sense to apply the consistency constraint loss (+CC-loss) since the linked words are random. Table 6 compares the performance. On the one hand, replacing words in word lists with random words still achieves +0.49 BLEU over Transformer. This suggests that even randomly exchanging information cross sentences is helpful. On the other hand, using random linked words does not bring LTCR improvement over Transformer. This in turn may suggest that the BLEU improvement achieved by our approach is mainly contributed by improved lexical translation consistency.

Performance on LDC2015T06
In Section 2 we use word-aligned document-level parallel corpus LDC2015T06 to analyze lexical consistency in translation. Table 7 compares the LTCR performance of our approach to those of the gold and sentence-level NMT scenarios. It shows that our approach (e.g., +word-link +CCloss) achieves higher LTCR than Transformer over   all POS tags, especially for nouns. Meanwhile, the performance gap behind that of reference translation suggests that there still exists room for further improvement.

Pronoun Translation
We follow Miculicich et al. (2018) and Tan et al. (2019) to evaluate coreference and anaphora using the reference-based metric: accuracy of pronoun translation (Werlen and Popescu-Belis, 2017). Table 8 lists the performance of pronoun translation. From it we observe that our approach also improves the performance of pronoun translation while exchanging context information among linked words (i.e., +word-link) contributes more than the consistency constraint loss (i.e., +CCloss).

Human Evaluation
We conduct a human evaluation on 500 sentences randomly selected from our test set. Let us assume that the i-th sentence Si in a document-level parallel pair (S, T ) is selected. Then we provide    two annotators with a group of source sentences and translations, i.e., (Si−2, Si−1, Si, Si+1, Si+2) and (Ti−2, Ti−1, ?, Ti+1, Ti+2), where ? is Si's translation of either our approach or the sentence-level Transformer. Besides, translation ? is provided in random order with no indication which model it is from. Following Voita et al. (2019a), the task is to pick one of the three options: (1) the first translation is better, (2) the second translation is better, and (3) the translations are equal quality. The two annotators are asked to avoid the third option if they could give preference to one of the translations. Table 9 shows the human evaluation results. In average the annotators mark 47% cases as having equal quality. Among the others, our approach outperforms Transformer in 64% cases, suggesting that overall the annotators have a strong preference for our approach over Transformer.

Effect of Sentence Position Embedding
As shown in Section 3.2.1, we introduce sentence position embedding (SPE) to indicate the sentence position of words. To analyze that the effects of it on our proposed approach, we perform a contrastive experiment. Table 10 compares the performance. The SPE slightly improves BLEU (+ 0.49) and LTCR (+ 0.84%) over word-link Transformer without SPE. This is suggest that SPE for document-level NMT is helpful. We will explore more about it in the future work.

Analysis of Exchanging Information among Linked Words
As shown in Section 3.2.2, we use the multi-head attention function to exchange information among linked words. To valid the effectiveness of this method, we perform a contrastive experiment by replacing multi-head attention function in Eq. 3 with the average pooling function Eq. 8. Table 11 lists the performance of translation when we use different functions to exchange information among linked words. From it we observe that the multi-head attention function performs better. This in turn may suggest that simply averaging hidden states of linked words to exchange information lead to the mediocrity of cross-sentence information.

Related Work
There has been substantial work in SMT that either encourages or enforces lexical translation consistency. For example, Xiao et al. (2011) and Garcia et al. (2014Garcia et al. ( , 2017 propose post-editing approaches to re-translate those source words which have been translated differently in a document. Tiedemann (2010a,b) and Gong et al. (2011) propose cache-based approaches to remember translation history. Discriminative learning approaches  are also proposed to fix lexical translation non-consistency. Besides, Carpuat (2009) and Türe et al. (2012) demonstrate that applying "one translation per discourse" constraint in SMT leads to better translation quality.
Moving to NMT, most of document-level NMT studies have proposed various context-aware NMT models to leverage either local context, e.g., previous sentences Wang et al., 2017;Bawden et al., 2018;Voita et al., 2018Voita et al., , 2019bYang et al., 2019), or entire document (Maruf and Haffari, 2018;Mace and Servan, 2019;Maruf et al., 2019;Tan et al., 2019;Zheng et al., 2020;Kang et al., 2020). However, different from ours, these studies aim to improve the translation accuracy without handling a specific discourse phenomena. Kuang et al. (2017) and Tu et al. (2018) cache recently translated words and/or their translations which could be used to increase lexical consistency when translate future sentences. However, cache-based approaches require to translate sentences in a document one by one and may potentially guide the translation of future sentences in a wrong way since the cached translations could be incorrect. Experimental re-sults in related studies Miculicich et al., 2018) have shown that the improvement of cache-based approaches is limited in BLEU over (sentence-level) Transformer. Our approach is different from cached-based approach as we translate sentences within a document synchronously, and more importantly it does not explicitly suggest any translation.
There also exists many studies in NMT that aim to resolve discourse phenomena in post-process. For example, to make translation outputs of a document more coherent, Voita et al. (2019a) propose DocRepair trained on monolingual target language documents to correct the inconsistencies in sentence-level translation while Yu et al. (2020) train a context-aware language model to re-rank sentence-level translation candidates.

Conclusion
In this paper, we apply "one translation per discourse" in NMT, and have proposed an approach to encourage lexical translation consistency. This is done by first obtaining a word link for each source word in a document, which tells the positions the source word appears at. Then we encourage the translations of words within a link to be consistent by both exchanging their context information in encoding, and using an auxiliary loss to constrain their translation being consistent. Experimental results on Chinese↔English and English→French translation tasks show that our approach not only achieves higher BLEU scores than various contextaware NMT models, but also greatly improves lexical translation consistency.

A Experimental Datasets
For ZH↔EN on news domain, the training data set consists of LDC2002T01, LDC2004T07, LDC2005T06, LDC2005T10, LDC2009T02, LDC2009T15, and LDC2010T03. Table 12 summarizes statistics of the translation tasks. Note that we split long documents in training datasets into sub-documents with at most 20 sentences for efficient training. Table 13 presents the percentage of words of our interest against all source-side words in the five translation tasks. It shows that the percentage of words of our interest varies across different translation tasks.
For ZH↔EN, the English sentences are tokenized and lowercased by Moses toolkit (Koehn et al., 2007) 11 while the Chinese sentences are segmented by Jieba. 12 For News (TED), we segment the source and target sentences into sub-words by a BPE model with 32K (21K) merged operations (Sennrich et al., 2016).
For EN→FR, all English and French sentences are tokenized and lowercased by Moses toolkit, we use BPE with 32K merged operations to segment words into sub-word units.

B Model Settings
For all translation models, the hidden size and the filter size are set to 512 and 2048, respectively. the number of heads in multi-head attention is set to 8. The dropout rate is 0.1. For models on ZH↔EN, the numbers of layers in the encoder and the decoder are set to 6, while for models on EN→FR, we change the numbers to 4. We train the models on two V100 GPUs with batch-size 4096 and use Adam with β1 = 0.9, β2 = 0.98 for optimization (Kingma and Ba, 2015). In the first training stage, we train the models for 150K steps, warm-up steps as 8K, learning rate as 1.0 while in the second training stage, we continue to train the models for 50K steps, warm-up steps as 4K, learning rate as 0.5. In inferring, we set the beam size to 5.

C Effect of Non-linear Projection Head
We take ZH→EN translation on news domain as example to study the importance of including a projection head, i.e. g(.). Figure 4 shows LTCR and BLEU scores using three different architecture for the head: (1) identity mapping; (2) linear projection and (3) the default non-linear projection with one additional hidden layer (and ReLU activation).We observe that a non-linear projection is better than a linear projection (+0.46 BLEU and +0.32% LTCR), and much better than no projection (+0.58 BLEU and +0.31% LTCR).  We study if our approach performs better, i.e, more BLEU improvement over Transformer when there are more words of our interest in a document. To this end, we divide all documents in the test set into three subsets with different percentages of words of our interest: • <=20%, which includes 137 documents with 1,449 sentences; • 20 ∼ 40%, which includes 362 documents with 3,606 sentences; • >40%, which includes 10 documents with 91 sentences.
As shown in Table 14, we observe that our approach indeed achieves more improvement over documents with higher percentages of words of our interest. For example, when the percentage is bigger than 40%, we achieve +4.31 BLEU gain.

Word-Link NMT
#1: ( international ) ciq sign memorandum on implementation of animal and plant health measures #4: the memorandum stipulates that both s ides s hould strictly implement inspection and quarantine of animals ... #5: the memorandum was s igned by the deputy director of the s tate administration of quality and inspection of ... #2: ... and the state ministry of agriculture of chile signed a memorandum on the implementation of health ... #3: under the memorandum , the two sides will , in accordance with the rules of the agreement and the standards ...

Sentence-Level NMT
#1: ( international ) zhongji signed memorandum on measures to implement animal and plants #4: the mou stipulates that the two sides should strictly adhere to the protocol or the agreed ins pection and ... #5: the memorandum was s igned by the deputy director-general of the s tate of quality inspection and inspection of ... #2: ... inspection general of china and the chilean minis try of agriculture signed a memorandum here on ... #3: under the mou , both sides will , in accordance with the rules of the agreement and the standards developed by...

#1: ( international ) china and chi le s ign memorandum on application of animal and plant sanitary measures
#4: the memorandum also stipul ates that both sides should conduct inspection and quarantine of the imported and ... #5: ... quality inspection bureau , and barrera , acting minister of agriculture of chile , signed the memorandum . #2: ... national quality inspection bureau and the ministry of agriculture of chile signed here a memorandum ... #3: according to the memorandum , china and chi le will formulate inspection and quarant ine requirements for the ... Figure 5: An example of document-level Chinese-English translation from our test set.   Table 15 lists the performance. It is not surprising that the performance of using E as encoder output is lower than that of using E since the former does not use any contextual information. This suggests that although E is not directly used to train the model, it is in the semantic space as E.

F Qualitative Analysis
We use an example to illustrate how word-link method helps translation ( Figure 5). From it we observe that our proposed approach (Word-Link NMT) can effectively alleviate the translation inconsistency issue in document-level NMT, source word 备忘录/bei_wang_lu is consistently translated into memorandum by our model.