Document Graph for Neural Machine Translation

Previous works have shown that contextual information can improve the performance of neural machine translation (NMT). However, most existing document-level NMT methods failed to leverage contexts beyond a few set of previous sentences. How to make use of the whole document as global contexts is still a challenge. To address this issue, we hypothesize that a document can be represented as a graph that connects relevant contexts regardless of their distances. We employ several types of relations, including adjacency, syntactic dependency, lexical consistency, and coreference, to construct the document graph. Then, we incorporate both source and target graphs into the conventional Transformer architecture with graph convolutional networks. Experiments on various NMT benchmarks, including IWSLT English–French, Chinese-English, WMT English–German and Opensubtitle English–Russian, demonstrate that using document graphs can significantly improve the translation quality. Extensive analysis verifies that the document graph is beneficial for capturing discourse phenomena.


Introduction
Although neural machine translation (NMT) has achieved great success on sentence-level translation tasks, many studies pointed out that translation mistakes become more noticeable at the documentlevel (Wang et al., 2017;Tiedemann and Scherrer, 2017;Miculicich et al., 2018;Kuang et al., 2018;Voita et al., 2018;Läubli et al., 2018;Voita et al., 2019b;Kim et al., 2019;. They proved that these mistakes can be alleviated by feeding the contexts into context-agnostic NMT models.
Previous works have explored various methods to integrate context information into NMT models.
They usually take a limited number of previous sentences as contexts and learn context-aware representations using hierarchical networks (Miculicich et al., 2018;Wang et al., 2017;Tan et al., 2019) or extra context encoders (Jean et al., 2015;. Different from representation-based approaches,  and Kuang et al. (2018) propose using a cache to memorize context information, which can be either history hidden states or lexicons. To keep tracking of most recent contexts, the cache is updated when new translations are generated. Therefore, long-distance contexts would likely be erased.
How to use long-distance contexts is drawing attention in recent years. Approaches, like treating the whole document as a long sentence (Junczys-Dowmunt, 2019) and using memory and hierarchical structures (Maruf and Haffari, 2018;Maruf et al., 2019;Tan et al., 2019), are proposed to take global contexts into consideration. However, Kim et al. (2019) point out that not all the words in a document are beneficial to context integration, suggesting that it is essential for each word to focus on its own relevant context.
To address this problem, we suppose to build a document graph for a document, where each word is connected to those words which have a direct influence on its translation. Figure 1 shows an example of a document graph. Explicitly, a document graph is defined as a directed graph where: (1) each node represents a word in the document; (2) each edge represents one of the following relations between words: (a) adjacency; (b) syntactic dependency; (c) lexical consistency; or (d) coreference.
We apply a Graph Convolutional Network (GCN) on the document graph to obtain a document-level contextual representation for each word, fed to the conventional TRANSFORMER model (Vaswani et al., 2017) by additional attention and gating mechanisms. We evaluate our model on four translation benchmarks, IWSLT English-French (En-Fr) and Chinese-English (Zh-En), Opensubtitle English-Russian (En-Ru), and WMT English-German (En-De). Experimental results demonstrate that our approach is consistently superior to previous works (Miculicich et al., 2018;Macé and Servan, 2019;Tan et al., 2019;Maruf et al., 2019) on all the language pairs.
Contributions of this work are summarized as: • We represent a document as a graph that connects relevant contexts regardless of their distances. To the best of our knowledge, this is the first work to introduce such graphs into document-level neural machine translation. • We investigate several relations between words to construct document graphs and verify their effectiveness in experiments. • We propose a graph encoder to learn graph representations based on GCN layers with an attention mechanism to combine representations of different sources. • We proposed a context integration method that examined the proposed graph model in different context-aware MT architectures.

Approach
In this section, we introduce the proposed document graph and model for leveraging contextual information from documents. Firstly, we present a definition of the problem. Then, the construction and representation learning of document graphs are explained in Section 2.2 and Section 2.3, respectively. Finally, we describe the method of integrat-CoreNLP (https://corenlp.run/).
ing document graphs and model architectures that we use to examine the integration.

Problem Definition
Document-level NMT learns to translate from a document in a source language to a document in a target language. Formally, a source document is a set of M sentences X = [X 1 , ..., X m , ..., Given the source document to translate, we assume that there is a pair of source and target hidden graphs G X,Ŷ = G X , GŶ (called document graphs and defined in Section 2.2) to help generate the target document. Therefore, the translation probability from X to Y can be represented as: Equation (1) is computationally intractable. Therefore, instead of considering all possible graph pairs, we only sample one pair of graphs according to the source document resulting in a simplified Equation (2). The construction of source and target graphs are described in Section 2.2.
The translation of a document is further decomposed into translations of each sentence with document graphs as context:

Graph Construction
Graphs used in this paper are directed, which can be represented as G = (V, E), where V is a set of nodes and E is a set of edges where an edge e = (u, v) with u, v ∈ V denotes an arrow connection from the node u to the node v.
Our graph contains both word-level and sentence-level nodes. Given a document X = [· · · ; x m 1 , · · · , x m Im ; · · · ] where x m i is the ith (1 ≤ i ≤ I m ) word in the mth (1 ≤ m ≤ M ) sentence, we construct a document graph with M m=1 I m word-level nodes and M sentence-level nodes. Each word-level node x m i in the mth sentence is directly connected to the sentence-level node S m . Edges between word-level nodes are determined by intra-sentential and inter-sentential relations. Figure 1 shows an example document graph. Note that not all edges are depicted for simplicity.
Intra-sentential Relations provide links between words in a sentence X m = x m 1 , · · · , x m Im . These links are relatively local yet informative and help understand the structure and meaning of the sentence. In this paper, we consider two kinds of intra-sentential relations: • Adjacency provides a local lexicalized context that can be obtained without resorting to external resources and has been proven beneficial to sentence modeling Xu et al., 2019). For each word x m i , we add two edges (x m i , x m i+1 } and (x m i , x m i−1 }. This means we add links from the current word to its adjacent words.
• Dependency directly models syntactic and semantic relations between two words in a sentence. Dependency relations not only provide linguistic meanings but also allow connections between words with a longer distance. Previous practices have shown that dependency relations enhance representation learning of words Strubell et al., 2018;Lin et al., 2019). Given a dependency tree of the sentence and a word x m i , we add a graph edge ( Inter-sentential Relations allow links from one sentence X m = x m 1 , · · · , x m Im to another following sentence X n = x n 1 , · · · , x n In . These relations provide discourse information, which is important for capturing document phenomena in documentlevel NMT (Tiedemann and Scherrer, 2017;Voita et al., 2018). Accordingly, we consider two kinds of relations in our document graph: • Lexical consistency considers repeated and similar words across sentences in the document, which reflects the cohesion of lexical choices. In this paper, we add edges . Namely, the exact same words and words with the same lemma in the two sentences are connected in the graph.
• Coreference is a common phenomenon in documents and exists when referring back to someone or something previously mentioned. It helps understand the logic and structure of the document and resolve the ambiguities. In this paper we add a graph edge (x m i , x n j ) if x m i is a referent of x n j given by coreference resolution.
Inter-sentential relations also exist between words in the same sentence, where m = n.
Source and Target Graphs In this paper, we construct a source graph directly from a source document using the method mentioned above. The target graph is built incrementally during inference, i.e., translations of previous sentences in the same document are used as target context. For simplicity, each target context sentence is treated as a fully connected graph and encoded independently by the graph encoder.

Document Graph Encoder
As the document is projected into a document graph, a flexible graph encoder is required to encode the complex structure. Previous studies verified that GCNs can be applied to encode linguistic structures such as dependency trees Bastings et al., 2017;Koncel-Kedziorski et al., 2019;. In this paper, we follow previous practices to use stacked GCN layers as the encoder of document graph with considerations on edge directions.
Graph Convolutional Networks GCNs are neural networks operating on graphs and aggregating information from immediate neighbors of nodes. Information of longer-distance nodes is covered by stacking GCN layers. Formally, given a graph G(V, E), the GCN network first projects the nodes

Hyb-Integration Pre-Integration
Context Figure 3: Illustration of the examined architecture. The context information is integrated with a Context-Attn mechanism. Hyb-integration is adding the Context-Attn inside each encoder layer. Post-and Pre-integration is aggregating after and before the encoder, respectively. N in this paper is 6. We only apply source context to the encoder and target context to the decoder, when the contexts are available. Otherwise, we follow the setting of existing works. We share the graph encoder for both source and target graph. Details are shown in Supplementary.
V into representations H 0 ∈ R I×d , where d stands for hidden size and I = |V |. Node representations H l of the lth layer can be updated as follows: where σ is the sigmoid function and W l+1 ∈ R d×d , B l+1 ∈ R d are learnable parameters, A ∈ R I×I is an adjacency matrix that stores edge information: The degree matrix D ∈ R I×I is assigned to weight the expected importance of a current node based on the number of input nodes, which can be calculated with the adjacency matrix: Fusion of Edge Information Equation (5) only considers input features. To fully use direction information in the graph, we apply GCN on different types of edges: where t ∈ {in, out, self} represents one of the edge types, i.e., input edges, output edges, or a specific type of self-loop edges. We assume the contributions of the representations learned from a different kind of edges should be different. We then apply a type-attention mechanism, which works better than a linear combination in our experiments, 2 to com- 2 We report our experiments in Section 2 of Supplementary.
bine these representations of different edge types: where the α t are attention weights given by a dotproduct attention algorithm (Vaswani et al., 2017).
Sentence Embedding After the GCN, we extract the sentence-level nodes S m as context representation. Since the GCN ignores explicitly positional information between sentences, we add a sentence embedding before integrating the context representation into an encoder or decoder. Figure 2 shows our graph encoder.

Integration of Context Representation
Context representation H G from the document graph encoder is treated as a memory and used by an attention mechanism, namely: (10) where Context-Attn is a multi-head attention function (Vaswani et al., 2017). Instead of using the standard residual connection in this sublayer, we adopt a gated mechanism following  to dynamically control the influence of context information: where λ are gating weights, and σ(·) denotes the sigmoid function. W a and W c are the trainable parameters. In the rest of this paper, we use Context-Attn to denote both the attention and gated residual mechanisms.
In this paper, the Context-Attn sublayer is used in three different ways, as shown in Figure 3: • Hyb-integration: integrates the contextual information with an additional Context-Attn layer inside each encoder layer . • Post-integration: aggregates the contextual information by adding a Context-Attn layer after the encoder (Tan et al., 2019;Miculicich et al., 2018;Maruf et al., 2019). • Pre-integration: interpolates the context representation before the encoder, which can be considered as the hierarchical embedded (Ma et al., 2020).

Experiments
Data We evaluate our approach on translation benchmarks with different corpus size: (1) IWSLT En-Fr and Zh-En translation tasks (Cettolo et al., 2012) with around 200K sentence pairs for training. Following convention (Wang et al., 2017;Miculicich et al., 2018;, both language pairs take dev2010 as the development set. tst2010 is used for testing on En-Fr and tst2010∼tst2013 on Zh-En. (2) Opensubtitle2018 En-Ru translation corpus released by Voita et al. (2018), which contains 6M sentence pairs for training, among which 1.5M sentence pairs have context sentences.
(3) We adopted the WMT19 document-level corpus published by Scherrer et al. (2019) for the En-De translation task. This data contains 2.9M parallel sentences with document boundaries and 10.3M back-translated sentence pairs. All data are tokenized and segmented into subword units using the byte-pair encoding (Sennrich et al., 2016). We apply 32k merge steps for each language on En-Fr, En-Ru, En-De tasks, and 30k for Zh-En task. As a node in a document graph represents a word rather than its subwords, we average embeddings of the subwords as the embedding of the node. The 4-gram BLEU (Papineni et al., 2002) is used as the evaluation metric.
Models and Baselines Models trained in two stages (Jean et al., 2015): conventional sentencelevel TRANSFORMER models (denoted as BASE) are first trained with configurations following previous works Miculicich et al., 2018;Voita et al., 2019b;Vaswani et al., 2017). Then, we fix sentence-level model parameters and only train additional parameters introduced by our methods. We set the layers of the document graph encoder to 2 and share their parameters 3 .
To compare our graph-based method with prior works, we reimplement several document-level baselines on the TRANSFORMER architecture and replace their context modules with ours (Please refer to Supplementary on details): • CTX  employs an additional encoder to learn context representations, which are then integrated by cross-attention mechanisms. • HAN (Miculicich et al., 2018) uses a hierarchical attention mechanism with two levels (word and sentence) of abstraction to incorporate context information from both source and target documents.  Table 1 shows the overall results on four translation tasks. We find that systems with document graphs achieve the best performance among all context-aware systems on all language pairs with comparable or better training speed. This verifies our hypothesis that document graphs are beneficial for modeling and leveraging the context. With target graphs, the translation quality in terms of BLEU gets slightly improved, which shows the positive effect of the target context to some extent. Compared with the corresponding baseline model, our model has a comparable or less number of parameters indicating that the improvements of our method are not because of parameter increments.

Ablation Study
Edge Relations To investigate the influence of the graph construction, we first inspect each kind of edge relation individually by constructing graphs  (Koehn, 2004) over the best baseline model with context on each task at p < 0.05/0.01, respectively. The models in bold are selected to merge with our document graph methods. "Para." and "Speed" indicate the model size (M = million) and training speed (tokens/second), respectively. * denotes that the model considers the target context.  using only one of them. Table 2 shows that each kind of relation itself improves the translation quality over the BASE model, which demonstrates the effectiveness of each selected intra-sentential and inter-sentential relation. Combining relations can further improve the system, which achieves the best performance when all relations are considered.
These results indicate that the selected relations in this paper are complementary to each other.
Word-level vs. Sentence-level Nodes We further examined the influence of the context information at different levels (word-and sentence-level).
In this experiment, we tried to use representations  of word-level nodes as context. For achieving a better performance, only words in the current sentence are selected. The results are shown in Table  3. We can find that using only representations of sentence-level nodes as context (i.e., default setting) achieves comparable BLEU scores but with a faster training speed. Table 4 show the influence of sentence embedding. We can find that using sentence embedding slightly improves the performance (+0.2 BLEU). This is because our graphs are directed where positional information is preserved to some extent.   context distance and its influence;

Sentence Embedding
(2) accuracy of dependency tree; (3) changes in document phenomena of translations; and (4) give a case study. Figure 4a shows the influence of context distance on translation quality. We found that HAN performs worse when increasing the number of context sentences. One possible reason is that sequential structures introduce not only long-distance context but also more irrelevant information. By contrast, our model is getting better while more context is considered. This suggests that graphs help the model focus on relevant contexts regardless of their distance. SELECTIVE achieves a lower performance than our model and the gap becomes larger when on longer context, which we surmise is because the attention mechanism has difficulties to differentiate the usefulness of context. This also indicates that the prior knowledge indeed benefits to select relevant context. Figure 4b shows evaluation results on different document lengths, i.e., the number of sentences in the document. We found that models considering global context (SELECTIVE and OUR) achieve better results than HAN. OUR is consistently better than SELECTIVE as well, especially on shorter and longer documents. These results suggest that a global context is beneficial to document-level NMT and appropriate consideration of global context is essential. Figure 5: Influence of dependency-tree accuracy on the En-Fr translation task. We examined three different integration methods as described in Section 2.4. We treat the conversion of k-best results from constituency parser as the dependency tree with decreasing accuracy. Figure 5 illustrates the influence of accuracy of dependency trees during inference. The Best means the best result from the dependency parser. The 1 to 5 denote dependency trees converted from the 5best constituency trees of decreasing accuracy. 4 We find that the performance of our systems with Post and Hyb methods slightly decrease when parsing accuracy becomes lower. However, the Pre method is more robust to parsing accuracy. We attribute this to the fact that integrating the document graph  before the encoder leads to more opportunities to resist the noise.

Discourse Phenomena
We also examine whether our approaches are beneficial to capture discourse phenomena by evaluating our model on the Consistency test set (Voita et al., 2019a) and Discourse test set (Bawden et al., 2018). 5 Test set The Consistency test set contains three types of tasks on En-Ru: 1) Dex. checks the translation of deictic words or phrases. 2) Lex. focuses on the translation consistency of reiterative phrases.
3) Ell. tests whether models correctly predict ellipsis verb phrases or the morphology of words. The Discourse test set consists of two probing tasks on En-Fr: 1) Coref. aims to test whether the gender of an anaphoric pronoun (it or they) is coherent with the previous sentence. 2) Cohe. is a set of ambiguous examples whose correct translations rely on the context. Table 5, all the context-aware models comprehensively improve the performance on discourse phenomena over the context-agnostic BASE model. Results on the the NOISE model (Li et al., 2020) indicate that the improvement is not merely because of robust training. Compared to prior contextaware models, our model achieves the best accuracy on all tasks. Especially on the Lex., Coref. and Cohe. tasks, our model outperforms others over 5 More detailed reports on these tasks are presented in the Supplementary. two points. Note that on the ellipsis task graph edges are usually missing for elided verb phrases. For example, given the following source sentence and its context (Voita et al., 2019b), the verbs "told" and "did" are not directly connected in our graph but indirectly connected via the coreference relation of their neighbors "Nick" and "he". Hence, our approach is still slightly better than the best prior method SELECTIVE. Directly linking such words may bring further improvements, which we leave for future work.

Result on Discourse Phenomena As shown in
Context Nick told you what happened, right? Source Yeah, he did.
Analysis on Graphs We further conduct experiments with the hope of figuring out the influence of graphs on the discourse phenomena, as shown in Table 5. We found that our model with only source graphs (i.e., w/o TGT-G) is consistently better than the BASE model on all tasks. Target graphs further improve it to achieve the best performance indicating the importance of target graphs on documentlevel translation. Both types of relations, INTER and INTRA, make significant contributions as well.
Their combination brings significant improvement verifying they are complementary to some extent. We also found that compared to INTRA relations, INTER relations contribute more on all tasks except the Ell. task. We attribute this to the fact that our document graph contains inter-sentential relations, i.e., lexical consistency and coreference, which directly link relevant contexts for reiterative and deictic words.

Case-Study
To verify the long-distance consistency, we perform case studies on the Zh-En task. Table 6 shows an example where a named entity "米格尔" (miguel) repeatedly appears in different positions in the document. We first found that both document-level NMT systems, i.e., HAN and OUR, generate more consistent translations of the entity than the contextagnostic BASE model. Compared with the HAN model, OUR system keeps translating "米格尔" into "migel", suggesting a more effective capability of handling consistency in long-distance context.

Related work
In recent years, a variety of studies work on improving document-level machine translation with context. Most of them focus on using a limited number  of previous sentences. One typical approach is to equip conventional sentence-level NMT with an additional encoder to learn context representations, which are then integrated into encoder and/or decoder (Jean et al., 2015;Voita et al., 2018). Wang et al. (2017) and Miculicich et al. (2018) adopted hierarchical mechanisms to integrate contexts into NMT models.  and Kuang et al. (2018) used cache-based methods to memorize historical translations which are then used in following decoding steps.
Recently, several studies have endeavoured to consider the full document context. Macé and Servan (2019) averaged the word embeddings of a document to serve as the global context directly. Maruf and Haffari (2018) applied a memory network to remember hidden states of the document, which are then attended by a decoder. Maruf et al. (2019) first selected relevant sentences as contexts and then attended to words in these sentences. Tan et al. (2019) learned global contextaware representations by firstly using a sentence encoder followed by a document encoder. Junczys-Dowmunt (2019) considered the global context by merely concatenating all the sentences in a document. Zheng et al. (2020) took an additional attention layer to get a representation mixed from the current sentence and whole document. Kang et al. (2020) dynamically selected the relevant context from the whole document via a reinforcement learning method.
Unlike previous approaches, we represent document-level global context in graph encoded by graph encoders and integrated into conventional NMT via attention and gating mechanisms.

Conclusion
In this paper, we propose a graph-based approach for document-level translation, which leverages both source and target contexts. Graphs are constructed according to inter-sentential and intrasentential relations. We employ a GCN-based graph encoder to learn the graph representations, which are then fed into the NMT model via attention and gating mechanisms. Experiments on four translation tasks and several existing architectures show the proposed approach consistently improves translation quality across different language pairs. Further analyses demonstrate the effectiveness of graphs and the capability of leveraging long-distance context. In the future, we would like to enrich the types of relations to cover more document phenomena.
Data The statistics of the datasets are reported in Table 7. For the Chinese language, we segment the data set with the jieba toolkit but the Moses tokenizer.pl for the other languages. WMT19 and Opensubtitle are will pre-processed by Scherrer et al. (2019) and Voita et al. (2018).
Settings We incorporate the proposed approach into the widely used context-agnostic framework TRANSFORMER (Vaswani et al., 2017) on FAIRSEQ toolkit (Ott et al., 2019). The model are trained on V100 GPU. The conventional contextagnostic TRANSFORMER models are trained with BASE settings. For the IWSLT and Opensubtitle benchmarks, we train the context-agnostic model with 0.2 dropout. The learning rate is set to 0.0007 with 4k warm-up steps. We set the dropout of the document graph encoder to 0.2, which tuned on validation set. We use approximately 16,000 tokens in a mini-batch for En-Fr, Zh-En, En-Ru, and 32,000 for En-De.
In decoding, the beam size is set to 4. Following the setting of previous work Miculicich et al., 2018;Voita et al., 2019b), we set the hyper-parameter α of length penalty to 0.6 for En-Fr, En-De, 0.5 for En-Ru and 1 for Zh-En.

B Ablation Study
Graph Encoder We extend the GCN-based graph encoder with an attention mechanism to combine different representations, which is different from the gate-based method in previous work (Bastings et al., 2017). Table 8 shows that the attentionbased aggregation works better in our model. We presume this is because the attention mechanism balances the contributions of different representations. Table 9 shows the influence of the graph encoder with various numbers of layers. We found that stacking two graph encoder layers and sharing their parameter obtains the best performance. Further increasing the number of layers does not lead improvement. This finding is consistent with existing works as well Bastings et al., 2017). As shown in Table 10, we also investigate the traditional TF-IDF construction method, the result indicates that our method is not limited to the examined relations but also works with other graph construction methods.

Graph Contribution
We evaluated the performance of the context form each side. As seen in Table 11, only using the source or target side graph shows comparable performance. With both source and target context further improve the translation quality.

B.1 Discourse Phenomena
Test set The consistency test set (Voita et al., 2019b) contains four tasks on En-Ru: 1) Deixis aims to detect the deictic words or phrases whose denotation depends on the context. 2) Lex.C is a lexical cohesion task, which focuses on the reiteration of named entities. 3) Ell.inf tests the model on words whose morphological form depends on the context. 4) Ell.VP is to test whether the model can correctly predict the ellipsis verb phrase in Russian. Discourse test set (Bawden et al., 2018) consists of two probing tasks on En-Fr: 1) Coref. aims to test the anaphoric pronoun (it or they) whose gender is coherent with the previous sentence. 2) Coh. is a set of ambiguous examples whose correct translations rely on the context. The difference between the Cor. and Sem. is whether the context is correct or not.          We didn't modify the basic architecture of the existing works, but take place their context encoder with our graph encoder. Note that the Unified method didn't add the context on the target side. Therefore, we modified the decoder when we integrate the target graph.