Measuring and Increasing Context Usage in Context-Aware Machine Translation

Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context, context from sentences other than those currently being translated. However, while many current methods present model architectures that theoretically can use this extra context, it is often not clear how much they do actually utilize it at translation time. In this paper, we introduce a new metric, conditional cross-mutual information, to quantify usage of context by these models. Using this metric, we measure how much document-level machine translation systems use particular varieties of context. We find that target context is referenced more than source context, and that including more context has a diminishing affect on results. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models. Experiments show that our method not only increases context usage, but also improves the translation quality according to metrics such as BLEU and COMET, as well as performance on anaphoric pronoun resolution and lexical cohesion contrastive datasets.


Introduction
While neural machine translation (NMT) is reported to have achieved human parity in some domains and language pairs (Hassan et al., 2018), these claims seem overly optimistic and no longer hold with document-level evaluation (Toral et al., 2018;Läubli et al., 2018). Recent work on contextaware NMT attempts to alleviate this discrepancy by incorporating the surrounding context sentences (in either or both the source and target sides) in the translation system. This can be done by, for example, feeding context sentences to standard NMT 1 https://github.com/neulab/contextual-mt Figure 1: Illustration of how we can measure context usage by a model q M T as the amount of information gained when a model is given the context C and source X vs when the model is only given the X. models (Tiedemann and Scherrer, 2017), using different encoders for context , having cache-based memories (Tu et al., 2018a), or using models with hierarchical attention mechanisms (Miculicich et al., 2018;Maruf et al., 2019a) -more details in §2. While such works report gains in translation quality compared to sentence-level baselines trained on small datasets, recent work has shown that, in more realistic high-resourced scenarios, these systems fail to outperform simpler baselines with respect to overall translation accuracy, pronoun translation, or lexical cohesion (Lopes et al., 2020).
We hypothesize that one major reason for these lacklustre results is due to the fact that models with the architectural capacity to model cross-sentential context do not necessarily learn to do so when trained with existing training paradigms. However, even quantifying model usage of context is an ongoing challenge; while contrastive evaluation has been proposed to measure performance on inter-sentential discourse phenomena (Müller et al., 2018;Bawden et al., 2018), this approach is confined to a narrow set of phenomena, such as pronoun translation and lexical cohesion. A toolbox to measure the impact of context in broader settings is still missing.

Source:
The Church is merciful. . . It always welcomes the misguided lamb.

Target:
Die Kirche ist barmherzig. . . Baseline Es heisst die fehlgeleiteten Schäflein immer willkommen. Context-Aware Es heisst die fehlgeleiteten Schäflein immer willkommen. Context-Aware w/ our method Sie heisst die fehlgeleiteten Schäflein immer willkommen. Table 1: Example where context (italic) is needed to correctly translate the pronoun "it". Both the sentencelevel baseline and context-aware model fail to correctly translate it while the context-aware model trained with COWORD dropout correctly captures the context.
To address the limitations above, we take inspiration from the recent work of Bugliarello et al. (2020) and propose a new metric, conditional cross-mutual information (CXMI, §3), to measure quantitatively how much context-aware models actually use the provided context by comparing the model distributions over a dataset with and without context. Figure 1 illustrates how it measures context usage. This metric applies to any probabilistic context-aware machine translation model, not only the ones used in this paper. We release a software package to encourage the use of this metric in future context-aware machine translation research. We then perform a rigorous empirical analysis of the CXMI between the context and target for different context sizes, and between source and target context. We find that: (1) context-aware models use some information from the context, but the amount of information used does not increase uniformly with the context size, and can even lead to a reduction in context usage; (2) target context seems to be used more by models than source context.
Given the findings, we next consider how to encourage models to use more context. Specifically, we introduce a simple but effective variation of word dropout (Sennrich et al., 2016a) for context-aware machine translation, dubbed COWORD dropout ( §4). Put simply, we randomly drop words from the current source sentence by replacing them with a placeholder token. Intuitively, this encourages the model to use extra-sentential information to compensate for the missing information in the current source sentence. We show that models trained with COWORD dropout not only increase context usage compared to models trained without it but also improve the quality of translation, both according to standard evaluation metrics (BLEU and COMET) and according to contrastive evaluation based on inter-sentential discourse phenomena such as anaphoric pronoun resolution and lexical cohesion ( §4.2, Table 1).

Context-Aware Neural Machine Translation
We are interested in learning a system that translates documents consisting of multiple sentences between two languages. 2 More formally, given a corpus of parallel documents in two languages, where each document is a sequence of source and target sentences, D = {(x (1) , y (1) ), ..., (x (K) , y (K) )}, we are interested in learning the mapping between the two languages. We consider the typical (auto-regressive) neural machine translation system q θ parameterized by θ. The probability of translating x (i) into y (i) given the context of the sentence C (i) is where y (i) t represents the t th token of sentence y (i) . This context can take various forms. On one end, we have the case where no context is passed, C (i) = ∅, and the problem is reduced to sentence-level translation. On the other end, we have the case where all the source sentences and all the previous generated target sentences are passed as context C (i) = {x (1) , ..., x (K) , y (1) , ..., y (i−1) }.
As mentioned, there are many architectural approaches to leveraging context (see §5 for a more complete review), and the methods that we present in this paper are compatible with most architectures because they do not specify how the model q θ uses the context. In experiments, we focus mostly on the simpler approach of concatenating the context to the current sentences (Tiedemann and Scherrer, 2017). Recent work by Lopes et al. (2020) has shown that, given enough data (either through pretraining or larger contextual datasets), this simple approach tends to be competitive with or even outperform its more complex counterparts 6469 3 Measuring Context Usage

Conditional Cross-Mutual Information
While context-aware models allow use of context, they do not ensure contextual information is actually used: models could just be relying on the current source sentence and/or previously generated target words from the same sentence when generating the output.
Contrastive evaluation, where models are assessed based on the ability to distinguish correct translations from contrastive ones, is a common way to assess the ability of context-aware models to capture specific discourse phenomena that require inter-sentential context, such as anaphora resolution (Müller et al., 2018) and lexical cohesion (Bawden et al., 2018). However, these methods only provide an indirect measure of context usage with respect to a limited number of phenomena and can fail to capture other, unknown ways in which the model might be using context. Kim et al. (2019) showed that most improvements to translation quality are due to non-interpretable usages of context, such as the introduction of noise that acts as a regularizer to the encoder/decoder. This problem is further exacerbated by the fact that there is no clear definition of what entails "context usage".
In a different context, Bugliarello et al. (2020) introduced cross-mutual information (XMI), to measure the "difficulty" of translating between different language pairs in sentence-level neural machine translation. Given a language model q LM for a target sentence Y and a translation model q M T for translating from X to Y , XMI is defined as: where H q LM denotes the cross-entropy of the target sentence Y under the language model q LM and H q M T the conditional cross-entropy of Y given X under the translation model q M T . This allows us to measure how much information the source sentence gives us about the target sentence (an analogue of mutual information for cross-entropy). In the case where q LM and q M T perfectly model the underlying probabilities we would have XMI(X → Y ) = MI(X, Y ), the true mutual information.
Taking inspiration from the above, we propose Conditional Cross-Mutual Information (CXMI), a new measure of the influence of context on a model's predictions. This is done by considering an additional variable for the context C and measuring how much information the context C provides about the target Y given the source X. This can then be formulated as where H q M T A is the entropy of a context-agnostic machine translation model, and H q M T C refers to a context-aware machine translation model. This quantity can be estimated (see Appendix A for a more formal derivation) over an held-out test set with N sentence pairs and the respective context as: While q M T A and q M T C can, in theory, be any models, we are interested in removing any confounding factors other than the context that might lead to instability in the estimates of the distributions. For example, if q M T A and q M T C use completely different models, it would not be clear if the difference in the probability estimates is due to the introduction of context or due to other extraneous factors such as differences in architectures, training regimens, or random seeds. To address this we consider a single model, q M T , that is able to translate with and without context (more on how this achieved in §3.2). We can then set the context-agnostic model and the contextual model to be the same model q M T A = q M T C = q M T . This way we attribute the information gain to the introduction of context. Throughout the rest of this work, when we reference "context usage" we will precisely mean this information gain (or loss).

Experiments
Data We experiment with a document-level translation task by training models on the IWSLT2017 (Cettolo et al., 2012) dataset for language pairs EN → DE and EN → FR (with approximately 200K sentences for both pairs). We use the test sets 2011-2014 as validation sets and the 2015 as test sets. To address the concerns pointed out by Lopes et al. (2020) that gains in performance are due to the use of small training corpora and weak baselines, we use Paracrawl (Esplà et al., 2019) and perform some data cleaning based on language identification tools, creating a pretraining dataset of around 82M and 104M sentence pairs for EN → DE and EN → FR respectively.
All data is encoded/vectorized with byte-pair encoding (Sennrich et al., 2016b) using the Senten-cePiece framework (Kudo and Richardson, 2018). For the non-pretrained case, we use 20K vocabulary size shared across source/target, while for the pretrained case we use a 32K vocabulary size.
Besides translation quality, we also evaluate our models on two contrastive datasets for different discourse phenomena to better assess the ability of our models to capture context (more on this in §4.2): • Models and Optimization For all our experiments, we consider an encoder-decoder Transformer architecture (Vaswani et al., 2017). In particular, we train the transformer small (hidden size of 512, feedforward size of 1024, 6 layers, 8 attention heads). For the pretrained setup, we also pre-train a transformer large architecture (hidden size of 1024, feedforward size of 4096, 6 layers, 16 attention heads) and subsequently fine-tune on the IWSL2017 datasets. As in Vaswani et al. (2017), we train using the Adam optimizer with β 1 = 0.9 and β 2 = 0.98 and use an inverse square root learning rate scheduler, with an initial value of 10 −4 and 5 × 10 −4 for pretrained and non-pretrained cases respectively, and with a linear warm-up in the first 4000 steps. We train the models with early stopping on the validation perplexity.
We train all our models on top of the Fairseq framework (Ott et al., 2019).

What Context Matters?
To assess the relative importance of different context sizes on both the source and target side, we start by considering two models, one for the source-side context and one for the target-side context, that receive context of size k, During training, k is selected randomly to be in {1, . . . , 4} for every example. This way the model is trained to translate the same source without and with different context sizes and is thus able to translate based on any context size in that interval. Figure 2 shows the CXMI values computed over the test set as a function of the context size for both the source-side and target-side contextual models for both the non-pretrained and pretrained regimens for the EN → DE language pair. Results for the EN → FR language pair are similar and can be found in Appendix B.
For the non-pretrained case, for both the source and target context, the biggest jump in context usage is when we increase the context size from 0 to 1. After that, increasing the context size leads to diminishing increases in context usage and even reduced context usage for the source-side context. Interestingly, when the model is stronger, such as in the pretrained case, we can see that it can leverage target-side context even better than the nonpretrained case, with a similar trend of diminishing increases in context usage for both regimes. However, this is not the case for the source-side context, and it seems that the pretrained model is barely able to use the contextual information on this side.
Overall, for this regime, we can conclude that having a context size of one or two previous sentences on both sides is beneficial to the model, and that target-side context is slightly more used than source-side context. This appears to corroborate the findings of Bawden et al. (2018) that target-side context is more effective than the source context.    (3). Bold values mean the correlation is statistically significant with p < 0.01.

Does CXMI Really Measure Context Usage?
To assert that CXMI correlates with interpretable measures of context usage, we perform a correlation analysis with the performance in the contrastive datasets mentioned. In these datasets, usage of context is evident where the model picks the right answer when it is passed the context and is not able to do so when no context is given. Thus Table 2 shows the point-biserial correlation coefficient 3 between the per-sample CXMI and binary random variable and a binary variable that takes the value 1 if the contextual model picks the correct translation and the non-contextual model picks the incorrect one, for different context sizes on the pretrained model. We can see that there is a statistically significant correlation between both values, which strengthens the notion that CXMI captures previous measures of context usage to some extent.

Context-aware Word Dropout
Motivated by the above results demonstrating the limited context usage of models trained using the standard MLE training paradigm, particularly with respect to more distant context, we now ask the question: "Is it possible to modify the training methodology to increase context usage by the model?" As an answer, we extend a popular regularization technique used in sentence-level machine translation, word dropout (Sennrich et al., 2016a), to the context-aware setting. The idea behind context-aware word (COWORD) dropout is to model the translation probability between x (i) and y (i) as wherex (i) is a perturbed version of the current source sentence generated by randomly dropping tokens and replacing them with a mask token given a dropout probability p: In the case where no context is passed C (i) = ∅, COWORD dropout reduces to word dropout. The intuition behind such a perturbation is that, by dropping information from the current source and not the context, we increase the relative reliability of context C (i) , therefore providing the inductive bias that context is important for the translation. We will see in §4.2 that this inductive bias is beneficial and that COWORD dropout not only improves performance but also increases context usage.

Experiments
Setup As in §3.2, we consider transformer models trained on the IWSLT2017 for both EN → DE and EN → FR, both from scratch and pretrained using the procedure previously described. In particular, due to findings in the previous section, we consider models with either only target-side context or both source-side and target-side context.

Context Usage
To assess if our proposed regularization technique, COWORD dropout, increases context usage by models, we train a model using the same dynamic context size setting used in §3.2. Figure 3 plots the CXMI values on the test set as a function of the target context size as we increase the dropout value p. We see that increasing this value consistently increases context usage according to CXMI across different context sizes. Note that, at test time, COWORD dropout is disabled, which means that it provides inductive bias only during training and models learn to use more context by themselves. Table 3 illustrates some examples where the COWORD dropout increased the per-sample CXMI significantly. While the model only has access to target context, we present the source context for clarity. In the first example, while the source is a complete sentence, the target is only a fragment of one so the context helps complete it. In the other two examples shown, we can see that context helps disambiguate the gender of the German translation of the English pronoun it. Interestingly, the words that use context the most according to CXMI match very closely to the ones that native speakers annotated.
Translation Quality To evaluate if the increased usage of context correlates with better machine translation quality, based on the previous experiments on context usage and values for COWORD dropout, we consider three models trained with fixed-size context: • A baseline that has no context, reducing to sentence-level model ie: i.e., C (i) = ∅; • a one-to-two model having as context the previous target sentence, i.e., C (i) = {y (i−1) }; • a two-to-two model having as context the previous source sentence and the previous target sentence, i.e., C (i) = {x (i−1) , y (i−1) }.
In addition, to explore the benefits of COWORD dropout in other architectures, we also train a one-to-two multi-encoder (Jean et al., 2017) transformer small model (more details in Appendix §C). For all models with target context, when decoding, we use the previous decoded sentences as target context. Table 4 shows the performance across three different seeds of the baseline and contextual models for both the non-pretrained and pretrained setting, with increasing values of COWORD dropout p. We also run the baseline with COWORD dropout (which, as said previously, reduces to word dropout) to ensure that improvements were not only due to regularization effects on the current source/target. We report the standard BLEU score (Papineni et al., 2002) calculated using sacreBLEU (Post, 2018) and COMET, a more accurate evaluation method using multilingual embeddings (Rei et al., 2020).
For the non-pretrained case, we can see that a COWORD dropout value p > 0 consistently improves the performance of the contextual models when compared to models running with p = 0 and with the sentence-level baseline with the same values for word dropout. For the pretrained case, the improvements are not as noticeable, although models trained with COWORD dropout still always outperform models trained without it. This is perhaps a reflection of the general trend that better models are harder to improve.   Table 5: Results on IWSLT2017 for a multi-encoder 1-to-2 model with different probabilities for COWORD dropout. Averaged across three runs for each method. Table 5 shows that COWORD dropout is also helpful for the multi-encoder model, with COWORD dropout helping significantly. This shows that this method could be helpful for contextaware architectures other than concatenationbased.
Discourse Phenomena While automatic metrics such as BLEU and COMET allow us to measure translation quality, they mostly target sentencelevel quality and do not specifically focus on phenomena that require context-awareness. Contrastive datasets, as described in §3.2, allow us to measure the performance of context-aware models in specific discourse phenomena by comparing the probability of correct translation against the contrastive translations. Models that capture the targeted discourse phenomena well will consistently rank the correct translation higher than the contrastive ones. While there is a disconnect between the translation (done via decoding) and contrastive evaluation, it is currently the best way to measure a model's performance on context-aware discourse phenomena.   Table 7: Results on anaphoric pronoun resolution and lexical cohesion contrastive datasets for the multiencoder 1-to-2 model with different probabilities for COWORD dropout. Averaged across three runs for each method. Table 6 shows the average performance over the contrastive datasets of the baseline and contextual models for both the (non-)pretrained settings, with increasing values of COWORD dropout p. We can see that in general, increasing COWORD dropout leads to improved performance, particularly for the non-pretrained case. This gain is particularly clear for pronoun resolution and the EN → DE language pair. We hypothesise that this is due to the small size of the contrastive sets for the EN → FR language pair, which leads to high variance. Table 7 similarly shows that COWORD dropout improves the performance of the multi-encoder model across all phenomena, which again shows that our proposed regularization method has benefits for multiple architectures for context-aware machine translation. Curiously, when these models are trained without COWORD dropout, they achieve performance similar to the sentence-level baseline, while when dropout is applied, they are able to effectively start using context.

Related Work
Context-aware Machine Translation There have been many works in the literature that try to incorporate context into NMT systems. Tiedemann and Scherrer (2017) first proposed the simple approach of concatenating the previous sentences in both the source and target side to the input to the system; Jean et al. (2017), Bawden et al. (2018), and used an additional contextspecific encoder to extract contextual features from the previous sentences; Maruf and Haffari (2018) and Tu et al. (2018b)  For a more detailed overview, Maruf et al. (2019b) extensively describe the different approaches and how they leverage context. While these models lead to improvements with small training sets, Lopes et al. (2020) showed that the improvements are negligible when compared with the concatenation baseline when using larger datasets. However, importantly, both our metric CXMI for measuring context usage and the proposed regularization method of COWORD dropout, can theoretically be applied to any of the above-mentioned methods.
Evaluation In terms of evaluation, most previous work focuses on targeting a system's performance on contrastive datasets for specific inter-sentential discourse phenomena. Müller et al. (2018) built a large-scale dataset for anaphoric pronoun resolution, Bawden et al. (2018) manually created a dataset for both pronoun resolution and lexical choice and Voita et al. (2019) created a dataset that targets deixis, ellipsis and lexical cohesion. Stojanovski et al. (2020) showed through adversarial attacks that models that do well on other contrastive datasets rely on surface heuristics and create a contrastive dataset to address this. In contrast, our CXMI metric is phenomenon-agnostic and can be measured with respect to all phenomena that require context in translation. Bugliarello et al. (2020) first proposed cross-mutual information (XMI) in the context of measuring the difficulty of translating between languages. Our work differs in that we propose a conditional version of XMI, where S is always observed, and we use it to assess the information gain of context rather than the difficulty of translating different languages.

Implications and Future Work
We introduce a new, architecture-agnostic, metric to measure how context-aware machine translation models are using context and propose a simple regularization technique to increase context usage by these models. Our results are theoretically applicable to almost all recently proposed context-aware models and future work should go about measuring exactly how much these models leverage context and if COWORD dropout also improves context usage and performance in these.
We also hope this work motivates exploring (C)XMI for other uses cases where measuring the relevance/usage of inputs to a particular model other than context-aware machine translation. It could, for example, be used in conditional language modelling to analyse how the inputs we are conditioning on are being used by the model.

A Estimating CXMI
Let S denote a random variable over source sentences, T a random variable over target sentences and C a random variable over possible context. We assume these random variables are distributed according to some true, unknown distribution p(s, t, c). The cross-entropy between the true distribution p and a probabilistic context-aware neural translation model q M T C (t|s, c) is defined as where V * S , V * T , V * C represent the space of possible source sentences, target sentences and contexts respectively. Since we do not know the true distribution p, we cannot compute this quantity exactly. However, given a dataset of samples {(s (i) , t (i) , c (i) )} N i=0 assumed to be drawn from p, we can estimate this quantity using the Monte Carlo estimator If we consider the marginal p(s, t) = c∈V * C p(s, t, c), we can by a similar argument obtain an estimate for the cross-entropy for a contextagnostic neural translation model q M T A as: This leads trivially to the estimator for the crossmutual information: B CXMI for EN → FR

C Multi-Encoder
For the multi-encoder model, we take the approach of initializing a separate transformer encoder for the context, with shared input-output embeddings with the original encoder (or decoder in the case of target context). The tokens in the current sentence attend to the context by the means of crossattention. There are several other ways of formulation a multi-encoder context-aware systems, and exploring them is left for future research.