Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation

In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying ‘conservation principle’ makes relevance propagation unique: differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token’s influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions; the training process is non-monotonic with several stages of different nature.

Unfortunately, although a lot of work on model analysis has been done, a question of how the NMT predictions are formed remains largely open. Namely, the generation of a target token is defined by two types of context, source and target, but there is no method which explicitly evaluates the relative contribution of source and target to a given prediction. The ability to measure this relative contribution is important for model understanding since previous work showed that NMT models often fail to effectively control information flow from source and target contexts. For example, adding context gates to dynamically control the influence of source and target leads to improvement for both RNN (Tu et al., 2017;Wang et al., 2018) and Transfomer (Li et al., 2020) models. A more popular example is a model's tendency to generate hallucinations (fluent but inadequate translations); it is usually attributed to the inappropriately strong influence of target context. Several works observed that, when hallucinating, a model fails to properly use source: it produces a deficient attention matrix, where almost all the probability mass is concentrated on uninformative source tokens (EOS and punctuation) (Lee et al., 2018;Berard et al., 2019).
We argue that a natural way to estimate how the source and target contexts contribute to generation is to apply Layerwise Relevance Propagation (LRP) (Bach et al., 2015) to NMT models. LRP redistributes the information used for a prediction between all input elements keeping the total contribution constant. This 'conservation principle' makes relevance propagation unique: differently from other methods estimating influence of individual tokens (Alvarez-Melis and Jaakkola, 2017;He et al., 2019a;Ma et al., 2018), LRP evaluates not an abstract quantity reflecting a token importance, but the proportion of each token's influence.
We extend one of the LRP variants to the Transformer and conduct the first analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes (reference, generated by a model or random translations), when varying training objective or the amount of training data, and during the training process. We show that models suffering from exposure bias are more prone to over-relying on target history (and hence to hallucinating) than the ones where the exposure bias is mitigated. When comparing models trained with different amount of data, we find that extra training data teaches a model to rely on source information more heavily and to be more confident in the choice of important tokens. When analyzing the training process, we find that changes in training are non-monotonic and form several distinct stages (e.g., stages changing direction from decreasing influence of source to increasing).
Our key contributions are as follows: • we show how to use LRP to evaluate the relative contribution of source and target to NMT predictions; • we analyze how the contribution of source and target changes when conditioning on different types of prefixes: reference, generated by a model or random translations; • by looking at the contributions when conditioning on random prefixes, we observe that models suffering from exposure bias are more prone to over-relying on target history (and hence to hallucinating); • we find that (i) with more data, models rely on source information more and have more sharp token contributions, (ii) the training process is non-monotonic with several distinct stages.

Layer-wise Relevance Propagation
Layer-wise relevance propagation is a framework which decomposes the prediction of a deep neural network computed over an instance, e.g. an image or sentence, into relevance scores for single input dimensions of the sample such as subpixels of an image or neurons of input token embeddings. The original LRP version was developed for computer vision models (Bach et al., 2015) and is not directly applicable to the Transformer (e.g., to the attention layers). In this section, we explain the general idea behind LRP, specify which of the existing LRP variants we use, and show how to extend LRP to the NMT Transformer model. 2

General Idea: Conservation Principle
In its general form, LRP assumes that the model can be decomposed into several layers of computation. The first layer are the inputs (for example, the pixels of an image or tokens of a sentence), the last layer is the real-valued prediction output of the model f . The l-th layer is modeled as a vector Layerwise relevance propagation assumes that we have a relevance score R i of the previous layer l such that the following holds: This equation represents a conservation principle, which LRP exploits to back-propagate the prediction. Intuitively, this means that the total contribution of neurons at each layer is constant.

Redistribution Rules
Assume that we know the relevance R (l+1) j of a neuron j at network layer l+1 for the prediction f (x). Then we would like to decompose this relevance into messages R (l,l+1) i←j sent from the neuron j at layer l + 1 to each of its input neurons i at layer l. For the conservation principle to hold, these messages R (l,l+1) i←j have to satisfy the constraint: ( Then we can define the relevance of a neuron i at layer l by summing all messages from neurons at layer (l + 1): Equations (2) and (3) define the propagation of relevance from layer l+1 to layer l. The only thing that is missing is specific formulas for computing the 2 Previous work applying one of the LRP variants to NMT (Ding et al., 2017;Voita et al., 2019) do not describe extensions beyond the original LRP rules (Bach et al., 2015). messages R (l,l+1) i←j . Usually, the message R (l,l+1) i←j has the following structure: Several versions of LRP satisfying equation (4) (and, therefore, the conservation principle) have been introduced: LRP-ε, LRP-αβ and LRPγ (Bach et al., 2015;Binder et al., 2016;Montavon et al., 2019). We use LRP-αβ (Bach et al., 2015;Binder et al., 2016), which defines relevances at each step in such a way that they are positive.
Rule for relevance propagation: the αβ-rule. Let us consider the simplest case of linear layers with non-linear activation functions, namely where w ij is a weight connecting the neuron x , b j is a bias term, and g is a nonlinear activation function. Let where + = max(0, ) and − = min(0, ). Then the αβ-rule (Bach et al., 2015;Binder et al., 2016) is given by the equation where α+β = 1. Note that all terms in the brackets are always positive: negative signs of z − j and z − ij cancel out when evaluating the ratio. This propagation method allows to control manually the importance of positive and negative evidence by choosing different α and β. For example, α, β = 1 2 treats positive and negative contributions as equally important, while α = 1, β = 0 considers only positive contributions. In our experiments, both versions lead to the same observations. Note that (5) is directly applicable to all layers for which there exist functions g j and h ij such that These layers include linear, convolutional and maxpooling operations. Additionally, pointwise monotonic activation functions g j (e.g., ReLU) are ignored by LRP (Bach et al., 2015).

Propagating relevance through attention layers.
For the structures that do not fit the form (6), the weighting v ij can be obtained by performing a first order Taylor expansion of a neuron x (l+1) j (Bach et al., 2015;Binder et al., 2016).
For attention layers in the Transformer, we extend the approach by Binder et al. (2016). Namely, let x Then by Taylor expansion at some pointx = (x 1 , . . . ,x n ), we get Elements of the sum can be assigned to incoming neurons, and the zero-order term can be redistributed equally between them. This leads to the following decomposition: We use the zero vector in place ofx. Equation (7), along with the standard redistribution rules (5), defines relevance propagation for complex non-linear layers. In the Transformer, we apply equation (7) to the softmax operations in the attention layers; all other operations inside the attention layers are linear functions, and the rule (5) can be used.

LRP for Conditional Language Models
Given a source sequence x = (x 1 , . . . , x S ) and a target sequence y = (y 1 , . . . , y T ), standard autoregressive NMT models (or, in a more broad sense, conditional language models) are trained to predict words in the target sequence, word by word. Formally, at each generation step such models predict p(y t |x 1:S , y 1:t−1 ) relying on both source tokens x 1:S and already generated target tokens y 1:t−1 . Using LRP, we evaluate relative contribution of all tokens, source and target, to the current prediction.
Propagating through decoder and encoder. At first glance, it can be unclear how to apply a layerwise method to a not completely layered architecture (such as encoder-decoder). This, however, is rather straightforward and is done in two steps: 1. total relevance is propagated through the decoder. Since the decoder uses representations from the final encoder layer, part of the relevance 'leaks' to the encoder; this happens at each decoder layer; 2. relevance leaked to the encoder is propagated through the encoder layers.
The total contribution of neurons in each decoder layer is not preserved (part of the relevance leaks to the encoder), but the total contribution of all tokens -across the source and the target prefixremains equal to the model prediction.
We evaluate relevance of input neurons to the top-1 logit predicted by a model. Then token relevance (or its contribution) is the sum of relevances of its neurons.
Notation. Without loss of generality, we can assume that the total relevance for each prediction equals 1. 3 Let us denote by R t (x i ) and R t (y j ) the contribution of source token x i and target token y j to the prediction at generation step t, respectively. Then source and target contributions are defined as 3 Experimental setting Model. We follow the setup of Transformer base model (Vaswani et al., 2017) with the standard training setting. More details on hyperparameters and the optimizer can be found in the appendix.
Data. We use random subsets of the WMT14 En-Fr dataset of different size: 1m, 2.5m, 5m, 10m, 20m, 30m sentence pairs. In Sections 4 and 7, we report results for the model trained on the 1m subset. In Section 6, we show how the results depend on the amount of training data.
Evaluating LRP. The αβ-LRP we use requires choosing values for α and β, α + β = 1. We tried treating positive and negative contributions as equally important (α = β = 1 2 ), or considering only positive contributions (α = 1, β = 0). The observed patterns in behavior were the same for these two versions. In the main text, we use α = 1; in the appendix, we provide results for α = β = 1 2 . Figure 1: (a) contribution of the whole source at each generation step; (b) total contribution of source tokens at each position to the whole target sentence.
Reporting results. All presented results are averaged over an evaluation dataset of 1000 sentence pairs. In each evaluation dataset, all examples have the same number of tokens in the source, as well as in the target (e.g., 20 source and 23 target tokens; the exact number for each experiment is clear from the results). 4

Getting Acquainted
In this section, we explain general patterns in model behavior and illustrate the usage of LRP by evaluating different statistics within a single model. Later, we will show how these results change when varying the amount of training data (Section 6) and during model training (Section 7).

Changes in contributions
Here we evaluate changes in the source contribution during generation, and in contributions of source tokens at different positions to entire output.
Source −→ target(k). For each generation step t, we evaluate total contribution of source R t (source). Note that this is equivalent to evaluating total contribution of prefix Results are shown in Figure 1(a). 5 We see that, during the generation process, the influence of source decreases (or, equivalently, the influence of the prefix increases). This is expected: with a longer prefix, the model has less uncertainty in deciding which source tokens to use, but needs to control more for fluency. There is also a large drop of source influence for the last token: apparently, to generate the EOS token, the model relies on prefix much more than when generating other tokens.
Source(k) −→ target. Now we want to understand if there is a tendency to use source tokens at certain positions more than tokens at the others. For each source token position k, we evaluate its total contribution to the whole target sequence.
To eliminate the effect of decreasing source influence during generation, at each step t we normalize source contributions R t (x k ) over the total contribution of source at this step R t (source). Formally, for the k-th token we evaluate For convenience, we multiply the result by S T : this makes the average total contribution of each token equal to 1. Figure 1(b) shows that, on average, source tokens at earlier positions influence translations more than tokens at later ones. This may be because the alignment between English and French languages is roughly monotonic. We leave for future work investigating the changes in this behavior for language pairs with more complex alignment (e.g., English-Japanese).

Entropy of contributions
Now let us look at how 'sharp' contributions of source or target tokens are at different generation steps. For each step t, we evaluate entropy of (normalized) source or target contributions: Entropy of source contributions. Figure 2(a) shows that during generation, entropy increases until approximately 2/3 of the translation is generated, then decreases when generating the remaining part. Interestingly, for the last punctuation mark and the EOS token, entropy of source contributions is very high: the decision to complete the sentence requires broader context.
Entropy of target contributions. Figure 2(b) shows that entropy of target contributions is higher for longer prefixes. This means that the model does use longer contexts in a non-trivial way.

Reference, Model and Random Prefixes
Let us now look at how model behavior changes when feeding different types of prefixes: prefixes of reference translations, translations generated by the model, and random sentences in the target language. 6 As in previous experiments, we evaluate relevance for top-1 logit predicted by the model.
Reference vs model prefixes. When feeding model-generated prefixes, the model uses source more (Figure 3(a)) and has more focused source contributions (lower entropy in Figure 3(b)) than when generating the reference. This may be because model-generated translations are 'easier' than references. For example, beam search translations contain fewer rare tokens (Burlot and Yvon, 2018;Ott et al., 2018), are simpler syntactically (Burlot and Yvon, 2018) and, according to the fuzzy reordering score (Talbot et al., 2011), model translations have significantly less reordering compared to the real parallel sentences (Zhou et al., 2020). As we see from our experiments, these simpler model-generated prefixes allow for the model to rely on the source more and to be more confident when choosing relevant source tokens.
Reference vs random prefixes. Results for random sentence prefixes are given in Figures 3c, 3d. The reaction to random prefixes helps us study the self-recovery ability of NMT models. Previous work has found that models can fall into a hallucination mode where "the decoder ignores context from the encoder and samples from its language mode" (Koehn and Knowles, 2017;Lee et al., 2018). In contrast, He et al. (2019b) found that a language model is able to recover from artificially distorted history input and generate reasonable samples.
Our results show evidence for both. At the beginning of the generation process, the model tends to rely more on the source context when given a random prefix compared to the reference prefix, indicating a self-recovery mode. However, when the prefix becomes longer, the model choice shifts towards ignoring the source and relying more on the target: Figure 3c shows a large drop of source influence for later positions. Figure 3d also shows that with a random prefix, the entropy of source contributions is high and is roughly constant.

Exposure Bias and Source Contributions
The results in the previous section agree with some observations made in previous work studying selfrecovery and hallucinations. In this section, we illustrate more explicitly how our methodology can be used to shed light on the effects of exposure bias and training objectives. Wang and Sennrich (2020) empirically link the hallucination mode to exposure bias (Ranzato et al., 2016), i.e. the mismatch between the gold history seen at training time, and the (potentially erroneous) model-generated prefixes at test time. The authors hypothesize that exposure bias leads to an over-reliance on target history, and show that Minimum Risk Training (MRT), which does not suffer from exposure bias, reduces hallucinations. However, they did not directly measure this overreliance on target history. Our method is able to directly test whether there is indeed an over-reliance on the target history with MLE-trained models, and more robust inclusion of source context with MRT. We also consider a simpler heuristic, word dropout, which we hypothesize to have a similar effect.
where Y(x) is a set of candidate translations for x, ∆(ỹ, y) is the discrepancy between the model predictionỹ and the gold translation y (e.g., a negative smoothed sentence-level BLEU). More details on the method can be found in Shen et al. (2016) or Edunov et al. (2018); training details for our models are in the appendix.
Word Dropout is a simple data augmentation technique. During training, it replaces some of the tokens with a special token (e.g., UNK) or a random token (in our experiments, we replace 10% of the tokens with random). When used on the target side, it may serve as the simplest way to alleviate exposure bias: it exposes a model to something other than gold prefixes. This is not true when used on the source side, but for analysis, we consider both variants.

Experiments
We consider two types of prefixes: modelgenerated and random. Random prefixes are our main interest here. We feed prefixes that are fluent but unrelated to the source and look whether a model is likely to fall into a language modeling regime, i.e., to what extent it ignores the source. For model-generated prefixes, we do not expect to see large differences in contributions: this mode is 'easy' for the model and the source contributions are high (see Section 4.3). The results are shown in Figures 4 and 5.
Model-generated prefixes. MRT causes more prominent changes in contributions ( Figure 4). We see the largest difference in the beginning and the end of the generation process, which may be expected when comparing models trained with tokenlevel and sequence-level objectives. The direction of change, i.e. decreasing influence of source, is rather unexpected; we leave a detailed investigation of this behavior to future work. For word dropout, changes in the amount of contributions are less noticeable; we see, however, that target-side word dropout makes the model more confident in the choice of relevant source tokens (Figure 4b).
Random prefixes. We see that, among all models, the MRT model has the highest influence of source ( Figure 5a) and the most focused source contributions (Figure 5b). This agrees with our expectations: by construction, MRT removes exposure bias completely. Therefore, it is confused by random prefixes less than other models. Additionally, this also links to Wang and Sennrich (2020) who showed that MRT reduces hallucinations. When using word dropout, both its variants also increase the influence of source, but to a much lesser extent ( Figure 5a). As expected, since targetside word dropout slightly reduces exposure bias (in contrast to source-side word dropout), it leads to a larger increase of source influence. Experiments in this section highlight that the methodology we propose can be applied to study exposure bias, robustness, and hallucinations, both in machine translation and more broadly for other language generation tasks. In this work, however, we want to illustrate more broadly the potential of this approach. In the following, we will compare models trained with varying amounts of data and will look into the training process.

Data Amount
In this section, we show how the results from Section 4 change when increasing the amount of train- ing data. The observed patterns are the same when evaluating on datasets with reference translations or the ones generated by the corresponding model (in each case, all sentences in the evaluation dataset have the same length). In the main text, we show figures for references.
More data =⇒ higher source contribution. Figure 6(a) shows the source contribution at each generation step. We can see that, generally, models trained with more data rely on source more heavily. Surprisingly, this increase is not spread evenly across positions: at approximately 80% of the target length, models trained with more data use source more, but at the last positions, they switch to more actively using the prefix.
More data =⇒ more focused contributions. Figure 6(b) shows that at each generation step, entropy of source contributions decreases with more data. This means that with more training data, the model becomes more confident in the choice of important tokens. In the appendix, we show that this is also the case for target contributions.

Training Stages
Now we turn to analyzing the training process of an NMT model. Specifically, we look at the changes in how the predictions are formed: changes in the amount of source/target contributions and in the entropy of these contributions. Our findings are summarized in Figure 7. In the following, we explain them in more detail. In Section 7.1, we draw connections between our training stages (shown in Figure 7) and the ones found in previous work focused on validating the lottery ticket hypothesis.
Contributions converge early. First, we evaluate how fast the contributions converge, i.e., how quickly a model understands which tokens are the most important for prediction. For this, at each generation step t we evaluate the KL divergence in token influence distributions . . , R t (y t−1 )) from the final converged model to the model in training. Figure 8(a) shows that contributions converge early. After approximately 12k batches, the model is very close to its final state in the choice of tokens to rely on for a prediction.
Changes in training are not monotonic. Figures 8(b-d) show how the amount of source contribution and the entropy of source and target contributions change in training. We see that all three figures have the same distinct stages (shown with vertical lines). First, source influence decreases, and both source and target contributions become more focused. In this stage, most of the change happens (Figure 8(a)). In the second stage, the model also undergoes substantial change, but all processes change their direction: source influence increases and the model learns to rely on broader context (entropy is increasing). Finally, in the third stage, the direction of changes remains the same, but very little is going on -the model slowly converges. These three stages correspond to the first three stages shown in Figure 7; at this point, the model trained on 1m sentence pairs converges. With more data (e.g., 20m sentence pairs), we further observed the next stage (the last one in Figure 7), where the entropy of both source and target contributions is decreasing again. However, this last stage is much slower than the third, and the final state does not differ much from the end of the third stage.
Early positions change more. Figures 9(a-b) show how source contributions and their entropy changes for each target position. We see that earlier positions are the ones that change most actively: at these positions, we see the largest decrease at the first stage and the largest following increase at the subsequent stages. If we look at how accuracy for each position changes in training (Figure 10), we see that at the end of the first stage, early tokens have the highest accuracy. 7 This is not surprising: one could expect early positions to train faster because they are observed more frequently in training. Previously such intuition motivated the usage of sentence length as one of the criteria for curriculum learning (e.g., Kocmi and Bojar (2017)).

Relation to Previous Work
Interestingly, our stages in Figure 7 agree with the ones found by Frankle et al. (2020) for ResNet-20 trained on CIFAR-10 when investigating, among other things, the lottery ticket hypothesis (Frankle and Carbin, 2019). Their stages were defined based on the changes in gradient magnitude, in the weight space, in the performance, and in the effectiveness of rewinding in search of the 'winning' subnetwork (for more details on the lottery ticket hypothesis  (2020) with ours, we see that (1) their relative sizes in the corresponding timelines match well, (2) the rewinding starts to be effective at the third stage; for our model, this is when the contributions have almost converged. In future work, it would be interesting to further investigate this relation.

Additional Related Work
To estimate the influence of source to an NMT prediction, Ma et al. (2018) trained an NMT model with an auxiliary second decoder where the encoder context vector was masked. Then the source influence was measured as the KL divergence between predictions of the two decoders. However, the ability of an auxiliary decoder to generate similar distribution is not equivalent to the main model not using source. More recently, as a measure of individual token importance, He et al. (2019a) used Integrated Gradients (Sundararajan et al., 2017).
In machine translation, LRP was previously used for visualization (Ding et al., 2017) and to find the most important attention heads in the Transformer's encoder (Voita et al., 2019). Similar to our work, Voita et al. (2019) evaluated LRP on average over a dataset (and not for a single prediction) to extract patterns in model behaviour. Both works used the more popular ε-LRP, while for our analysis, the αβ-LRP was more suitable (Section 2). For language modeling, Calvillo and Crocker (2018) use LRP to evaluate relevance of neurons in RNNs for a small synthetic setting.

Conclusions
We show how to use LRP to evaluate the relative contributions of source and target to NMT predictions. We illustrate the potential of this approach by analyzing changes in these contributions when conditioning on different types of prefixes (references, model predictions or random translations), when varying training objectives or the amount of training data, and during the training process. Some of our findings are: (1) models trained with more data rely on source information more and have more sharp token contributions; (2) the training process is non-monotonic with several distinct stages. These stages agree with the ones found in previous work focused on validating the lottery ticket hypothesis, which suggests future investigation of this connection. Additionally, we show that models suffering from exposure bias are more prone to over-relying on target history (and hence to hallucinating) than the ones where the exposure bias is mitigated. In future work, our methodology can be used to measure the effects of different and novel training regimes on the balance of source and target contributions. We use random subsets of the WMT14 En-Fr dataset: http://www.statmt.org/wmt14/ translation-task.html. Sentences were encoded using byte-pair encoding (Sennrich et al., 2016), with source and target vocabularies of about 32000 tokens. Translation pairs were batched together by approximate sequence length. Each training batch contained a set of translation pairs containing approximately 16000 8 source tokens for 1m subsample and 32000 for larger datasets.

A.2 Model parameters
We follow the setup of Transformer base model (Vaswani et al., 2017). More precisely, the number of layers in the encoder and in the decoder is N = 6. We employ h = 8 parallel attention layers, or heads. The dimensionality of input and output is d model = 512, and the inner-layer of a feedforward networks has dimensionality d f f = 2048. We use regularization as described in (Vaswani et al., 2017).
We train models till convergence and average 5 latest checkpoints. Approximate number of training batches are: 57k for 1m dataset, 220k for 2.5m dataset and 600k for the rest.

B.1 Background
Minimum Risk Training (MRT) minimises the expected loss ('risk') with respect to the posterior distribution: where Y(x) is a set of all possible candidate translations for x, ∆(ỹ, y) is the discrepancy between the model predictionỹ and the gold translation y.
Since the search space Y(x) is exponential, in practice it is common to use only a subset of the full space. Formally, instead of Y(x) we use S(x) ∈ Y(x), where S(x) is obtained by sampling several translations. The probabilities P (ỹ|x, θ) are replaced with theP , which is renormalized over the subset S: The hyperparameter α is used to control the sharpness of the distribution.

B.2 Experimental setting
To choose the setting, we mostly relied on previous work (Shen et al., 2016;Edunov et al., 2018). Model is pre-trained with the token-level objective MLE and then fine-tuned with MRT; the fine-tuning stage is approximately one epoch.
Candidate translations. The translations are sampled using standard random sampling without temperature. Following Shen et al. (2016), we take the large number of candidates; specifically, we use 50 translations and add a reference to the subset. While Edunov et al. (2018) report that adding the reference to the set of candidates hurts quality, in preliminary experiments we found that this was not the case for our setting.
Measure of discrepancy. The measure of discrepancy, ∆(ỹ, y), is a negative smoothed sentencelevel BLEU.
Batch size. On average, the number of examples (where an example is a translation pair along with all candidates) is the same as in training of the baseline models. This is achieved by accumulating gradients for several steps and making an update.
Other parameters. Following (Wang and Sennrich, 2020), we set α = 0.005 and the learning rate to 0.00001.

C.1 Data Amount
When varying the amount of data, Figure 11 shows changes in the influence of source tokens at different positions to the whole output, Figure 12 -in the entropy of target contributions.