Measuring and Improving Faithfulness of Attention in Neural Machine Translation

While the attention heatmaps produced by neural machine translation (NMT) models seem insightful, there is little evidence that they reflect a model’s true internal reasoning. We provide a measure of faithfulness for NMT based on a variety of stress tests where attention weights which are crucial for prediction are perturbed and the model should alter its predictions if the learned weights are a faithful explanation of the predictions. We show that our proposed faithfulness measure for NMT models can be improved using a novel differentiable objective that rewards faithful behaviour by the model through probability divergence. Our experimental results on multiple language pairs show that our objective function is effective in increasing faithfulness and can lead to a useful analysis of NMT model behaviour and more trustworthy attention heatmaps. Our proposed objective improves faithfulness without reducing the translation quality and has a useful regularization effect on the NMT model and can even improve translation quality in some cases.


Introduction
How trustworthy are our neural models? This question has led to a wide variety of contemporary NLP research focusing on (a) different axes of interpretability including plausibility (or interchangeably human-interpretability) (Herman, 2017;Lage et al., 2019) and faithfulness (Lipton, 2018;Jacovi and Goldberg, 2020b), (b) interpretation of the neural model components (Belinkov et al., 2017;Dalvi et al., 2017;Vig and Belinkov, 2019), (c) explaining the decisions made by neural models to humans (using explanations, highlights, rationales, etc.) (Ribeiro et al., 2016;Ding et al., 2017;Ghaeini et al., 2018;Bastings et al., 2019;Jain et al., 2020), and (d) evaluating different explanation methods from different perspectives je to moorův zákon za posledních sto let je to moorův zákon za posledních sto let it's moore's law for the last century it's moore's law for the last century 0.00 1.00 attention weights Figure 1: An example of unfaithful attention weights produced during a Cs-En translation. Note in the left attention heatmap, the attention is on the word sto while the decoder generates century. However, in the right heatmap, sto is not attended to at all but century is still produced as the output. This shows unfaithful behavior. (Samek et al., 2016;Mohseni and Ragan, 2018;Poerner et al., 2018;Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Li et al., 2020). All of these approaches make NLP neural models more trustworthy. In this work, we focus on faithfulness which intuitively provides the extent to which an explanation accurately represents the true reasoning behind a prediction. It is particularly important for NLP practitioners who wish to debug their neural models and improve them. Faults of a neural model cannot be identified if the neural model does not provide a faithful and trustworthy description of what it is doing.
However, the formal definition of faithfulness and the proper approach to its evaluation are still contended in the literature. Jacovi and Goldberg (2020b) emphasize distinguishing faithfulness from human-interpretability in interpretability research by providing several clarifications about the terminology used by researchers. They describe the following conditions on the evaluation of how well a research project tackles the notion of faithfulness: • Be explicit: provide a measurable evaluation of faithfulness.
• Human judgements are not relevant because we are interested in model internals.
• Do not match against gold labels (e.g. AER) because faithfulness of both correct and incorrect decisions made by the model are equally important.
• No model is "inherently" faithful. We need to measure faithfulness not as a binary aspect of a model (it is faithful or not) but rather as a gray-scale measure.
• A more faithful system is a necessary but not sufficient condition for model interpretation by humans, c.f. Jacovi and Goldberg (2020a).
Aligned with these criteria, we study faithfulness of NLP neural models, specifically NMT models. We provide a faithfulness measure that is computed based on a variety of stress tests where attention weights that are crucial for prediction are perturbed. We expect from a faithful model to change its prediction under such tests ( Figure 1). We quantify faithfulness based on how often the model outputs changed. The proposed metric is defined based on discrete changes in the output. It is not differentiable and cannot be simply included in the loss function of NMT to be optimized. We propose a novel differentiable objective based on probability divergence and study its effect on the discrete faithfulness measure. Our findings show that our objective is effective in increasing faithfulness and can lead to a useful analysis of NMT model behaviour and more trustworthy attention heatmaps. We assert that faithfulness is a good property to have in a model whether or not it will be useful for downstream interpretation. A model that is faithful can be trusted better as a component in a larger end-to-end neural model.

Contributions
We seek to improve faithfulness of NMT models. To this end, we make the following contributions in this work: • We propose a measure for quantifying faithfulness in NMT.
• We introduce a novel learning objective based on probability divergence that rewards faithful behavior and which can be included in the training objective for NMT.
• We provide empirical evidence that we can improve faithfulness in an NMT model. Our approach results in more a more faithful NMT model while producing better BLEU scores. 1 We chose to study the impact of faithfulness in NMT because it is under-studied in terms of interpretability. Most previous work has focused on document or sentence-based classification tasks where attention models are not as directly useful as in NMT models. Attention is also more challenging in terms of faithfulness in the context of NMT models due to the substantial impact of the decoder component. While Transformers (Vaswani et al., 2017) generally produce better NMT models, they rely on multiple heads for attention. Defining an overall faithfulness measure in this case is challenging as different heads possibly have different faithfulness. Before addressing this more complicated problem, we first focus on the simpler single-head attention models. However, we expect larger and overparameterized models to get worse in terms of faithfulness because the language model in the decoder gets stronger in guessing the next word which, as we shall discuss in more detail later, tends to make attention less faithful.

Faithfulness in NMT Models
Intuitively, a faithful explanation should reflect the true internal reasoning of the model. Although there is no formal definition for faithfulness, a common approach in the community is to design stress tests to perturb the model parameters chosen in such a way that the model's decision should change if the model is faithful (Jacovi and Goldberg, 2020b). A common stress test is the erasure test in which the most-relevant part of the input is removed (Arras et al., 2017). In the context of NMT, at decoding time step t the attention component assigns attention weights α t , attending to the source word at position m t = argmax i α t [i] (or the k-best attended-to words in the source). These weights are often implicitly or explicitly regarded as an interpretation for the model's prediction at the time step t (Tu et al., 2016;Mi et al., 2016;Lee et al., 2017;Ding et al., 2017;Ghaeini et al., 2018). The erasure stress test for evaluating faithfulness offered by α t is done by setting α t [m t ] to zero and observing whether or not the output changes. It is worth noting that erasure is only one of the possible stress tests for evaluating faithfulness. Passing more stress tests implies a more faithful model as it is properly reacting to more adversaries by changing its decision. In this paper we consider three intuitive stress test cases: ZeroOutMax: (Arras et al., 2017): Here we remove attention from the most important token according to the attention weights by setting α t [m t ] = 0.
Uniform: (Moradi et al., 2019): In this stress test all attention weights are set to be equal, α t = 1 m 1, where m is the length of the source sentence. This is to confuse the model about which part of the input is the most important one.
RandomPermute: (Jain and Wallace, 2019): In this stress test we randomly permute attention weights several times until a change in the model output is observed. We ensure that m t , the most important token according to attention, is always changed. We set α t = random permute(α t ) such that argmax i α t [i] = m t Many prior studies of attention (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019) have used a binary measure: either attention is faithful or it is not. These studies typically are about whether attention has the potential to be useful in terms of accuracy and faithful in terms of model behaviour. In many cases, especially in the case of NMT models, attention is clearly useful and by and large it must be faithful. The question is can we measure the faithfulness and improve faithfulness. It is more natural to have a gray-scale notion of faithfulness for evaluation (Jacovi and Goldberg, 2020b). Following this reasoning, we define F (M ) as faithfulness of attention heatmaps in model M as the following equation: is a number between 0 to 1 measuring the percentage of output tokens during inference which passed the stress tests, i.e., they changed in the presence of adversarial attention. This metric can also be regarded as a measure of trust we can assign to the attention heatmap to fully reflect the internal reasoning of the NMT model.

Approach
The conventional objective function in a sequenceto-sequence task is a cross-entropy loss F acc : where S is the training data and X and Y are source sentence and the correct translation respectively. This training objective does not explicitly model the interpretability aspects (e.g. faithfulness) of the network and it remains unoptimized during training.

Attention Layer
Context vector

Attn Weights
Stress tests zom uni perm Figure 2: Using ZeroOutMax, Uniform, and Random-Permute stress tests, we generate adversaries to the attention weights. When adversarial attention weights are used, in a faithful model we expect the probability of the original output (ŷ) to drop significantly. We use this criterion to define a faithfulness objective function.

Faithfulness Objective
In an effort to develop a model that is right for right reason, Ross et al. (2017) change the loss function of their classifier to model both right answers and right reasons instead of only the former. They achieve this by introducing a regularizing term that tends to shrink irrelevant gradients. In a similar spirit, we change our objective to account for the NMT model's faithfulness as well as the cross-entropy score against the reference translations: F f aith is an additional component that rewards the model for having more faithful attention. The parameter λ f aith regulates the trade-off between between faithfulness and accuracy objectives. Our proposed metric for faithfulness is calculated based on discrete changes in the output under adversarial attention. It is not differentiable and cannot be simply used as F f aith to be optimized. Thus, we propose a novel differentiable objective which mimics a faithful behavior hoping it improves the discrete faithfulness metric. While here we intend to improve faithfulness using adversarial attention weights, it is important to note that making the model more robust is not our main goal. Robustness of a model is its resilience against adversarial input or small perturbations such as typos. Whether or not our approach results in a more robust model is a separate research question that we have not focused on.

Divergence-based Faithfulness Objective
Consider a predictive model g θ in which an intermediate calculation is later employed to justify predictions: would be the context vector calculated by the attention mechanism.
Hypothesis If there exists an intermediate calculation IC (x) that conveys a contradictory post-hoc attention compared to IC(x), then IC(x) cannot be regarded as faithful for predictingŷ. If IC(x) is faithful, we expect the model to diverge from predictingŷ when IC (x) is employed instead.
Based on our hypothesis, we propose a divergence-based objective which mimics behavior of a faithful explanation under stress test: This objective is a negative loss that should be minimized. The minimum of this objective is achieved when the probability of the original prediction approaches zero under the stress test which is the ideal. Thus, it promotes reduction in output probability under an adversarial intermediate calculation ( Figure 2). It is worth noting that this objective can be potentially employed in models where outputs are modeled as soft probabilities and thus is not limited to NMT. To put model under various stress tests we manipulate the context vector during training time by changing the attention weights and feed it to the decoder to calculate the probability. More precisely: where IC zom , IC uni and IC perm are ZeroOut-Max, Uniform and RandomPermute methods (see Sec. 2) to manipulate attention weights, respectively. λ {method} parameters regulate the contribution of each objective. We use the term F all when all λ {method} s in Eq. (6) are non-zero. Moreover, we use the term F {method} when λ {method} is set to 1 and other regularization weights are zero.

On Attention Sparsity
Do the models trained with the faithfulness objective have sparser attention weights? Sharper attention in a model M might correlate with an intensified contribution of the most-attended source hidden state on the prediction resulting in higher faithfulness.
To measure sparseness of the attention, we take an average over the normalized entropy of attention distribution for each output token during inference on test data. We use normalized entropy which is in range [0,1] to account for the fact that the range of the entropy for each output token depends on the length of the corresponding source sentence.
Here α ij is the attention distribution for the output token j in the generated translation of source sentence i, and P is a discrete probability distribution. In Eq. (8) low entropy indicates a sharper distribution.
Attention Entropy Regularization Alongside investigating sparsity of the models trained by the faithfulness objective, we also train a model in which sparsity in attention is directly optimized. We used attention entropy regularization (Zhang et al., 2018): where entropy of attention weights is added to the cross-entropy loss (2) as a regularization term.

Datasets
We use the Czech-English (Cs-En) dataset from IWSLT2016 2 and the German-English (De-En) dataset from IWSLT2014 3 . For the Czech-English dataset we use dev2010, tst2010, tst2011, tst2012, and tst2013 as the test data. For the German-English dataset we use dev2010, tst2010, tst2011, dev2012, and tst2012 as the test data. We used Moses (Koehn et al., 2007) to tokenize the dataset.

Architecture and Hyperparameters
We use OpenNMT (Klein et al., 2017) as our translation framework. We employ a 2 layer LSTMbased encoder-decoder (Sutskever et al., 2014;Cho et al., 2014) model with global attention (Luong et al., 2015). Dimension of the hidden states and the word embeddings for both source and target languages are set to 500. Vocabulary size for both the source and target language is set to 50000. We remove sentences with more than 50 tokens from the training data. We use Adam (Kingma and Ba, 2014) for training our models and we set the learning rate to 0.001. Models are trained until convergence. Our models have around 82M parameters. We optimize the hyperparameters of our models using the validation set. The baseline model is trained using Eqn.
(2) and we call it F baseline . λ ent in Eq. (9) is set to 0.04. We refer to the objective as F all when λ zom , λ uni , and λ perm are set to 0.5, 0.375, and 0.125 respectively. λ f aith is set to 1.

Training Difficulties
Our first attempts at using the modified objective function in Eq. (3) trained poorly. We observed that it was difficult for the model to learn the faithfulness constraint without having already learned to assign a reasonable probability to correct translations. To address this problem, we first train the NMT model using the standard unmodified objective function and then fine-tune this trained model by switching the objective function to Eq. (3).
One caveat is that the value of faithfulness loss can be arbitrarily large and interfere with the learning because cross-entropy error converge to infinity 5 Results and Discussion

Impact on Faithfulness
To measure the effectiveness of the proposed objectives, we choose the best model in terms of provided faithfulness but within the 0.5 BLEU score of the maximum achieved BLEU score in the validation set. The reason is that we prefer a model that is both accurate and with faithful attention-based explanations. Table 1 shows the performance of the different faithfulness objective functions when generating content words and function words across different attention manipulation methods in the Czech-English (Cs-En) and German-English (De-En) datasets.
Results indicate that the proposed divergencebased objective has been effective in increasing the faithfulness metric. F all is the most effective objective for increasing faithfulness when all stress tests are included in Eq. (1). When using F all , faithfulness of attention-based explanations for content words is increased 78% to 89%, while that of the function words is from 33% to 82%(see All column in Table 1). The same reductions are from 76% to 89% for content works and from 32% to 86% for function words in De-En dataset. These results establish the effectiveness of our proposed objectives to increase the faithfulness metric.
It is worth noting that increase in faithfulness of attention-based explanations for function words is much more than that of content words. This can be attributed to the fact the function words are mostly generated using the target-side information in the decoder (Tu et al., 2017;Moradi et al., 2019) and manipulating attention does not have much effect on generating them. However, our proposed faithfulness objective (F f aith ) seems to tighten the dependence of the decoder on the attention component. This results in much more increase in faithfulness for function words compared to such content words. 4 We also plot faithfulness over different checkpoints in Figure 3. It indicates that progress in faithfulness is much faster for function words compared to content words.

Effect of Training With Single Adversary on Passing Other Stress Tests
An interesting observation in Table 1 is that training with an adversary has positive effects on the model for passing stress tests from other types of adversaries. As an example, in Table 1

POS-tag Analysis
In addition to categorizing tokens into function and content words, we also analyze the effect of our proposed objective within different universal part-ofspeech (POS) tags (Petrov et al., 2012) in Table 2.
Our proposed objective has increased faithfulness in each POS tag and in our both datasets. Tokens with less lexical meaning are the ones affected the most as explained in Sec. 5.1. As expected, punctuations (PUNC) and particles (PRT) tags have benefited the most from increase in the faithfulness. Interestingly numbers (NUM tag) have the lowest increase in faithfulness. One reason might be that they already had a high initial faithfulness and this has made further increase less likely.

Regularization Effect
The model checkpoints used in Tables 1 were selected based on maximum increase in faithfulness without sacrificing accuracy. To investigate if the proposed objective can have a general positive side effect in terms of accuracy, we train three independent models using the F baseline and F all objectives.
To make it fair for the baseline, we also add additional steps of training for the baseline model as well to isolate the benefit of adding the faithfulness objective.  Table 3: BLEU score of the baseline and the model trained with F all . Pairwise bootstrap resampling (Koehn, 2004) resulted in a p-value < 0.01 which indicates the statistical significance of the observed difference.
Improved BLEU scores for the faithful model can be due to two reasons: 1) the faithfulness objective can be seen as a regularization term which prevents the model from relying too much on the target-side context and the implicit language model in the decoder, which results in increased contribution of attention on the decoder and reducing some bias in the model. 2) penalizing the model for the lack of connection between justification and prediction forces the model to learn better translations by forcing it to justify each output in a right answer for the right reason paradigm. Figure 4 shows some examples of how our proposed model can produce better translations. Table 4 shows the average entropy and average normalized entropy for the baseline, the proposed model (F all ), and the model trained with attention entropy regularization respectively. Evidently, the proposed model has not increased sparsity. On the other hand attention entropy regularization has been very effective in making attention weights sparser. But Table 1 indicates that attention entropy regularization has not been effective in increasing faithfulness. This suggests that sharper attention weights only affect the context vector and do not contribute to increased dependence of the decoder on attention. src sie drängten wasser aus dem land heraus und hinaus in den fluss ref they pushed water off the land and out into the river base they kept running water from the land and out in the river ours they pushed water out of the country and out in the river . src anstatt hunderte von kilometern entfernt im norden ref instead of hundreds of miles away in the north base instead of hundreds of miles away from north america ours instead of hundreds of miles away from north Figure 4: These examples show some cases where the more faithful model trained using our faithfulness objective produces better translations compared to the baseline model. In each of these cases, perturbing the attention weights has no effect on the baseline model output. The faithful model is able to focus on the source side when needed in order to produce a more accurate translation.   Vig and Belinkov, 2019;Clark et al., 2019), evaluating attention as an interpretability approach has garnered a lot of interest. From the faithfulness perspective, (Jain and Wallace, 2019;Serrano and Smith, 2019) show that for instances in a data set there can be adversarial attention heatmaps that do not change the output of the text classifier. In other words, adversarial attention leads to no decision flip in each instance. They use this to claim that attention heatmaps are not to be trusted, or unfaithful. Wiegreffe and Pinter (2019) argue against per-instance modifications at test time for two reasons: 1) in classification tasks attention may not be useful so perturbing attention is misleading. This is not true for NMT since attention is very useful in NMT. 2) they train an adversarial attention model (e.g. uniform attention) chosen to produce attention weights distant from the original attention weights while at the same time trying to minimize classification error. They show that such adversarial attention models are not as accurate as models with attention. In our work we acknowledge that attention is useful and faithful to some extent and we aim to improve faithfulness of NMT models.

Do the New Models Have Sparser Attention?
While most of these works provide evidence that attention weights are not always faithful, Moradi et al. (2019) confirm similar observations on the unfaithful nature of attention in the context of NMT models. Li et al. (2020) is one of the few papers examining attention models in NMT. However, they are focused on the task of identifying relevant source words to explain the output translations selected by the NMT model. They look for optimal proxy models that agree with the NMT model such that the relevant source words picked as an explanation by a proxy model exhibits similar behaviour to the target model. They use the notion of fidelity over proxy models and evaluate several alternative proxy models using empirical risk minimization. Attention weights are evaluated alongside other proxy models for this task. In contrast, our work is about improving the faithfulness of NMT models and we focus on the internal state of the NMT model rather than proxy models. They use human references, e.g. AER, for evaluating fidelity. As discussed earlier, evaluation of faithfulness cannot involved human judgements or reference data. It is possible that our faithful NMT models are also better at fidelity, but that is an open question.
While prior works have mostly failed to explicitly distinguish faithfulness from plausibility in their arguments, Jacovi and Goldberg (2020a,b) focus on formalizing faithfulness and addressing evaluation of faithfulness separately from plausibility respectively. Subramanian et al. (2020) have investigated the concept of faithfulness in neural modular networks (NMN) which are employed for modeling compositionality. They question the faithfulness of the structure of the network modules describing the true abstract reasoning of the model. Similar to us, they attempt to quantify faithfulness and improve upon it. However their contributions like training with an auxilary atomic-task supervi-sion for improved faithfulness are specific to the context of NMNs. Pruthi et al. (2020) demonstrate that it is possible to train a model that produces a deceptive attention mask, questioning the use of attention weights as explanation from the fairness and accountability perspective. Alvarez-Melis and Jaakkola (2018) investigate the interpretability methods from the robustness perspective. They attempt to quantify robustness and show that current interpretability methods cannot be considered as robust.
Sparsity For Improved Interpretability This line of work suggests making attention sparser so that the most contributing input word is more distinguishable over other input words. Martins and Astudillo (2016); Malaviya et al. (2018) propose sparse but differential alternatives to softmax function for calculating attention weights, while Zhang et al. (2018) propose sparsity regularization terms such as entropy regularization to promote sparsity in the attention. Ross et al. (2017) augment the loss function of their classification model with an explanation objective to constrain input gradient explanations. Rieger et al. (2019) follow a similar spirit but they use contextual decomposition (Murdoch et al., 2018) to extract explanations offered by the model. Aligning attention (as explanation) with prior knowledge has also been extensively studied. This prior knowledge can include alignment data (Mi et al., 2016;, human rationales (Zhong et al., 2019), or even structural biases (Cohn et al., 2016).

Regularizing Explanations
Inherently Interpretable Neural Models Contrary to post-hoc explanation methods for interpreting a neural model, Stahlberg et al. (2018) show that the NMT model can be made self-explanatory by training it to produce the discrete decisions made by the model (from which the translations can be extracted later). In another work, (Lei et al., 2016;Bastings et al., 2019) propose models in which first a rationale is selected from the input and then is further used for prediction.

Conclusion
We proposed a method for quantifying faithfulness of NMT models. To optimize faithfulness we have defined a novel objective function that rewards faithful behavior through probability divergence. We also show that the additional constraint in the training objective for NMT does not harm transla-tion quality and in some cases we see some better translations presumably due to the regularization effect of our faithfulness objective.
Future Work We aim to investigate and improve faithfulness of attention-based explanations in more sophisticated attention models such as Transformers (Vaswani et al., 2017). We can generalize our approach by designing explanatory modules in NMT through functionality separation (alignment, reordering, etc.) instead of relying only on attention. We also plan to investigate if faithful models can also be more useful for copy models and other applications of attention heatmaps in NMT.