Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for making a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.


Introduction
The ability to explain neural networks benefits both accountability and ethics when deploying models (Doshi-Velez et al., 2017) and helps develop a scientific understanding of what models do (Doshi-Velez and Kim, 2017).Particularly, in NLP, attention (Bahdanau et al., 2015) is often used as an explanation to provide insight into the logical process of a model (Belinkov and Glass, 2019).
are relevant for a given prediction.This type of explanation is called an importance measure.
A major challenge in the field of interpretability is ensuring that an explanation is faithful: "a faithful interpretation is one that accurately represents the reasoning process behind the model's prediction" (Jacovi and Goldberg, 2020).Unfortunately, importance measures that are claimed to have strong theoretical foundations and are widely used in practice (Bhatt et al., 2019) often later turn out to be questionable (Hooker et al., 2019;Kindermans et al., 2019;Adebayo et al., 2018;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019).
Accurately measuring if an explanation is faithful is therefore paramount.Such faithfulness metrics are difficult to develop as the models are too complex to know what the correct explanation is.Doshi-Velez and Kim (2017) says a faithfulness metric should use "some formal definition of interpretability as a proxy for explanation quality." In this work, we use the definition of faithfulness by Samek et al. (2017) and Hooker et al. (2019): if information (input tokens) is truly important, then removing it should result in a worse model performance compared to removing random information (tokens).We build upon the ROAR metric by Hooker et al. (2019), which adds that it is necessary to retrain the model after information is removed, to avoid out-of-distribution issues.Finally, the model performance is compared with removing random information.
A limitation of ROAR is that it is theoretically impossible to measure the faithfulness of an importance measure when dataset redundancies exist.For example, if two tokens are equally relevant but only one of them is identified as important, ROAR fails to remove the second token.
We propose Recursive ROAR which solves this limitation.In addition to the Recursive ROAR metric, we introduce a summarizing metric (RACU) which aggregates the results into a scalar metric.
We hope that such a metric will make it more feasible to compare importance measures across papers.
Using the proposed faithfulness metrics, we perform a comprehensive comparative study of 4 different importance measures and two popular architectures: BiLSTM-Attention and RoBERTa (Liu et al., 2019).We use 8 different datasets which are commonly used in the faithfulness of attention literature (Jain and Wallace, 2019).
Our comparative study reveals that no importance measure is consistently better than others.Instead, we find that faithfulness is both task and model dependent.This is valuable knowledge, as although each importance measure might be equal in faithfulness, they are not equal in computational requirements or understandability to humans.
In particular, we find that attention generally provides more sparse explanations than gradient or integrated gradient.Although their faithfulness may be the same, a sparser explanation is often easier for humans to understand (Miller, 2019).
Computationally speaking, integrated gradient is approximately 50 times more expensive than the gradient method.This additional complexity is usually justified by being considered more faithful than gradient.However, our results indicate that this is rarely a worthwhile trade-off.

Related Work
Much recent work in NLP has been devoted to investigating the faithfulness of importance measures, particularly attention.In this section, we categorize these faithfulness metrics according to their underlying principle and discuss their drawbacks.ROAR (Hooker et al., 2019) and our Recursive ROAR metrics differ significantly from these approaches.
The works on attention are all based on the BiLSTM-Attention models and datasets from Jain and Wallace (2019), they are therefore highly comparable.We use the same models and datasets, while also analyzing RoBERTa.

Comparing with alternative importance measures
The idea is to compare attention with an alternative importance measure, such as gradient.The claim is, if there is a correlation this would validate attention's faithfulness.Jain and Wallace (2019) specifically compare with the gradient method and the leave-one-out method.Meister et al. (2021) repeat this experiment in a broader context.
Both Jain and Wallace (2019) and Meister et al. (2021) find that there is little correlation between importance measures and interpret this as attention being not faithful.
Jain and Wallace (2019) does acknowledge the limitations of this approach, as the alternative importance measures are not themselves guaranteed to be faithful.A correlation, or lack of correlation, does therefore not inform about faithfulness.A criticism that we agree with and highlight here.

Mutate attention to deceive
Jain and Wallace (2019) propose that if there exist alternative attention weights that produce the same prediction, attention is unfaithful.
They implement this idea by directly mutating the attention such that there is no prediction change but a large change in attention and find that alternative attention distributions exist.Vashishth et al. (2019) and Meister et al. (2021) apply a similar method and achieve similar results.
Wiegreffe and Pinter (2019) find this analysis problematic because the attention distribution is changed directly, thereby creating an out-ofdistribution issue.This means that the new attention distribution may be impossible to obtain naturally from just changing the input, and it therefore says little about the faithfulness of attention.

Optimize model to deceive
Because the mutate attention to deceive approach has been criticized for using direct mutation, an alternative idea is to learn an adversarial attention.
Wiegreffe and Pinter (2019) investigate maximizing the KL-divergence between normal attention and adversarial attention while minimizing the prediction difference between the two models.By varying the allowed prediction difference, they show that it is not possible to significantly change the attention weights without affecting performance.Importantly, Wiegreffe and Pinter (2019) only use this experiment to invalidate the mutate attention to deceive experiments, not to measure faithfulness.However, (Meister et al., 2021) do use this experiment setup as a faithfulness metric.Pruthi et al. (2020) perform a similar analysis but report a contradictory finding.They find it is possible to significantly change the attention weights without affecting performance.They use this to show that attention is not faithful.
We find this approach problematic because by changing the optimization criteria the analysis is no longer about the standard BiLSTM-attention model (Jain and Wallace, 2019), which is the subject of interest.Therefore, this analysis only works as a criticism of the mutate attention to deceive approach, not as an evaluation of faithfulness.

Known explanations in synthetic tasks
Arras et al. ( 2022) constructs a purely synthetic task, where the true explanation is known.Evaluating importance measures against this true explanation serves as the faithfulness metric.Unfortunately, this approach cannot be used on real datasets and assumes a well behaved model.Bastings et al. (2021), a concurrent work to ours, therefore introduce spurious correlations into real datasets, creating partially synthetic tasks.They then evaluate if importance measures can detect these correlations.They conclude, similar to us, that faithfulness is both model and task-dependent.
We believe that this approach is the most valid among the mentioned metrics in the section.However, model behavior, and thereby the explanation behavior, can be drastically different on observations with spurious correlations from those without.This method is therefore limited in scope as it can only evaluate if the importance measure can be used to detect known spurious correlations.

ROAR: RemOve And Retrain
To address the shortcomings of the current faithfulness measures as described in Section 2, we base our metric on ROAR (Hooker et al., 2019).
ROAR has been used in computer vision to evaluate the faithfulness of importance measures and to a limited extent in NLP (Pham et al., 2021).The central idea is that if information is truly important, then removing it from the dataset and retraining a model on this reduced dataset should worsen model performance.This can then be compared with an uninformative baseline, where information is removed randomly.
For example, at a step size of 10%, one can remove the top-{10%, 20%, • • • 90%} allegedly important tokens, evaluate the model performance, and compare this with removing {10%, 20%, • • • 90%} random tokens.If the importance measures is faithful, the former should result in a worse model performance than the latter.
This section covers how ROAR is adapted to an NLP context.Furthermore, we explain the dataset redundancy issue which is solved by our proposed Recursive ROAR metric.Finally, we show that Recursive ROAR is an improvement on ROAR using a synthetic task.

Adaptation to NLP
ROAR was originally proposed as a faithfulness metric in computer vision.In this context, pixels measured to be important are "removed" by replacing them with an uninformative value, such as a gray pixel (Hooker et al., 2019).
In this work, ROAR is applied to sequence classification tasks.Because these models use tokens, the uninformative value is a special [MASK] token (example in Figure 1).We choose a [MASK] token rather than removing the token to keep the sequence length, which is an information source unrelated to importance measures.

Recursive ROAR
With ROAR there are two conclusions, either 1) the importance measure is to some degree faithful or, 2) the faithfulness is unknown.The former is observed when the model's performance is statistically significantly below the random baseline.In the latter case, Hooker et al. (2019) explain that the importance measure can either be not faithful or there can be a dataset redundancy.Recursive ROAR solves this redundancy issue and thereby provides a more informative conclusion.
A dataset redundancy affects the conclusion because the model does not need to use the redundant information.A faithful importance measure would therefore not highlight redundancies as important.After the important information which the importance measure did highlight is removed and the model is retrained, the redundant information can still keep the model's performance high.An example of this issue is demonstrated in Figure 1.
We solve this issue by recursively recomputing  1, where redundancies are not removed and the performance can remain the same, even when the importance measure is faithful.
the importance measure at each iteration of information removal.This way, if the importance measure is faithful, it would quickly mark the redundant information as important after which it would be removed.Note that already masked tokens are kept masked.We call this Recursive ROAR and provide an example in Figure 2. Note, Recursive ROAR might not remove all redundancies unless the step size is one token.However, because ROAR requires retraining the model, for every evaluation step, this is infeasible.Instead, we approximate it by removing a relative number of tokens.We discuss this more in Appendix F.

Validation on a synthetic problem
To show that Recursive ROAR provides an optimal faithfulness metric, we validate it on the same generated synthetic problem (with input x and output y) presented in the original ROAR paper (Hooker et al., 2019): Quoting Hooker et al. ( 2019) "All random variables were sampled from a standard normal distribution.The vectors a and d are 16 dimensional vectors that were sampled once to generate the dataset.In a only the first 4 values have nonzero values to ensure that there are exactly 4 informative features.The values z, η, and are sampled independently for each example." The ground truth removal order is to remove the first 4 features (the specific order does not matter) followed by the remaining irrelevant features.Note that these first 4 features are mutually redundant.
In model as the importance measure and apply ROAR and Recursive ROAR using this explanation.
Figure 3 shows that Recursive ROAR is identical to the ground truth, while ROAR is worse.

Importance Measures
In this section, we describe the importance measures that will be evaluated.We choose these explanations as they are common and computationally feasible to evaluate on every observation.
As attention does not attend to the begin-ofsequence token, end-of-sequence token, and auxiliary sequence in paired-sequence problems, these tokens are also not considered for other importance measures.This is to ensure a fair comparison.
Attention These are the attention weights of a BiLSTM-Attention model.We repeat the definitions in Appendix C.1.
While we also look at a transformer-based model which also have internal attention mechanisms, these models do not provide one specific way to convert attention scores into an importance measure.There are proposals to turn the many attention heads into an importance measure (Abnar and Zuidema, 2020).However, these are computationally expensive and requires knowing which layer to select.Performing this analysis is a standalone research topic which we will not answer.
Gradient Let the logits be denoted as f (x).Then the gradient explanation is ∇ x f (x), where x is a one-hot-encoding of the input (Baehrens et al., 2010;Li et al., 2016).To reduce away the vocabulary dimension, we use an L 2 -norm.
Input times Gradient This explanation is x ∇ x f (x).Note that because x is a one-hot encoding, only one element per token will be non-zero.This non-zero element is considered as the explanation.
Integrated Gradient (IG) Sundararajan et al. (2017) argue this to be more faithful, via axiomatic proofs, compared to previous gradient-based methods.A disadvantage is that it is significantly more computationally intensive as it requires computing k gradients.We use k = 50 like the original paper (Sundararajan et al., 2017), and use b = 0 as is done in NLP literature (Mudrakarta et al., 2018): (2)

Experiments
The datasets, performance metrics, and the BiLSTM-attention model are identical to those used in Jain and Wallace ( 2019) and most other literature evaluating the faithfulness of attention.In addition, we use the RoBERTa-base model with the standard fine-tuning procedure (Liu et al., 2019).Details are in Appendix C1 .We report model performance on the 8 studied datasets in Table 1.Below, we provide a short description of each dataset.We provide additional details in Appendix B.

Supporting experiments
In Appendix G, we compare ROAR and Recursive ROAR.These results show dataset redundancies interfere with ROAR.For example, with the Diabetes dataset, only by using Recursive ROAR can gradient be measured to be faithful.Table 1: Model performance scores and sequencelength for each dataset.Performance is averaged over 5 seeds with a 95% confidence interval.Following Jain and Wallace (2019), we report performance as macro-F1 for SST, IMDB, Anemia and Diabetes, micro-F1 for SNLI, and accuracy for bAbI.
The average sequence-length is for the BiLSTMattention model, for the RoBERTa model the number will be higher but with inputs truncated at 512 tokens.
In Appendix F, we avoid the approximation of removing a relative number of tokens at 10% increments by instead removing exactly one token in each iteration.These results show that the approximation does affect the results, but not the conclusions that can be drawn from the results.
In Appendix E, we report the sparsity of each importance measure and find that attention is significantly more sparse than other importance measures.If the faithfulness is equal, this may make it more desirable as sparse explanations are more understandable to humans (Miller, 2019).

Main experiment: Recursive ROAR
To evaluate the faithfulness of importance measures, we apply Recursive ROAR to all datasets and both models.The results are presented in Figure 4 and discussed in Section 6.
In Appendix D, we report the compute times.Because BiLSTM-Attention is a small model and RoBERTa-base is only fine-tuned, Recursive ROAR is feasible when importance measure can be evaluated on every observation.For some importance measures, like SHAP (Lundberg and Lee, 2017), which have exponential compute complexity, ROAR would not be feasible.Additionally, for large language models, like T5 (Raffel et al., 2020), ROAR would also be difficult to apply as fine-tuning these models is generally challenging.Performance is averaged over 5 seeds with a 95% confidence interval.

How to interpret
If the model performance of a given importance measure is below the random baseline, then this indicates a faithful importance measure.Note that "faithful" is not absolute, rather we measure the degree of faithfulness.However, if the model performance is not statistical significant below the random baseline, then the importance measure is not considered to be faithful.With the (Not Recursive) ROAR measure, this latter case would be inconclusive as the faithfulness could be hidden by dataset redundancies.
Figure 4 also presents the model performance at 100% masking, which provides a lower bound for the model performance and is helpful as the datasets are often biased.These biases come from unbalanced classes or the secondary sequence for the paired-sequence tasks (Gururangan et al., 2018).For these datasets, sequence-length bias is not a concern Appendix B.3.

Summarizing faithfulness metric
While a ROAR plot can provide valuable insights, such as "this importance measure is only faithful for the top-20% most important tokens," it does not summarize the faithfulness to a scalar metric.Such a metric is useful as it allows for easy comparisons, particularly between different papers.
To provide a scalar metric, we propose using a relative area-between-curves (RACU) metric.Intuitively, an importance measure is more faithful if it has a larger area between the random baseline curve and the importance measure curve.Additionally, when the importance measure is above the random baseline, a negative area is contributed.Finally, the metric is normalized by an upper bound, where the performance at 100% masking is achieved immediately.A visualization of this calculation can be seen in Figure 5.
Using an area-between-curves is useful because, unlike many other summarizing statistics, it is invariant to the step-size used in ROAR.In this case, we have a step size of 10%.Future work may choose a smaller or larger step size depending on their computational resources.
Let r i be the masking ratio at step i out of I total step, in our case r = {0%, 10%, • • • , 100%}.Let p i be the model performance for a given importance measure and b i be the random baseline performance.With this, the metric is defined in (3), and we present the results in Table 2.

Important Findings
Based on the results in Figure 4 and Table 2, we highlight the following important findings.
Faithfulness is model-dependent.In particular, the faithfulness with SNLI is highly modeldependent as seen in Table 2. Furthermore, comparing the faithfulness between the two models,the faithfulness of Gradient on IMDB and Integrated Gradient on bAbI-3 is significantly affected by the model architecture.
Faithfulness is task-dependent.For BiLSTM-Attention, in Table 2, Attention is best for SNLI while Input times Gradient and Integrated Gradient is best for SST.For RoBERTa, Integrated Gradient is best for IMDB and SNLI, while Gradient is best for bAbI-1 and bABI-2.In fact, Integrated Gradient is worst in all bAbI tasks.
Attention can be faithful.In Table 2, Attention is among the top explanations in terms of faithfulness, except for SST.This contradicts many of the previous results mentioned in Section 2, which found attention to be unfaithful.
Because attention is computationally free and attention is more sparse (Appendix E), which is important for human understanding (Miller, 2019), attention can be an attractive explanation.
Integrated Gradient is not necessarily more faithful than Gradient or Input times Gradient.For BiLSTM-Attention, in Table 2, bAbI-2, bAbI-3, and SNLI has least one gradient-based importance measure which is significantly more faithful than Integrated Gradient.For RoBERTa, we find the same for bAbI-2 and bAbI-3.These results contradicts the claim that Integrated Gradient is theoretically superior (Sundararajan et al., 2017).This is a valuable finding, as Integrated Gradient is significantly more computationally expensive than other gradient-based importance measure.
Importance measures often work best for the top-20% most important tokens.In Figure 4, we observe that the largest drop tends to happen at about 10% or 20% tokens masked.This indicates that importance measures are best at ranking the most important tokens, while for less important tokens, they become noisy.This is particularly observed in bAbI for both models and Diabetes with the BiLSTM-Attention model.
Class leakage can cause the model performance to increase.Because the importance measures explain predictions of the target label, they can leak the target label when allegedly important tokens are masked.
Consider a sentiment classification task.If an importance measure indicates that the word bad is a strong indicator of negative sentiment, then in the next iteration bad would be masked in negative sentences.This means the presence of bad now leaks the true label (positive sentiment) which may increase the performance.
This issue is particularly observed with bAbI-3 using RoBERTa in Figure 4, where the performance increases slightly at 60% tokens masked.This issue affects both ROAR and Recursive ROAR (Appendix G).In fact, it likely affects most faithfulness metrics.However, Recursive ROAR can mitigate this issue to some extent.We discuss this more in Appendix A.

Conclusion
We show that Recursive ROAR is an improvement on ROAR.In a synthetic setting, Recursive ROAR matches the ground truth, while ROAR does not.Additionally, we argue why other faithfulness metrics may be either invalid or limited in scope.
We then use Recursive ROAR to measure the faithfulness of the most common importance measures, including attention.This is done on both recurrent and transformer-based neural models.In general, we find that the faithfulness of importance measures is both model-dependent and taskdependent.This means that no general recommendation can be made for NLP practitioners considering the current importance measures.Instead, it is necessary to measure the faithfulness of different importance measures given a task and a model.
Because Recursive ROAR works on real-world datasets and not just synthetic problems, we hope it can serve as a standardized benchmark for the faithfulness of importance measures in NLP.

Limitations
Recursive ROAR requires the model to be retrained.This means it is not possible to evaluate the faithfulness of a specific model instance, rather we evaluate the faithfulness of the model architecture.The confidence intervals we provide then inform us about what can be statistically expected in terms of the faithfulness for a model instance.
The retraining dependence also means Recursive ROAR can only measure the faithfulness of a taskmodel combination that is feasible to train/finetune repeatedly and importance measures that are feasible to compute across the entire dataset.
A second category of limitation comes from the use of masking.In particular, if the dataset is heavily biased, then the performance at 100% will remain high.This can happen if for example the sequence length is a good predictor of the class.In principle, this means that no tokens are important.Therefore, we can't comment on the faithfulness of an importance measure in that context.In such a case, the faithfulness metric in (3) should become unstable (in theory division by zero, but in practice chaotic values) and result in a large confidence interval.
As discussed in the previous section, because the importance measures explains the target class, they can leak the class information when used to mask input features.This can make an importance measure appear less faithful than it actually is.However, this issue cannot make an importance measure appear more faithful than it is (see Appendix A for more discussion).
Furthermore, while we believe Recursive ROAR provides a useful metric for faithfulness, only measuring faithfulness is not enough for an explanation to be used in production settings (Doshi-Velez and Kim, 2017).In addition to faithfulness, one should also evaluate if the explanation is understandable to humans (known as human-groundedness).This is already being done to some extent but is a complex topic (Sen et al., 2020;Hase and Bansal, 2020;Prasad et al., 2021;González et al., 2021;Schuff et al., 2022;Lertvittayakumjorn and Toni, 2019;Nguyen, 2018).
Finally, Doshi-Velez and Kim (2017) argue that explanations should be tested with the final application in mind.Unfortunately, in deployment settings very little evaluation of any kind is done (Bhatt et al., 2019).However, we hope that this work can help establish a metric for faithfulness.

Impact Statement and Ethics
Interpretability itself is paramount to the ethical deployment of machine learning models.Whether this is to proactively ensure that a model performs predictions that align with human values or to retroactively understand what went wrong in a model's prediction (Doshi-Velez and Kim, 2017;Doshi-Velez et al., 2017).
Providing misleading explanations can be potentially dangerous, as even wrong explanations can be very convincing.To prevent this we need accurate faithfulness metrics, which this paper hopes to provide.However, history has shown that it is notoriously difficult to develop principled faithfulness metrics (Jain and Wallace, 2019; Kindermans et al., 2019;Adebayo et al., 2018;Hooker et al., 2019).
It is always a possibility that a proposed faithfulness metric is flawed, including the one proposed here.If this is not caught it could lead to more misleading explanations.To prevent this, we try to be extra transparent about the limitations of the proposed faithfulness metric, as described in Section 8.In particular, we also advocate for testing an interpretability method in terms of the humangroundedness and application-groundedness before using it in production (Doshi-Velez and Kim, 2017).

A Explanation of class leakage
When importance measures are computed, it is the prediction of the gold label that is explained.For example, for the Gradient method, it is ∇ x f (x) y that is computed, where x is the input and y is the gold label.
We want an importance measure for the correct label, as removing the tokens that are relevant for making a wrong prediction, would help the performance of the model.If the gold label was not used, the faithfulness results would be affected by the model performance.As faithfulness and model performance should be unrelated, this is not a desired outcome.
This is a general issue with faithfulness metrics due to how importance measures are calculated in benchmark settings.This is an unfortunate gap between the benchmark-setting and the practical setting where the gold label is unknown.Furthermore, it is rarely documented.
In ROAR and Recursive ROAR, this issue is expressed as an increase in the model performance.Intuitively, it should not be possible for the model performance to increase with more information removed compared to less.However, because the importance measures are w.r.t. the gold label, they can leak the gold label which can increase the model performance.
Thought experiment.Consider the SST dataset, a binary sentiment classification task.Let's say that the and token has a spurious correlation with the positive label (there is some truth to this).Although, clearly the and token can appear in both negative and positive sentences.
For example, let's say that just using the and token provides a 60% accurate classification of positive labels.An importance measure would therefore highlight the and token as being important for the prediction of positive sentiment.Unfortunately, an importance measure might not consider the and token equally important for a negative sentiment (could be due to non-linearity).If all and tokens are removed from sentences with positive sentiment as the gold label, the existence of an and token is now a perfect predictor of negative sentiment.Hence, the model performance will increase (there will still be negative sentiment sentences without and tokens).
Assuming a faithful importance measure, in the next iteration of Recursive ROAR the and token would now be important for predicting negative sentiment and would be removed.However, this assumption is rarely completely justified, there is also no guarantee that and is considered the most important for all observations.Finally, in the case where a relative number of tokens are masked, the removal of other tokens may leak the gold label.
General issue.As mentioned, the need to use the gold label is a general issue that likely2 extends beyond ROAR.However, because ROAR presents a more qualitative metric (Figure 4) where a curve can be observed to increase, this issue is more apparent.Had we just presented the summarizing metric (Table 2), as most faithfulness metrics do, the issue would have been hidden.

B Datasets
The datasets used in this work are listed below.All datasets are public works.There have been made no attempts to identify any individuals.The use is consistent with their intended use and all tasks were already established by Jain and Wallace (2019).
The MIMIC-III dataset (Johnson et al., 2016) is an anonymized dataset of health records.To access this a HIPAA certification is required, which the first author has obtained.Additionally, the MIMIC-III data has not been shared with anyone else, including other authors of this paper.
Below, we provide more details on each dataset.In Table 3 (Weston et al., 2016) -A set of artificial text for understanding and reasoning.We use the first three tasks, which consist of questions answerable using one, two, and three sentences from a passage, respectively.

B.3 Class bias and sequence-length bias
Because Recursive ROAR masks tokens the sequence-length remains the same.At 100% masking the only information the model has is the sequence-length.To understand the relevance of the sequence-length, we compare the 100% masking model performance with a basic class-majority classifier.The results in Table 4 show that the sequence-length does not have much relevance.SNLI does show significant difference but this relates it's the secondary sequence being a very good predictor on its own, not the sequence length (Gururangan et al

C Models C.1 BiLSTM-Attention
The BiLSTM-Attention models, hyperparameters, and pre-trained word embeddings are the same as those from Jain and Wallace (2019).We repeat the configuration details in Table 5.
There are two types of models, single-sequence and paired-sequence, however, they are nearly identical.They only differ in how the context vector b is computed.In general, we refer to x ∈ R T ×V as the one-hot encoding of the primary input sequence, of length T and vocabulary size V .The logits are then f (x) and the target class is denoted as c.

C.1.1 Single-sequence
A d-dimentional word embedding followed by a bidirectional LSTM (BiLSTM) encoder is used to transform the one-hot encoding into the hidden states h x ∈ R T ×2d .These hidden states are then aggregated using an additive attention layer h α = T i=1 α i h x,i .To compute the attention weights α i for each token: where W, b, v are model parameters.Finally, the h α is passed through a fully-connected layer to obtain the logits f (x).

C.1.2 Paired-sequence
For paired-sequence problems, the two sequences are denoted as x ∈ R Tx×V and y ∈ R Ty×V .The inputs are then transformed to embeddings using the same embedding matrix, and then transformed using two separate BiLSTM encoders to get the hidden states, h x and h y .Likewise, they are aggregated using additive attention h α = Tx i=1 α i h x,i .The attention weights α i are computed as: where W x , W y , v are model parameters.Finally, h α is transformed with a dense layer.

C.2 RoBERTa
We use RoBERTa (Liu et al., 2019) as a transformer-based model due to its consistent convergence.Consistent convergence is helpful as ROAR and Recursive ROAR requires the model to be trained many times.We use the RoBERTa-base pre-trained model and only perform fine-tuning.The hyperparameters are those defined used by Liu et al. (2019, Appendix C) on GLUE tasks.We list the hyperparameters in Table 6.
RoBERTa makes use of a beginning-of-sequence Note that when computing the importance measures, only the main sentence is considered.This is to be consistent with the BiLSTM-attention model.

D Compute
In this section, we document the compute times and resources used for computing the results.Unfortunately, our compute infrastructure changed during the making of this paper.The BiLSTM-attention results were computed on V100 GPUs while the RoBERTa results were computed on A100 GPUs.The A100 GPU is significantly faster than the V100 GPU, hence the compute times are not comparable across models.We could have recomputed the BiLSTM-attention results, but doing so would be a waste of resources.We report the machine specifications in Table 7.
The compute times are reported in Table 8.All compute was done using 99% hydroelectric energy.
While the totals in Table 8 may be large, in partial situations only one dataset is usually considered.Additionally, the variance in Figure 4 is quite low, making less seeds an option.Finally, the compute time of integrated gradient is approximately 2/3 of the total.As discussed in Section 6, this is rarely worth it.Practical settings may want to not consider integrated gradient at all for this reason.

E Sparsity
In this section, we analyse the sparsity of each importance measure.While none of the importance measures produce an actual importance for any token, they may have most of the importance assigned to just a few tokens.
This analysis serves two purposes, to show that masking a relative number of tokens is justified and to test if any importance measure are more sparse than others.
Masking a relative number of tokens is justified.If the majority of the importance is assigned to just a few tokens (e.g. 10 tokens have 99% of the total importance scores), then it would make more sense to perform the non-approximate version of Recursive ROAR where exactly one token is masked in each iteration.
In Figure 6, we look at the sparsity considering the top-10 tokens.We find that that the sparsity is not sufficiently high to justify masking exactly one token in each iteration.For completeness, we include this analysis in Appendix F.
There are cases where masking exactly one token in each iteration could make sense, for example, for attention in bAbI.However, as this is a comparative study among several importance measures and datasets, this is not enough.
Attention is more sparse than others importance measures If a particular importance measure is more sparse than others, while having a  similar faithfulness, then the more sparse importance measure would be preferable.This is because it is more likely to be understandable to humans (Miller, 2019).
In Figure 7, we look at the sparsity considering a relative number of tokens.We find that for some datasets, in particular bAbI, attention is the most sparse importance measure.Besides this, integrated gradient is usually the most sparse is nearly all cases.However, while the difference in sparsity is often statistically significant we speculate that the difference is not large enough to cause a difference in practical settings.

F Recursive ROAR with a stepsize of one token
To analyze the effect of masking 10%, as opposed to masking exactly one token in each iteration, we perform the Recursive ROAR experiment with exactly one token token masked.The results are in Figure 8.Because this is computationally expensive, we only do this for up to 10 tokens.This makes it harder to make draw clear conclusions from this experiment, in particular because not all redundancies are removed when only masking 10 tokens.
In general, the results in Figure 8 show that the approximation of masking 10% in each iteration does affect the results.However, we can draw the same conclusions.That being said, some of the conclusions are less obvious because we only look at 10 tokens.

F.1 The results are affected by the approximation
Looking just at RoBERTa, for Diabetes, Integrated Gradient yields 65% performance at 10% masking (approximately 51 tokens), while Integrated Gradient yields 55% performance at 10 tokens.Similarly for bAbI-3, Gradient yields 65% at 10% masking (approximately 30 tokens), while Gradient yields 30% at 10 tokens.Both of these cases, shows that a lower performance is achieved earlier when masking one token in each iteration.This is to be expected, as masking one token in each iteration is more effective for removing redundancies.Were we to complete the experiment to eventually mask all tokens, the faithfulness scores can therefore be expected to be higher.

F.2 The conclusions are the same
In Section 6, we present 5 findings.Here, we briefly show that the same conclusions can be drawn from Figure 8.However, as only 10 tokens are masked they may be less obvious and there may be less evidence.
Faithfulness is model-dependent.Yes, this is most clearly seen for IMDB, where BiLSTM-Attention archives significantly lower performance (higher faithfulness) compared to RoBERTa.
Faithfulness is task-dependent.Yes, looking at BiLSTM-Attention, for IMDB Integrated Gradient is the worst importance measure.However, for the bAbI tasks Integrated Gradient is among the best importance measures.
Attention can be faithful.Yes, particularly for bAbI, IMDB, and Diabetes attention is faithful.
Integrated Gradient is not necessarily more faithful than Gradient or Input times Gradient.Yes, considering BiLSTM-Attention, IMDB Integrated Gradient is significantly worse than other explanations.For most datasets, Integrated Gradient has similar faithfulness as other importance measures.
Importance measures often work best for the top-20% most important tokens.As Figure 8 only shows 10 tokens, which is usually below top-20% this is hard to comment on.
Class leakage can cause the model performance to increase.For RoBERTa, in bAbI-3, the Integrated Gradient importance measure can be seen to increase performance after 2 tokens are masked.

G ROAR vs Recursive ROAR
As an ablation study we compare ROAR by Hooker et al. ( 2019) with our Recursive ROAR. Figure 9 shows the comparison for BiLSTM-Attention and Figure 10 shows the comparison for RoBERTa.Recall that for ROAR by Hooker et al. (2019) it is not possible to say that an importance measure is not faithful.
Some datasets have redundancies which affects ROAR.In particular, we find that Diabetes shows a significant difference comparing ROAR with Recursive ROAR.This is both for BiLSTM-Attention (Figure 9) and RoBERTa (Figure 10).For both models, Gradient and Input times Gradient becomes faithful with Recursive ROAR.Additionally, for RoBERTa the same is the case for Integrated Gradient.This is not surprising, as Diabetes contains incredibly long sequences and contains redundancies.
Also, for IMDB, and to a lesser extent SST, there is a clear difference between BiLSTM-Attention and RoBERTa.This too is not surprising, as sentiment can often be inferred from just a single word.However, there are likely to be many positive or negative words in each observation.
Class leakage affects both ROAR and Recursive ROAR.We observe the class leakage issue for ROAR in SNLI with BiLSTM-Attention and for the bAbI tasks with RoBERTa.We observe the issue for Recursive ROAR in bAbI with BiLSTM-Attention.The fact that the issue mostly exists with bAbI is somewhat encouraging, as the bAbI datasets are synthetic.The class leakage issue appears to affect real datasets less.

0%
The movie is great .I really liked it .10% The movie is [MASK] .I really liked it .20% The [MASK] is [MASK] .I really liked it .

Figure 1 :
Figure 1: Example of ROAR.The first sentence shows the importance of various tokens.The next two sentences demonstrate the proportion of important tokens replaced by [MASK].Note, the second sentence is enough to infer the sentiment.

Figure 4 :
Figure 4: Recursive ROAR results, showing model performance at x% of tokens masked.A model performance below random indicates faithfulness, while above or similar to random indicates a non-faithful importance measure.Performance is averaged over 5 seeds with a 95% confidence interval.

Figure 5 :
Figure5: Visualization of the faithfulness calculation done in (3).The faithfulness area is the numerator in (3), while the normalizer area is the denominator.Essentially (3) computes the relative area-betweencurves (RACU) between an explanation curve and the random baseline curve.

Figure 7 :
Figure7: The accumulative importance score relative to the total importance score for the top-x% number of tokens.The metric is averaged over 5 seeds with a 95% confidence interval.

Figure 8 :
Figure8: Recursive ROAR results, showing model performance at up to 10 tokens masked.Note that because the datasets have more than 10-tokens, the conclusion one can draw from this plot may change if more tokens were considered.However, in general, a model performance below random indicates faithfulness, while above or similar to random indicates a non-faithful importance measure.Performance is averaged over 5 seeds with a 95% confidence interval.

Figure 10 :
Figure 10: ROAR and Recursive ROAR results for RoBERTa, showing model performance at x% of tokens masked.A model performance below random indicates faithfulness.For Recursive ROAR a curve above or similar to random indicates a non-faithful importance measure, while for ROAR by Hooker et al. (2019) this case is inconclusive.Performance is averaged over 5 seeds with a 95% confidence interval.

Table 3 :
, we provide dataset statistics.Datasets statistics for single-sequence and paired-sequence tasks.Following Jain and Wallace (2019), we use the same BiLSTM-attention model and report performance as macro-F1 for SST, IMDB, Anemia and Diabetes, micro-F1 for SNLI, and accuracy for bAbI.

Table 4 :
Performance of the class-majority classifier and the BiLSTM-Attention and RoBERTa classifier on the 100% masked dataset.Performance is the standard metric for the dataset.Meaning, macro-F1 for SST, IMDB, Anemia and Diabetes, micro-F1 for SNLI, and accuracy for bAbI.

Table 7 :
Compute hardware used for each model.Note, the models were computed on a shared user system.Hence, we only report the resources allocated for our jobs.
Figure6: Shows the accumulative importance score relative to the total importance score, for the top-k number of tokens.The metric is averaged over 5 seeds with a 95% confidence interval.Note that datasets are not equal in sequence-length, the scores are therefore hard to compare across datasets.Please refer to Table1for statistics on the sequence-length.
Figure 9: ROAR and Recursive ROAR results for BiLSTM-Attention, showing model performance at x% of tokens masked.A model performance below random indicates faithfulness.For Recursive ROAR a curve above or similar to random indicates a non-faithful importance measure, while for ROAR by Hooker et al. (2019) this case is inconclusive.Performance is averaged over 5 seeds with a 95% confidence interval.