Model Interpretability and Rationale Extraction by Input Mask Optimization

,


Introduction
Black-box machine-learning models like transformers (Vaswani et al., 2017) or convolutional neural networks (Tan and Le, 2019) are state-of-the-art in natural language processing and computer vision.Their complexity enables them to perform well on a variety of tasks, but this comes at the cost of a lack of interpretability: The question of why a model made a specific prediction cannot be answered reliably.Especially if such black-box models are used in critical real-world applications (e.g., in the medical domain), this creates a demand for methods that explain network predictions while fulfilling a variety of requirements, like being easy to implement, model and task agnostic, faithful to the inner workings of the network, and producing results that are easily interpretable for humans.
To this end, a variety of interpretability methods have been proposed (Guidotti et al., 2018;Zhang et al., 2021), but as the aforementioned requirements are often at odds, at least one of them often remains unfulfilled.Reasons for this include the reliance on complex message passing schemes that require laborious implementations (e.g., Montavon et al., 2017, Shrikumar et al., 2017), the applicability only to specific model architectures (e.g., Yuan et al., 2021, Abnar andZuidema, 2020), or the fact that explanations often highlight individual, disconnected input features (e.g., standard gradient-based saliency), which contradicts human intuition of a sensible explanation (compare Section 2.1 for details).
As an example, in a text classification setting, interpretability methods often highlight individual words that explain the prediction, but do not include their context (Remmer, 2022), even though the context of a word is crucial in determining its meaning: The word "good" influences the prediction in a completely different way if it is preceded by the word "not", meaning that this context has an impact on the classification and should therefore be part of the rationale.Notably, this holds true even in the absence of such modifiers, since the context must be available to confirm this absence.
In this work, we propose a new method for model explainability that is able to identify parts of the input that are, on the one hand, most indicative of a class and, on the other hand, perceived as a sensible rationale by humans.Our method is applicable to all input types that define a spatial structure between individual features (e.g., texts, images) and builds on the assumption that interpretable explanations correspond to smooth and connected regions of features with respect to this spatial structure.It uses numerical optimization to mask out parts of the input that the model does not consider indicative of the class of interest, thus leaving only the parts of the input that are indicative of this class.The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness, and compactness of the generated explanation (Yu et al., 2019), three criteria that have been established in the domain of rationale extraction but are less common in network interpretability methods.In this way, our method bridges the gap between model interpretability and rationale extraction, thereby showing that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier.

Background
Methods that explain the predictions made by black-box models to users can be broadly categorized into (i) interpretability methods that aim at creating explanations for existing classifiers after they have been trained (Section 2.1) and (ii) rationale extraction approaches that are designed to create a rationale as a model output in addition to the usual label prediction (Section 2.2).Our interpretability method relies on gradient-based input optimization, discussed in detail in Section 2.3.

Neural Network Interpretability
Interpretability methods usually assign importance scores to features or parts of a given input, indicating how relevant the respective feature is for making the prediction.Many early methods focus on convolutional neural networks and use backpropagation-like procedures to compute saliency scores for each input feature.Simonyan et al. (2013) use the network's gradient at the input image as saliency scores, while Sundararajan et al. (2017)'s integrated gradients method sums over gradients at different inputs that are created by gradually transforming a neutral input into the input of interest.The DeconvNet architecture (Zeiler and Fergus, 2014) and the guided backpropagation algorithm (Springenberg et al., 2015) again rely on a single evaluation but change the standard gradient computation to produce visually improved importance maps.Attribution methods like layer-wise relevance propagation Bach et al. (2015) extend this idea by defining a backward pass that redistributes the total function value layer-wise backwards using a propagation rule that makes the total relevancy within each layer add up to the function value that is to be explained.Deep Taylor Decomposition (Montavon et al., 2017) and DeepLIFT (Shrikumar et al., 2017) then introduced different rules for redistributing the relevance between layers.For transformer models, methods like (Abnar and Zuidema, 2020) track the attention flow through the network.This has been extended to incorporate information from attribution methods like the Deep Taylor Decomposition to more accurately identify neurons that have a strong influence on the final prediction (Chefer et al., 2021b,a).
Other well-known explainability methods rely on input perturbations.LIME (Ribeiro et al., 2016) identifies important input features by perturbing the input, observing the change in the model predictions, and fitting an interpretable model to the observed data, while other methods occlude parts of the input to detect features that are important for the classification (Zeiler and Fergus, 2014;Bazzani et al., 2016;Zhou et al., 2014;Petsiuk et al., 2018).
A further approach to model interpretability is to generate an input that maximally activates specific neurons, thereby yielding insights about the responsibilities of these neurons, as was done for CNNs by Simonyan et al. (2013).Fong and Vedaldi (2017) then used a similar idea to remove class-indicative information from input images to detect the parts of the image responsible for the classification.
The method we propose in this paper differs from standard gradient-based techniques by not relying on evaluations at a single point or at fixed perturbations, but at points that are determined by a dynamic optimization process.This control via optimization is also a key difference from methods that rely on random permutations or masking of input features.Compared to message-passing schemes and model-specific methods (like methods for transformer interpretability), relying only on the gradient makes our method applicable to models with a variety of architectures and layer types without requiring additional implementational effort.

Rationale Extraction
The task of rationale extraction, also commonly referred to as selective rationalization, is concerned with designing models that can produce humaninterpretable rationales in addition to the usual model output (Lei et al., 2016), with the domain usually being textual inputs and the rationales being a subset of the input text that is determined to be responsible for the prediction.Lei et al. (2016) approached this task by developing a twostep procedure in which a proposal network extracts a rationale from the input text and a subsequent classification network only has access to the rationale to make the final prediction.By training this model end-to-end, the proposal network learns to extract the most useful text fragments from the input, which thus corresponds to an explanation for the classification.Later, Yu et al. (2019) proposed three criteria that rationales should satisfy to be perceived as sensible: Sufficiency: The rationale should be sufficient to correctly classify the sample only by its rationale.
Comprehensiveness: All relevant information should be contained in the rationale, meaning that the correct label can not be inferred by just considering the words not included in the rationale.
Compactness: The rationale should be sparse but should nevertheless consist of consecutive text fragments instead of single words.Yu et al. (2019)'s methods enforce these criteria through regularizers and by using a complement predictor that predicts the correct label based on all words that are not part of the rationale.Training the proposal network to fool the complement predictor then enforces the comprehensiveness constraint.Other approaches extend this and extract class-dependent rationales (Chang et al., 2019) or select complete paragraphs as rationales (Chalkidis et al., 2021).
The two main differentiators of rationale extraction models to the interpretability methods discussed in Section 2.1 are that, one the one hand, models are explicitly trained to produce rationales instead of creating them post hoc, and, on the other hand, the focus is on creating human interpretable rationales while the focus for interpretability methods often is on mathematical faithfulness measures.
Our method combines the focus on faithfulness with the desire for human interpretability to create rationales that faithfully explain model predictions post hoc and correspond to human rationales, as these properties substantially enhance the usefulness of explanations for many applications.

Input Optimization
As mentioned in Section 2.2, optimization of input images for CNNs has been used to explain the re-sponsibilities of specific neurons, but notably, the resulting images do not resemble naturally occurring images.This is caused by the huge complexity and highly nonlinear behavior of neural networks, leading to the property of having unpredictable behavior on out-of-domain inputs that quickly arise during the optimization.In different experiments, this has led to behaviors like making highly confident class predictions for images that resemble random noise (Nguyen et al., 2015) or predicting a completely different class after adding almost imperceivable noise to a given image (Szegedy et al., 2014).Unconstrained optimization of the input to a neural network to optimize the activation of specific neurons will therefore inevitably result in inputs that are out-of-domain, do not resemble natural images, or seem downright counter-intuitive.Different strategies for mitigating this problem in the context of input optimization exist (e.g., the use of GANs, Nguyen et al., 2016), with the most common being extensive regularization to prevent highfrequency information in images from influencing the prediction (Yosinski et al., 2015;Mahendran and Vedaldi, 2015) or using lower-resolution inputs and blurring to limit the degrees of freedom within the optimization (Fong and Vedaldi, 2017).
In this study, we use input optimization to perform model interpretability by optimizing a mask to suppress all parts of a given input that a given model does not consider indicative of the given class.Compared to (Fong and Vedaldi, 2017), we propose a new optimization objective as well as a new regularization scheme that allows for the creation of more detailed masks.Additionally, we expand the scope of input optimization methods from the domain of images to text processing.

MaRC
In this section, we introduce MaRC, our framework for Mask-based Rationale Creation.Section 3.1 develops the general framework.Sections 3.2 and 3.3 address the specificities of applying MaRC to texts and images, respectively.

Method
We design an interpretability method that detects parts of an input x that a model M considers most indicative of a specific class c.We assume an input x with n input features, each of which could be high-dimensional, e.g., token embeddings or pixels with color channels.The main idea of the approach is to detect input features that are highly indicative of class c by replacing as much of the input as possible with an uninformative input b, i.e., an input that the model does not consider indicative of any class, while having the model assign a high score for class c to the altered input.We define a mask λ ∈ R n , λ i ∈ [0, 1] to obtain a masked input x in the following way: When λ i is close to 1, feature i is mostly retained in x‚ while λ i values close to 0 replace feature i almost completely with the uninformative b.
MaRC tackles this masking as an optimization problem: it optimizes λ to obtain rationales that fulfill the properties of sufficiency, comprehensiveness, and compactness (compare Section 2.2).It models these properties via dedicated regularizers, which we will develop step by step in the following.
Sufficiency We want to find a mask λ such that the probability that model M assigns to x for class c is close to 1.We optimize this criterion as follows: Here, L(x, c) is a scoring function for c under M and Ω λ is a sparsity regularizer that enforces the detection of the smallest set of input features that still induces a high score for c.An obvious choice for L(x, c) is the log-likelihood of c, maximizing the probability of c under M and leading λ to highlight class-discriminative information, i.e., input features that indicate only class c.A different choice would be the logarithm of the sigmoid of the logit for c, which does not suppress other classes and therefore leads λ to highlight class-indicative information, i.e., all input features relevant for c, even if they are indicative of other classes as well.
In both cases, M considers x to be highly indicative of class c, thereby fulfilling the sufficiency criterion.
Comprehensiveness Optimizing Equation 2 leads to sufficiency but not comprehensiveness, as the smallest set of highly indicative input features is detected.To detect all information relevant for c, we introduce the complement of rationale (Yu et al., 2019): which leaves features unmasked that were masked for x.Minimizing the score of xc for c enforces all parts that indicate class c to be masked in xc (meaning that they will be unmasked in x), resulting in the following optimization: This formulation combines the "deletion game" and "preservation game" that were introduced by Fong and Vedaldi (2017) but treated as separate objectives.Optimizing the mask with respect to both objectives greatly supports the detection of precise boundaries of the relevant features.
Compactness The original compactness criterion states that a rationale shall consist of longer but fewer meaningful spans of text.Here, we generalize this to all input types that possess a spatial structure that defines neighborhoods around input variables.The underlying assumption is, that for these types of inputs, a feature is only meaningful in the context of its neighborhood, as, for example, is the case for single words in text or individual pixels in images, so that a sensible rationale must include larger groups of closely located features.Thus, we now assume a general spatial structure on the input x that defines distances d(i, j) between the features i and j, with features that are closer together having a higher chance of belonging to the same meaningful entity.We enforce the selection of larger groups of features by reparameterizing our mask, i.e., we introduce two new parameters, w ∈ R n and σ ∈ R n >0 from which the mask values λ can subsequently be calculated.The optimization is then performed with respect to w and σ.
The mask values λ are mainly determined by w, in a way that w i largely determines the final value of λ i .Crucially, w i now also influences the values of λ around i, so that, for example, λ i−1 and λ i+1 are also strongly influenced by w i .σ i then determines the strength and extent of w i 's influence on its neighbors, as it parameterizes an unnormalized Gaussian placed at position i, so that the influence w i→j of a weight w i onto λ j is then given by: The final value for λ j is then calculated as follows: This parameterization of λ enforces neighboring inputs to have similar values if the corresponding σ values are large, which also plays a key role in regularizing the optimization to avoid the issues discussed in Section 2.3.Large values for σ are softly enforced by introducing an additional regularizer: The logarithm was chosen to enforce positive values of σ i while gradually discounting the effect that increases in σ i have on the loss function.Notably, this regularizer does not enforce large values of σ by means of hard constraints, meaning that low values and therefore sharper boundaries between mask values for neighboring features can be optimal if the other parts of the optimization objective support this behavior.This is in contrast to (Fong and Vedaldi, 2017), who used a lower resolution mask in combination with upsampling and Gaussian blur to detect smooth masks, which does not allow for sharp masks even if they were optimal.In summary, the final optimization objective looks as follows: This objective can be optimized using stochastic gradient descent, but in practice, we found using an optimizer that incorporates momentum (e.g., Adam, Kingma and Ba, 2015) to be key for avoiding local optima and obtaining optimal results.

Textual Inputs
As MaRC only requires the gradient of a model prediction at the input, it can be applied to all common text processing models.In the following, we discuss specific aspects of using MaRC with stateof-the-art transformer architectures like BERT (Devlin et al., 2019).
As uninformative input b, we choose a sequence of PAD-tokens of the same length as x.During training, the model learns to treat these tokens as uninformative since they are added to inputs irrespective of their content or the desired output.
As we want importance scores for each individual word, we define n to be the number of words in the input sequence.Notably, this is different from the actual input dimension, as it is common to use WordPiece embeddings (Wu et al., 2016) which could split words into multiple input tokens.In this case, we use parameter tying to only have a single parameter for all pieces of a word representation.The distance function is then simply defined as d(i, j) = |i − j|, with i and j being the positions of the words in the text.
Finally, we found that introducing noise into the optimization process is beneficial for regularization (see Section 2.3 for discussion of regularization in input optimization).Thus, for text inputs, we add Gaussian noise to x and xc and randomly set mask values to 0 or to 1 in each optimization step.

Image Inputs
Image inputs also fulfill the requirements on the presence of a spatial structure that is needed for our method.They also provide natural choices for uninformative inputs, as uniformly colored images can generally be assumed to be uninformative in most prediction settings.Therefore, obvious choices for b would, for example, be a white image, a black image, or an image of the mean color within the given dataset.A different option is to remove usable information from the input image by blurring it and using this blurred image as uninformative input (Fong and Vedaldi, 2017).As parts of the input image could have the same color as the uninforma- tive input (which renders the corresponding mask values meaningless) and even uniformly colored patches could be seen as informative by neural networks, we chose to alter the optimization objective to be the average over different choices for b, with B being the set of all uninformative inputs: arg min As images generally have more variables and therefore more degrees of freedom in the optimization, further regularization is needed to obtain sensible optimization results.To this end, this formulation includes an additional regularizer Ω NB , which denotes the average squared difference between mask values that are neighboring with respect to the 8connected grid structure of the image, weighted by a corresponding parameter α NB .
To complete the specification of the optimization problem, we define the distance function d between two pixels to be the euclidean distance between their two-dimensional position vectors in the image grid.In contrast to the textual inputs, the introduction of noise to the optimization process did not prove to be beneficial.

Experiments on Rationale Extraction
We evaluate MaRC on rationale extraction, a task that is concerned with predicting the correct label for a given textual input while also providing a subset of the input as a rationale for the prediction.

Data
We use the movie review data set (Zaidan et al., 2007) with 2000 movie reviews annotated with sentiment labels (positive or negative) as well as span-level rationales.We test on the additional rationales created by DeYoung et al. ( 2020), which are more comprehensive and thus, on average, comprise a much larger fraction of words (7.2% vs. 31.4%).As our approach is designed for extracting span-level rationales and most other datasets for rationale extraction are not annotated on span-level (DeYoung et al., 2020), this is the only dataset suitable for an evaluation of MaRC.We use a standard BERT base model (Devlin et al., 2019) and train it as a standard binary classification model on the training data, therefore only using the class labels and not the annotated rationales.

Evaluation
There are two common ways of evaluating the rationales produced by different models (DeYoung et al., 2020;Atanasova et al., 2020): 1. Agreement with human annotator rationales: A strong overlap between rationales given by human annotators and rationales produced by an explainability model is a good indicator that sensible rationales have been selected.Additionally, similarity to human rationales could be considered a desirable property (depending on the use case), even if it is not perfectly in line with the actual reasoning process of the neural network.
2. Faithfulness: Ideally, the rationales produced by a model fulfill the conditions of sufficiency and comprehensiveness, meaning that they actually reveal the information that the model considered indicative for the predicted label.
Different metrics exist to evaluate the performance of approaches that produce "soft" scores (i.e., continuous values) or binary values as the output of the rationale generation.As we see use cases for both outputs, we evaluate our approach with respect to both.To create a binary mask from the continuous mask values that MaRC produces, we train a kernel regression model to predict the optimal percentage of words that need to be included in the rationale (described in Appendix A), which we do in the same way for all methods tested in this study.
To evaluate the agreement with human rationales, we calculate the token F1 score for the binary masks by using precision and recall of the "positive" class of words belonging to the rationale, while the soft-scoring models are evaluated using the mean average precision (mAP).To evaluate the agreement of larger detected spans with the spans present in the human rationales, we evaluate the IoU F1 score that counts a ground-truth span as correctly detected if there is a predicted span with an IoU of over 0.5, which again allows for the calculation of an F1 score for the "positive" class of detecting the spans.These three metrics were used in the ERASER benchmark (DeYoung et al., 2020), which also proposed metrics to evaluate sufficiency and comprehensiveness.For these metrics, we slightly deviate from their evaluation metrics by evaluating these scores for a given sample x and rationale r in the following way: Here, M (x) denotes the class probability prediction (for the ground-truth class) of our model, r i denotes the top (i • 5)% of words according to the soft rationale scores (all other words are removed), and x\r i denotes sample x with all words that belong to r i removed, where we "remove" words by replacing the corresponding tokens with PAD-tokens.
Therefore, the comprehensiveness score evaluates, how much removing the rationale decreases the model performance (higher scores are better) while the sufficiency score evaluates how well the correct label can be predicted from the rationale alone (lower scores are better).

Results
The evaluation results are displayed in the upper part of Table 1 (see Appendix A for more details on the setup).We compare MaRC against other interpretability methods that are commonly used in the context of NLP but omit specialized rationale extraction models as they (i) usually produce binary masks, making it impossible to perform the soft-scoring evaluation, and (ii) do not produce explanations for existing models, which makes the faithfulness evaluation inapplicable.We see that MaRC achieves state-of-the-art results on all measures that evaluate agreement with human rationales, i.e.Token F1, mAP, and IoU F1, showing that MaRC is the best method for obtaining rationales that match human intuition.Especially with respect to the IoU F1 score, MaRC outperforms all other methods by a large margin, even though the hyperparameters for other methods were set to explicitly support high scores in this measure (e.g., masking larger spans for occlusion and LIME).This highlights that MaRC is suitable for detecting span-level rationales in a paragraphlong text that agree with spans that humans annotate, without being trained to do so and without additional model components as in state-of-the-art rationale extraction models.
For sufficiency and comprehensiveness, MaRC also achieves impressive results, being outperformed in both metrics only by Shapley value sampling.The excellent performance of this method with regard to these evaluation metrics is not surprising, though, as it is based on choosing a random permutation of input features, adding them successively to the input, and using the change in the model's output as the resulting score.This method is very closely connected to the sufficiency and comprehensiveness calculations, thereby rendering the great results of this method unsurprising.It should be noted, that multiple methods, including MaRC, achieve close to optimal results for sufficiency, as scores close to 0 indicate that the removal of very few high-scoring tokens is enough to completely throw off the classifier.Notably, MaRC can produce good results while aiming to create human-like rationales, showing that this kind of rationale to some extent corresponds to the inner workings of the neural network.
We also conduct an ablation study that tests the importance of the different parts of the optimization objective by leaving these parts out in turn and reporting the results with the altered objective.The results are displayed in the lower part of Table 1, with the "Method" column specifying with part of the objective is omitted.The full optimization objective is almost uniformly the best-performing variant, proving that all parts a essential to achieve optimal performance.

Experiments on Image Classification
We evaluate MaRC on the task of creating rationales for classifications of ImageNet (Russakovsky et al., 2015) images.A visual comparison of masks created by MaRC and other interpretability approaches for ResNet-101 (He et al., 2016) and the vision transformer ViT-B/16 (Kolesnikov et al., 2021) is displayed in Figures 2 and 3, respectively (see Appendix C for with more visualizations).We see that MaRC is able to produce sharp masks that often cover the complete object of interest in the image.For ViT-B/16, we include a visualization highlighting the distinction between class-discriminative vs. class-indicative information (compare Section 3.1): Figure 3b) used the softmax of the model output as scoring function, which leads MaRC to highlight only the head and tail of the animal, the two parts that the model uses to differentiate the correct class from the other classes.For Figure 3c), on the other hand, a mixture of the sigmoid of the class logit and the softmax of the model output was used with a ratio of 9:1, making the model highlight all parts in the image that indicate the ground-truth class, as long as they are not significant indicators of other classes.
We also evaluate the faithfulness of the explanation created by MaRC and a variety of other interpretability methods using the same metrics as for textual inputs.For this experiment, we use pretrained ResNet-101 and ViT-B/16 models on a random sample of 500 ImageNet validation images, with further implementational details being described in Appendix A. As shown in Table 2, MaRC is the best-performing model with respect to sufficiency for both ResNet-101 and ViT-B/16, showing that the areas that MaRC highlights are indeed the areas that allow the model to predict the correct class based solely on these regions.With respect to comprehensiveness, MaRC achieves competitive results, only falling behind model-specific architectures that heavily use the knowledge about the inner workings of the model and the information flow inside it (e.g., transition attention maps (TAM), Yuan et al., 2021), as well as two other methods in the form of Guided Backpropagation (Springenberg et al., 2015) and Integrated Gradients Sundararajan et al. (2017).The latter two methods often predict individual pixels that are spread over many areas of the image as the most indicative input features, indicating that the removal of key pixels at different positions of the image is a good strategy to quickly decrease the classifier performance, an approach that MaRC is actively discouraged to pursuit.

Conclusion
We propose a new method for creating explanations for neural network predictions that are faithful to the model's reasoning process as well as being sensible with respect to human judgment.We achieve state-of-the-art results on the task of rationale extraction, achieve competitive or state-of-the-art results with respect to faithfulness, and provide visually sensible explanations for classifications of images.As MaRC is model-agnostic, we believe it to be a useful tool in many areas of machine learning that include textual or image inputs.We further believe that other domains can make use of MaRC, including multimodal tasks that, for example, combine textual and image inputs, as well as other domains that fulfill the requirements on the spatial structure of the input, like auditory data.

Limitations
Compared to other interpretability methods, MaRC is able to create explanations that more closely resemble human rationales.Nevertheless, the similarity to human rationales is always limited by the inner workings of the respective neural network: If a network's reasoning does not mirror human reasoning, the resulting rationales will be incomprehensible to humans.
Additionally, rationales created by MaRC are the result of a complete input optimization process.Therefore, the rationale creation usually requires hundreds of forward passes and gradient evaluations for the respective neural network, which makes the process of creating the rationale timeconsuming and therefore infeasible for many realtime applications.On modern hardware, creating a rationale for BERT base can take two to three minutes depending on the length of the input text, while ResNet-101 and ViT-B/16 are faster at about one minute.

A Experimental Details
The implementations of MaRC for the experiments conducted in this study is available at https://github.com/inas-argumentation/Explainability.

A.1 Rationale Extraction
We perform rationale detection using BERT base (uncased) (Devlin et al., 2019), which we train as a binary classifier for at most 20 epochs on the first eight folds of the movie review dataset (Zaidan et al., 2007), with the ninth and tenth fold being used for validation and testing, respectively.For the optimization, we use the Adam optimizer and achieve a 96.5% test set accuracy.
For rationale creation, the results from Table 1 as well as the example images for MaRC were created by using the optimization objective given by Equation 8 with all specifications as described in Section 3.2, hyperparameters set to α λ = 1, α σ = 1.2 and w and σ being uniformly initialized to 1.2 and 2, respectively.We add zero-mean Gaussian noise to x and xc (σ = 0.03) and randomly set 5% of mask values to 0 or 1, respectively, in each optimization step.We use the log-likelihood of the respective class as scoring function.All these choices were made by using the validation split, with the measure of quality being visual coherence of the created explanation, as the data set does not offer a validation split with the same label distribution, thus making validation with respect to scores infeasible.Texts that surpass the limit of 510 input tokens for BERT base are split into multiple segments, with consecutive segments overlapping for 100 tokens, and a separate mask is predicted for each segment.The resulting masks are concatenated, with the overlapping parts being linearly blended.We proceed in the same way for all other interpretability methods.
The following models and parameters were used in the method comparison: • Occlusion (Zeiler and Fergus, 2014): We chose to mask slightly larger spans of 5 tokens as this produced smoother masks which resulted in higher IoU F1 scores.Occluded parts were replaced by PAD-tokens.
• Saliency (Simonyan et al., 2013): No special hyperparameter settings required.In each evaluation, we randomly select 5 13% of tokens and replace them as well as the next three tokens with PAD-tokens.We train a linear classifier and use the resulting weights as rationale.
• Shapley (Shapley value sampling, Castro et al., 2009): We evaluate the token contributions for 25 feature permutations per sample.Removed tokens are replaced by PAD-tokens.
We use the implementations provided by (Kokhlikyan et al., 2020) for all methods.All methods have access to the ground truth label and therefore do not have to rely on a correct classifier prediction.
For methods that produce scores for each entry of the embedding vector, we report results for two different methods of combining these scores to single values per token, with one being taking the vector norm (results are reported for the L1 norm, but we did not see a significant difference for the L2 norm), and the other one being summing over the resulting scores (indicated by subscript n and s in Table 1, respectively).In the latter case, the resulting value was negated if the target label is 0.
To evaluate the token F1 score and the IoU F1 score, we need to create a binary mask from the continuous scores produced by the different interpretability methods.We do this by selecting the top-scoring words as rationale, with the percentage of words that are selected being decided by a Nadaraya-Watson kernel regression model using an RBF kernel.The input to the kernel regression for a given sample is the percentage of words that have a score greater than a fixed threshold (a hyperparameter, here set to 0.1), while the output is the percentage of words to be selected as rationale.As we use the rationales from (DeYoung et al., 2020) (who only annotated 200 samples) for our experiment, we do not have access to a separate training set to train the kernel regression, so we resort to a leave-one-out scheme to use the same set for training and testing.
For the faithfulness evaluation, we note that we deviate from the common practice of evaluating the area under the curve (AUC) (e.g., used by Petsiuk et al., 2018) and instead take the average over the tested range of values.We do this, to accommodate for the possibility of negative scores in the sufficiency calculation, which undermine the theoretical foundation of the AUC.We also adapt the comprehensiveness calculation accordingly for consistency.

A.2 ImageNet Explanations
We use MaRC with the optimization objective given by Equation 9. We use pretrained ResNet-101 (He et al., 2016) and vision transformer ViT-B/16 (Kolesnikov et al., 2021, input image size=384) models and use the following hyperparameter setting for MaRC: • ResNet-101: We set α λ = 0.6, α σ = 1.2, α NB = 10 and initialize w and σ uniformly to 0.5 and 1.2 respectively.As ResNet models seem to treat uniformly colored images as uninformative, we chose B to be a set containing a black image, a white image, and an image with the mean color from the dataset.
As the scoring function, we chose the log of a combination of the softmax output of c (weighted by 0.9) and the sigmoid of the logit of c (weighted by 0.1).
• Vit-B/16: We set α λ = 0.25, α σ = 1.2, α NB = 10 and initialize w and σ uniformly to 0.5 and 1.2 respectively.As the vision transformer often seems to interpret the uniformly colored backgrounds as indicative of specific

Figure 1 :
Figure 1: An exemplary rationale created by MaRC for the prediction of the positive sentiment label.

Figure 2 :
Figure 2: Comparison of masks created for ResNet-101 by different explainability methods.

Figure 3 :
Figure 3: Comparison of masks created for ViT-B/16 by different explainability methods.

Table 1 :
(DeYoung et al., 2020)xtraction on the movie reviews dataset(DeYoung et al., 2020), including faithfulness evaluation.See Section A for an overview of the methods tested and for experimental details.

Table 2 :
Results for the faithfulness evaluation of different explainability methods for ResNet-101 and ViT-B/16.Compare Section A for an overview of the methods tested and for experimental details.