Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training

Feature attribution methods highlight the important input tokens as explanations to model predictions, which have been widely applied to deep neural networks towards trustworthy AI. However, recent works show that explanations provided by these methods face challenges of being faithful and robust. In this paper, we propose a method with Robustness improvement and Explanation Guided training towards more faithful EXplanations (REGEX) for text classification. First, we improve model robustness by input gradient regularization technique and virtual adversarial training. Secondly, we use salient ranking to mask noisy tokens and maximize the similarity between model attention and feature attribution, which can be seen as a self-training procedure without importing other external information. We conduct extensive experiments on six datasets with five attribution methods, and also evaluate the faithfulness in the out-of-domain setting. The results show that REGEX improves fidelity metrics of explanations in all settings and further achieves consistent gains based on two randomization tests. Moreover, we show that using highlight explanations produced by REGEX to train select-then-predict models results in comparable task performance to the end-to-end method.


Introduction
As the broad adoption of Pre-trained Language Models (PLMs) requires humans to trust their output, we need to understand the rationale behind the output and even ask questions regarding how the model comes to its decision (Lipton, 2018).Recently, explanation methods for interpreting why a model makes certain decisions are proposed and become more crucial.For example, feature attribution methods assign scores to tokens and highlight the important ones as explanations (Sundararajan et al., 2017;Jain et al., 2020;DeYoung et al., 2020).
However, recent studies show that these explanations face challenges of being faithful and ro-Figure 1: Visualization of positive and negative highlights produced by post-hoc explanation methods (e.g., feature attribution).However, these explanations suffer from unfaithfulness problems (e.g., same model framework A and A' with different attributions) and can be further fooled by adversarial manipulation without changing model output (Ghorbani et al., 2019) (see §4.4).
bust (Yeh et al., 2019;Sinha et al., 2021;Ivankay et al., 2022), illustrated in Figure 1.The faithfulness means the explanation accurately represents the reasoning behind model predictions (Jacovi and Goldberg, 2020).Though some works are proposed to use higher-order gradient information (Smilkov et al., 2017), by incorporating game-theoretic notions (Hsieh et al., 2021) and learning from priors (Chrysostomou and Aletras, 2021a), how to improve the faithfulness of highlight explanations remains an open research problem.Besides, the explanation should be stable between functionally equivalent models trained from different initializations (Zafar et al., 2021).Intuitively, the potential causes of these challenges could be (i) the model is not robust and mostly leads to unfaithful and fragile explanations (Alvarez-Melis and Jaakkola, 2018;Li et al., 2022) and (ii) those explanation methods themselves also lack robustness to imperceptible perturbations of the input (Ghorbani et al., 2019); hence we need to develop better explanation methods.In this paper, we focus on the former and argue that there are connections between model robustness and explainability; any progress in one part may represent progress in both.
To this end, we propose a method with Robustness improvement and Explanation Guided training to improve the faithfulness of EXplanations (REGEX) while preserving the task performance for text classification.First, we apply the input gradient regularization technique and virtual adversarial training to improve model robustness.While previous works found that these mechanisms can improve the adversarial robustness and interpretability of deep neural networks (Ross and Doshi-Velez, 2018;Li et al., 2022), to the best of our knowledge, the faithfulness of model explanations by applying them has not been explored.Secondly, our method leverages token attributions aggregated by the explanation method, which provides a local linear approximation of the model's behaviour (Baehrens et al., 2010).We mask input tokens with low feature attribution scores to generate perturbed text and then maximize the similarity between new attention and attribution scores.Furthermore, we minimize the Kullback-Leibler (KL) divergence between model attention of original input and attributions.The main idea is to allow attention distribution of the model to learn from input importance during training to reduce the effect of noisy information.
To verify the effectiveness of REGEX, we consider a variety of classification tasks across six datasets with five attribution methods.Additionally, we conduct extensive empirical studies to examine the faithfulness of five feature attribution approaches in out-of-domain settings.The results show that REGEX improves the faithfulness of the highlight explanations measured by sufficiency and comprehensiveness (DeYoung et al., 2020) in all settings while outperforming or performing comparably to the baseline, and further achieves consistent gains based on two randomization tests.Moreover, we show that using the explanations output from REGEX to train select-then-predict models results in comparable task performance to the endto-end method, where the former trains an independent classifier using only the rationales extracted by the pre-trained extractor (Jain et al., 2020).Considering neural network models may be the primary source of fragile explanations (Ju et al., 2022;Tang et al., 2022), our work can be seen as a step towards understanding the connection between explainability and robustness -the desiderata in trustworthy AI.The main contributions of this paper can be summarized as: • We explore how to improve the faithfulness of highlight explanations generated by feature attributions in text classification tasks.
• We propose an explanation guided training mechanism towards faithful attributions, which encourages the model to learn from input importance during training to reduce the effect of noisy tokens.(Hendrycks et al., 2020;Wang et al., 2021).However, as the debug tools for black-box models, explanation methods also lack robustness to imperceptible and targeted perturbations of the input (Heo et al., 2019;Camburu et al., 2019;Meister et al., 2021;Hsieh et al., 2021).While significantly different explanations are provided for similar models (Zafar et al., 2021), how to elicit more reliable explanations is a promising direction towards interpretation robustness.Different from Camburu et al. (2020) that addresses the inconsistent phenomenon of explanations, we investigate the connection between model robustness and faithfulness of the explanations.

Explanation Faithfulness
The faithfulness of explanations is important for NLP tasks, especially when humans refer to model decisions (Kindermans et al., 2017;Girardi et al., 2018).Jacovi and Goldberg (2020) first propose to evaluate the faithfulness of Natural Language Processing (NLP) Figure 2: The overall framework of proposed REGEX method.REGEX consists of two components for robustness improvement and explanations guided training respectively.For latter, we iteratively mask input tokens with low attribution scores and then minimize the KL divergence between attention of masked input and feature attributions.
methods by separating the two definitions between faithfulness and plausibility and provide guidelines on how evaluation of explanations methods should be conducted.Recently, some works have focused on faithfulness measurements of NLP model explanations and improve the faithfulness of specific explanations (Wiegreffe et al., 2021;Yin et al., 2021;Chrysostomou and Aletras, 2021b;Bastings et al., 2022).Among them, Ding and Koehn (2021) propose two specific consistency tests intending to measure if the post-hoc explanations remain consistent with similar models.
Incorporate Explanations into Learning While most previous explanation methods have been developed for explaining deep neural networks, some works explore the potential to leverage these explanations to help build better models (Liu and Avci, 2019;Rieger et al., 2020;Jayaram and Allaway, 2021;Ju et al., 2021;Bhat et al., 2021;Han and Tsvetkov, 2021;Ismail et al., 2021;Chrysostomou and Aletras, 2021a;Stacey et al., 2022;Ye and Durrett, 2022).For example, Hase and Bansal (2021) propose a framework to understand the role of explanations in learning, and find that explanations are suitably used in a retrieval-based modeling approach.Similarly, Adebayo et al. (2022) investigate whether post-hoc explanations effectively detect model reliance on spurious training signals, but the answer seems to be negative.While effectively incorporating explanations remains an open problem, we focus on using model explanations in a self-training way to improve its faithfulness.

Problem Formulation
First, we consider the setting of multi-label text classification problem with n input examples The input space embedded into vectors is x ⊆ R l×d and the output space is Y.A neural classifier is f θ : X → Y where f θ (x) parameterized by θ which denotes the output class for one example x = (x 1 , • • • , x l ) ∈ X , where l represents the length of the sequence.The optimization of the network is to minimize the cross-entropy loss L over the training set as follows: Then, given an input and its particular prediction f θ (x i ) = y i , the goal of feature attribution is to assign each token with a normalized score that then can be used to extract a compact set of relevant sub-sequences with respect to the prediction.Formally, an attribution of the prediction at input x i is a vector and a ij is defined as the attribution of x ij .After that, we denote the set of extracted tokens (i.e., highlight explanations or rationales) provided by taking top-k values from x i as r i , and use r i = x i \ r i , as the complementary set of r i to denote the set of irrelevant tokens.

Robustness Improvement
Adversarial attacks are inputs that are intentionally constructed to mislead neural networks (Szegedy et al., 2013;Goodfellow et al., 2015).Given the f θ and an input x ∈ X with the label y ∈ Y, an adversarial example x adv satisfies where ϵ is the worst-case perturbation.Several defense methods have been proposed to increase the robustness of deep neural networks to adversarial attacks.We adopt two popular methods: virtual adversarial training (Miyato et al., 2015) which lever-ages a regularization loss to promote the smoothness of the model distribution, and input gradient regularization (Ross and Doshi-Velez, 2018) which regularizes the gradient of the cross-entropy loss.
Note that the methods used to improve the robustness are not limited to these techniques.As shown in Figure 2, we aim to improve the robustness of deep neural networks intrinsically.Instead of adopting adversarial training objective, we follow Jiang et al. (2019) to regularize the standard objective using virtual adversarial training (Miyato et al., 2018): The goal of this approach is the enhancement of label smoothness in the embedding neighborhood.Specially, we run additional projected gradient steps to find the perturbation δ with violation of local smoothness to maximize the adversarial loss.On the other hand, input gradient regularization trains neural networks by minimizing not just the "energy" of the network but the rate of change of that energy with respect to the input features (Drucker and LeCun, 1992).The goal of this approach is to ensure that if any input changes slightly, the KL divergence between the predictions and the labels will not change significantly.Formally, it takes the original loss term and penalizes the ℓ 2 norm of its gradient and parameters: It can also be interpreted as applying a particular projection to the Jacobian of the logits and regularizing it (Ross and Doshi-Velez, 2018).

Explanation Guided Training
If post-hoc explanations faithfully quantify the model predictions, the irrelevant tokens should have low feature attribution scores (Ismail et al., 2021).Based on this intuition, we leverage the existing explanations to guide the model for reducing feature attribution scores of irrelevant tokens without sacrificing the model performance.Concretely, we propose the Explanation Guided Training (EGT) mechanism.Instead of using the saliency method (i.e., gradient of the target class with respect to the input) (Simonyan et al., 2014), we apply the Integrated Gradients (IG) method (Sundararajan et al., 2017) that is more faithful via axiomatic proofs to calculate the token importance.We do not assume the IG is totally faithful, and we also experiment with other attribution methods in §5.1.It integrates the gradient along the path from an uninformative baseline to the original input.This baseline input is used to make a high-entropy prediction that represents uncertainty.As it takes a straight path between baseline and input, it requires computing gradients several times.The motivation for using path integral rather than vanilla gradient is that the gradient might have been saturated around the input while the former can alleviate this problem.Formally, given an input x and baseline x ′ , the integrated gradient along the i th dimension is defined as follows: where ∂f θ (x) ∂x i represents the gradient of f along the i th dimension at x which is the concatenated embedding of the input sequence, and the attribution of each token is the sum of the attributions of its embedding.Note that we attribute the output of the model with ground-truth labels during training.We also test other feature attribution methods in §5.1.
After calculating the token's importance score by ℓ 2 aggregation over embedding dimensions, we sort tokens of x based on these scores and mask the bottom K% words according to that sorting.We define the sorting function as s(•) and the masking function as m(•).For example, s i (x) is the i th smallest element in x, and m k (s(x), x) replaces all x i ∈ {s i (x)} ceil(1,Kl) e=0 with a mask distribution, i.e., m k (s(x), x) removes the K% lowest features from x based on the order provided by s(x).During training, we generate a new input x for each example x by masking the features with low attribution scores as follows: x is then passed through the network which results in an attention scores att( x).Following Jain et al. (2020), the attention scores are taken as the mean self-attention weights induced from the first token index to all other indices.Then we maximize the similarity between att (x) and att( x) to ensure that the model produces similar output probability distributions over labels for both masked and unmasked inputs.The optimization objective for the EGT is: where D KL is the KL divergence function between two distributions.The motivation behind two KL divergence terms is to encourage the model to focus on high salient words and ignore low salient words during training, and generate similar outputs for the original input x and masked input x, which can be seen as a special adversarial example.On the other hand, as the calculation of the mask input is batchwise, the model should learn to assign low gradient values to irrelevant tokens for the predicted label in an iterative way.

Training
We define the final weighted loss as follows, where

Erasure-based Faithfulness Evaluation
To evaluate post-hoc explanations, we adopt sufficiency that measures the degree to which the highlight explanation is adequate for a model to make predictions, and comprehensiveness that measures the influence of explanations to predictions (DeYoung et al., 2020).These two metrics are usually used to evaluate faithfulness as it does not require re-training and the main idea is to estimate the effect of changing parts of inputs on model output.Let p θ (y j |x i ) be the output probability of the j-th class for the i-th example, and rationale r i extracted according to attribution scores.Formally, the sufficiency we used is as follows: where higher sufficiency values are better as we normalize and reverse it between 0 and 1, and S(x i , y j , 0) is the sufficiency of the input where no token is erased.Similarly, we define the comprehensiveness as follows: where higher comprehensiveness values are better.As choosing the appropriate rationale length is dataset dependent, we use the Area Over the Perturbation Curve (AOPC) metrics for sufficiency and comprehensiveness.It defines bins of tokens to be erased and calculates the average measures across bins.Here, we keep the top 1%, 5%, 10%, 20%, 50% tokens into bins in the order of decreasing attribution scores.

Experiments
We conduct the experiments in six datasets under the in-domain/out-of-the-domain settings: SST (Socher et al., 2013), IMDB (Maas et al., 2011), Yelp (Zhang et al., 2015), and AmazDigiMu/AmazPantry/AmazInstr (Ni et al., 2019) (See details in Appendix A).The baseline is a text classification model fine-tuned on the training set while the same pre-trained language model is applied to REGEX.In other words, the baseline is optimized by Eqn. 1 without robustness improvement and explanation guided training mechanisms.

Post-hoc Explanation Methods
We consider five feature attribution methods and a random attribution method:

Post-hoc Explanations Faithfulness
We conduct experiments on the faithfulness metrics (i.e., normalized sufficiency and normalized comprehensive) to compare the fidelity of different post-hoc explanation methods between the baseline and REGEX models.We extract rationale r from a model by selecting the top-k most important tokens measured by these post-hoc explanation methods.Following Chrysostomou and Aletras (2022), we also evaluate explanation faithfulness in out-of-domain settings without retraining models (i.e., zero-shot), and we follow their settings with six dataset pairs and a random attribution baseline.
Especially the model has first trained on the source datasets, and then we evaluate its performance on the test set of the target datasets.
As shown in Table 1, REGEX improves the explanation faithfulness with all five attribution methods by a large gap under most in-and outof-domain settings.Among them, scaled attention and DeepLift perform better than others.For example, REGEX surpasses the baseline in the sufficiency metric for the explanation extracted by DeepLift under all scenarios, while the comprehensiveness decreases when the model is trained in the AmazDigiMu dataset and tested in the AmazInstr dataset.It shows that REGEX improves the fidelity of post-hoc explanations measured by sufficiency and comprehensiveness.Nevertheless, we observe a decrease in the comprehensiveness metrics for attention and IG on specific datasets.For example, considering the uncertainty of attention as an interpretable method (Serrano and Smith, 2019), the fidelity metrics of attention attribution are inferior to the baseline on all three Amazon Reviews datasets.
Overall, feature attribution approaches outperform random attributions of in-and out-of-domain settings in most cases.Moreover, results show that post-hoc explanation sufficiency and comprehensiveness are higher in in-domain test sets than in out-of-domain except for the Yelp dataset.On the other hand, as shown in Table 2, REGEX improves performance or achieves similar task performance to the baseline on most out-of-domain datasets.

Quantitative Evaluation by FRESH Method
We further compare the average macro F1 of the FRESH classifier (Jain et al., 2020) across five random seeds in the in-and out-of-domain settings.In short, FRESH is a select-then-predict framework, and the general process is that an extractor is first trained where the labels are induced by arbitrary feature importance scores over token inputs; then, an independent classifier is trained exclusively on rationales provided by the extractor which are assumed to be inherently faithful.Here, rationales extracted by the top-k most important tokens are used as input to the classifier for training and test.
As shown in Table 2, the best two methods are DeepLift and scaled attention, which achieve a similar performance as the original text input model in the in-and out-of-domain settings and is consistent with the faithfulness evaluation.For example, the FRESH classifier applying the DeepLift attribution method is higher than the baseline and outperforms the model with the full text input (97.1 vs. 96.9) on the Yelp dataset.It also illustrates that the performance depends on the choice of the feature attribution method.

Explanation Robustness
Following Zafar et al. ( 2021  attributions generated by the post-hoc explanation method.We use Jaccard similarity for explanations extracted by top 25% important tokens using the scaled attention method.If the two attributions are more similar, the Jaccard metric is higher.We compare the REGEX and baseline by comparing two identical models trained from different initializations.The #Untrained is a untrained model which randomly initialize the fully connected layers attached on top of the Transformer encoders.As shown in Table 3, REGEX achieves an improved performance than baseline.For example, REGEX gets 0.56 while baseline gets 0.36 for Init#3 and Init#4.As we expected, the similarity between explanations of the trained and untrained models is low, e.g., 0.12 between Init#4 and #Untrained.It shows that improving faithfulness of explanations can strengthen interpretation robustness.However, the overall results between the two feature attributions are still low as 50% of similarity comparisons are less than 0.5.

Ablation Study
We perform ablation studies to explore the effect of robustness improvement and explanation guided training for faithfulness evaluations shown in Table 4 (all results in The performance of the attention method varies more across different hyper-parameters.In Figure 3, we compare different λ 4 in Eqn. 8 and observe that all methods achieve best sufficiency at 0.01 and best comprehensiveness at 0.001.In Figure 4, we compare different mask ratios in §3.3 and find that the mask ratio between 0.15 and 0.2 is useful as larger values can bring noise. The choice of aggregation method and feature attribution method in §3.3 has a large effect on the faithfulness evaluation.We find that for most attribution methods, ℓ 2 aggregation has higher fidelity performance.For example, Saliency with ℓ 2 aggregation is better than Saliency with mean aggregation with more sufficiency improvement (0.70 vs. 0.55).Though there is no best method for explanation guided training, gradient-based methods (e.g., IG, 0.71) may be good choices in line with Atanasova et al. (2020).
…,is the fact that the wonderful RAYMOND MASSEY is relegated to the last twenty or so minutes in the trial scene.… David NIVEN and KIM HUNTER are wonderfully cast as the young lovers….French accented MARIUS GORING is a delight (he even gets in a remark about Technicolor) as the heavenly messenger sent to reclaim Niven when his wartime death goes unreported due to an oversight.Seeing this tonight on TCM for the first time in twenty or so years, I think it's a supreme example of what a wonderful year 1946 was for films.The Technicolor photography, somewhat subdued and not garish at all, is excellent and the way it shifts into B&W for the heavenly sequences is done with great imagination and effectiveness….

Qualitative Analysis
Table 5 presents two randomly-chosen examples of the test set of the IMDB dataset.For example, the top-k important tokens returned by REGEX are wonderfully, wonderful, wonderful, excellent and great in the first example.We observe that these highlight explanations seem intuitive to humans and reasonably plausible.Though faithfulness and plausibility are not necessarily correlative (Jacovi and Goldberg, 2020), we find that the highlights extracted by REGEX contain more sentiment-related words, which should be helpful for review-based text classification.

Conclusion
We explore whether the fidelity of explanations can be further optimized and propose an explanation guided training mechanism.Extensive empirical studies are conducted on six datasets in both in-and out-of-domain settings.Results show that our method REGEX improves both fidelity metrics and performance of select-then-predict models.The analysis of explanation robustness further shows that the consistency of explanations has been improved.The observation suggests that considering model robustness yields more faithful explanations.In the future, we would like to investigate more PLMs architectures and faithfulness metrics under the standard evaluation protocol.
Possible limitations include the limited PLM architecture/size (although we include additional results with RoBERTa in the Appendix D) and faithfulness evaluation metrics are not necessarily comprehensive.And we only focus on text classification tasks.
As a result, we do not investigate other language classification (e.g., natural language inference and question answering) and text generation tasks.If we can intrinsically know or derive the golden faithful explanations (Bastings et al., 2022;Lindner et al., 2023), the exploration of model robustness and explainability will be alternatively investigated for revealing the internal reasoning processes.And future work could include human study (e.g., evaluation about whether explanations help users choose the more robust of different models) and improve the robustness by more diverse ways (e.g., model distillation and data augmentation).
Our findings are also in line with Tang et al. (2022) and Logic Trap 3 (Ju et al., 2022) which claims the model reasoning process is changed rather than the attribution method is unreliable.Different from this two works -output probability perturbation or changing information flow, we view our results as complementary to their conclusion via sourcing the improvement of faithfulness.Although we show the link between robustness and faithfulness empirically, future work can strengthen the conclusions by discussion on a more conceptual and theoretical level.From a theoretical perspective, one possible reason is that the gradient of the model is more aligned with the normal direction to the close decision boundaries (Wang et al., 2022).In the future, we would like to analyze the relationship between robustness and explainability from geometric dimension.
Furthermore, we do not exhaustively experiment with all possible evaluation settings of interest even with the scale of our experiments.For example, saliency guided training methods (Ismail et al., 2021) could have been used as another baseline.We hope this work inspires more future work that develops more effective strategies to make explanations reliable and investigate how our findings translate to large language models, such as GPT-3 model family2 , as with the emergent capabilities of these models, fidelity to their explanations or rationale will have societal impacts on accountability of NLP systems.

A Dataset
We consider six datasets to evaluate explanations and the data statistics are as follows.

B Experiment Settings
We use Spacy3 to pre-tokenize the sentence and apply the BERT-base model to encode text (Devlin et al., 2019).We use AdamW optimizer with batch sizes of 8, 16, 32, 64 for model training.The initial learning rate is 1 × 10 −5 for fine-tuning BERT parameters and 1 × 10 −4 for the classification layer.The maximum sequence length, the dropout rate, the gradient accumulation steps, the training epoch and the hidden size d are set to 256, 0.1, 10%, 10, 768 respectively.We clip the gradient norm within 1.0.The learning parameters are selected based on the best performance on the development set.Our model is trained with NVIDIA Tesla A100 40GB GPUs (PyTorch & Huggingface/Transformers 4 & Captum 5 ).Following Jiang et al. (2019), we set the perturbation size ϵ = 1 × 10 −5 , the step size η = 1 × 10 −3 , ascent iteration step C = 2 and the variance of normal distribution σ = 1 × 10 −5 .The weight parameters λ 1 , λ 2 , λ 3 , λ 4 are set to 1.0, 0.01, 0.5, 0.01 respectively.The mask ration K is set to 0.15.The number of steps used by the approximation method in IG is 50, and we use zero scalar corresponding to each input tensor as IG baselines.The parameters are selected based on the development set.For the baseline and FRESH model, we use the same transformer-based models as mentioned previously to encode tokens and we choose rationale length by following Chrysostomou and Aletras (2022).The model is trained for 10 epochs, and we keep the best models with respect to macro F1 scores on the development sets.

C Text Classification to Attacks
We conduct the behavioral testing with CHECKLIST (Ribeiro et al., 2020) and TextAttack (Morris et al., 2020)  the attack success rate which is used to evaluate the effectiveness of the attacks is 3.23%.

D UIT and DIT with Larger Pre-trained Language Model
To further verify the effect of model scale on the results, we conducted experiments on the robustness of explanations under the pre-trained language model RoBERTa (Liu et al., 2019), including UIT and DIT.The experimental results are shown in the Table 8.We have two findings: (1) the size of the model has a certain positive effect on the stability of explanations, with the Jaccard similarity improved under REGEX and Baseline, although the improvement is not significant.(2) REGEX can still improve performance under larger pre-trained models which further strengths our findings.

E Full Results
Table 7 presents the Full-text F1 of variants in ablation study.Table 9 lists the full results for FRESH (select-then-predict) models.Table 10 lists the full results of ablation study.From these results, we further found that sufficiency of the extracted explanations when using one robustness training method (either virtual adversarial training or input gradient regularization) is inferior to the sufficiency when using no robustness training.We speculate that there are several reasons: (1) the two mechanisms are related, i.e., removing one has a more significant impact than removing both simultaneously; (2) the results have variance despite the adoption of the AOPC metric, not to mention that the sufficiency metrics suffer from out-of-distribution challenges; (3) these ablation experiments are on models trained on SST and tested on SST; future works could perform a more detailed ablation analysis on other datasets (such as in out-of-domain settings).
(Shrikumar et al., 2017)n importance is assigned at random.Attention (α)(Jain et al., 2020): Normalized attention scores are used to calculate token importance.DeepLift(Shrikumar et al., 2017): The difference between each neuron activation and a reference vector is used to rank words.

Table 2 :
Average macro F1 results of Full-text and FRESH models with a prescribed rationale length.REGEX vs. baseline (shown in brackets, averaged across 5 seeds).The reference performance (Full-text F1) is from the BERT-base model fine-tuned on the full text.Full results are in Appendix E. The bold numbers represent the results of the best FRESH model trained with rationales from REGEX model among five attribution methods.
), we test implementation invariance of feature attributions by Untrained Model Test (UIT) and Different Initialization Test (DIT).The UIT and DIT measure the consistency and calculate the Jaccard similarity between feature

Table 3 :
Jaccard@25% between the feature attributions (REGEX vs. baseline, here we use scaled attention) for models with same architecture, with same data, and same learning schedule, except for randomly initial parameters.

Table 4 :
Ablation study with different aggregation methods and feature attribution methods in §3.3.

:
Positive Prediction: Positive Dataset: IMDB ID: Test 1364 …but pompous horror icon Christopher Lee squirming in the midst of it all (the gracefully-aged star has pathetically asserted a number of times in interviews that he hasn't appeared in horror-oriented fare since his last picture for Hammer Films back in 1976!).Anyway, this film should have borne the subtitle "Your Movie Is A Turd" being astoundingly inept in all departments (beginning with the allimportant werewolf make-up)!The plot (and dialogue) is not only terrible, but it has the limpest connection with Dante's film strangely enough, the author of the original novel Gary Brandner co-wrote this himself!Still, one of the undeniable highlights (er...low points) of the film is the pointless elliptical editing Label: Negative Prediction: Negative Dataset: IMDB ID: Test 1373

Table 5 :
We randomly pick two examples from test set of IMDB dataset, and highlight the Top-k important tokens using DeepLift method (REGEX vs. Baseline).

Table 6 :
Attack results of REGEX and baseline by CHECKLIST attack recipe.

Table 7 :
Macro F1 and standard deviations with different aggregation methods and feature attribution methods in §3.3.

Table 10 :
Full results of ablation study with different aggregation methods and feature attribution methods in §3.3.