Learning to Explain: Generating Stable Explanations Fast

The importance of explaining the outcome of a machine learning model, especially a black-box model, is widely acknowledged. Recent approaches explain an outcome by identifying the contributions of input features to this outcome. In environments involving large black-box models or complex inputs, this leads to computationally demanding algorithms. Further, these algorithms often suffer from low stability, with explanations varying significantly across similar examples. In this paper, we propose a Learning to Explain (L2E) approach that learns the behaviour of an underlying explanation algorithm simultaneously from all training examples. Once the explanation algorithm is distilled into an explainer network, it can be used to explain new instances. Our experiments on three classification tasks, which compare our approach to six explanation algorithms, show that L2E is between 5 and 7.5×10ˆ4 times faster than these algorithms, while generating more stable explanations, and having comparable faithfulness to the black-box model.


Introduction
Explaining the mechanisms and reasoning behind the outcome of complex machine learning models, such as deep neural networks (DNNs), is crucial. Such explanations can shed light on the potential flaws and biases within these powerful and widely applicable models, e.g., in medical diagnosis (Caruana et al., 2015) and judicial systems (Rich, 2016).
Existing explainability methods mostly produce explanations, or rationales (DeYoung et al., 2020), which identify the attributions of features in an input example, e.g., are they contributing positively or negatively to the prediction of an outcome. For text classifiers, this means identifying words or phrases in an input document that account for a Novell's Microsoft attack completes Linux conversion: Novell Inc. has completed its conversion to Linux by launching an attack on Microsoft Corp., claiming that the company has stifled software innovation and that the market will abandon Microsoft Windows at some point in the future.

Microsoft expands Windows update Release: Microsoft
Corp. is starting to ramp up distribution of its massive security update for the Windows XP operating system, but analysts say they still expect the company to move at a relatively slow pace to avoid widespread glitches. prediction. Current approaches are typically computationally demanding, requiring expensive operations, such as consulting a black-box model multiple times (Zeiler and Fergus, 2014), or generating samples to learn an approximate but explainable transparent model (Ribeiro et al., 2016). This computational demand reduces the utility of these explanation algorithms, especially for large black-box models, long documents and real-time scenarios . Further, these algorithms generate explanations for different examples independently. This may lead to the generation of different explanations for similar examples, which is undesirable. For example, a black-box predicts with similar confidence (99% and 98%) that the topic of the two semantically similar documents in Figure 1 is Sci/Tech. However, even though the words 'Microsoft' and 'Windows' appear in both documents, the baseline explainer A deems 'Windows' to be important for the top document, and 'Microsoft' for the bottom document (that is, mask-ing these words results in a significant drop in the black-box's confidence).
In this paper, we present a learning to explain (L2E) approach that efficiently learns the commonalities of the explanation process across different examples. This, in turn, leads to explanations that exhibit stability, i.e., important words are chosen consistently, without loss of faithfulness to the underlying black-box. 1 Given a set of examples paired with their explanations produced by an existing method, e.g., LIME (Ribeiro et al., 2016), our approach uses a DNN to learn the explanation algorithm. DNNs are Turing complete (Pérez et al., 2019;Montufar et al., 2014); therefore, given enough training data and learning capacity, they should be able to learn the existing explanation algorithms. This is akin to Knowledge Distillation (Hinton et al., 2015), where a teacher, or in our case a teacher algorithm, distils knowledge into a student network.
Our contributions are: (i) the L2E framework, which is general, and can successfully learn to produce explanations from several teacher explainers; (ii) two learning formulations, i.e., Ranking and Sequence Labelling, to enable L2E to circumvent the high variance of non-discrete teacher explanations via discretization; (iii) an experimental setup to compare L2E against six popular explanation algorithms, and a comprehensive evaluation to investigate the stability and faithfulness of L2E on three text classification tasks; (iv) a methodology that employs human rationales as proxies for the ground-truth explanations of a black-box model. The core of this method is a modified training protocol whereby the model makes neutral predictions if human rationales are absent.

Related Work
We consider two main approaches to explanation generation: algorithmic and model-based.
Algorithmic Approaches. These approaches can be broadly categorized into gradient-based, attention-based and perturbation-based methods.
Gradient-based methods (Simonyan et al., 2013;Sundararajan et al., 2017;Shrikumar et al., 2017;Erion et al., 2019) or backpropagation-based methods (Bach et al., 2015) require access to the blackbox, and are mostly applied to models with differentiable functions. Further, they may be sensitive to randomized model initializations or permuted data labels (Adebayo et al., 2018), which is undesirable. These methods can be computationally heavy in the case of complex black-box models (Wu and Ong, 2021), e.g., BERT (Devlin et al., 2018).
Perturbation-based methods approximate feature importance by observing changes in a model's outcome after a feature is changed. They either consider changes in performance as an indicator of feature importance directly (Martens and Provost, 2014;Zeiler and Fergus, 2014;Schwab and Karlen, 2019), or they employ a higher-order approximation of the decision boundary (Ribeiro et al., 2016;Lundberg and Lee, 2017). Perturbation-based methods are typically computationally inefficient for explaining high-dimensional data, and they suffer from high variance due to perturbation randomness (Slack et al., 2020;Chen et al., 2019). Model-based Approaches. These approaches train the explainer with an objective function to improve efficiency at test time. The closest work to ours is by Schwab and Karlen (2019), who train an explainer using a causality-based explanation algorithm. However, these approaches do not learn from arbitrary algorithms or discretize feature weights -the high variation of continuous weights may impair the ability to capture the commonalities in an explanation algorithm.  discretize the weights produced by an existing method, but they use these weights to build a faithful classifier for an underlying blackbox model, rather than using them to explain the model directly.
Other works train a classifier and an explainer jointly in order to incorporate explainability directly into the classifier (Lei et al., 2016;Camburu et al., 2018). Unlike these approaches, we do not change the classifier or require an expensive process to collect human rationales, as done in (Camburu et al., 2018). Lastly, a few works use information-theoretic objectives to train an explainer directly from the underlying classifier (Chen et al., 2018;Bang et al., 2019). These explainers require careful training to select a low number of important features (Paranjape et al., 2020); hence, some input features do not have attributions.
Goodness of Explanations. Researchers have quantified the goodness of an explanation in different ways, such as brevity, alignment to human rationales, contrastiveness and stability.
According to Atanasova et al. (2020), only a few algorithmic explanation methods produce stable explanations (Robnik-Šikonja and Bohanec, 2018), e.g., LIME (Ribeiro et al., 2016). To the best of our knowledge, we are the first to explore the stability of explanations in model-based approaches.

Learning to Explain (L2E)
L2E can be applied to any Natural Language Processing task to which an underlying feature-based explanation algorithm can be applied, such as Natural Language Inference and Question Answering (Wang et al., 2020). In this paper, we focus on explaining text classification models.
Our setup requires two inputs: (i) a black-box text classification modelŷ = f θ θ θ (x x x), which assigns document x x x to a labelŷ ∈ Y , where Y is the label set; and (ii) an explanation algorithm A(x x x,ŷ, f θ θ θ ) → w w w, which generates explanation w w w ∈ R |x x x| for the class of document x x x obtained by the black-box f θ θ θ (x x x). A can be any off-the-shelf explanation algorithm; and w i can be thought as the importance weight of x i -the i th token of a document.
The main idea of L2E is to train a separate explanation model g φ φ φ (x x x) to predict the explanation generated by A(.) for f θ θ θ (.) (Figures 2a and 2b). Intuitively, our approach distils the explanation algorithm A into the explanation model g φ φ φ . As confirmed by our experiments ( §4.5), this has several benefits. Firstly, it leads to stable explanations, as g φ φ φ can capture A's common patterns when generating explanations for different documents. Secondly, it speeds up the explanation generation process compared to many existing explanation algorithms, which rely on computationally heavy operations, such as consulting the black-box model multiple times, e.g., Occlusion (Zeiler and Fergus, 2014), or sampling, e.g., LIME (Ribeiro et al., 2016). Our approach, which learns a model with explanations Algorithm 1 Learning to Explain (L2E) 1: D: a training set of documents 2: f θ θ θ : the original deep NN model 3: g φ φ φ : the explainer deep NN model 4: A: the underlying explanation method 5: procedure TRAINEXPLAINER(D, f θ θ θ ) 6: Z ← ∅ 7: for each input x x x ∈ D do 8:ŷ ← f θ θ θ (x x x) 9: w w w ← A(x x x,ŷ, f θ θ θ ) 10: Z ← Z ∪ (x x x,ŷ, w w w) 11: end for 12: initialize φ φ φ randomly 13: t ← 0 14: while a stopping condition is not met do 15: Randomly pick (x x xt,ŷt, w w wt) ∈ Z 16: return the explanation model g φ φ φ 20: end procedure of all training data, takes advantage of the computations done by A, and generates more stable explanations faster.
Our approach to train the explanation model g φ φ φ is summarized in Algorithm 1. First, the algorithm generates training data in the form of triplets (x x x,ŷ, w w w) (lines 7-11), and then it trains the explanation model using supervised learning (lines 14-18). At test time, the trained model is deployed to generate explanations for unseen documents.
A crucial component in training the explanation model under supervised learning is the loss function L(g φ φ φ (x x x t ,ŷ), w w w). It penalizes a deviation of the predicted explanation g φ φ φ (x x x t ,ŷ) from the ground truth explanation w w w. This loss function is determined by our supervised learning formulation.
Given that w w w is a continuous-valued vector, learning the model g θ θ θ may be cast as a multivariate regression problem. However, the continuous feature attributions generated by existing explanation algorithms could be sensitive to initializations (Slack et al., 2020). Further, manually annotated rationales (highlighting important words in a document) are sufficient for people to understand/perform a classification task (Zaidan et al., 2007). So, instead of a regression formulation, we consider two supervised learning formulations for discretized outputs: Ranking and Sequence Labeling.
Ranking Formulation. In this formulation, the explanation model aims to learn the ranking of the document tokens from their importance weights. That is, we consider the ordering of the token weights induced by w w w, and train the explanation This is a great movie.

(a)
this is a great movie Figure 2: (a) Pipeline of our L2E method; dashed arrows represent offline processes. (b) Detailed input and output for the sequence labeling formulation of our explanation model; red + label indicates that g θ θ θ considers 'great' to be more important than other words in the predictionŷ.
model g φ φ φ such that it induces the same ordering. Specifically, the loss function is as follows: predicted by the explanation model, and k = arg max k ∈{i,j} |w k |. In other words, each pair of token weights is compared, and the parameters are learnt such that a token with a high importance weight under A also gets a high score under g φ φ φ .
Sequence Labeling Formulation. Here, explanation generation is treated as a sequence labeling problem, where the continuous importance weights are discretized according to the heuristic h, whereby the importance weights are partitioned along two dimensions, high/low and positive/neutral/negative, according to the mean value of the positive/negative weights from the baseline explanation method A. Thus, the labels are recoded to {high negative, low negative, neutral, low positive, high positive}. The explanation model g φ φ φ is then trained to predict the label of the tokens according to the following loss function: where g φ φ φ,i (x x x,ŷ) is the predicted distribution over the labels of the i th token of the document, and h(w i ) is the discrete label produced using the discretization heuristic h.
Owing to the quadratic complexity of the Ranking formulation, compared to the linear complexity of Sequence Labeling, we recommend using Ranking when the input is short, and a fine-grained order of feature attributions is required. Otherwise, the Sequence Labeling formulation is a better option.

Tasks and Black-Box Models (f θ θ θ )
We conduct experiments on three classification tasks; each task has a different black-box classifier chosen based on the best accuracy on the selected dataset as reported in the literature. 2 Dataset statistics are reported in Appendix A.
• Topic Classification. The AG corpus (Zhang et al., 2015) comprises news articles on multiple topics. We separate 10% of the training documents for the dev set. The black-box classifier is a fine-tuned BERT model (Devlin et al., 2018) with 12 hidden layers and 12 attention heads. It achieves a 92.6% test accuracy.
• Sentiment Analysis. The SST dataset (Socher et al., 2013) comprises movie reviews with positive and negative sentiments. The black-box classifier is a distilled BERT model  with 6 layers and 12 attention heads from Hugging Face . It achieves 90% test accuracy.
• Linguistic Acceptability. The CoLA dataset (Warstadt et al., 2019) contains sentences that are deemed acceptable or unacceptable in terms of their grammatical correctness. The black-box classifier is a fine-tuned ALBERT model (Lan et al., 2020) with 12 attention heads and 12 layers. It achieves a 74% test accuracy.

Explanation Models
We use a Transformer encoder (Vaswani et al., 2017) with 4 blocks and 4 attention heads as g φ φ φ . 3 All models are trained with a Stochastic Gradient Descent optimizer and a fixed learning rate (1e −4 ) until convergence. To balance the different statuses of model convergence, we train all models with three random parameter initializations and report the average values of their performance metrics.
We condition the explainer model g φ φ φ on the label y predicted by the underlying black-box model f θ θ θ by appendingŷ to the start and the end of the input document before passing it to g φ φ φ (Figure 2a). Thus, g φ φ φ can leverage the predicted label in the attention computation. For the sequence labeling formulation, we also introduce a softmax layer on top to produce the labeling distribution over the discrete labels for each token, as detailed in Figure 2b.

Performance Metrics
Faithfulness. A standard approach to evaluate the faithfulness of an explanation to a black-box classification model is to measure the degree of agreement between the prediction given the full document and the prediction given the explanation (Ribeiro et al., 2016). However, the aim of L2E is to approximate an existing explanation method A, which constitutes a layer of separation from the original black-box f θ θ θ . Hence, we provide two faithfulness evaluations for our approach when the ground-truth explanation is unavailable: • Prediction based. We measure the agreement between: (a) the predictions of the black-box model f θ θ θ when the explanations generated by g φ φ φ are given as input, and (b) f θ θ θ 's predictions when A's explanations are given as input (instead of using the full document); 4 • Confidence based. We adopt the ∆log-odds(x x x) metric used by Schwab and Karlen (2019), which measures the difference in the confidence of the f θ θ θ black-box model in a prediction before and after masking the words in an explanation.
log-odds(Pr(ŷ|f θ θ θ (x x x)))−log-odds(Pr(ŷ|f θ θ θ (x x x))) whereŷ is the predicted output of f θ θ θ (x x x), log-odds(Pr) = log Pr 1−Pr , andx x x is a version of input x x x where the tokens in the explanation are masked out. We expect a high ∆log-odds value if we mask positive important words iñ x x x, and a low value if we mask unimportant or negative important words.
We report the average of each of these metrics across the test documents.
Stability. We employ Intersection over Union (IoU) to measure explanation stability across similar instances. Specifically, for each test instance x x x, we select its nearest neighbors N (x x x) according to one of two pairwise document similarity metrics: semantic similarity -cosine of their BERT representations; and lexical similarity -ratio of overlapping n-grams. Details appear in Appendix C. IoU(x x x, N (x x x)) then measures the consistency of explanations of x x x and those of its neighbours, where L is the discretized label set in the Sequence Labeling formulation or the top K words in the Ranking formulation, and v v v x x x is the set of tokens with label in the predicted explanation g φ φ φ (x x x,ŷ).
We report the average of IoU(x x x, N (x x x)) across documents in the test set.

Results and Discussion
We start by investigating the faithfulness of an explanation model to the black-box model f θ θ θ . Once faithfulness has been established, we investigate stability and speed compared to the underlying explanation methods A. We also include a Random baseline, which displays the performance obtained by randomly selecting the same K number of words as we select from explanations produced by L2E and A in each row of the  Faithfulness. For the Ranking formulation of L2E, we select the top 30% of the important words in each test sample. 5 For the Sequence Labeling formulation, we select the same number of positive/negative words identified by L2E and A. Table 1 shows the Prediction-based agreement between the black-box model f θ θ θ and our method L2E, between f θ θ θ and the underlying explainer A, and between L2E and A. We see that the explanations generated by L2E are equally predictive of the output class as those generated by A in both the Ranking and the Sequence Labeling formulations. We also note that the L2E version that learns with the Ranking formulation is often less faithful, though not significantly, to the black-box model f θ θ θ than A compared to the version that learns with the Sequence Labeling formulation. 6 For example, the percentage agreement of L2E-Ranking is lower than that of Occlusion for the three datasets, while the agreement of L2E-SequenceLabeling is higher than that of Occlusion for these datasets. Interestingly, when the baseline explanation algorithm does not perform well, e.g., Kernel SHAP on SST, L2E is still able to find words that are predictive of the output of f θ θ θ . In such circumstances, the agree- 5 We select 30% to ensure sufficient important words are selected in each dataset given their average document length. We use the same percentage in the Stability evaluation. 6 Statistical significance (α < 0.05) was measured by performing the Wilcoxon Signed-Rank Test (Woolson, 2007) followed by a sequential Holm-Bonferroni correction (Holm, 1979;Abdi, 2010)   ment between L2E and A is quite low ("Both" is 58% and 51% for Ranking and Sequence Labeling respectively). The low performance of Kernel SHAP may be attributed to insufficient samples (10 3 in this case) in the kernel computation for SST, while L2E could still utilize all the samples during training. Table 2 presents the ∆log-odds results for positive explanation words in the Sequence Labeling formulation. Similar results are observed for negative explanation words in the same formulation, and top important words in the Ranking formulation. These results appear in Appendix D. They are obtained by randomly selecting 100 documents in the test set, and masking the same number of important words in each document based on the explanations generated by L2E and by A.
We observe that some baselines have inconsistent faithfulness for different datasets. For example, LRP and Deep SHAP perform worse than Kernel SHAP for the News dataset, but better for SST. We also note that, when one baseline performs worse than the other baselines, e.g., Kernel SHAP for SST, our method L2E still performs significantly better than that baseline. This result demonstrates that our model can learn important words that yield more faithful explanations than those learned by the teacher explainer. Interestingly, none of the results for the CoLA dataset, from the baseline A or L2E, significantly outperforms the Random baseline. This flags a drawback of evaluating explanation faithfulness on short documents.  Stability. For each test document, we consider the top-3 similar documents in the test set, and report the average IoU as explained in §4.4. Table 3 shows the results obtained using semantic similarity for the baseline A and L2E. Similar results with lexical similarity appear in Appendix C. From Table 3, we see that, in most cases, our method statistically significantly outperforms the baseline for all three datasets. For both formulations, Ranking and Sequence Labeling, L2E achieves a higher stability than the baseline A, even in cases where A's IoU is comparable to that of the Random baseline, e.g., Gradient for SST and CoLA. These results show that learning the explanation process across different examples, as done by L2E, can capture more commonalities (higher stability) than generating explanation individually (baselines). Overall, the LIME baseline performs consistently better than most baselines in terms of faithfulness and stability across the three datasets. Therefore, L2E also performs better when it learns from LIME than when it learns from other baselines.
Computational Efficiency. We now compare the efficiency of L2E against that of the baseline explanation algorithms A when generating explanations for test documents. In our experiments, the black-box is a transformer-based model comprising L layers, H attention-heads and D embedding dimensions. The complexity of this model when predicting a document of size N is then O(L × N × D × (D + N + H)) (Gu et al., 2020). Various factors contribute to the computational demands of existing explanation algorithms (details in Appendix B), and make the complexity of these algorithms grow with the size of the black-box model. These factors include the size of the input document (Occlusion), the sample size (LIME, Kernel SHAP and Deep SHAP) etc. In contrast, L2E is a distillation of any explanation algorithm, employing a smaller architecture than the black-box, e.g., fewer layers and attention heads, and lower embedding dimensions. Figure 3 shows the inference time of L2E-SequenceLabeling compared to that of the baseline explainers for the IMDB-R dataset. 7 We only show the results obtained with Sequence Labeling, since the inference time of L2E models is independent of the learning formulation. As seen in Figure 3, L2E requires statistically significantly less time than any of the six baseline explanation algorithms for IMDB-R. Similar patterns were observed for the other three datasets (Appendix E).
Finally, L2E only needs a forward pass through the explainer DNN. Comparing with Gradient and LRP, which require only one backpropagation through the black-box DNN, L2E is respectively 5 and 10 times faster for all datasets (all black-box sizes appear in §4.1 and Appendix F).

Evaluation with Human Rationales
Evaluation of explanation methods for DNNs is challenging, as ground-truth explanations are often unavailable. In this section, we propose to address this issue using the IMDB-R dataset (Zaidan et al., 2007), which contains movie reviews x x x together with their sentiment y, as well as rationales r r r annotated by people for the sentiment label. Our use of rationales for evaluating explanations is related to that in (Osman et al., 2020), where synthetic data are generated from apriori fixed rationales. Specifically, we generate new data by assigning a "neutral" label to an example where the human rationales are masked. We then use both the original data (without masking) and the new data to train the black-box model, where the training protocol forces the classifier to make a "neutral" prediction when the human rationales are removed from the review. More formally, we maximize the following training objective, (x x x,r r r,y)∈D log Pr(y = f θ θ θ (x x x))+ log Pr(NEUTRAL = f θ θ θ (x x x − r r r)) where x x x − r r r denotes the input x x x with the rationale words r r r masked out, NEUTRAL is an extra label, 8 and D is the training data.
Our classifier achieves an accuracy of 83.83% on the training set, 79.68% on the validation set and 74.5% on the test set. Due to the large document sizes ( Table 6 in Appendix A) and the quadratic time complexity of the Ranking formulation as a function of document size, we only train L2E with the Sequence Labeling formulation; we use lexical similarity to measure IoU, due to the timeconsuming computation of semantic similarity with BERT. Details about the dataset, the classifier and the explainer's architecture appear in Appendix F.
The faithfulness and stability of the explanation methods are evaluated as follows.  Faithfulness. We select the top-K important words generated by an explanation method and compute the precision, recall and F1 against the human-annotated rationales. It is worth noting that our L2E explainer is not supervised by human rationales directly. Instead, we use the same experimental setup as in Section 4.5 to ensure the L2E explainer is learning from the baseline algorithms rather than the human rationales. Table 4 displays the average values over all test instances. As noted by Carton et al. (2020), the rationales in the original dataset are not exhaustively identified by human annotators. For a particular event, we expect to observe a lower precision than recall, since the black-box model might still be able to utilize the words not being annotated in addition to the words annotated by a human. The results in Table 4 align with this hypothesis. For instance, besides LRP for the positive reviews and Kernel SHAP for both reviews, all baselines and the corresponding L2E have higher recall than precision. Furthermore, L2E outperforms the corresponding baseline A significantly in most cases for both positive and negative reviews, except when comparing with LIME's precision. This observation indicates that learning the explanations of multiple examples together, as done by L2E, achieves high faithfulness to human rationales, as well as to the blackbox model.  Similarly to the results in §4.5, as seen in Table 5, L2E yields more stable explanations than the corresponding baselines. The best stability, obtained with L2E (58.6 ± 0.27) by filtering non-annotated words when learning from Occlusion, is comparable to that of the human rationales. This is due to the high recall (92 and 82 for positive and negative reviews respectively in Table 4) in the explanations produced by L2E, which indicates they have high overlap with human rationales. Further, when measuring the IoU values, the L2E explanations of similar examples have the same intersection with the human rationales, but a lower union. This result indicates that people favour stable rationales in similar documents, and reinforces our findings regarding the greater consistency of the explanations produced by L2E compared to the baselines. LRP has been proven to have explanation continuity (Montavon et al., 2018), where the explanations of two nearly equivalent instances are also equivalent. However, we do not observe such a pattern in our experiments. We hypothesize that using perturbed instances as neighbours, as done by Montavon et al. (2018), does not necessarily follow the same distribution of the data. Instead, we posit that finding similar examples within a dataset, as done in our experiments, is a better proxy for stability evaluation.

Conclusions and Future Work
We have presented a Learning to Explain (L2E) approach to learn the commonalities of the explanation generation processes across different examples. We have further proposed Ranking and Sequence Labeling formulations to effectively learn the explainer model by discretizing feature weights produced by existing explanation algorithms.
Our experimental results show that our method can generate more stable explanations (i.e., not vary much across similar documents) than those generated by the explainer baselines, while maintaining the same level of faithfulness to the underlying black-box model as the baseline algorithms. Moreover, our L2E approach produces explanations between 5 and 7.5 × 10 4 times faster than the six baselines, making it suitable for long documents and very large black-box models.
Our L2E approach trains an explainer, a blackbox, to mimic the behaviour of an explanation method for an existing black-box model. A key challenge lies in the variation in the convergence status of such an explainer for different initializations. In order to mitigate this problem, we evaluate the performance of our explainer by averaging three different initializations.
The L2E approach opens up the possibility of distilling multiple explanation algorithms into one model. Although we focused on the stability, faithfulness and efficiency aspects of explanation generation, there are further desirable properties, e.g., transparency, comprehensibility and novelty (Robnik-Šikonja and Bohanec, 2018). Devising model-based explanation methods and their evaluation with these desiderata are interesting directions for future research.  Table 7: Intersection over union (IoU) using lexical similarity (measured according to overlapping n-grams); bold indicates statistical significance.

Appendix D Faithfulness
We present the negative explanation words of the Sequence Labeling formulation in Table 8 and the top important words of Ranking formulation in Table 9.   dation and test set respectively, with each set having an even distribution of positive and negative reviews. We also remove 8 very long documents from the training set for the sake of CUDA memory. For each example in the training and validation sets, we construct a new example by masking the rationales, i.e., we replace each words in the rationale with a mask token, and assign this new example to a third label, e.g., neutral, so as to ensure the classifier 'pays attention' to the rationale. The final dataset split appears in Table 6.
The classifier is trained by fine-tuning the last layer of a pre-trained Longformer (Beltagy et al., 2020) with 12 layers and 12 attention heads from Hugging Face . It achieves 83.83%/79.68%/74.5% accuracy for the training/validation/test sets respectively after 40 epochs. The statistics of our experiment are measured on test examples that are predicted correctly by the classifier. For each L2E explainer that learns from a baseline explanation method, we use a Longformer with 4 layers, 4 attention heads.

Appendix G Precision and Recall on Positive Reviews
We plot the precision versus recall from all the L2E-A pairs in dataset IMDB-R in Figure 7. The results show that, in most case, L2E performs better than A in terms of faithfulness to the underlying blackbox and alignment with the human rationales.