ferret: a Framework for Benchmarking Explainers on Transformers

As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them.We introduce ferret, a Python library to simplify the use and comparisons of XAI methods on transformer-based classifiers.With ferret, users can visualize and compare transformers-based models output explanations using state-of-the-art XAI methods on any free-text or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics to select the most faithful and plausible explanations. To align with the recently consolidated process of sharing and using transformers-based models from Hugging Face, ferret interfaces directly with its Python library.In this paper, we showcase ferret to benchmark XAI methods used on transformers for sentiment analysis and hate speech detection. We show how specific methods provide consistently better explanations and are preferable in the context of transformer models.

Despite the shared objective, each method has peculiar configurations and explanation artifacts. For example, consider the class of feature importance methods (Danilevsky et al., 2020). LIME (Ribeiro et al., 2016) estimates word importance by learning a regression model and presenting the user with the learned weights. Simonyan et al. (2014a), instead, impute word contributions measuring gradients with respect to the model's loss, i.e., how sensitive is the model to each input component. Current evidence shows that these differences are more than subtle, motivating the adoption of certain methods over others (Attanasio et al., 2022).
Motivated by renewed requirements in regulation policies and social decision-making (European Commission, 2016;Goodman and Flaxman, 2017), the issue of evaluating and benchmarking explanations has gained interest. Recent works have discussed the properties of faithfulness, plausibility (Jacovi and Goldberg, 2020), and simulatability (Hase and Bansal, 2020) of explanations. Others introduced new diagnostic benchmarks, datasets, and metrics to compare different interpretability methods (Atanasova et al., 2020;DeYoung et al., 2020;Attanasio et al., 2022). However, these focused efforts show a common limitation: the authors devise and develop their study in technical isolation, i.e., under no unified framework that allows either testing new explainers, new evaluation metrics, or datasets, ultimately hindering solid benchmarking. In other words, it is unnecessarily hard to answer paramount questions such as: Given all explanation methods applicable to my use case, which one should I choose? Which method is more reliable? Can I trust it?
We introduce ferret, an open-source Python library to benchmark interpretability approaches. With ferret, we provide a principled evaluation framework combining state-of-the-art interpretability methods, metrics, and datasets with an easy-touse, extensible, and transformers-ready (Wolf et al., 2020) interface.
Contributions. ferret is the first interpretability tool to offer an Evaluation API using as the input Hugging Face model names and free text or interpretability corpora. ferret is under active de-arXiv:2208.01575v1 [cs.CL] 2 Aug 2022 velopment: we release our code and documentation under the MIT license at https://github.com/ g8a9/ferret. 1 2 Library Design ferret builds on four core principles.
1. Built-in Post-hoc Interpretability. We include four state-of-the-art post-hoc feature importance methods and three interpretability corpora. Ready-to-use methods allow users to explain any text with an arbitrary model; annotated datasets provide valuable test cases for new interpretability methods and metrics.
To the best of our knowledge, ferret is first in providing integrated access to XAI datasets, methods, and a full-fledged evaluation suite.
2. Unified Faithfulness and Plausibility Evaluation. We propose a unified API to evaluate explanations. We currently support six stateof-the metrics along the principles of faithfulness and plausibility (Jacovi and Goldberg, 2020).
3. Transformers-ready. ferret offers a direct interface with models from the Hugging Face Hub. 2 Users can load models using standard naming conventions and explain them with the built-in methods effortlessly. Figure 1 shows the essential code to classify and explain a string with a pre-existing Hugging Face model and evaluate the resulting explanations.
4. Modularity and Abstraction. ferret counts three core modules, implementing Explainers, Evaluation, and Datasets APIs. Each module exposes an abstract interface to foster new development. For example, user can sub-class BaseExplainer or BaseEvaluator to include a new feature importance method or a new evaluation metric respectively.

Explainer API
We focus on the widely adopted family of posthoc feature attribution methods (Danilevsky et al., 2020). I.e., given a model, a target class, and a prediction, ferret lets you measure how much each token contributed to that prediction. We integrate four renowned methods: Gradient (Simonyan et al., 2014b) (also known as Saliency) and Integrated Gradient (Sundararajan et al., 2017); SHAP (Lundberg and Lee, 2017) as a Shapley value-based method, and LIME (Ribeiro et al., 2016) as representative of local surrogate methods.
We build on open-source libraries and streamline their interaction with Hugging Face models and paradigms. We report the supported configurations and functionalities in Appendix A.

Dataset API
Fostering a streamlined, accessible evaluation on independently released XAI datasets, we provide a convenient Dataset API. It enables users to load XAI datasets, explain individual or subsets of samples, and evaluate the resulting explanations.
Currently, ferret includes three classificationoriented datasets annotated with human rationales, i.e., the annotations highlight the most relevant words, phrases, or sentences a human annotator attributed to a given class label (DeYoung et al., 2020;Wiegreffe and Marasovic, 2021).
Stanford Sentiment Treebank (SST). (Socher et al., 2013). A sentiment classification dataset of 9,620 movie reviews annotated with binary sentiment labels, including human annotations for word phrases of the parse trees. We extract human rationales from annotations following the heuristic approach proposed in Carton et al. (2020).
These three datasets provide an initial example of what an integrated approach can offer to researchers and practitioners. Please refer to Appendix B for an overview of new datasets we are actively integrating.

Evaluation API
We evaluate explanations on the faithfulness and plausibility properties (Jacovi and Goldberg, 2020;DeYoung et al., 2020). Specifically, ferret implements three state-of-the-art metrics to measure faithfulness and three for plausibility.
Faithfulness. Faithfulness evaluates how accurately the explanation reflects the inner working of the model (Jacovi and Goldberg, 2020).
Comprehensiveness (↑) evaluates whether the explanation captures the tokens the model used to make the prediction. We measure it by removing the tokens highlighted by the explainer and observing the change in probability as follows.
Let x be a sentence and let f j be the prediction probability of the model f for a target class j. Let r j be a discrete explanation or rationale indicating the set of tokens supporting the prediction f j . Comprehensiveness is defined as f (x) j −f (x\r j ) j where x \ r j is the sentence x were tokens in r j are removed. A high value of comprehensiveness indicates that the tokens in r j are relevant for the prediction.
While comprehensiveness is defined for discrete explanations, feature attribution methods assign a continuous score to each token. We hence select identify r j as follows. First, we filter out tokens with a negative contribution (i.e., they pull the prediction away from the chosen label). Then, we compute the metric multiple times, considering the k% most important tokens, with k ranging from 10% to 100% (step of 10%). Finally, we aggregate the comprehensiveness scores with the average, called Area Over the Perturbation Curve (AOPC) (DeYoung et al., 2020).
Sufficiency (↓) captures if the tokens in the explanation are sufficient for the model to make the prediction (DeYoung et al., 2020). It is measured as A low score indicates that tokens in r j are indeed the ones driving the prediction. As for Comprehensiveness, we compute the AOPC by varying the number of the relevant tokens r j .
Correlation with Leave-One-Out scores (↑). We first compute leave-one-out (LOO) scores by omitting tokens and measuring the difference in the model prediction. We do that for every token, once at a time. LOO scores represent a simple measure of individual feature importance under the linearity assumption (Jacovi and Goldberg, 2020). We then measure the Kendall rank correlation coefficient τ between the explanation and LOO importance (Jain and Wallace, 2019). The closer the τ correlation is to 1, the more faithful the explanation is faithful to LOO.
Plausibility. Plausibility reflects how explanations are aligned with human reasoning by comparing explanations with human rationales (DeYoung et al., 2020) .
We integrate into ferret three plausibility measures of the ERASER benchmark (DeYoung et al., 2020): Intersection-Over-Union (IOU) at the token level, token-level F1 scores, and Area Under the Precision-Recall curve (AUPRC).
The first two are defined for discrete explanations. Given the human and predicted rationale, IOU (↑) quantifies the overlap of the tokens they cover divided by the size of their union. Token-level F1 scores (↑) are derived by computing precision and recall at the token level. Following DeYoung et al. (2020) and Mathew et al. (2021), we derive discrete explanations by selecting the top K tokens with positive influence, where K is the average length of the human rationale for the dataset. While being intuitive, IOU and Token-level F1 are based only on a single threshold to derive rationales. Moreover, they do not consider tokens' relative ranking and degree of importance. We then also integrate the AUPRC (↑), defined for explanations with continuous scores (DeYoung et al., 2020). It is computed by varying a threshold over token importance scores, using the human rationale as ground truth.

Transformers-ready Interface
ferret is deeply integrated with Hugging Face interfaces. Users working with their standard models and tokenizers can easily integrate our library for diagnostic purposes. The contact point is the main Benchmark class. It receives any Hugging Face model and tokenizer and uses them to classify, run explanation methods and seamlessly evaluate the explanations.
Similarly, our Dataset API leverages Hugging Face's datasets 4 to retrieve data and human rationales.

Case Studies
ferret is ready for real-world tasks. In the following, we describe how users can leverage ferret to identify reliable explainers in sentiment analysis  and hate speech detection tasks. Our running examples use an XLM-RoBERTa model fine-tuned for sentiment analysis (Barbieri et al., 2021) and a BERT model fine-tuned for hate speech detection (Mathew et al., 2021).

A Faithful Error Analysis
Explanations on individual instances are often used for model debugging and error analysis (Vig, 2019;Feng et al., 2018). However, different explanations can lead users to different conclusions, hindering a solid understanding of the model's flaws. We show how practitioners can alleviate this issue including ferret in their pipeline. Figure 2 shows explanations and faithfulness metrics computed on the sentence "Great movie for a great nap!" for the "Positive" class label. Note that the model wrongly classifies the text as positive.
Faithfulness metrics show that SHAP adheres best to the model's inner workings since it returns the most comprehensive and relevant explanations. Indeed, SHAP retrieves the highest number of tokens the model used to make the prediction (aopc_compr(↑) = 0.41) that are relevant to drive the prediction (aopc_suf f (↓) = 0.09). Further, taucorr_loo(↑) = 0.43 indicates that SHAP explanations capture the most important tokens for the prediction under the linearity assumption. Although Integrated Gradient (x Input) shows a higher taucorr_loo, it does not provide comprehensive and sufficient explanations. Similarly, Gradient and Integrated Gradient show bad sufficiency and comprehensiveness, respectively. LIME and Gradient (x Input) do not return trustworthy explanations according to all faithfulness metrics.
Once SHAP has been identified as the best explainer, its explanations enable researchers to thoroughly investigate possible recurring patterns or detect model biases. In this case, the explanations shed light on a type of lexical overfitting: the word "great" skews the prediction toward the positive label regardless of the surrounding context and semantics.

Dataset-level Assessment
While instance-level explanations might help with sensitive examples, a legitimate question remains: how does my explainer behave across multiple texts, potentially the whole corpus? With ferret, users can easily produce and aggregate evaluation metrics across multiple dataset samples.
We describe how to choose the explainer that returns the most plausible and faithful explanations for the HateXplain dataset. For demonstration purposes, we focus only on a sample of the dataset. We discuss the issue of computational costs and requirements in Appendix D. Figure 3 (Appendix C) shows the metrics averaged across ten samples with the "hate speech" label. Results suggest again that SHAP yields the most faithful explanations. SHAP and Gradient achieve the best comprehensiveness and sufficiency scores, but SHAP outperforms all explainers for the τ correlation with LOO ( taucorr_loo (↑) = 0.41). Gradient provides the most plausible explanations, followed by SHAP.

Related Work
While many open-source frameworks explain black-box models, virtually none implement explicit evaluation paradigms on top of transformers (Wolf et al., 2020). We hence review the literature on tools and libraries that offer a subset of the ferret's functionalities, namely the option to use multiple XAI methods and datasets, evaluation API, transformer-readiness, and built-in visualization.
Tools for Post-Hoc XAI. Toolkits for post-hoc interpretability offer built-in methods to explain model prediction, typically through a code interface. ferret builds on and extends this idea to a unified framework to generate explanations, evaluate and compare them, with support to several XAI datasets. Moreover, ferret's explainers are integrated with transformers's (Wolf et al., 2020) principles and conventions.
Captum (Kokhlikyan et al., 2020) is a Python library supporting many interpretability methods. However, the library lacks integration with the Hugging Face Hub and offers no evaluation procedures. In contrast, ferret allows users to load models directly from the Hugging-Face Hub using standard naming conventions and explain them with the built-in methods.
AllenNLP Interpret (Wallace et al., 2019b) provides interpretability methods based on gradients and adversarial attacks for AllenNLP models (Gardner et al., 2018). We borrow the modular and extensible design and extend it to a wider set of explainers, providing an evaluation paradigm and supporting the widely used transformers library.
Transformers-Interpret 5 leverages Captum to explain Transformer models, but it supports only a limited number of methods.
Thermostat (Feldhus et al., 2021) allows users to interact with pre-computed feature attribution  scores of a limited number of models within the Hugging-Face Hub. Although collecting static feature attribution scores is promising to save redundant computation time, it is a different objective than ours. ferret enables on-the-fly interpretability benchmarking on any model and piece of text. In this regard, integrating Thermostat within ferret would be an interesting direction. Malandri et al. (2022) introduce ContrXT, a global contrastive explainer for two arbitrary text classifiers trained on the same target class set. Differently, ferret explains local predictions and supports multiple comparisons.
Visualization. Most studies that develop visualization tools to investigate the relationships among the input, the model, and the output focus either on specific NLP models -NLPVis (Liu et al., 2018), Seq2Seq-Vis (Strobelt et al., 2018), or explainers -BertViz (Vig, 2019), ELI5 6 . LIT (Tenney et al., 2020) streamlines exploration and analysis in different models, however, it acts mainly as a graphical browser interface. ferret provides a Python interface easy to integrate with pre-existing pipelines.
Evaluation. Although prior works introduced diagnostic properties for XAI techniques, evaluating them in practice remains challenging. Studies either concentrate on specific model architectures (Lertvittayakumjorn and Toni, 2019;Arras et al., 2019;DeYoung et al., 2020), one particular dataset (Guan et al., 2019;Arras et al., 2019), or a single group of explainability methods (Robnik-Šikonja and Bohanec, 2018;Adebayo et al., 2018). Hence, providing a generally applicable and automated 6 https://github.com/TeamHG-Memex/eli5 tool for choosing the most suitable method is crucial. To this end, Atanasova et al. (2020) present a comparative study of XAI techniques in three application tasks and model architectures. To the best of our knowledge, we are the first to present a userfriendly Python interface to interpret, visualize and empirically evaluate models directly from the Hugging Face Hub across several metrics. We extend previous work from DeYoung et al. (2020), who developed a benchmark for evaluating rationales on NLP models called ERASER by offering a unified interface for evaluation and visual comparison of the explanations at the instance-and dataset-level. Table 2 summarizes the features we discussed and compares ferret with similar frameworks.

Conclusions
We introduced ferret, a novel Python library to benchmark explainability techniques. ferret is a first-of-its-kind melting pot of several XAI facets that always talked but never met: users can explain using state-of-the-art post-hoc explainability techniques, evaluate explanations on several metrics for faithfulness and plausibility, and easily interact with datasets annotated with human rationales. Code supports Hugging Face models and conventions, facilitating the introduction of ferret into existing NLP pipelines.
In addition, the modular and abstract programmatic design will encourage open-source contributions from the community. We are currently developing and enhancing the library to set an example, adding support to new methods, metrics, and datasets. All stable releases are distributed via PyPI (https://pypi.org/ project/ferret-xai/) and through our demo.

Ethics Statement
ferret's primary goal is to facilitate the comparison of methods that are instead frequently tested in isolation. Nonetheless, we cannot assume the metrics we currently implement provide a full, exhaustive picture, and we work towards enlarging this set accordingly.
Further, interpretability is much broader than post-hoc feature attribution. We focus on this family of approaches for their wide adoption and intuitiveness.
Similarly, the evaluation measures we integrate are based on removal-based criteria. Prior works pointed out their limitations, specifically the problem of erased inputs falling out of the model input distribution (Hooker et al., 2019).