ProtoTEx: Explaining Model Decisions with Prototype Tensors

We present ProtoTEx, a novel white-box NLP classification architecture based on prototype networks (Li et al., 2018). ProtoTEx faithfully explains model decisions based on prototype tensors that encode latent clusters of training examples. At inference time, classification decisions are based on the distances between the input text and the prototype tensors, explained via the training examples most similar to the most influential prototypes. We also describe a novel interleaved training algorithm that effectively handles classes characterized by ProtoTEx indicative features. On a propaganda detection task, ProtoTEx accuracy matches BART-large and exceeds BERTlarge with the added benefit of providing faithful explanations. A user study also shows that prototype-based explanations help non-experts to better recognize propaganda in online news.


Introduction
Neural models for NLP have yielded significant gains in predictive accuracy across a wide range of tasks.However, these state-of-the-art models are typically less interpretable than simpler, traditional models, such as decision trees or nearest-neighbor approaches.In general, less interpretable models can be more difficult for people to use, trust, and adopt in practice.Consequently, there is growing interest in going beyond simple "black-box" model accuracy to instead design models that are both highly accurate and human-interpretable.
While much research on white-box explainable models focuses on attributing parts of the input (e.g., word sequences) to a model's prediction (Xu et al., 2015;Lei et al., 2016;Bastings et al., 2019;Jain et al., 2020;Glockner et al., 2020), there is much debate around their faithfulness and reliability (Serrano and Smith, 2019;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019;Pruthi et al., * Both authors contributed equally.2020).Additionally, while such local explanations (if faithful) can be extremely useful in more intuitive tasks such as sentiment classification, that may not be the case for difficult tasks where human judgments may require a high degree of training or domain expertise.In such cases, understanding how models make their decisions for a particular input based on its training data can be insightful especially for engaging with users to develop an intuition on the model's decision making process.
In this paper, we propose Prototype Tensor Explainability Network (PROTOTEX)1 to faithfully explain classification decisions in the tradition of case-based reasoning (Kolodner, 1992).Our novel white-box NLP architecture augments prototype classification networks (Li et al., 2018) with large-scale pretrained transformer language models.Through a novel training regime, the network learns a set of prototype tensors that encode latent clusters of training examples.At inference time, classification decisions are entirely based on similarity to prototypes.This enables model predictions to be faithfully explained based on these prototypes, directly via similar training examples (i.e., those most similar to top-matched prototypes).We build upon the state-of-the-art NLP neural architectures to augment their accuracy with faithful and humaninterpretable explanations.Figure 1 shows an example of PROTOTEX on the task of propaganda detection (Da San Martino et al., 2019).
Another contribution of PROTOTEX concerns effective modeling of positive vs. negative classes in the presence of asymmetry.In a typical binary classification (e.g., sentiment detection), the presence of positive vs. negative language can be used to distinguish classes.However, with a task such as Web search, what most distinguishes relevant vs. irrelevant search results is the presence vs. absence of relevant content.Having this absence (rather than presence) of certain features most clearly dis- In this use case, the user gives PROTOTEX an input, which produces a prediction while retrieving a set of highest ranked examples that directly influenced model decision.In this diagram, by using "overly aggressive or grandstanding" the input creates propaganda via exaggeration (Da San Martino et al., 2019).PROTOTEX learns to identify sentences that contain propaganda phrases.In example 1, using "Russian" to describe Manafort (Former American political consultant) constitutes propaganda, so does using "witch hunt" in example 2. Exposure to similar examples helps users build an intuition towards the language used in propaganda.
tinguish a class complicates both predicting it and explaining these predictions to users.To address this, we introduce a single negative prototype for representing the negative class, learned via a novel training regime.We show that including this negative prototype significantly improves results.While our model is largely agnostic to the prediction task, we evaluate PROTOTEX on a sentence-level binary propaganda detection task (Da San Martino et al., 2019).Recent work on explainable fact-checking (Kotonya and Toni, 2020a) has provided explanations via attention (Popat et al., 2018;Shu et al., 2019), rule discovery (Gad-Elrab et al., 2019), and summarization (Atanasova et al., 2020;Kotonya and Toni, 2020b,a), but not prototypes.Better explanations could enable support for human fact-checkers (Nakov et al., 2021).
We show that PROTOTEX provides faithful explanations without reducing classification accuracy, which remains comparable to the underlying encoder, BART-large (Lewis et al., 2020), superior to that of BERT-large (Devlin et al., 2019), and with the added benefit of faithful explanations in the spirit of case-based reasoning.Furthermore, to the best of our knowledge, we are the first work in NLP that examines the utility of global casebased explanations for non-expert users in model understanding and downstream task accuracy.

Related work
Explainable classification Unlike post-hoc analysis approaches for explainability (Ribeiro et al., 2016;Sundararajan et al., 2017), prototype classification networks (Li et al., 2018;Chen et al., 2019;Hase et al., 2019) are white-box models with explainability built-in via case-based reasoning (Kolodner, 1992) rather than extractive rationales (Lei et al., 2016;Bastings et al., 2019;Jain et al., 2020;Glockner et al., 2020).They are the neural variant of prototype classifiers (Bien and Tibshirani, 2011;Kim et al., 2014), predicting based on similar known instances.Contemporary work (Rajagopal et al., 2021) also stressed the importance of "global" explainability through training examples, yet in their approach, the similar training examples are not directly integrated in the decision itself; in contrast, we do so via learned prototypes to provide more transparency.
Our work builds on Li et al. (2018), which we lay out in Section 3.1.Later work (Chen et al., 2019;Hase et al., 2019) enables prototype learning of partial images.In NLP, Guu et al. (2018) retrieved prototype examples from the training data for edit-based natural language generation.Hase and Bansal (2020) used a variant of Chen et al. (2019)'s work to examine among other approaches; unlike our work, they used feature activation to obtain explanations similar to post-hoc approaches, and did not handle the absence of relevant content.
Evaluating explainability Explainability is a multi-faceted problem.HCI concerns include: a) For whom are we designing the explanations?b) What goals are they trying to achieve? c) How can we best convey information without imposing excessive cognitive load? and d) Can explainable systems foster more effective human+AI partnerships (Amershi et al., 2019;Wickramasinghe et al., 2020;Wang et al., 2019;Liao et al., 2020;Wang et al., 2021;Bansal et al., 2021)?On the other hand, algorithmic concerns include generating faithful and trustworthy explanations (Jacovi and Goldberg, 2020), local vs.global explanations, and post-hoc vs. self-explanations (Danilevsky et al., 2020).
Human+AI fake news detection While explainable fact-checking (Kotonya and Toni, 2020a) could better support human-in-the-loop factchecking (Nakov et al., 2021;Demartini et al., 2020), studies rarely assess a human+AI team in combination (Nguyen et al., 2018).In fact, hu-man+AI teams often under-perform the human or AI working alone (Bansal et al., 2021), emphasizing the need to carefully baseline performance.
Propaganda detection (Da San Martino et al., 2019) constitutes a form of disinformation detection.Because propaganda detection is a hard task for non-expert users and state-of-the-art models are not accurate enough for practical use, explainability may promote adoption of computational propaganda detection systems (Da San Martino et al., 2021).

Methodology
We adopt prototype classification networks (Li et al., 2018) first proposed for vision tasks as the foundation for our prototype modeling work (Section 3.1).We design a novel interleaved training procedure, as well as a new batching process, to (a) incorporate large-scale pretrained language models, and (b) address within classification tasks where some classes can only be predicted by the absence of characteristics indicative of other classes.Classification is performed via a linear model that takes as an input the distances to the prototype tensors.As such, the network is a white-box model where global explanation is attained by directly linking the model to learned clusters of the training data.
Shown in Figure 1, the input is first encoded into a latent representation.This representation is fed through a prototype layer, where each unit of that layer is a learned prototype tensor that represents a cluster of training examples through loss terms L p1 and L p2 (specified by equations 2 and 3 below).
For each prototype j, the prototype layer calculates the L2 distance between its representation p j and that of the input x i , i.e., ||x i − p j ||2 2 .The output of the prototype layer, which is a matrix of L2 distances, is then fed into a linear layer; this learns a weight matrix of dimension K × m for K classes and m prototypes, where the K weights learned for each prototype indicates that prototype's relative affinity to each of the K classes.Classification is performed via softmax.
The total loss is a weighted sum of three terms: 2 (1) with hyperparameter λs, standard classification cross-entropy loss L ce , and two prototype loss terms, L p1 and L p2 .
L p1 minimizes avg.squared distance between each of the m prototypes and ≥ 1 encoded input: encouraging each learned prototype representation to be similar to at least one training example.L p2 encourages training examples to cluster around prototypes in the latent space by minimizing the average squared distance between every encoded input and at least one prototype: 2018) used convolutional autoencoders to represent input images.However, in the context of NLP, convolutional neural networks do not have sufficient representation power (Elbayad et al., 2018) and transformer-based language models, which are pretrained on large amounts of data, have consistently performed better in recent research.Thus to encode inputs, we experiment with two such models: BERT (Devlin et al., 2019) (a masked language model) and BART (Lewis et al., 2020) (a sequence-to-sequence autoencoder).During inference, we rank the prototypes by proximity to the test example.Thus, for a test example, we can obtain the training examples closest to the prototypes most influential to the classification decision.Jacovi and Goldberg (2020) define faithfulness as "how accurately [explanations] reflects the true reasoning process of the model."Since prototypes are directly linked to the model predictions via a linear classification layer, explanations derived by the prototypes are faithful by design.We also provide a mathematical intuition of how prototype layers relates to soft-clustering (which is inherently interpretable) in the appendix A.1.

Handling asymmetry: negative prototype
Section 1 noted a challenge in effectively modeling positive vs. negative classes in the presence of asymmetry.With detection tasks (e.g., finding relevant documents (Kutlu et al., 2020) or propaganda (Da San Martino et al., 2019)), the negative class may be most distinguished by the lack of positive features (rather than presence of negative ones  2020) define a negative rationale as summarizing the instance, to succinctly show it is not germane to the positive class.However, if we conceptualize the positive class as a specific foreground to be distinguished from a more general background, such "summary" negative rationales drawn from the background distribution are likely to provide only weak, noisy evidence for the negative class.
We investigate the potential value of including or excluding a single negative prototype to model this "background" negative class, and design an interleaved training procedure to learn this prototype.

Training
We present two algorithms for training.The vanilla one, which we call SIMPLEPROTOTEX, does not interleave the training of positive and negative prototypes.This is illustrated in Algorithm 1.
One of our contributions is the design of an iterative, interleaved approach to training that balances competing loss terms, encouraging each learned prototype to be similar to at least one training example (L p1  We illustrate this process in Algorithm 2. We initialize prototypes with Xavier, which allows the prototype tensors to start blind (thus unbiased) with respect to the training data and discover novel patterns or clusters on their own.After initialization, in each iteration, we first update the prototype tensors to become closer to at least one training example (henceforth δ loop).Then, in a separate training iteration, we update the representations of the training examples to push them closer to the nearest prototype tensor (henceforth γ loop).Since prototypes themselves do not have directly trainable parameters, we train the classification layer together with the encoder representations during the γ loop.We further separate the training of the positive and negative prototypes in order to push the negative "background" examples to form its own cluster.To this end, we perform class-level masking by setting the distances between the examples and prototypes of different classes to inf.
Finally, we perform instance normalization (Ulyanov et al., 2016) for all distances in order to achieve segregation among different prototypes (namely, the prototypes of the same class do not rely solely on a handful of examples).We discuss the effects of instance normalization in Section 4.2.

Experiments
Task We evaluate a binary sentence-level classification task predicting whether or not each sentence contains propaganda.We adopt ganda; see analysis of prototypes in Section 4.2.

Models and Settings
Hyperparameters are tuned on the validation data.Optimization for all neural models use AdamW (Loshchilov and Hutter, 2019) algorithm with a learning rate of 3e-5 and a batch size of 20.We use early-stopping (Fomin et al., 2020) with Macro F1 on validation data.We further perform upsampling within each batch to balance the number of examples in the positive and the negative classes.
Prototype Models PROTOTEX can be used across different underlying encoders on which interpretability components are added.Empirically, we found BART performed better on classification and so adopt it.We empirically determine the optimal number of prototypes to be 20, with one negative prototype.δ = 1, λ = 2, γ 1 = γ 2 = 0.9.
To achieve the maximum transparency, we set the bias term in the linear layer to 0 so that all information goes through the prototypes. 3Additionally, we compare to SIMPLEPROTOTEX, which trains without use of the negative prototype.
Baselines As a strong blackbox benchmark we use pretrained LMs without prototypes.BERT-large (Devlin et al., 2019): we use a simple linear layer over the output of the CLS token from the BERT encoder for classification.
BART-large (Lewis et al., 2020): we use the eos token's representation from the BART encoder as input to the linear layer of the model.
We also include a random baseline and a case-based reasoning K-Nearest-Neighbor (KNN-BART) baseline with the BART-large encoder.

Classification Results
Table 1 shows F1 scores achieved by models. 3Early experiments showed no difference between including vs. excluding the bias term with instance normalization..00.00.00.00.00.00.00.00.00.00.00.00.96.00.00.00.00.00.00 Figure 3: For each subcategory of propaganda (and the class), the fraction of validation examples from that subcategory that are associated with each prototype; "association" defined as the closest prototype for that example.We see that PROTOTEX learns prototypes that "focusses" differently on the subcategories.
Among the black-box baselines, the BART-large encoder representation outperformed BERT-large significantly (p < 0.05, bootstrap test (Berg-Kirkpatrick et al., 2012)).PROTOTEX performed on-par with its underlying encoder BART, showing that PROTOTEX's explainability came at no cost of classification performance.It also substantially outperforms the KNN-BART baseline.
Figure 2 shows F1 for the examples, pretaining to each subclass labeled by Da San Martino et al. (2019).We can see that the model performance is relatively consistent across subclasses.The two subclasses that are most difficult for the model are "Reductio ad Hitleru" and "Appeal to Authority".
In Figure 3, we visualize and show that different prototypes "focus" on each subclass differently.We also see that negative examples are associated only with the negative prototype, and vice-versa.Negative Prototype.Using a negative prototype far exceeds SIMPLEPROTOTEX results that lack it (by roughly 10%).Lacking a negative prototype, the only way to classify a negative class would be via a negative correlation on the distance between the test input and the learned prototypes.The use of the negative prototype simplifies the discriminatory process by dissociating the classification process of the negative class from the classification process of the positive class.Instance Normalization.As shown in Table 1, normalization boosts classification performance.We also observe its benefit for explainability.

Human Evaluation
PROTOTEX is designed to provide faithful casebased explanations (as shown in  totypes most influential in predicting the class for a given example, we hypothesize that these top prototypes will be representative of the example and the label corresponding to the example.We carry out two user studies to assess the utility of these prototype-based explanations for non-expert end users.Specifically, we examine whether model explanations help non-expert users to: 1) better recognize propaganda in online news; and 2) better understand model behavior.
We obtain 540 user-responses, based on 20 testset examples, balancing gold labels and model predictions to include 5 examples from each group: true-positives, false-negatives, true-negatives, and false-positives.To simplify propaganda definitions for non-experts, we pick only four types of propaganda and we provide participants with definitions and examples for each type: Appeal to Authority, Exaggeration or Minimisation, Loaded Language, and Doubt.We select these categories because they cover the majority of the examples in the test set.
For each example, we select the top-5 prototypes that most influenced the model's prediction.We then represent each prototype by the closest training example in the embedding space.As with case-base reasoning, we explain model decisions to participants by showing for each test example the five training examples that best represent the evidence (prototypes) consulted by the model in making its prediction.Participants are primed that the model is wrong in 50% of the cases (to prevent over-trust).

Recognizing Propaganda
In this first likert-scale rating task, participants are asked whether the test example contains propaganda.Options included: definitely, probably, probably not, definitely not, or "I have no idea (completely unsure how to respond)".We compare the following four study conditions: No Explanation (Baseline) We show only the test example that needs to be classified.ticipants were able to correctly predict the presence of propaganda in 59% of the cases.In the second baseline condition, when we providing random examples as "explanation", accuracy drops to 44%.We also measure how varying model accuracy impacts the effect of model explanations, comparing four model accuracy conditions: 0% (always incorrect), 50%, 75%, and 100% (always correct)5 .When the model is always wrong, explanations reduce the human performance below both baselines (38% in the EO condition, 26% in ME).At 50% model accuracy, human performance is higher than the "random" condition, but lower than the baseline.At 75% , the ME condition outperforms the baseline (67%).Finally, at 100% model performance both model conditions improve the accuracy of the human annotation, with ME condition reaching 84%.Our sample size of 540 exceeds the necessary 70 to holds a statistical power for between-subject studies (Bojko, 2013) .

Random Examples
Results from this experiment demonstrate that case-based explanations can improve human performance compared to a random baseline.However, the utility of the explanations is a function of the model accuracy.

Model Understanding
The second user task investigates model understanding by simulatability (Hase et al., 2019): can the participant predict the model decision given the most important evidence consulted by the model?Specifically, we show five training examples to the user, either Random Examples (RE) or PROTO-TEX Examples (PE) (i.e., the same training examples used in the EO condition above).We ask participants to predict the model's decision using the same 5-point likert-scale as earlier.
Results Per Figure 6a, PROTOTEX's explanations help the users predict the model behavior better than random examples: 50% correct user assessment for PE vs 43.3% for RE.In 23.3% of the RE examples users are unable to make a prediction vs. 8% for the PE.Random guessing would be 40% accurate on a five-way rating task with 2 positive, 1 neutral, and 2 negative options ( §5.1).
In Figure 6b we can see that the users are better at assessing the model prediction when the model is right (57%) vs when the model is wrong (43%).Additionally, we see that less users report inability to identify mode prediction when the model is correct (3.33%) vs. when the model is not (13.3%).

Conclusion
PROTOTEX is a novel approach to faithfully explain classification decisions by directly connecting model decisions with training examples via learned prototypes.PROTOTEX builds upon the state of the art in NLP.It integrates an underlying transformer encoder with prototype classification networks, and uses a novel, interleaving training algorithm for prototype learning.On the challenging propaganda detection task, PROTOTEX performed on-par in classification as its underlying encoder (BART-large), and exceeded BERT-large, with the added benefit of providing faithful model explanations via prototypes.Our pilot human evaluation study shows that additional input provided by PRO-TOTEX contains relevant information for the task and can improve the annotation performance, provided sufficient model accuracy.We further demonstrate that explanations help non-expert users better understand and simulate model predictions.

Ethical Statement
For annotation, we source participants from Amazon Mechanical Turk only within the United States, paying $10/hour based on average task time.We did not reject any work but exclude data from participants who failed an attention check.

Figure 1 :
Figure1: PROTOTEX architecture along with a use case demonstration.Pink/Green dots denote training examples, which are clustered around positive prototypes (blue) and a single negative prototype (black).Dotted lines represent distances.In this use case, the user gives PROTOTEX an input, which produces a prediction while retrieving a set of highest ranked examples that directly influenced model decision.In this diagram, by using "overly aggressive or grandstanding" the input creates propaganda via exaggeration(Da San Martino et al., 2019).PROTOTEX learns to identify sentences that contain propaganda phrases.In example 1, using "Russian" to describe Manafort (Former American political consultant) constitutes propaganda, so does using "witch hunt" in example 2. Exposure to similar examples helps users build an intuition towards the language used in propaganda.

3. 1
Base architecture PROTOTEX is based on Li et al. (2018)'s Prototype Classification Network, and we integrate pretrained language model encoders under this framework.Their architecture is based on learning prototype tensors that serve to represent latent clusters of similar training examples (as identified by the model).
) and encouraging training examples to cluster around prototypes (L p2 ).We perform each type of representation update separately to ensure that we progressively push the prototypes and the encoded training examples closer to one another.Propaganda Subtypes (ordered w.r.t #instances)

Figure 2 :
Figure 2: Macro-F1 score of PROTOTEX predicting examples that belong to each propaganda subclass.The black line corresponds to the number of examples in the test set.Classes are reordered in terms of F1.

Figure 4 :
Figure 4: Number of unique 5-nearest training examples to each prototype (blue+red), and the number of examples associated with only 1 prototype (blue-only).Without normalization, very few examples (out of 100) are close to all prototypes; with normalization, we observe more diversity: different training examples are near different prototypes.

Figure 5 :
Figure 5: Accuracy of human annotations when provided with PROTOTEX explanations or PROTOTEX explanations + prediction.Model Performance: the accuracy of the model generating the explanations.Baseline: Annotation accuracy without explanations.Random: Randomly selected examples for explanation.

Figure 6 :
Figure 6: Model Simulatability.User assessment of the model prediction a) Comparing PROTOTEX selected training examples vs. random examples; b) Comparing examples where the model prediction is accurate and examples where the model prediction is wrong (Kolodner, 1992)prototypes occupy the same space as encoded inputs, we can directly measure the distance between prototypes and encoded train or test instances.During inference time, prototypes closer to the encoded test example become more "activated", with larger weights from the prototype layer output.Consequently, model prediction is thus the weighted affinity of each prototype to the test example, where each prototype has K weights over the possible class assignments.In the context of classification in NLP, we operationalize case-based reasoning(Kolodner, 1992)by providing similar training examples.Once the model is trained, for each prototype we rank the training examples by proximity in the latent space.
).If a document is relevant only if it contains relevant content, how can one show the lack of such content?This poses a challenge both in classifying Algorithm 2 Decoupled training for prototypes and classification, which enables the learning of the negative prototype.negativeinstancesand in explaining such classification decisions on the basis of missing features.For propaganda, Da San Martino et al. (2019) side-step the issue by only providing rationales for positive instances.For relevance,Kutlu et al.  (

Table 2 )
for its classification decisions.Given the set of top pro-

Table 2 :
Examples of similar sentences identified by our model.The input sentence uses the phrase feeding frenzy which is an example of propaganda phrasing.The model identifies training examples that also contain propaganda phrases as highlighted.Note that the model does not obtain the highlights shown here.Highlights are also not part of our human evaluation.