Human Rationales as Attribution Priors for Explainable Stance Detection

As NLP systems become better at detecting opinions and beliefs from text, it is important to ensure not only that models are accurate but also that they arrive at their predictions in ways that align with human reasoning. In this work, we present a method for imparting human-like rationalization to a stance detection model using crowdsourced annotations on a small fraction of the training data. We show that in a data-scarce setting, our approach can improve the reasoning of a state-of-the-art classifier—particularly for inputs containing challenging phenomena such as sarcasm—at no cost in predictive performance. Furthermore, we demonstrate that attention weights surpass a leading attribution method in providing faithful explanations of our model’s predictions, thus serving as a computationally cheap and reliable source of attributions for our model.


Introduction
Stance detection, automatically identifying the position on a topic taken by a text (Mohammad et al., 2017), allows readers to glean valuable information from news articles and social media, such as whether the writing is politically slanted. Due to the sensitive nature of many topics (e.g., political ideologies and religious beliefs), it is crucial that stance models are transparent and rationalize their predictions in human-like ways. Furthermore, their reasoning must remain human-like even when they are tasked with generalizing to a new, unseen test topic.
A model's rationale for a specific input can be extracted in the form of feature attributions, which quantify the influence that each input feature asserts on the model's predictions. Methods for obtaining these attributions vary by faithfulness (the extent to which they accurately measure the importance of each feature). Attributions can be incorporated into the training process via attribution priors, a powerful framework for imparting domain knowledge The better question is, though, how many people who have used marijuana DO NOT use other illegal drugs. I think the answer would surprise you. Correlation is not causation -the sooner you learn that, the sooner you can get to the real root of our opiate addition problem. Prediction AGAINST (wrong) Our Method The better question is, though, how many people who have used marijuana DO NOT use other illegal drugs. I think the answer would surprise you. Correlation is not causation -the sooner you learn that, the sooner you can get to the real root of our opiate addition problem. Prediction FOR (correct) Table 1: Model reasoning as explained by mean attention weights (MAW). The baseline is trained using only cross-entropy loss, while the second model is trained using our proposed attribution prior.
to models (Erion et al., 2019(Erion et al., , 2020. In this work, we propose a prior that penalizes a stance classifier for producing attributions that deviate from human judgements of word importance (i.e., which words or phrases in a text are most indicative of stance). Notably, our method is model-agnostic and can be used with any differentiable feature attribution technique.
We train and evaluate our model on the VAST (VAried Stance Topics) dataset, whose test set covers topics that are either absent from (zero-shot) or scarce in (few-shot) the training data (Allaway and McKeown, 2020). To construct our attribution prior, we crowdsource word importance annotations on a small subset (∼500 examples) of the training data, as well as on a sample of the test data for evaluation. Additionally, to assess how our method might fare in realistic, resource-limited scenarios, we experiment using not only the complete VAST train set, but also reduced versions of it.
As the attribution method for our prior, we choose mean attention weights (MAW), a method that is computationally cheap in comparison to popular alternatives. Although numerous recent publications have demonstrated that attention weights do not in general provide faithful explanations of model behavior, we find strong evidence that the attributions offered by MAW are more faithful to our model's predictions than Gradient × Input (GI), a leading attribution method for transformer models.
Our contributions are as follows: (1) we propose a method that, in a simulated data-scarce setting, improves the reasoning of a state-ofthe-art stance detection model without compromising its performance and (2) we show that MAW is not only simpler and computationally cheaper than GI, but also more faithful to our models. Our data and models are available at https://github.com/SahilJ97/ Explainable-Stance-Detection.

Related Work
While early work on stance detection focused primarily on ideological debates (Walker et al., 2012;Hasan and Ng, 2014;Abbott et al., 2016), recent datasets have also begun to include more political topics, such as elections (Mohammad et al., 2016;Vamvas and Sennrich, 2020;Lai et al., 2020) and referendums (Taulé et al., 2017;Tsakalidis et al., 2018). This reflects a growing interest in developing models to understand public opinion on a range of topics. However, to be used in real-world scenarios, such models must also exhibit good generalization ability and a degree of transparency. Recent work has focused on the generalization ability of stance detection models: across topics (Augenstein et al., 2016;Xu et al., 2018;Allaway and McKeown, 2020;Zhang et al., 2020;Allaway et al., 2021), languages (Vamvas and Sennrich, 2020), and even label sets and genres (Schiller et al., 2020;Hardalov et al., 2021). In contrast, our work focuses on the reasoning of models during topic generalization.
Many attribution methods have been used to extract rationales from text classifiers, including activation-based (Atanasova et al., 2020), perturbation-based (Ribeiro et al., 2016, gradient-based, and attention-based (Abnar and Zuidema, 2020;Wu and Ong, 2021) methods. Furthermore, numerous works incorporate these model attributions into the training process. Liu and Avci (2019) train a toxicity classifier using an attribution prior based on Integrated Gradients, a gradient-based method of feature attribution (Sundararajan et al., 2017). Zhong et al. (2019) directly train an attention mechanism for relation extraction. Previous studies have used word importance annotations much like ours to supervise attention with the goal of improving predictive performance (Pruthi et al., 2020;Kanchinadam et al., 2020). Our work, in contrast, focuses on verifiably improving the reasoning of a classifier. While there exist models whose reasoning paths are naturally transparent and easy to train, such as select-predict (Jacovi and Goldberg, 2021) and rationale-augmented (Zaidan et al., 2007;Zhang et al., 2016) models, our framework makes no assumptions about model architecture, and can thus be applied to a state-of-the-art stance classifier.
A recent survey of explainability (i.e., attribution) methods proposed five diagnostic properties for comparing techniques, including faithfulness, defined as a measure of how true attributions are to the inner workings of a model (Atanasova et al., 2020). Their experiments with transformer models show that gradient-based methods (e.g., GI) score high in faithfulness. In fact, similar assessments of faithfulness have found that attention does not provide faithful explanations (Jain and Wallace, 2019;DeYoung et al., 2020), especially compared to GI (Wu and Ong, 2021). However, recent studies have argued for a more nuanced understanding of faithfulness that prioritizes 'explainable enough' (Wiegreffe and Pinter, 2019;Jacovi and Goldberg, 2020) and allows for the 'best' technique to vary by model, task, or input (Ghorbani et al., 2018). Our work presents a scenario in which attention is in fact more faithful than GI.

Crowdsourcing Annotation
In order to study model rationales for stance detection across many topics, we annotate a portion of the recently proposed VAST (Allaway and McKeown, 2020) dataset with human rationales. VAST is composed of comments (referred to here as arguments) from a portion of The New York Times. The stance topics in the dataset were extracted automat-ically (e.g., by identifying important noun-phrases) and then validated (or corrected) using crowdsourcing. Crowdsourcing was also used to assign a label to each example: "pro" (for), "con" (against), or "neutral." For annotation, we randomly select from the train set 700 non-neutral examples whose topics were validated by annotators. We also select 75 such examples from the test set. We do not annotate examples (142 in training, 32 in test) for which we judge the topic to be unclear (e.g., "problem"). Furthermore, in the training sample, we identify 14 examples with incorrect stance labels.
We collect our annotations via Amazon Mechanical Turk (MTurk), a popular crowdsourcing platform. For each HIT (task), workers are asked to (1) classify the stance of an argument with respect to a topic and (2) select the k most important words in the argument (for each example, we provide an acceptable range of values for k). A word is considered to be important if masking it would make (1) more difficult 1 . A detailed illustration of our MTurk HIT is provided in Appendix A.
To ensure the quality of our annotations, we first publish a "qualification" HIT consisting of a single example. Then, three qualified workers annotate each example in our subsets. We also use (1) above to check for worker quality in the training subset only. In particular, for 74 samples in the training subset, at least two annotators disagree with the gold stance label. The authors inspect each of these examples and either flip the label (35 examples) or discard the example. The Cohen's κ on these decisions is 0.392.

Oracle Attributions
Computation and Processing: The results of our crowdsourced tasks are used to compute oracle attributions for each of the annotated arguments.
We disregard a worker's annotations for an example if their stance classification (see (1) above) disagrees with the label. Our oracle attributions are a weighted sum of annotator responses, with worker quality score (WQS) as the weighting factor. WQS measures an annotator's word-level agreement with their peers (Dumitrache et al., 2018). The average WQS for our annotators is 0.58. Note that these oracle scores are later normalized ( §4.2).
Analysis: We examine the processed oracle attributions and observe that annotators mark an average of 26% of the tokens in an argument as important. Surprisingly, words from the topic are only selected for 51% (267/519) of the examples while an average of 44% of important words are stopwords. For example, a worker selected "no one here" in the sentence "I know of no one here who is even remotely excited about the olympics" (see Table 5). This shows that human word importance judgements cannot be approximated simply by selecting the topic or the words most similar to the topic. Additionally, we find that on average only 10% (9%) of important words are positive (negative) sentiment-bearing, as identified by the MPQA lexicon (Wilson et al., 2017). This further highlights the complexity of human word importance judgements for stance detection, since sentiment has only a minor role in determining stance.

Methods
We propose a model for stance detection that uses a BERT-based encoder ( §4.1) trained with an additional loss term ( §4.2) designed to impose a prior based on human rationales. Define as a dataset with N examples, each consisting of an argument d i , a topic t i , and a stance label y i ∈ {0, 1, −1}. In addition, let M θ be some model with parameters θ. Then we can define for each example x a set of oracle attributions s, model attributions a, and penalty weights γ (the contribution of each token to our loss term). Our prior loss term encourages the model to produce, for each example, attributions a that are very "similar" to s ( §4.2-4.3).

Base Architecture
Our base architecture builds on the baseline model BERT-joint introduced by Allaway and McKeown (2020), which jointly embeds a topic and document using BERT (Devlin et al., 2019), thereby conditioning the topic representation on the document and vise versa. Our model differs from BERT-joint in two ways. First, rather than fixing BERT, we fine-tune its weights during training, thus allowing the transformer to update its attention heads. This is necessary in order to accommodate our choice of attribution method (MAW). Second, rather than removing stopwords from the input, we pass to BERT the full input sequence ([CLS] document [SEP] topic [SEP]) and compute the final representation used for stance classification by taking the mean hidden state over all non-stopwords. We do this because our oracle attributions cover all words in the argument, not just the non-stopwords.

Prior Loss
Our example-level rationale loss function Ω is a weighted mean square error between normalized model attributions and normalized oracle attributions. Formally, for example x = (d, t, y), let m be the length of our argument d and let θ denote the parameters of our model. Let x be a word-importanceannotated example with penalty weights γ = (γ 1 , ..., γ m ), oracle attributions s = (s 1 , ..., s m ), and model attributions a = (a 1 , ..., a m ). Let a j denote the normalized attribution score for argument token j. That is, Similarly, let s j denote the normalized oracle score for argument token j.
Then Ω is defined as follows: Intuitively, the square error associated with the attribution on the argument's j-th token is weighted by penalty weight γ j . For an example x that is not endowed with oracle attributions, we define Ω(θ; x ) to be 0. Our complete loss function is the sum of the stance classification loss L c and the scaled average prior loss across examples where D is a dataset, L c is the cross-entropy loss, and λ > 0 is a hyperparameter.

Penalty Weights
In order to exclude certain tokens (punctuation and numerals) from our rationale loss function, and to potentially assign non-uniform influence to the remaining tokens, we introduce the notion of penalty weights. The penalty weight of a token specifies its contribution to the rationale loss function. In our experiments, we focus primarily on binary penalty weights, where tokens that are punctuation marks or numerals receive a score of 0 and all other tokens receive a score of 1. However, we also experiment with tf-idf penalty weights, where the score of a non-punctuation, non-numeral token is its tf-idf with respect to the training set. 2

Feature Attribution
Our prior allows for any choice of attribution method. We select Mean Attention Weights (MAW) because of its extremely low computational cost in comparison with other methods, most of which require backpropagation and/or multiple forward passes (Atanasova et al., 2020). In MAW, the attribution score of token j is the mean, taken across all tokens, layers, and attention heads, of attention weights α ij -that is, all attention weights associated with the key at index j. Informally, MAW measures how much attention each token receives (from other tokens as well as from itself). Our framework assumes that attribution scores are magnitudes (i.e., unsigned). Thus, we implicitly take the absolute value of our MAW attributions. We also compare MAW with an additional attribution method, Gradient × Input (GI) (Wu and Ong, 2021), for evaluation. Let e j = (e j1 , . . . , e jh ) be the input embedding of the jth token of the argument for some example x. We then define the GI attribution score of token j as where f c denotes the component of the model's output function corresponding to class c. Intuitively, GI measures the sensitivity of the model to perturbations of e j , which in theory measures the dependence of the model's prediction on token j. We choose to aggregate across output classes because all output neurons contribute to the model's decision; a negative contribution to a non-predicted class is just as important as a positive contribution of equal magnitude to the predicted class (Bach et al., 2015). We select GI as a benchmark method because of its high performance in assessments of faithfulness for transformer-based models across several domains (Atanasova et al., 2020;Wu and Ong, 2021 Table 2).

Models
We train a stance model (prior-bin:gold) with our proposed attribution prior, using binary penalty weights ( §4.3), our crowdsourced oracle attributions ( §3.2), and MAW to extract model attributions ( §4.4). We compare this model to one that shares its architecture but is trained without prior loss (base). In addition, we compare with two baselines proposed for VAST: BERT-joint -our architecture ( §4.1) without fine-tuning and with additional data pre-processing, and TGA Net -a modification of BERT-joint that uses unsupervised clustering and attention to improve performance on unseen topics (Allaway and McKeown, 2020). We tune λ using a manual hyperparameter search. We find that because only a small fraction of examples are endowed with oracle attributions, the coefficient applied to our prior loss term must be quite large: λ = 49152 in the full and reduced 25 settings and λ = 16384 in the reduced 10 setting. Our models are implemented in PyTorch 3 and optimized using Adam for 20 epochs with a batch size of 32 and a fixed learning rate of 10 −5 . We use a maximum sequence length of 250 for arguments and 10 for topics. All models use bert-base-uncased from Huggingface 4 . Results are averaged across three random seeds unless otherwise specified.

Results: Stance Prediction
We evaluate our models using macro-averaged F1 on both the few-shot and zero-shot subsets ( §3.1) of the VAST test set (see Table 3). We see that across training settings, prior-bin:gold and base achieve comparable results and outperform the baselines proposed for VAST. We also conduct an ablation on the method for computing penalty weights in the prior loss ( §4.3) in the datascarce reduced 25 setting. Specifically, we experiment with prior-tfidf:goldtf-idf penalty weights and crowdsourced oracle attributions and priorbin:tfidf -binary penalty weights and tf-idf values as pseudo-oracle attributions (instead of our crowdsourced labels). Both these methods perform worse than prior-bin:gold and base, achieving 0.661 and 0.655 macro-F1 respectively. This result aligns with our observations about human word importance annotations ( §3.2), namely that human rationales are complex and do not necessarily parallel notions of word importance derived through tf-idf. Therefore, our stance prediction results show that human word importance annotations are necessary in order to obtain strong results using our proposed attribution prior.

Analysis of Rationales
In addition to evaluating our models' predictions, we also assess the quality of their reasoning. In order to do this, we first analyze the relative reliability of explanations obtained from MAW and GI. We then use our findings to evaluate rationale quality via two separate mechanisms: human raters and our rationale loss function (Ω).

Faithfulness of Attributions:
The faithfulness of an attribution method is the extent to which it accurately reflects a model's reasoning (Herman, All Zero-Shot Few-Shot Pro Con Avg Pro Con Avg Pro Con Avg  Table 3: F1 results on the test set for all three versions of the train set. Avg refers to the macro-average across all three classes (Pro, Con, and Neutral). † marks results reported in Allaway and McKeown (2020). Differences between base and prior-bin:gold are not statistically significant (p < .05).  2017). Although Atanasova et al. (2020) propose five diagnostic properties for explainability techniques, we only consider faithfulness, as we find that the other four properties are either non-meaningful or inapplicable in the case of our methods (see Appendix B).
Our faithfulness analysis considers only the reduced 25 setting, as we are interested in improvements under data-scarcity and believe the faithfulness of MAW and GI to be relatively constant across all three data settings. To gauge the faithfulness of a feature attribution method, we use the diagnostic employed by Atanasova et al. (2020). Namely, for all ψ ∈ {0, 10, . . . , 100}, we mask the most important (as determined by the attribution method) ψ% of tokens in each input example and compute the resulting macro-F1 across all examples. The area under this threshold-performance curve (AUC-TP) gives us an inverse measure of faithfulness; intuitively, if an attribution method is faithful, then model performance relies predominantly on the most important tokens as suggested by that method, resulting in a low AUC-TP. As a baseline, we also compute a threshold-performance curve using random masking (equivalent to assessing random attributions). We find that MAW surpasses GI in faithfulness for both prior-bin:gold and base and considerably outperforms random attributions (see Figure 1, Table 4). This indicates that overall, MAW attributions are faithful to our model, and can therefore be trusted, for our purposes, as explanations of model reasoning. In other words, we can justifiably interpret MAW attributions as rationales.

Eval Scores
Attributions oracle I have lived in brazil for the last five years (and off and on over the last 27 years). I know of no one here who is even remotely excited about the olympics. It would seem that people don't care. The economy is tanking and government is at a complete standstill. We have more important things on our mind right now. priorbin: gold congruity: 2 sufficiency: 2 irrelevance: 0 I have lived in brazil for the last five years (and off and on over the last 27 years). I know of no one here who is even remotely excited about the olympics. It would seem that people don't care. The economy is tanking and government is at a complete standstill. We have more important things on our mind right now. base congruity: 0 sufficiency: 0 irrelevance: 2 I have lived in brazil for the last five years (and off and on over the last 27 years). I know of no one here who is even remotely excited about the olympics. It would seem that people don't care. The economy is tanking and government is at a complete standstill. We have more important things on our mind right now. Ratings are done on a five-point Likert scale. Each example is scored by two annotators. The Krippendorff's alpha (Krippendorff, 1980) for congruity, sufficiency, and irrelevance are 0.316, 0.311, and 0.136, respectively. Score averages are mapped to a 0-2 scale for analysis. As in our faithfulness analysis, we focus exclusively on the reduced 25 setting.
We find that prior-bin:gold outperforms base by a considerable margin for all three questions (see Table 6), demonstrating that in a datascarce setting, our attribution prior is highly effective for conditioning model reasoning.
Additionally, we investigate whether rationale loss is a reliable proxy for rationale quality. Specifically, we compute the correlation between root rationale loss ( √ Ω) and the averaged human evaluation scores. We use root loss since our rationale loss Ω is intuitively a weighted mean-square error ( §4.2). We compute correlation for both 5 Refer to Appendix C for information regarding how we generate visualizations of attributions.

MAW-based (
√ Ω M AW ) and GI-based ( √ Ω GI ) root rationale loss for our models. We find that both metrics correlate with human judgements; however, while √ Ω M AW is a better indicator of irrelevance, √ Ω GI is a better indicator of congruity and sufficiency (see Table 7). This suggests that in the context of attribution priors, different attribution methods are better suited for enforcing different qualities on model rationales, and attribution methods should be chosen accordingly.
Computational Evaluation of Rationales: In terms of MAW-based rationale loss (Ω M AW ), training with our attribution prior yields a statistically significant advantage in all data settings (see Table 8). In terms of Ω GI , the model trained with our proposed prior performs best in the moderately    Table 8. data-scarce setting, but falls short of base when the complete train set is used. This may indicate that in the full setting, prior-bin:gold's rationales are poorer than those of base in terms of congruity and sufficiency. Thus, our attribution prior may have adverse effects on model reasoning when an insufficient fraction of the train set is endowed with oracle attributions.

Error Analysis
Challenging Phenomena: We also examine performance on the five challenging phenomena identified in VAST: Imp -the topic phrase is absent from the argument and the label is non-neutral, mlT -the argument appears in multiple examples (each with a different topic), mlS -the argument appears in multiple examples with different, non-neutral stance labels, Qte -the argument contains a quotation, and Sarc -the argument contains sarcasm. We find that while training with our proposed attribution prior yields comparable performance on these phenomena, it provides superior rationales for all five (see Table 9). This shows the efficacy of our method at improving rationalization for difficult examples without degrading performance.
Rationale Error Types: We analyze the errors in the rationales produced by prior-bin:gold.
Specifically, we randomly sample 50 examples for which the model predicts the incorrect label and manually categorize them as: amount err -errors in the amount of words selected (i.e., selecting too few or too many), content err -errors in the content of selected words (i.e., missing negations or critical parts of phrases), complex err -failure to understand complex language (e.g., sarcasm or implicit references to the topic), and data err -errors in the data annotation (i.e., incorrect label or nonsensical topic). Semantic errors (content err and complex err ) occur in 68% of the cases (32% and 36% respectively). For example, the model often fails to understand rhetorical questions or misses important negations. Additionally, we find that 46% of the rationales select too few or too many words (e.g., selecting most stopwords in the argument). Finally, we see that data err account for 30% of the errors. This analysis suggests that, while our attribution prior improves the rationales for semantically complex examples, semantic understanding remains a key challenge for future improvements.

Conclusion
This paper addresses two issues concerning the task of stance detection: 1) the need for models whose reasoning aligns with that of humans and 2) the need for a way to meaningfully observe the reasoning of models in the first place. We find that in a simulated data-scarce setting, our attribution prior improves model rationales using a practical volume of crowdsourced annotations. We also find that attention-based explanations, which have recently been the subject of much criticism, provide faithful explanations of our models' behavior, more so than a high-ranking alternative method.
In future work we plan to apply our method to more challenging settings, such as multilingual zero-shot stance detection. We will also further investigate the "economics" of our method-for instance, the number of annotated examples necessary to meaningfully improve model reasoning-as well as experiment with a broader range of attribution methods, e.g., Guided Backpropagation (Springenberg et al., 2015) and LIME (Ribeiro et al., 2016). Lastly, we hope to study how to condition model reasoning to protect against adversarial attacks.

Ethics Statement
We use a dataset collected and distributed by Allaway and McKeown (2020): https://github. com/emilyallaway/zero-shot-stance. Data was collected from publicly available comments on articles on The New York Times. No user information is retained with the comments, so the data does not contain explicit information about race, gender or ethnicity of the original authors. For the additional annotations we collect, we compensate workers at ∼$13 per hour, above the federal minimum wage in the United States (where many annotators are based).
Some of the methods we discuss are intended to provide model transparency when predicting stance labels, including on sensitive topics (e.g., religious beliefs). When using these methods to provide ex-planations for a prediction on a text, real-world users should be informed that the explanations are automatically generated and may not be representative of the full opinions of the text's authors.

A Crowdsourcing
We pay each worker $0.28 per HIT. We observe that workers spend an average of roughly 75 seconds on each HIT, excluding outliers (cases in which the worker took much longer than the other two workers for that HIT, presumably due to a workflow interruption). Only tokens containing at least one alphabetical character are selectable. For each HIT, workers are required to select at least round(num_selectable/11) tokens, and at most round(num_selectable/5.5) (where num_selectable is the number of selectable words in a specific HIT). See Figure 2, Figure 3,

B Other Diagnostic Properties
When assessing GI and MAW as explainability techniques for our models, we choose not to consider four of the five diagnostic properties proposed by Atanasova et al. (2020). Agreement with human annotations (HA) is not necessarily a desirable property, as it is little more than an indication of how convincing an attribution method is to humans. Note that we compute HA in the form of rationale loss ( §5.4), but do so as a way of evaluating attributions themselves, as opposed to attribution methods. Confidence Indication (CI) does not apply to MAW, as attention weights do not differ by class for a fixed input. The authors' metric for Rationale Consistency (RC) requires the assumption that models with similar reasoning paths have similar activation maps, an assumption we believe is flawed primarily on account of architectural symmetry. Lastly, we believe that the proposed metric for Dataset Consistency (DC) would not be meaningful for our dataset, as the degree of similarity between different arguments in VAST is extremely low.

C Visualizing Attributions
To visualize model attributions for figures, human evaluation, and rationale error analysis, we map the attribution score for each token to a new score of 0 (unselected), 0.5 (selected but only moderately important), or 1 (selected and very important). We perform this mapping using the following procedure, which takes parameters k and : 1. Rank the attribution scores for the input sequence in descending order, and let k_score be the score of the k-th item.
2. Assign all tokens with score > k_score + a new score of 1.
3. Assign all tokens with score < k_score − a new score of 0.
4. Assign all other tokens a new score of 0.5.
For an argument of length m, we set k = m/8. We let = .05 * max_att, where max_att is the maximum of the original attribution scores for the argument. We obtain these values through trialand-error on training examples, with the subjective goal of achieving visuals that contain a meaningful number of "moderately important" and "very important" words while reflecting stratifications in .703 n/a Table 10: Dev set results for various λ (evenly spaced by 2 14 = 16384) in each of the data settings. λ = 0 indicates that our attribution prior was not applied. A single random seed was used. n/a indicates that the trial was not performed.
the original attribution scores. We take the new score of a multi-token word to be the maximum new score over its subword tokens.

D Choosing λ
See Table 10.
F Misc.
Our model consists of 109, 917, 780 parameters. Training on the full train set using our proposed attribution prior takes 11 hours and 16 minutes using two Tesla T4 GPUs.  Table 12: Variance across trials for λL p reported in Table 9, multiplied by 10 5 .