Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience

Pretrained transformer-based models such as BERT have demonstrated state-of-the-art predictive performance when adapted into a range of natural language processing tasks. An open problem is how to improve the faithfulness of explanations (rationales) for the predictions of these models. In this paper, we hypothesize that salient information extracted a priori from the training data can complement the task-specific information learned by the model during fine-tuning on a downstream task. In this way, we aim to help BERT not to forget assigning importance to informative input tokens when making predictions by proposing SaLoss; an auxiliary loss function for guiding the multi-head attention mechanism during training to be close to salient information extracted a priori using TextRank. Experiments for explanation faithfulness across five datasets, show that models trained with SaLoss consistently provide more faithful explanations across four different feature attribution methods compared to vanilla BERT. Using the rationales extracted from vanilla BERT and SaLoss models to train inherently faithful classifiers, we further show that the latter result in higher predictive performance in downstream tasks.


Introduction
Pretrained transformer-based (Vaswani et al., 2017) language models (LMs) such as BERT (Devlin et al., 2019), have achieved state-of-the-art results in various language understanding tasks (Wang et al., 2019b,a). Despite their success, their highly complex nature consisting of millions of parameters, makes them difficult to interpret (Jain et al., 2020). This has motivated new research on understanding and explaining their predictions.
Previous work has explored whether LMs encode syntactic knowledge by studying their multi-head attention distributions (Clark et al., 2019;Htut et al., 2019;Voita et al., 2019). Recent studies have evaluated the faithfulness of explanations 2 for predictions made by these models (Vashishth et al., 2019;Atanasova et al., 2020;Jain et al., 2020). In general, LMs can provide faithful explanations, particularly using attention (Jain et al., 2020), but still fall behind other simpler architectures (Atanasova et al., 2020) possibly due to increased information mixing and higher contextualization in the model (Brunner et al., 2020;Pascual et al., 2021;Tutek and Snajder, 2020). Recent studies have attempted to improve the explainability of non transformer-based models, by guiding them through an auxiliary objective towards informative input importance distributions (e.g. human or adversarial priors) (Ross et al., 2017a;Liu and Avci, 2019;Moradi et al., 2021).
In a similar direction, we propose Salient Loss (SALOSS), an auxiliary objective that allows the multi-head attention of the model to learn from salient information (i.e. token importance) during training to reduce the effects of information mixing (Pascual et al., 2021). We compute a priori token importance scores (Xu et al., 2020) using TEXTRANK (Mihalcea andTarau, 2004) (i.e. an unsupervised graph-based method) and penalize the model when the attention distribution deviates from the salience distribution. Our contributions are as follows: • We demonstrate that models trained with SA-LOSS generate more faithful explanations in an input erasure evaluation.
• We finally show that rationales extracted from SALOSS models result in higher predictive performance in downstream tasks when used as the only input for training inherently faithful classifiers.

Related Work
Model Explainability Explanations can be obtained by computing importance scores for input tokens to identify which parts of the input contributed the most towards a model's prediction (i.e. feature attribution). A common approach to attributing input importance is by measuring differences in a model's prediction between keeping and omitting an input token (Robnik-Šikonja and Kononenko, 2008;Li et al., 2016b;Nguyen, 2018a). Input importance can also be obtained by calculating the gradients of a prediction with respect to the input (Kindermans et al., 2016;Li et al., 2016a;Sundararajan et al., 2017;Bastings and Filippova, 2020). We can also use sparse linear meta-models that are easier to interpret (Ribeiro et al., 2016;Lundberg and Lee, 2017). Finally, recent studies propose using feature attribution to extract a fraction of the input as a rationale and then use it to train a classifier (Jain et al., 2020;Treviso and Martins, 2020). Brunner et al. (2020) criticize the ability of attention in providing faithful explanations for the inner workings of a LM, by showing that constructed adversary attention maps do not impact significantly the predictive performance. Pruthi et al. (2020) show similar outcomes by manipulating attention to attend to uninformative tokens. Pascual et al. (2021) and Brunner et al. (2020) argue that this might be due to significant information mixing in higher layers of the model, with recent studies showing improvements in the faithfulness of attention-based explanations by addressing this (Chrysostomou and Aletras, 2021;Tutek and Snajder, 2020). Atanasova et al. (2020) evaluate faithfulness of explanations (Jacovi and Goldberg, 2020) by removing important tokens and observing differences in prediction, showing that generally gradient-based approaches for transformers produce more faithful explanations compared to sparse meta-models (Ribeiro et al., 2016). However, transformer-based explanations are less faithful compared to simpler models due to their highly parameterized architecture. Atanasova et al. (2020) also show that explanation faithfulness does not correlate with how plausible it is (understandable by humans) corroborating arguments made by Jacovi and Goldberg (2020). Jain et al. (2020) show that attention-based feature attributions, in general, outperform gradient-based ones.

Faithfulness of Pretrained LM Explanations
A different branch of studies introduced adversarial auxiliary objectives to influence attentionbased explanations during training (Kennedy et al., 2020;Wiegreffe and Pinter, 2019;Ross et al., 2017b;Liu and Avci, 2019). These objectives have typically been used as a tool for evaluating explanation faithfulness generated by attention (Kennedy et al., 2020;Wiegreffe and Pinter, 2019;Pruthi et al., 2020;Ghorbani et al., 2019) while others used auxiliary objectives to improve the faithfulness of explanations generated by non-transformer based models (Ross et al., 2017b;Liu and Avci, 2019;Moradi et al., 2021;Mohankumar et al., 2020;Tutek and Snajder, 2020). The auxiliary objectives guide the model using human annotated importance scores (Liu and Avci, 2019), or allow for selective input gradient penalization (Ross et al., 2017b). Such studies illustrate the effectiveness of auxiliary objectives for improving the faithfulness of model explanations suggesting that we can also improve explanation faithfulness in transformers using appropriate prior information.

Improving Explanation Faithfulness with Word Salience
Even though attention scores are more faithful than other feature attribution approaches (Jain et al., 2020), they usually pertain to their corresponding input tokens in context and not individually due to information mixing (Tutek and Snajder, 2020;Pascual et al., 2021). As such, we hypothesize that we can improve the ability of a pretrained LM in providing faithful explanations, by showing to the model alternative distributions of input importance (i.e. word salience). We assume that by introducing the salience distribution via an auxiliary objective (Ross et al., 2017b), we can reduce information mixing by "shifting" the model's attention to other informative tokens. In a similar direction to ours, Xu et al. (2020) showed that by computing attention together with salience information from keyword extractors improves text summarization.
Computing Word Salience We compute word salience σ using TEXTRANK (Mihalcea and Ta rau, 2004), an unsupervised graph-based model for keyword extraction. TEXTRANK calculates indegree centrality of graph nodes iteratively based on a Markov chain, where each node is a wordpiece and each edge links wordpiece pairs within a context window (Xu et al., 2020). For each input document X, we construct an undirected graph and apply TEXTRANK to compute the local salience scores (σ i ) of its words by: where d is the damping coefficient, In(V i ) and Out(V j ) are the incoming and outgoing nodes. Our intuition is that by using the task-agnostic TEX-TRANK, we can extract words that are important in the context of the sequence and as such offer an alternative view of token importance. 3 Salience Loss We propose Salient Loss (SALOSS), an auxiliary objective which allows the model to learn attending to more informative input tokens jointly with the task. SALOSS penalizes the model when the attention distribution (α) deviates from the word salience distribution (σ). 4 For α we compute the average attention scores of the CLS token from the last layer (Jain et al., 2020). The joint objective for adapting a LM to a downstream classification task with SALOSS is: where L c is the Cross-Entropy Loss for a downstream text classification task and λ a regularization coefficient for the proposed SALOSS (L sal ) which can be tuned in a development set. L sal is defined as the KL divergence between α and σ: We assume a standard text classification setting where a set of labeled documents is used for finetuning a pretrained LM by adding an extra output classification layer. We normalize the salience scores for compatibility with the KL divergence.

Experimental Setup
Datasets We consider five natural language understanding tasks (see dataset statistics in Appx.   Evaluating Explanation Faithfulness We evaluate the faithfulness 5 of model explanations using two standard approaches: • Input Erasure: We first compute the average fraction of tokens required to be removed (in decreasing importance) to cause a change in prediction (decision flip) (Serrano and Smith, 2019; Nguyen, 2018b).
• FRESH: We also compute the predictive performance of a classifier trained on rationales extracted with feature attribution metrics (see §4) using FRESH (Jain et al., 2020). We extract rationales by; (1) selecting the top-k most important tokens (TOPK) and (2) selecting the span of length k that has the highest overall importance (CONTIGUOUS).

Feature Attribution Approaches
We opt using the following popular metrics to allocate importance to input tokens: (1) Normalized attention scores (α); (2) Attention scores scaled by their gradient (  LMs (BASELINE) and models with our proposed objective SALOSS. Results demonstrate that models trained with our proposed salience objective 7 achieve similar performance to the BASELINE models across datasets.
Input Erasure Table 2 shows results for the average fraction of input tokens required to be removed to cause a decision flip for BASELINE and SALOSS models in the test set. Results suggest that models trained with our proposed objective require a significantly lower fraction of tokens removed to cause a decision flip in 19 out of 20 cases (Wilcoxon Rank Sum, p < .05), with the exception of AG and α. This demonstrates that SALOSS obtains more faithful explanations in the majority of cases (Jacovi and Goldberg, 2020). For example in EV.INF., the BASELINE approach with α requires .25 fractions of tokens on average to observe a decision flip compared to .14 with SALOSS (approximately 40 tokens less). We also observe that in M.RC. where α is not the most effective feature attribution method with BASELINE, with SALOSS it becomes the most effective. In fact, α is the best performing feature attribution approach across most tasks and metrics using SALOSS, indicating the effectiveness of infusing salient information.
We also performed an analysis on the differences in Part-of-Speech (PoS) tags of the rationales selected by SALOSS and BASELINE,to obtain insights towards why rationales with SALOSS are shown to be more faithful to those from models trained without our proposed objective . In SST, we observe that SALOSS allocates more importance on adverbs and adjectives, which are consid-  Table 3: F1 macro on models trained with extracted rationales (TOPK and α) using FRESH for BASELINE and SALOSS models. Bold denotes best performance in each dataset. † indicates that SALOSS rationales perform significantly better (t-test, p < .05). ered important in sentiment analysis (Dragut and Fellbaum, 2014;Sharma et al., 2015). In EV.INF., we observe that SALOSS allocates importance to subordinating conjunction words such as than, which are indeed important for the task, which consists of inferring relationships (i.e. higher than). We thus hypothesize that SALOSS guides the model to other informative tokens, complementing the task specific information learned by the model. 8

Rationale Extraction
We finally compare our SALOSS models with vanilla LMs (BASELINE) on rationale extraction using FRESH (Jain et al., 2020), by measuring the predictive performance of the classifier trained on the extracted rationales. For completeness we also include an uninformative baseline for SALOSS, which comprise of a normalized uniform distribution over the input (i.e. all inputs are assigned the same salience score). For brevity, Table 3 presents results using the best performing metric from the erasure experiments α with TOPK. 9 Our approach significantly outperforms BASELINE in 2 out of 5 datasets (t-test, p < 0.05), whilst achieving comparable predictive performance on the rest. For example in SST we observe a 3% increase in F1 using the same ratio of rationales. It is notable that in M.RC, AG and EV.INF., performance of classifiers trained on rationales from both BASE. and SALOSS is comparable to that with full text (1-2% lower). We assume that this is due to the nature of the tasks, which likely do not require a large part of the input to reach high performance. This highlights the effectiveness of our approach, as a simple yet effective solution for improving explanation faithfulness.

Example 1
Data.:AG Id: test_239 [BASELINE]: NEW YORK ( Reuters ) -Shares of Google Inc. will make their Nasdaq stock market debut on Thursday after the year 's most anticipated initial public offering priced far below initial estimates , raising $1.67 billion .
[SALOSS (Ours)]: NEW YORK ( Reuters ) -Shares of Google Inc. will make their Nasdaq stock market debut on Thursday after the year 's most anticipated initial public offering priced far below initial estimates , raising $1.67 billion .  Table 4: True examples of extracted rationales from models using our proposed approach (SALOSS) and from models that do not (BASELINE)

Qualitative Analysis
In Table 4 we present examples of extracted rationales from a model trained with our proposed objective (SALOSS) and without (BASELINE) using α∇α, to gain further insights to complement the PoS analysis. For clarity we present rationales of CONTIGUOUS type.
In AG we observed similar performance between models trained with SALOSS and without. Example 1 illustrates such a case, where both models predicted correctly but attended to different parts of the input. Despite in different locations, both segments are closely associated with the label of "Business". Example 2 is an instance from the SST dataset, were the SALOSS rationale points to a phrase that is more associated with the task ("a promising unusual") compared to the BASE-LINE. This also aligns with previous observations from the PoS analysis, that models trained with our proposed objective attend to more adjectives compared to BASELINE. Example 3 considers an instance from the Ev.Inf. dataset, which shows that the model trained with SALOSS and BASE-LINE attended to two different sections. In fact what we observed in agreement with the PoS analysis, is that models with SALOSS attend mostly to segments including words related to relationships, such as "significantly attenuated" in this particular example.

Conclusion
We introduced Salient Loss (SALOSS), an auxiliary objective to incorporate salient information to attention for improving the faithfulness of transformer-based prediction explanations. We demonstrate that our approach provides more faithful explanations compared to vanilla LMs on input erasure and rationale extraction. In the future, we plan to explore additional objectives to better optimize for contiguity of rationales.

A Datasets
For our experiments we use the following tasks (see dataset details in Table 5

B TextRank Training
We run for 10 steps, or until convergence, with a window of 4 words, a damping coefficient of 0.85 and normalize the salience scores to make them more compatible to attention distributions.   Table 6: Model and their hyper-parameters for each dataset, including learning rate for the model (lr m ) and the classifier layer (lr c ) and F1 macro scores on the development set across three runs.

C Model Hyper-Parameters
Each experiment is run on a single Nvidia Tesla V100 GPU.
We found that the learning rate of our proposed objective, does not impact significantly F1 macro performance. As such, since our objective is improving faithfulness, our λ selection includes training then evaluating on the development set the average fraction of tokens required to cause a decision flip. We use the model with the lowest fraction of tokens scores and report on the test set.

D Further Details on Evaluating Faithfulness
Erasure (Serrano and Smith, 2019;Nguyen, 2018b): Jacovi and Goldberg (2018b) propose that an appropriate measure of faithfulness of an explanation can be obtained through input erasure (the most relevant parts of the input-according to the explanation-are removed). We therefore record the average fraction of tokens required to be removed across instances to cause a decision flip. Removal is conducted in descending token importance order at every 5% of the length in the sequence, as searching at every token is computationally expensive (Atanasova et al., 2020). Note that we conduct all experiments at the input level (i.e. by removing the token from the input sequence instead of only removing its corresponding attention weight) as we consider the scores from importance metrics to pertain to the corresponding input token following related work (Arras et al., 2016(Arras et al., , 2017Nguyen, 2018a;Vashishth et al., 2019;Grimsley et al., 2020).
FRESH (Jain et al., 2020): A pipeline composed of a support model-extractor-classifier, whereby the support model is the model trained on the full text and allocates importance to tokens, extractor the approach used and extract the rationales according to the importance from the support model and classifier the model trained on the rationales. The higher the classifier's predictive performance the more faithful the rationales by the support model. Similar to Jain et al. (2020), for FRESH we extract rationales of a fixed ratio compared to the sequence length by two thresholder approaches (THRESH.): • TOPK: The top-k tokens as indicated by the corresponding importance metric, treating each word independently.
• CONTIGUOUS: The span of length k that results in the highest overall score as indicated by the importance metric.

E Further Details on Feature Attribution Approaches
• α: Importance rank corresponding to normalized attention scores.

F Further Results on FRESH
In We can first observe that models trained on contiguous rationale extracted from models trained with SALOSS, obtain comparable performance to models without (BASE). Additionally, results show that classifier performance does not reach those 10 Serrano and Smith (2019) show that gradient-based attention ranking metrics (α∇α) are better in providing faithful explanations compared to just using attention (α).  Table 7: F1 macro on models trained with extracted rationales (CONTIGUOUS and α) using FRESH for BASELINE and SALOSS models. Bold denotes best performance in each dataset. † indicates that SA-LOSS rationales perform significantly better (t-test, p < 0.05).
with TOPK rationales. We can therefore assume that TOPK rationales result to inherently faithful classifiers with higher performance. It is encouraging to notice that in the datasets where performance is comparable with our approach (AG, EV.INF., M.RC), it is likely due to reaching close to FULL-TEXT performance. For example, classifier performance trained on CONTIGUOUS rationales from BASE. in SST is at .82 compared to .83 with SA-LOSS rationales.
Results also suggest that our uninformative baseline (UNIF.), reduces the faithfulness of rationales in most cases resulting in lower classifier performance. We hypothesize that in cases where performance is comparable with BASE. and SALOSS, it is due to the task being relatively easy and as such the loss function not impacting the faithfulness of rationales. We consider this direction as an interesting area for future work.

G PoS Importance Allocation
We also conduct an analysis whereby we record the average importance scores under each Part of Speach (PoS) tag. We run a pretrained PoS tagger from spaCy (Honnibal et al., 2020) across the text and compute average importance calculated from a feature attribution approach for each PoS tag. We therefore aim to observe differences in allocation of importance in linguistic features between models trained with out our proposed approach (BASE.) and with (SALOSS). In Figure 1 we present distribution of importance (calculated with α∇α) across PoS tags, on three datasets (SST, AG and EV.INF.).
Observing Figure 1a, we can see that α∇α with SALOSS places greater importance on proper nouns (PROPN), auxiliary words (AUX), pronouns (PRON) and interjections (INTJ). In comparison the most prominent tags with BASE are INTJ, PROPN, coordinating conjunctions (CCONJ) and nouns (NOUN). In a sentiment analysis task, it is notable that both BASE. and SALOSS base high importance on average on interjections, which typically demonstrate feelings or emotions. Both appear to highlight particularly well adjectives, which we consider more important for sentiment analysis as they name attributes of other words. On the other end we also observe that SALOSS places lower importance on average to CCONJ and punctuation (PUNCT) compared to BASE. This suggests that for SST, SALOSS models possibly shift their importance to more informative for the task word groups.
Moving on to Figure 1b, we observe a very high peak on proper nouns (PROPN) and unidentified tokens (X) with SALOSS compared to BASE.. In a news classification task proper nouns such as the NATO and other organization or city names can indicate the topic of a sequence. We assume that for SALOSS to place such great importance on proper nouns, we manage with our approach to shift the model's attention to more informative for the task tokens. However we also observe unidentified symbols having large average importance scores with SALOSS. Whilst we do not study plausibility (human understandability of explanations), we consider this a limitation and we consider exploring and addressing this an interesting direction for future work.
Finally, examining Figure 1c, we observe that both SALOSS and BASE place very high importance on particle (PART) words such as not. We consider this encouraging, as large parts of the task is to infer if there was a significant difference or not based on an observation in the text. Additionally, we observe that SALOSS attends highly to subordinating conjunction (SCONJ) words such as than, which if placed in the context of "significantly higher than" directly relates to our task. Also with SALOSS we observe a reduction in attention to pronouns (PRON) compared to BASE, which we consider encouraging as PRON words are not directly related to the task of infering relationships. This indicates that our proposed objective manages to guide the model's attention away from uninformative tokens such as others and punctation, and towards more informative for the task token types (SCONJ, CCONJ).   Table 8 presents the average fraction of tokens required to cause a prediction switch (decision flip), when training models with SALOSS and (1) TEX-TRANK; (2) CHISQUARED; (3) TFIDF. We observe that when models are regularized with TEXTRANK scores, the feature attribution approaches result in a lower average fraction of tokens to cause a prediction switch compared to the other two salience functions. We also observe that TFIDF is comparable with TEXTRANK in most cases, outperforming CHISQUARED. We hypothesize that TFIDF performs poorer than TEXTRANK is due to the way these two approaches compute their "importance" scores. The first computes them globally, whilst the latter locally (at instance-level) which we assume is more beneficial for explanation faithfulness.