Using Interpretation Methods for Model Enhancement

In the age of neural natural language processing, there are plenty of works trying to derive interpretations of neural models. Intuitively, when gold rationales exist during training, one can additionally train the model to match its interpretation with the rationales. However, this intuitive idea has not been fully explored. In this paper, we propose a framework of utilizing interpretation methods and gold rationales to enhance models. Our framework is very general in the sense that it can incorporate various interpretation methods. Previously proposed gradient-based methods can be shown as an instance of our framework. We also propose two novel instances utilizing two other types of interpretation methods, erasure/replace-based and extractor-based methods, for model enhancement. We conduct comprehensive experiments on a variety of tasks. Experimental results show that our framework is effective especially in low-resource settings in enhancing models with various interpretation methods, and our two newly-proposed methods outperform gradient-based methods in most settings. Code is available at https://github.com/Chord-Chen-30/UIMER.


Introduction
Deep neural networks have been extensively used to solve Natural Language Processing (NLP) tasks and reach state-of-the-art performance.Due to the black-box nature of neural models, there are a lot of studies on how to interpret model decisions by giving attribution scores to input tokens, i.e., how much tokens in an input contribute to the final prediction.We can roughly group these interpretation methods into four categories, namely gradientbased (Ross et al., 2017;Smilkov et al., 2017), attention-based (Vashishth et al., 2019;Serrano and Smith, 2019), erasure/replace-based (Prabhakaran et al., 2019;Kim et al., 2020) and extractor-based (De Cao et al., 2020;Chan et al., 2022) methods.In some scenarios, we have access to gold rationales (input tokens critical for predicting correct outputs) during training, or have simple and fast approaches to obtaining gold rationales.In that case, it is intuitively appealing to additionally train the model such that its most attributed tokens match gold rationales (see an example in Fig. 2).In other words, when equipped with interpretation methods, we can train the model on where to look at in the input, in addition to the standard output-matching objective.This can be seen as injecting external knowledge embodied in rationales into the model and is especially beneficial to low-resource scenarios with few training data.This intuitive idea, however, has not been fully explored.There are only a few previous studies based on the gradientbased category of interpretation methods (Huang et al., 2021;Ghaeini et al., 2019;Liu and Avci, 2019) and they neither compare the utilization of different interpretation methods on model enhancement nor experiment on a comprehensive range of tasks.
In this paper, we first propose a framework named UIMER that Utilizes Interpretation Methods and gold Rationales to improve model performance, as illustrated in Fig. 1.Specifically, in addition to a task-specific loss, we add a new loss that aligns interpretations derived from interpretation methods with gold rationales.We also discuss how the optimization of the task-specific loss and the new loss should be coordinated.For previous methods utilizing gradient-based interpretation for model enhancement (Huang et al., 2021;Ghaeini et al., 2019), we show that they can be seen as instances of our framework.
We then propose two novel instances of our framework based on erasure/replace-based and extractor-based interpretation methods respectively.Specifically, in the first instance, we utilize Input Marginalization (Kim et al., 2020) as the interpretation method in our framework, which computes attribution scores by replacing tokens with a variety of strategies and measuring the impact on outputs, and we design a contrastive loss over the computed attribution scores of rationales and non-rationales.In the second instance, we utilize Rationale Extractor (De Cao et al., 2020) as the interpretation method in our framework, which is a neural model that is independent of the task model and trained with its own loss.We again design a contrastive loss over the attribution scores computed by the extractor and in addition, design a new training process that alternately optimizes the task model (using the task-specific loss and our contrastive loss) and the extractor (using its own loss).
In summary, our main contributions can be summarized as follows: (1) We propose a framework that can utilize various interpretation methods to enhance models.(2) We utilize two novel types of interpretation methods to enhance the model under our framework.(3) We comprehensively evaluate our framework on diversified tasks including classification, slot filling and natural language inference.Experiments show that our framework is effective in enhancing models with various interpretation methods, especially in the low-resource setting.

Interpretation Methods
Interpretation methods aim to decipher the blackbox of deep neural networks and have been wellstudied recently.Our framework aims to utilize these various interpretation methods for model enhancement.In these interpretation methods, attribution scores are calculated to indicate the importance of input tokens.According to different calculations of attribution scores, we can generally group interpretation methods into the following four categories.
Gradient-based methods Gradient-based interpretation methods are quite popular and intuitive.Li et al. (2016) calculates the absolute value of the gradient w.r.t each input token and interprets it as the sensitiveness of the final decision to the input.Following that, Ghaeini et al. (2019) extends the calculation of gradient to intermediate layers of deep models.Sundararajan et al. (2017) proposes a better method that calculates Integrated Gradients as explanations of inputs.

Attention-based methods
The attention mechanism calculates a distribution over input tokens, and some previous works use the attention weights as interpretations derived from the model (Wang et al., 2016;Ghaeini et al., 2018).However, there is no consensus as to whether attention is interpretable.Jain and Wallace (2019) alters attention weights and finds no significant impact on predictions, while Vig and Belinkov (2019) finds that attention aligns most strongly with dependency relations in the middle layers of GPT-2 and thus is interpretable.
Erasure/replace-based methods The general idea is quite straightforward: erase or replace some words in a sentence and see how the model's prediction changes.Li et al. (2016) proposes a method to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.Kim et al. (2020) gives a new interpretation method without suffering from the out-of-distribution (OOD) problem by replacing words in inputs and it reaches better interpretability compared with traditional erasure-based methods.
Extractor-based methods An extractor-based method typically uses an extra model to extract words that the task model pays attention to.De Cao et al. (2020) introduces "Differentiable Masking" which learns to mask out subsets of the inputs while maintaining differentiability.This decision is made with a simple model based on intermediate hid-den layers and word embedding layers of a trained model.Chen and Ji (2020) proposes the variational word mask method to learn to restrict the information of globally irrelevant or noisy word level features flowing to subsequent network layers.

Utilizing Interpretation Methods to Enhance Models
Some previous work studies whether interpretation methods can be utilized to enhance models.Du et al. (2019)

Method
We propose a framework UIMER that utilizes an interpretation method to enhance models based on gold rationales on training data.
Setup Consider a training example with input x and gold output y.We also have access to gold rationales g of input x indicating the subset of tokens that are critical for predicting the correct output y.The gold rationales g can be annotated by humans or generated automatically from external knowledge sources.
In our setup, g is encoded as a 0/1 vector with g i = 1 indicating that x i is a gold rationale and g i = 0 otherwise.Our method, however, can be easily extended to handle g encoded with real numbers indicating the importance of a token.
Given a model tasked with predicting output y from input x, an interpretation method produces attribution scores a for input x. a can be defined on different levels of granularity.In common cases, a i and x i are one-to-one and a i is defined by a measure calculated by the interpretation method based on the model indicating the importance of token x i for the model in producing its output.For example, in gradient-based interpretation methods, a i is usually some function of the gradient of x i in the model.A higher magnitude of the gradient implies higher importance of input x i .
Learning Objective Apart from the original taskspecific learning objective L task , our framework introduces an extra learning objective L int that embodies the idea of aligning attribution scores a with gold rationales g.The overall objective on one example x takes the form of: where θ is the model parameter and α is a coefficient balancing the two objectives.
Warm-up Training L int can be seen as measuring whether the model pays attention to the gold rationale words in the input.We deem that compared to teaching a randomly initialized model focusing more on the task-specific rationale words, teaching a model with task knowledge is more effective and reasonable because it might be better at recognizing task-specific rationales with the help of task knowledge rather than rote memorization.Thus, instead of optimizing L θ at the very beginning, our framework requires the model to be well or at least halfway trained before training on the objective L int , and during warm-up training, only the objective of the task is optimized.The empirical results (in Sec.4.2) also support our intuition and show that warm-up training is effective.
In the following subsections, we introduce three instances of our framework.The first utilizes gradient-based interpretation methods and subsumes a few previous studies as special cases.The second and third are new methods proposed by us utilizing erasure/replace-based and extractor-based interpretation methods respectively.

Utilizing Gradient-Based Methods
As introduced in Sec.2.2, some previous studies utilize gradient-based (GB) interpretation methods to enhance models.They can be seen as instances of our framework, hence denoted as UIMER-GB.
In this type of methods, attribution score a is usually defined by a function f of the gradient of input x: where usually J refers to the training objective or the task model's output.
In general, L int is defined as a constraint on the gradients of gold rationales: where D is usually a distance function that calculates the error of how a approaches g.
In Ghaeini et al. ( 2019)'s work, f is a function that takes the sum of the gradients of each input embedding dimension and D is to take the sum of the gradients of rationale words.In Huang et al. (2021)'s work, f is the L 1 norm that sums up the absolute value of gradients over the input embedding dimensions and D is designed in various ways to give rationale words higher attribution scores.

Utilizing Erasure/Replace-Based Methods
We incorporate an erasure/replaced-based interpretation method, "Input Marginalization" (IM) (Kim et al., 2020), into our framework in this subsection and name this instance UIMER-IM.We first define the attribution score produced by IM and then define and introduce how to calculate L int .Other erasure/replace-based methods can be integrated into our framework in a similar way.

Attribution Score by Input Marginalization
Define p θ (y|x) as the probability of the gold output that the model predicts.To calculate the attribution score of token x i , a new set of sentences S needs to be generated, with the size of S being a hyperparameter.Denote x ′ u as a new sentence with x i replaced by some other token u.Denote q(x ′ u ) as the probability of replacing x i with u to obtain x ′ u , which can be determined by different strategies1 : 1. BERT: Replace x i by [MASK] to obtain the masked language model probability of u.
2. Prior: q(x ′ u ) = count(u)/N , where count(u) is the number of times token u appears in corpus and N is the corpus size.

Uniform
We follow one of the strategies to sample set S based on q(x ′ u )2 , and define a i as: where For inputs with only one gold rationale word, computing and optimizing the attribution score is easy.However, there might be more than one rationale in general and the calculation of the attribution score of each rationale token becomes impractical when the input length and the number of rationales get larger.Therefore, we extend the token attribution score defined in the original IM method to multi-token attribution score which can be more efficiently computed.
Formally, for input x with more than one rationale word, denote x ′ R as a new sentence in which all rationale words are replaced, and x ′ N as a new sentence in which the same number of non-rationale words are replaced. 3The way to generate one x ′ R is by replacing one rationale word at a time using the strategies mentioned before.We denote the score of replacing x to x ′ R as q(x ′ R ), and q(x ′ R ) is calculated as the average of the replacing probabilities of rationale words using a certain replacing strategy 4 .Similarly, x ′ N is generated and q(x ′ N ) can be defined.We repeat this generating process and denote the set of generated x ′ R (x ′ N ) as S R (S N ) with the size of S R (S N ) being a hyperparameter.Then the attribution scores a R for the entire set of rationales and a N for the same number of non-rationales are defined as: For a given input x and gold rationale g, with a R and a N defined, we expect the attribution score of rationale words to be higher than that of nonrationale words.Thus, we design a contrastive margin loss: where ϵ is a positive hyperparameter controlling the margin.Here a is not defined w.r.t each token, and it refers to the attribution score for multi-tokens.
Note that when calculating a N − a R , the term log 2 (odds(p θ (y|x))) is canceled out and does not need to be computed.We choose to use the margin loss instead of simply maximizing a R and minimizing a N because in many cases non-rationale words may still provide useful information and we do not want to eliminate their influence on the model.

Utilizing Extractor-Based Methods
In this section, we incorporate "DiffMask" (DM) (De Cao et al., 2020), an extractor-based interpretation method into our framework and name this instance UIMER-DM.Other extractor-based interpretation methods can be integrated into UIMER in a similar way.

Attribution Score by the Extractor
In DM, the attribution scores a are produced by a simple extractor model Ext ϕ parameterized by ϕ.
where Enc(x) refers to the encoding of input x produced by an encoder model, and a is composed of real numbers in the range of 0 to 1.Here a is defined one-to-one w.r.t. each token in x.The extractor needs to be trained by its own objective L DM ϕ composed of 2 parts: 1.A term to encourage masking the input (or hidden states) as much as possible.
2. A term to constrain the changes of task model's prediction after feeding the masked input (or hidden states).

Definition of L int in UIMER-DM
With attribution scores defined in DM, we define a contrastive loss L int as follows: (7) Intuitively, we encourage the attribution score of rationale words to be higher than the attribution score of non-rationale words.The loss will be zero as long as the maximum attribution score of nonrationale words is lower than the attribution score of any rationale word.

Training in UIMER-DM
When training UIMER-DM, there are two objectives and two sets of parameters, L θ (Eq. 1) for model parameter θ and L DM ϕ for extractor parameter ϕ.Intuitively, the two sets of parameters should not be optimized simultaneously.That is because our framework requires an accurate interpretation model (i.e., the extractor here).If the extractor is trained at the same time with the model, then since the model keeps changing, there is no guarantee that the extractor could keep pace with the model, and hence its interpretation of the model may not match the latest model, breaking the requirement of our framework.
We adopt the following training schedule to circumvent the problem.First, we follow the warmup strategy and train the model.After that, we alternate between two steps: (1) optimizing the extractor parameters ϕ w.r.t.L DM ϕ with the model parameters θ frozen; (2) optimizing the model parameters θ w.r.t.L θ with the extractor parameters ϕ frozen.The number of rounds that we alternately optimize ϕ and θ and the number of epochs in each round are hyperparameters.

Experimental Settings
Datasets To evaluate our framework, we experiment with all the methods introduced in the previous section on three tasks: Intent Classification (IC), Slot Filling (SF) and Natural Language Inference (NLI).The three tasks take the forms of single sentence classification, sequence labeling

Main Results
We present the main results in all the settings in Table 2. First, from the Mean column, we see that gradient-based methods (UIMER-GB Ghaeini et al. (2019); Huang et al. ( 2021)) reach better performance than the base model, and all variants of our proposed UIMER-IM and UIMER-DM methods outperform both UIMER-GB and base models.
For UIMER-IM, its variants achieve the best performance in eight of the twelve settings and the second best in the rest of settings, suggesting its general applicability.The "+Uniform" variant of UIMER-IM can be seen to clearly outperform the other variants in the 1/3-shot settings on Intent Classification and we analyze the potential reason behind this in App.8.2.UIMER-IM with the "BERT (warm.)"variant brings a 14.86% gain for the NLI 100-example setting.
For UIMER-DM, it achieves the best performance in the 1/3-shot settings and is competitive in the 10-shot setting on Slot Filling, which indicates  that an extra extractor might be more capable of locating rationales than other interpretation methods for structured prediction problems.Applying multi-round training to UIMER-DM can be seen to clearly improve the performance in almost all the settings of all three tasks.We conduct an ablation study on the effect of multi-round training of UIMER-DM in Sec.5.2.
From the results, it is also evident that in general, the performance gap between any instance method of our framework and the base model becomes larger with less data, implying the usefulness of rationales when training data cannot provide sufficient training signals for model enhancement.

Warm-up Training
We conduct an ablation study on the effect of warmup training of our method UIMER-IM in the 1-

Multi-Round Training
In our method UIMER-DM, we propose an asynchronous training process that trains the model and the extractor asynchronously for multiple rounds.
We conduct a study on the effectiveness of the multi-round training process on 10-shot Slot Filling and the results are shown in Fig. 4. We observe that by iteratively and alternately training the extractor and the model, performances on both the development and the test sets show an upward trend with more rounds.Note that we use fewer training epochs (around 10 for the extractor and 20 for the

Case Study
To show that our framework is able to both enhance the model in task performance and give better interpretations, we conduct a case study on 1-shot Intent Classification setting.From the first two examples in Table 3, we can see that the base model can neither predict the correct intent label of the sentence nor produce good interpretations (the attribution scores of non-rationales are higher than the scores in comparison, our UIMER-IM fixes both problems.In the last two examples, we show that UIMER-DM succeeds in lowering the attribution scores of irrelevant parts in the input and producing high scores for some or all of the rationales.It can also be seen that the extractor trained on the base model in 1-shot settings views most of the inputs as being important, while the extractor in UIMER-DM is much more parsimonious and precise.

Relation Between Attribution Score and Performance
In this section, we study how the model performs on the test set when it succeeds or fails to give rationale words higher attribution scores.We conduct experiments on 1-shot Intent Classification and calculate the accuracy while giving rationale words higher or lower attribution scores than non- Table 4: Performances when the model gives higher attribution scores to rationale words or not.
rationales, as shown in the first and second columns in Table 4.For methods with the DM interpretation method, a R is calculated by averaging the attribution scores for all rationale words in x and a N is calculated by averaging the attribution scores for all non-rationale words.We can see that when our UIMER-IM/DM method correctly recognizes rationale words, it reaches higher accuracy than the base model, which suggests that helping models pay more attention to rationales can additionally improve the task performance.

Conclusion
Though many interpretation methods are studied for deep neural models, only sporadic works utilize them to enhance models.In this paper, we propose a framework that can utilize various interpretation methods to enhance models.We also propose two novel instances utilizing two other types of interpretation methods for model enhancement.In addition, we discuss how the optimization of the task-specific loss and the new loss should be coordinated.Comprehensive experiments are conducted on a variety of tasks including Intent Classification, Slot Filling and Natural Language Inference.Experiments show that our framework is effective in enhancing models with various interpretation methods especially in the low-resource setting, and our two newly-proposed methods outperform gradient-based methods in most settings.
For future work, we plan to extend our framework to utilize more forms of rationales and additional interpretation methods.

Limitations
It can be inferred from the result that the two newly introduced methods do not give the best performance in rich-resource settings.We prospect that method UIMER-IM plays a role in incorporating the information of rationales by introducing more similar inputs to the model when the training data is scarce.However, when training data is sufficient enough for the task, the effect of this way to supply information on rationales is reduced.For method UIMER-DM, it also does not perform the best in rich-resource settings.We attribute the ordinary performance of UIMER-DM to that with rich data, most knowledge can be implicitly learned by the model, and injecting gold rationale doesn't help.8 Appendix

Patterns to match rationales
In our experiments, we adopt simple patterns to match rationales.For Intent Classification, a dictionary is constructed, as shown in Table 5 (upper), mapping intent labels to rationales, and for Slot Filling, a group of Regular Expressions, as shown in Table 5 (below), is used to extract rationales.Both tasks are derived from SNIPS dataset that contains 13084 examples, and obtaining both the dictionary and the Regular Expressions is simple and fast.Please note that though the table below seems a little complex, the gray parts are just the syntax of Regular Expressions.Only the black parts contain rational information.

Why is "+Uniform" Good
In experimenting with Sec.3.2, we found that replacing rationales/non-rationales randomly, i.e. the "Uniform" strategy often produces a better result even than the "BERT" strategy, despite the fact that compared to replacing tokens randomly, a pretrained BERT can apparently produce more fluent sentences.Here we give an example in Intent Classification task that possibly explains the cause of this fact in Fig. 5.
As stated in Sec.3.2, we aim to minimize L int defined in Eq. 5 in which we want to lower the latter term in a R , and it is a weighted sum of the predictions of these sentences generated by BERT on the ground truth intent.However, we see that BERT pretty much keeps the original meaning of the sentence in this example.Thus lowering the latter term in a R seems to be a contradiction to the origin task.In contrast, if we look at the sentences generated by replacing the rationale randomly they contain much less information about the ground truth intent, and minimizing L int seems to be more reasonable.

Hyperparameters
For Intent Classification and Slot filling task, the learning rate L θ is tuned among {8e-5, 1e-4, 2e-4, 4e-4} in 1/3/10/30-shot settings and {1e-5, 2e-5} in full training data setting.For Natural Language Inference task, the learning rate of L θ is tuned among {8e-5, 1e-4, 3e-4}.In all three tasks, α is tuned in the range [0.001, 20].ϵ is tuned in the range [0.01, 10].We use AdamW optimizer (Loshchilov and Hutter, 2018) and "linear schedule with warmup" scheduler.Detailed hyperparameters are shown in  refers to the dictionary we construct to match rationales for each intent type for Intent Classification task.The bottom one refers to Regular Expressions to match rationales for the Slot Filling task.Tokens following <rationale*> tag are annotated rationales.

Figure 1 :Figure 2 :
Figure 1: Our framework illustration utilizing interpretation methods to enhance models.The dotted green line indicates how the parameters of the model are optimized.

Figure 3 :
Figure 3: The curves of L int with and without warm-up training on 1-shot Intent Classification over 4 random seeds.

Figure 4 :
Figure 4: Ablation study of multi-round training on 10shot Slot Filling.

Figure 5 :
Figure 5: The underline marks a rationale.Replace the rationale by BERT v.s.randomly.

Table 1 :
Camburu et al. (2018)Chen et al., 2019)ks in our experiments.andsentencepairclassificationrespectively.For Intent Classification and Slot Filling, we adopt the SNIPS(Coucke et al., 2018)dataset.For Natural Language Inference, we adopt the e-SNLI(Camburu et al., 2018)dataset.SNPIS is widely used in NLU research(Jiang et al., 2021a;Chen et al., 2019).It is collected from SNIPS personal voice assistant.There are 13084, 700 and 700 samples in the training, development and test sets respectively, and there are 72 slot labels and 7 intent types.e-SNLI is a large dataset extended from the Stanford Natural Language Inference Dataset(Bowman et al., 2015)to include human-annotated natural language explanations of the entailment relations.There are 549367, 9842, and 9824 samples in the training, development and test sets respectively.RationalesWe give examples of rationales and show how they are obtained in Table1.For the Intent Classification task, we ask one annotator to construct a set of keywords for each intent type based on the training set.This only takes less than 15 minutes.For the Slot Filling task, we use 28 regular expressions with simple patterns which referenceJiang et al. (2021b)5to match the sentences in the SNIPS dataset and regard matched tokens as rationales.The job takes less than 30 minutes (less than 1 minute each on average).The complete rationale sets for Intent Classification and Slot Filling task are shown in App.8.1.For e-SNLI, we use the explanations provided byCamburu et al. (2018).

Table 2 :
Evaluation of our framework on three tasks.Underlines mark the results of our UIMER-IM/DM that outperform the base model and UIMER-GB methods.Boldface marks the best results among all the methods.The Mean column gives the average of each row.

Table 3 :
A case study that analyzes the task performance and quality of interpretations of the base model and our methods UIMER-IM&DM."BaseModel +IM/DM Attr.":The attribution scores produced by base model and IM/DM interpretation method.UIMER-IM/DM: The attribution scores produced by our framework.

Table 6
We show the result with unbiased estimation of standard deviation in Table9in all few-shot settings.

Table 5 :
The upper Table

Table 7 :
Hyperparameters for task Slot Filling.task trained*: The origin task is firstly well trained, then objective 1 is optimized.

Table 9 :
Result with std. on all few-shot settings.