Attending via both Fine-tuning and Compressing

Though being a primary trend for enhancing interpretability of neural networks, attention mechanism’s reliability and validity are still under debate. In this paper, we try to purify attention scores to obtain a more faithful explanation of downstream models. Speciﬁ-cally, we propose a framework consisting of a learner and a compressor, which performs ﬁne-tuning and compressing iteratively to enhance the performance and interpretability of the attention mechanism. The learner focuses on learning better text representations to achieve good decisions by ﬁne-tuning, while the compressor aims to perform compressions over the representations to retain the most useful clues for explanations with a Variational information bottleneck ATtention (VAT) mechanism. Extensive experiments on eight benchmark datasets show the great advantages of our proposed approach in terms of both performance and interpretability.


Introduction
Attention mechanisms (Bahdanau et al., 2014) have achieved great success in various natural language processing (NLP) tasks. They are introduced to mimic the human eye focusing on important parts in the inputs when predicting labels. The existing studies show attention mechanisms can improve not only the performance but also the interpretability of the models (Mullenbach et al., 2018;Xie et al., 2017;Xu et al., 2015). Li et al. (2016) pointed the view: "Attention provides an important way to explain the workings of neural models". Additionally, Wiegreffe and Pinter (2019) showed that attention mechanisms could help understand the inner workings of a model.
The basic assumption of understanding of models with attention scores is that the inputs (e.g., words) with high attentive weights are essential for making decisions. However, as far as we know, it has not been formally verified. Existing research (Jain and Wallace, 2019) also shows that attention is not explicable, and there are a lot of controversy regarding to the result explanations (Wiegreffe and Pinter, 2019;Jain and Wallace, 2019). Moreover, we find that though the attention mechanism can help improve the performance for text classification in our experiments, it may focus on the irrelevant information. For example, in the sentence "A very funny movie.", the long short-term memory model with standard attention (LSTM-ATT) infers a correct sentiment label while pays more attention to the irrelevant word "movie", making the result difficult to explain.
In general, the attention weights are only optimized to encode the task-relevant information while are not restricted to imitate human behavior. In order to enhance the interpretability of the attention mechanism, recent studies turn to integrate the human provided explanation signals into the attention models.  regularized the attention weights with a small amount of word-level annotations. Barrett et al. (2018); Bao et al. (2018) improved the explanation of attention by aligning explanations with human-provided rationales. These methods rely on additional labour consuming labelling for enhancing explanations, which is hard to extend to other datasets or tasks.
In this paper, we aim to train a more efficient and effective interpretable attention model without any pre-defined annotations or pre-collected explanations. Specifically, we propose a framework consisting of a learner and a compressor, which enhances the performance and interpretability of the attention model for text classification 1 . The learner learns text representations by fine-tuning the encoder. Regarding to the compressor, we are motivated by the effectiveness of the information bottleneck (IB) (Tishby et al., 1999) to enhance performance (Li and Eisner, 2019) or detect important features (Bang et al., 2019;Chen and Ji, 2020;Jiang et al., 2020;Schulz et al., 2020), and present a Variational information bottleneck ATtention (VAT) mechanism using IB to keep the most relevant clues and forget the irrelevant ones for better attention explanations. In particular, IB is integrated into attention to minimize the mutual information (MI) with the input while preserving as much MI as possible with the output, which provides more accurate and reliable explanations by controlling the information flow.
To evaluate the effectiveness of our proposed approach, we adapt two advanced neural models (LSTM and BERT) within the framework and conduct experiments on eight benchmark datasets. The experimental results show that our adapted models outperform the standard attention-based models over all the datasets. Moreover, they exhibit great advantages with respect to interpretability by both qualitative and quantitative analyses. Specifically, we obtain significant improvements by applying our model to the semi-supervised word-level sentiment detection task, which detects the sentiment words based on attention weights via only sentencelevel sentiment label. In addition, we provide the case studies and text representation visualization to have an insight into how our model works.
The main contributions of this work are summarized as follows.
• We propose a novel framework to enhance the performance and interpretability of the attention models, where a learner is used to learn good representations by fine-tuning and a compressor is used to obtain good attentive weights by compressing iteratively.
• We present a Variational information bottleneck ATtention (VAT) mechanism for the compressor, which performs compression over the text representation to keep the task related information while reduce the irrelevant noise via information bottleneck.
• Extensive experiments show the great advantages of our models within the proposed framework, and we perform various qualitative and quantitative analyses to shed light on why our models work in both performance and interpretability.

Related Work
In this section, we survey related attention mechanisms (Bahdanau et al., 2014) and review the most relevant studies on information bottleneck (IB) (Tishby et al., 1999). Attention has been proved can help explain the internals of neural models (Li et al., 2016;Wiegreffe and Pinter, 2019) though it is limited (Jain and Wallace, 2019). Many researchers try to improve the interpretability of the attention mechanisms.  leveraged small amounts of word-level annotations to regularize attention. Kim et al. (2017) introduced a structured attention mechanism to learn attention variants from explicit probabilistic semantics. Barrett et al. (2018); Bao et al. (2018) aligned explanations with human-provided rationales to improve the explanation of attention. Unlike these methods that require prior attributions or human explanations, the VAT method enforces the attention to learn the vital information while filter the noise via IB.
A series of studies motivate us to utilize IB to improve the explanations of attention mechanisms. Li and Eisner (2019) compressed the pre-trained embedding (e.g., BERT, ELMO), remaining only the information that helps a discriminative parser through variational IB. Zhmoginov et al. (2019) utilized the IB approach to discover the salient region. Some works (Jiang et al., 2020;Chen et al., 2018;Guan et al., 2019;Schulz et al., 2020;Bang et al., 2019) proposed to identify vital features or attributions via IB. Moreover, Chen and Ji (2020) designed a variational mask strategy to delete the useless words in the text. As far as we are aware, we are the first ones to leverage IB into attention mechanisms to train more interpretable attention with better accuracy.

Our Approach
In this section, we introduce our framework consisting of a learner and a compressor with a Variational information bottleneck ATtenttion (VAT) mechanism. Given an attention-based neural network model, we formulate our idea within the framework of variational information bottleneck (VIB) (Tishby et al., 1999). Our framework aims to improve the attention's interpretalility with better performance by restricting the attention to capture the crucial words while filter the useless information. Learner Compressor Figure 1: The framework.The learner aims to learn the good text representation X by fine-tuning, and the compressor aims to learn good attention weights by compressing the attentive representations to capture the important words while forget the redundant information via VAT. The blue circles mean the corresponding parameters of the modules are fixed.

Overview
Our framework is composed of a learner and a compressor, which performs fine-tuning and compressing iteratively ( Figure 1). The learner aims to learn a task-specific contextual word representation by fine-tuning. The compressor enforces the model to learn task-relevant information while reduce irrelevant information via IB. We iteratively perform the learner and compressor (fine-tuning and compressing) to improve each other.
Learner. We adopt a basic attention-based neural network model as a learner to learn representations of the words based on the good attention weights learned by the compressor. The model is optimized by cross-entropy loss to learn the label-relevant information. In this phase, we fix the attention's parameters so that the model will focus on updating the encoder to learn word representations.
Compressor. To restrict the attention to capture the vital information while reduce the noise, we integrate IB into attention mechanisms to compress the text attentive representation. We fix the encoder's parameters so that the model will focus on learning the attention weights based on current representations obtained from the learner.

Basic Attention Model (Learner)
In this section, we describe our learner, which is an attention-based neural network model. First, given a text T " tw 1 , w 2 , ..., w |T | u, where |T | is the length of text T , we feed it into an encoder with a First, we obtain the input text's word representations X via an encoder trained by the learner. Then, we calculate Z by compressing the text representation R that is the weighted sum of X based on the attention α, while remaining the maximum information to judge Y by inputting Z into a MLP classifier for predicting. word embedding layer. We adopt LSTM and BERT models as our encoder, and other models can also be applied to our framework. We obtain the contextaware word representations x " rx 1 , x 2 , ..., x |T | s, where x i is the hidden vector of the word w i .
x " encoderpT, θ encoder q, where θ encoder is the parameters of the encoder. Based on the contextual word representations, attention mechanism (Bahdanau et al., 2014) 2 is utilized to capture the important parts in the text and obtain the text representation R, which is calculated as, where θ attention " tv a , W a u is the trainable parameters of the attention, which is not updated in this step to learn the word representation x based the good attention learned by the compressor. α " rα 1 , α 2 , ..., α |T | s is the attention weights. Finally, we input the text representation R into a multi-layer perceptron (MLP) to predict the probability. The cross-entropy loss is used to optimize the model.

Variational Information Bottleneck Attention (Compressor)
The learner optimizes the sentence representations by minimizing the cross-entropy loss, which does not restrict the model to ignore the useless information. Thus, we compress sentence representations R into a latent representation Z that retains most useful information to infer the label Y . We propose to accomplish this by integrating VIB into the attention mechanism (Figure 2). To ensure Z contains maximum ability to predict Y (IpZ; Y q) while has the least redundant information form R (´IpZ; Rq), we use the standard IB theory (Tishby et al., 1999) and define the objective function as: where Ip¨;¨q means the mutual information and β is a coefficient to balance two components. The main challenge is to estimate the lower bound for IpZ; Y q and the upper bound for IpZ; Rq. 3 The joint probability p θ pr, y, zq can be factored as pprq¨ppy | rq¨p θ pz | rq based on the independence assumption 4 . By replacing the conditional distribution p θ py | zq with a variational approximation q φ py | zq, we obtain a lower bound of IpZ; Y q. q φ py | zq is a simple classifier that runs on a compressed text representation z.
where KLr¨}¨s represents Kullback-Leibler divergence. Specifically, we regard ppyq as constant and then minimize E p θ py,zq rlog q φ py | zqs. Since we must first sample r to sample y, z from p θ pr, y, zq, the lower bound of IpZ; Y q is computed as, We calculate the upper bound of IpZ; Rq by replacing p θ pzq with a variational distribution r ψ pzq, 3 We give the main steps as follows and the detailed derivation is provided in supplementary materials. 4 Y Ñ R Ñ Z: Y and Z are independent given R. The upper bound of IpZ; Rq is computed as, Then, we obtain the lower bound L of IB by substituting Equation 5 and 7 into Equation 3: The first component in L is to keep the most useful information in p θ pz|rq for inferring y, while the second one is to regularize p θ pz|rq with a predefined prior distribution r ψ pzq (e.g., Gaussian distribution). To compute p θ pz|rq, we adopt the reparametrization trick for multivariate Gaussians (Rezende et al., 2014), which obtains the gradient of parameters that derive z from a random noise .
where d means element-wise multiplication. u and σ denote the mean and covariance defined by two functions of R, where R " α¨x that is learned based on attention. In particular, two MLP are used to predict u and σ. Finally, we input the z into a MLP to predict q φ py | zq and optimize the attention's parameter via Equation 8.

Experiment Setup
We adopt two typical neural network models, attention-based LSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019), to explore our VAT algorithm.

Datasets and Baselines
Datasets To evaluate the effectiveness of our VAT model, we conduct the experiments over eight benchmark datasets: IMDB (Maas et al., 2011), Stanford Sentiment Treebank with (includes SST-1 and its binary version SST-2) (Socher et al., 2013), Yelp (Zhang et al., 2015), AG News (Zhang et al., 2015), TREC (Li and Roth, 2002), subjective/objective classification Subj (Pang and Lee, 2005) and Twitter (Rosenthal et al., 2015(Rosenthal et al., , 2014. The statistics information of these datasets are shown in Table 1.   Baselines We compare our model with two kinds of models, basic models (LSTM/BERT-base) and attention-based models (LSTM/BERT-ATT). LSTM-base takes the max-pooling of the LSTM's hidden vectors as text representation. For BERTbase, the "[CLS]" representation is obtained as the sentence representation. LSTM-ATT model is a standard attention-based LSTM model that has the same structure as the learner. We obtain the BERT-ATT by replacing the LSTM encoder with BERT in LSTM-ATT. Our models are marked with VAT (LSTM-VAT, BERT-VAT), which integrate VIB into attention-based neural models.

Implementation Details
For LSTM-based models, we use GloVe embedding (Pennington et al., 2014) with 300-dimension to initialize the word embedding and fine-tune it during the training. We randomly initialize all outof-vocabulary words and weights with the uniform distribution U p´0.1, 0.1q. For the BERT-based models, we fine-tune pre-trained BERT-base model.

Experiments
First, we perform our models and baselines on eight benchmark datasets and visualize the text representation to verify the effectiveness of VAT (Section 5.1). Second, to further investigate our VAT model, we adopt two popular explanation metrics for quantitative evaluation (Section 5.2). Third, we apply our models to semi-supervision sentiment detection task to evaluate the explanation of our model (Section 5.3). Fourth, we explore the influence of our iteration strategy in Section 5.4 and provide case studies in Section 5.5. For the limitation of the space, we may only list the results on parts of the datasets in some cases since the conclusions are similar for other datasets. The complete results are presented in the supplementary materials.

Main Results
We report the accuracy of our VAT and baselines based on LSTM and BERT (Table 2). From these results, we find the following observations: 1) our models (LSTM/BERT-VAT) outperform all the corresponding baselines over all the eight datasets, which denotes the effectiveness of our VAT on both LSTM and BERT-based models; 2) compared with attention-based models (LSTM/BERT-ATT), our models obtain better results. It indicates reducing the irrelevant information in input via VAT can improve the performance of the models. Furthermore, we visualize the sentence representations obtained from LSTM/BERT-ATT and -VAT models (Figure 3). We randomly select 1000 samples from the test set for each dataset. We can find that our VAT model can reduce the distance of the samples in a class and add the distance of the samples in different classes. For example, it is hard to split the positive samples from the negative ones based on the representations obtained from LSTM-ATT for the IMDB dataset, while the divider line based on our VAT is clear. These ob-    servations show our VAT model can learn a better task-specific representation by enforcing the model to reduce the task-irrelevant information.

Quantitative Evaluation
In this section, we evaluate our VAT model using two metrics, AOPC and post-hoc accuracy, which are widely used for explanations (Chen and Ji, 2020). Note that well-trained LSTM/BERT-base is used for evaluating the performance of classification.
AOPC. To evaluate the faithfulness of explanations to our models, we adopt the area over the perturbation curve (AOPC) (Nguyen, 2018;Samek et al., 2016) metric. It calculates the average change of accuracy over test data by deleting top K words via attentive weights. The larger the value of AOPC, the better the explanations of the models. Table 3 displays the results with K " 5. We compare our models with random and basic attention-based models. From the results, we observe that: 1) basic attention-based models (LSTM/BERT-ATT) can find the important words in the sentence to some extent. Comparing with random (Random), LSTM/BERT-ATT obtains significant improvement; 2) Our models (LSTM/BERT-VAT) outperform the standard attention-based models. It indicates that integrating VIB into the attention mechanism can help improve the interpretability of the models by filtering the useless information; 3) BERT model is sensitive to the context; deleting the words will destroy the semantic information of the sentence and significantly affect the model's performance.
We also explore the influence of top-K ( Figure  4). Intuitively, the more words we delete, the larger accuracy the models reduce. Our models reduce more performance than random and attention-based   Post-hoc Accuracy. We also adopt the post-hoc accuracy (Chen et al., 2018) to evaluate the influence of task-specific essential words on the performance of LSTM-based and BERT-based models.
For each test sample, we select the top K words based on their attentive weights as input to make a prediction and compare it with the ground truth. Table 4 presents the performance with K " 5. First, it is interesting to find that the post-hoc accuracy with five most important words on Sbuj dataset (89.10) is even better than the original sentence (89.00). Additionally, we obtain comparable results with only five words for SST-1, SST-2, and Twitter datasets. These show that our model can reduce the noise information since most of the words are useless for predictions in some cases. Second, for BERT-based models, the context words are also important for classification even though they may not be task-specific.
Similarly, we investigate the influence of top-K for post-hoc ( Figure 5). The LSTM-base model with top-10 words selected by our LSTM-VAT model can achieve comparable results with the original samples in most cases. Additionally, for the IMDB dataset, the accuracy of LSTM-base with one word selected by our VAT model is even better than the one with 20 words selected randomly.

Semi-Supervised Word-Level Sentiment Detection
We perform semi-supervised word-level sentiment detection in Twitter (Rosenthal et al., 2015(Rosenthal et al., , 2014 to evaluate the interpretability of our VAT. This task requires to detect the sentiment words in a tweet via the sentiment polarity of the whole tweet. In the following example from the dataset, positive words ("good" and "fantastic") are marked with a bold font and the overall polarity of the tweet is positive: Good morning becky! Thursday is going to be fantastic!
We use the SemEval 2013 Twitter dataset, which contains word-level sentiment annotation. We remove the samples with the neutral sentiment. We report word-level precision, recall, and F-measure for evaluating the models (Table 5), the same as . Note that we select the top-K (we set it as 1 and 5 here) words according to the attention weights as the sentiment words.
We compare our VAT model with random and attention-based models. The results show attentionbased models can capture the important words in the text, to a certain extent. Since our VAT can reduce irrelevant information, it performs better than the standard attention model. Also, LSTM-based models outperform BERT-based models for this task in most cases. It is because that BERT learns much semantic information from the text, and context information plays a vital role in prediction.

Influence of Iteration
We propose to train the learner and compressor iteratively so that the learner optimizes the word representations based on the good attention, and the compressor optimizes the attention based on the good word representations. To have a deep look at how it works, we first provide our VAT model's accuracy with different iterations (Table 6). From the results, we can find that the model's performance will improve at first, then it will converge.  Table 5: The results of semi-supervision word-level sentiment detection in twitter.
(a) AG News (LSTM) (b) AG News (BERT) Figure 6: Visualization of text representation obtained from LSTM/BERT-VAT with different iterations. We use t-SNE to transfer 100/768-dimensional feature space into two-dimensional space.

Method
Text Prediction  Also, we draw change of the sentence representation with different iterations (Figure 6). Similarly, we observe that fine-tuning and compressing iteratively can improve the sentence representations. The samples with the same class are close, and the samples with different classes have a large distance.

Case Studies
To understand why our proposed VAT model is more effective than the standard attention-based model, we visualize two examples of LSTM-based models using attention heatmaps (Figure 7). First, the standard attention-based LSTM model focuses on the wrong words (e.g., "this", "work") even though it predicts the right sentiment while our VAT model finds the correct words (e.g., "admired", "lot"). It indicates integrating IB into attention can help it focus on the key words and reduce the noisy information. Second, our proposed model can also improve the attention's performance by capturing the critical words accurately. For example, in the sentence "That sucks if you have to take the sats tomorrow.", our model predicts the right class label by attending the words "sucks" and "have to."

Conclusions and Future Work
This paper proposes a VAT-based framework to improve the performance and interpretability of attentions via both fine-tuning and compressing. The experimental results on eight benchmark datasets for text classification verify the effectiveness of our models within this framework. In addition, we apply the framework for sentiment detection, which further demonstrates the superiority in terms of interpretability. It is also interesting to find that training the models by fine-tuning and compressing iteratively is effective to improve the text representations. In the future, we will investigate the effectiveness of our proposed attention framework for other tasks and areas, such as machine translation and visual question answering.