Improve Interpretability of Neural Networks via Sparse Contrastive Coding

Although explainable artiﬁcial intelligence (XAI) has achieved remarkable developments in recent years, there are few efforts have been devoted to the following problems, namely, i) how to develop an explainable method that could explain the black-box in a model-agnostic way? and ii) how to improve the performance and interpretability of the black-box using such explanations instead of pre-collected important attributions? To explore the potential solution, we propose a model-agnostic explanation method termed as Sparse Contrastive Coding (SCC) and verify its effectiveness in text classiﬁcation and natural language inference. In brief, SCC explains the feature attributions which characterize the importance of words based on the hidden states of each layer of the model. With such word-level explainability, SCC adaptively divides the input sentences into foregrounds and backgrounds in terms of task relevance. Through maximizing the similarity between the fore-grounds and input sentences while minimizing the similarity between the backgrounds and input sentences, SSC employs a supervised contrastive learning loss to boost the interpretability and performance of the model. Extensive experiments show the superiority of our method over ﬁve state-of-the-art methods in terms of interpretability and classiﬁcation measurements. The code is available at https: //pengxi.me .


Introduction
Deep neural networks (DNNs) have achieved remarkable progress during the past few years.However, relying on stacking somewhat ad-hoc modules, DNNs are often referred to as "black-box" methods that lack understanding of the working mechanisms, thus increasing the risk of applying them into real-world applications (Ribeiro et al.,

Sparse Coding Layer
Figure 1: An illustration of the major differences between model explainability and our method.i) Different from model explainability (left) which only gives post-hoc explanations, our method (right) could further improve the interpretability for a given neural network.Such an improvement does not rely on pre-defined important attributions or pre-collected explanations; ii) In addition, unlike most of the existing works which only focus on explainability itself, our method could improve the performance of black-boxes using the obtained explanations.
2016; Rudin, 2019).For example, in medical diagnosis, predictions cannot be acted upon on blind faith.Instead, doctors need to understand the reasons behind the predictions, e.g., which part of the inputs (e.g., chemical index) the model concentrates on.
To understand the working mechanism behind DNNs, explainable artificial intelligence (Chen et al., 2020a(Chen et al., , 2018;;Lundberg and Lee, 2017) has been devoted in recent and one typical paradigm is explaining the black-boxes from the level of feature attributions, i.e., the importance of features w.r.t. the prediction of the network.In general, these studies could be roughly divided into two groups, i.e., interpretable model and model explainability.
To be specific, model explainability (also referred to as post-hoc explanations) (Ribeiro et al., 2016;De Cao et al., 2020;Chen et al., 2021;Sun and Lu, 2020) mainly focuses on explaining the feature attributions through some visualization techniques or agent models.The major merit of the post-hoc explanation is model-agnostic, but it only offers an approximate explainability and cannot improve the interpretability or the performance of the model.On the contrary, interpretable models (Rudin, 2019;Han et al., 2021) try to explain the working mechanism from the model design.In other words, the model could explicitly explain the feature attributions by itself.However, interpretable models are model-specific that cannot be generalized to different neural networks.
Based on the above discussions and observations, this paper aims to study two less-touched problems in XAI.Namely, i) how to develop an explainable method that could explain the black-box in a model-agnostic way? and ii) how to improve the performance and interpretability of the black-box using such explanations instead of pre-collected important attributions.The solution to the problems requires simultaneously enjoying the merits of model explainability and interpretable model to a certain extent.Notably, the answer would be helpful to highlight another perspective of XAI, i.e., XAI should play an important role in improving the model after understanding the model behavior.Notice that, some recent studies have been conducted and proved the effectiveness of XAI in interpretability improvement (Erion et al., 2019;Rieger et al., 2020).However, they often require to collect pre-defined feature attributions, which is labor-intensive and uneconomic.
To explore an effective solution to this problem, this paper proposes a model-agnostic explanation method dubbed sparse contrastive coding (SCC).As shown in Fig. 1, SCC designs a novel sparse coding layer (SCL) which explains the word-level feature attributions based on the hidden states of each layer in the model.To make the explanation faithful and exploit the explainability for model improvement, SCC employs a novel loss function consisting of a sparse coding loss, a contrastive coding loss, and a cross entropy loss.Specifically, the cross entropy loss is enforced between the prediction of texts masked by the feature attributions and the ground-truth to achieve word-level explainability.To make the explanation concise, the sparse coding loss enforces a sparse constraint on the feature attributions so that the foreground words are disentangled from the backgrounds.To further exploit the explainability for improving the model, the contrastive coding loss is enforced on three kinds of input divided by the feature attributions, i.e., the whole texts, foregrounds, and backgrounds.Different from the vanilla methods (He et al., 2020;Chen et al., 2020b;Lin et al., 2021Lin et al., , 2022;;Yang et al., 2022), our contrastive coding loss works in a supervised fashion and embraces the properties of negative sample mining and auto data augmentation that could boost the interpretability and performance.
The main contributions and novelties of this paper could be summarized as below: i) we study two less-touched problems in XAI as the aforementioned, i.e., how to develop a model-agnostic method to explain a given black-box and use such explanations to improve its performance and interpretability?To the best of our knowledge, there are few efforts have been devoted so far; ii) we accordingly propose SCC whose basic idea is disentangling the important words from inputs to improve interpretability and discrimination.Extensive experiments on six textual datasets verify the effectiveness of SCC in terms of interpretability and classification metrics.

Related Work
This work is closely related to model explainability and interpretable models which will be briefly introduced in this section.

Model Explainability
Model explainability (post-hoc) methods mainly focus on explaining models by detecting feature attributions, i.e., explaining the model by evaluating the contribution of each feature (Guan et al., 2019).For example, LIME (Ribeiro et al., 2016) learns feature attributions by using a local linear model with perturbations to approximate the blackbox model.L2X (Chen et al., 2018) aims to reveal the importance of features by maximizing the mutual information between the chosen words and the outputs of the model.KernelSHAP (Lundberg and Lee, 2017) employs different Shapley values to compute feature attributions.Recently, some post-hoc explanation methods enforce the model to focus on pre-defined important features using human-annotated explanations, which achieve remarkable progress.For example, (Rieger et al., 2020) utilizes the contextual decomposition to encode prior knowledge into explanations and (Erion et al., 2019) calculates the expected gradients to make full use of attribution priors.
Although this paper also explores interpretability based on feature attributions, it is different from the aforementioned studies in the given aspects.i) in the model f θ .(b) SCC contains three joint optimization losses, namely, sparse coding loss L sc , contrastive coding loss L cc , and cross entropy loss L ce .Specifically, the cross entropy loss is enforced between the prediction of texts masked by the feature attributions and the ground-truth to achieve word-level explainability.The sparse coding loss is enforced on the feature attributions to distinguish the irrelevant words and make the explanation concise.The contrastive coding loss is enforced on the whole text z, foregrounds (task-relevant words) z f , and backgrounds (task-irrelevant words) z b to boost the interpretability and performance of the model.The subscripts i, p, and n denote the i-th sample and corresponding within-class and between-class samples.
On the one hand, our method does not rely on the pre-defined important attributions or pre-collected explanations (Erion et al., 2019), thus enjoying a more economic solution.On the other hand, our method could not only improve the interpretability but also the performance.In contrast, existing methods may cause the performance drop due to inconsistency between the human explanation and model reason process (Jacovi and Goldberg, 2020).

Interpretable Models
Instead of generating post-hoc explanations, interpretable models aim to build moduledecomposable or algorithm-transparent neural networks.For example, TELL (Xi et al., 2021) proposes an algorithm-transparent clustering network which reformulates the k-means objective as a neural layer.SENN (Alvarez-Melis and Jaakkola, 2018) designs a module-decomposable neural network by progressively stacking a set of linear classifiers.VMASK (Chen and Ji, 2020) utilizes word masks to select important features for building an interpretable neural network.
The major differences between our work and existing works are two-folds.On the one hand, our method is a model-agnostic explanation method which could be applied to explain different blackboxes.In contrast, the interpretability of most existing interpretable models is limited to the original model.On the other hand, most studies achieve interpretability at the cost of performance (Rudin, 2019), whereas our method shows that the interpretability could improve the model performance.

Method
This section elaborates on the proposed Sparse Contrastive Coding (SCC) which tries to seek a feasible solution to the aforementioned two problems in XAI, i.e., i) how to develop an explainable method that could explain the black-box in a model-agnostic way? and ii) how to improve the performance and interpretability of the black-box using such explanations?
As illustrated in Fig. 2, SCC explains and improves the black-boxes through three jointly optimizing objectives, namely, sparse coding loss L sc , contrastive coding loss L cc , and cross entropy loss where the balanced factor λ is simply fixed to 0.1 throughout experiments.In the following, we will introduce how the sparse coding layer with L sc and L ce is built for embracing explanability in Section 3.1 and how to improve the model performance and the interpretability through L cc in Section 3.2.

Explaining Model via Sparse Coding Layer
Without loss of generality, we take text classification as an evaluation task.For an input text represents the embedding of i-th word and N denotes the number of words.The neural network f θ (•) aims at predicting the class label ỹ for x through the mapping f θ (x).
As shown in Fig. 2(a), to explain the neural networks, we design a sparse coding layer (SCL) g φ that could measure the feature attributions based on the output of each hidden layer in the model.To be specific, let h = h (0) , . . ., h (L) denotes the hidden states of each layer in the neural classifier, where h (0) = x is the word embedding layer.We identify the important words through In detail, g φ aggregates the information from each layer in a gated form (Chung et al., 2014) with three one-layer MLPs: where η is the Tanh activation function and [; ] is the concatenation operation.
To generate an effective explanation, M is expected to have the following properties: i) removing irrelevant words as many as possible for a concise explanation, ii) and meanwhile correctly selecting relevant words for classification.To achieve the first property, it is encouraged to maximize the sparsity of M through 0 -norm, i.e., minimizing the number of the non-zeros values, Although 0 -norm is discontinuous and has zero derivative almost everywhere, it is exactly equivalent to 1 -norm under the binary case.Based on this observation, we generate hard masks during training following the reparameterization trick (Maddison et al., 2017) as below where is the l-th Gumbel random variable, u ∼ Uniform(0, 1), and τ is the softmax temperature.In this way, one could surrogate 0 -norm with 1 -norm to achieve sparsity.Furthermore, we introduce a balance term on M to keep the exploratory in the preliminary training stage.Mathematically, the sparse coding loss for SCL is given by, where γ is steadily annealed from 1.0 to 0.01 with a decay of 0.099.
To achieve the second property, i.e., selecting most relevant words, we mask the word embeddings based on the measured feature attributions for classification by minimizing the cross entropy loss between the prediction ỹ and the ground-truth y, i.e., where and denotes the element-wise multiplication.As long as the prediction ỹ approximates the groundtruth, we deem M selects the most relevant words.

Improving Model using Explainability
Most XAI studies are deemed to be important for truthful and safety AI, which somehow ignore another important perspective, i.e., improving the interpretability and performance.After understanding the working mechanism of black-box, it is highly expected not only more trustworthy predictions but also higher performance.To this end, we propose a novel contrastive coding loss which could encourage model improvement using the explainability.
In detail, we first divide the input sentence x into foreground x f and background x b through the sparse coding layer, where x f denotes the set of important words and x b contains all irrelevant words for classification.By passing x, x f , and x b through the neural network f θ , we could obtain the representations z, z f , and z b , accordingly.
For clarity, we first present the general form of our contrastive coding loss and then elaborate on the training details.Let notation t i marks the isample and t p marks the positive sample for t i , our contrastive coding loss is given by ) where P (i) denotes the corresponding positive samples set of sample i, A(i) denotes all samples without sample i, and τ 2 ∈ R + is a scalar temperature parameter.
Notably, one major difference between Eq. ( 10) and the vanilla contrastive loss (Khosla et al., 2020) lies on the positive/negative construction which is non-trivial as pointed out in (Chen et al., 2020b;Khosla et al., 2020).Specifically, as shown in Fig. 2(b), our contrastive coding loss will deal with three kinds of anchors, i.e., the input sentence z i , the foreground z f i , and the background z b i .Mathematically, we have anchor t i ∈ {z i , z f i , z b i } (i.e., t i could be one of the samples z i , z f i , and z b i ) which will also determine the choice of positive sample t p , namely, • When the anchor is the input sentence or foreground, i.e., In other words, for either of z i and z f i , the objective aims to minimize its distance with within-class samples z p and z f p , while maximizing its distance with betweenclass samples z n , z f n , and all backgrounds {z b i , z b n , z b p }.The subscripts p and n denote the within-class and between-class samples of z i selected by the classification label.
• When the anchor is the background, i.e., t i = z b i , then t p ∈ {z b p , z b n }.In other words, the objective is to ensure that the backgrounds will only contain irrelevant words by pulling all backgrounds together while pushing the other sentences and foregrounds away.

Discussions
With the above contrastive coding loss, one could maximize the similarity between the foregrounds and the input sentences, while minimizing the similarity between the foregrounds and backgrounds.This strategy could encourage the model to select task-relevant words and throw away irrelevant words, thus boosting the interpretability.By incorporating the explainability into the training process, our contrastive coding loss owns the following desirable properties: Negative sample mining.The foregrounds and backgrounds divided by our sparse coding layer could be regarded as augmented positive samples and negative samples.Notably, there are few works that attempt to conduct negative data augmentations since there is no exact definition of negative data augmentations.Through our explainability paradigm, it is reasonable and natural to construct negative pairs using the task-irrelevant samples.As shown in Table 5, the ablation study verifies the effectiveness of such negative sample mining property by discarding the backgrounds contrast.
Auto data augmentation.The huge success of contrastive learning could be partially attributed to effective data augmentation techniques (He et al., 2020;Chen et al., 2020b).Most existing data augmentation methods often resort to hand-crafted approaches, e.g., rotation, flipping, and so on.Different from these methods, our SCC could be regarded as providing an auto data augmentation strategy which utilizes task relevance to filter out the salient words that are negative to the input in the semantic space.In addition, it is worthy to point out the difference between our method and (Gao et al., 2021).In brief, (Gao et al., 2021) randomly removes the fixed-rate words for data augmentation, which is task-irrelevant and the fixed parameter might lead to inferior performance.In contrast, our SCC will select salient words to augment data base on the task relevance, which will be adaptive to different inputs.As shown in Table 3, one could find the superiority of such an auto data augmentation.

Experiments
In this section, we carry out experiments on six widely-used textual datasets.For a comprehensive study, we compare SCC with five state-of-the-art approaches on two classification tasks (i.e., sentiment analysis and subjective/objective classifica-  tion) and natural language inference (NLI) task in terms of the metrics of interpretability and classification performance.

Experimental Settings
Datasets: Six widely-used datasets are used in our experiments, i.e., YELP reviews dataset (Zhang et al., 2015), movie reviews dataset IMDB (Maas et al., 2011), question classification dataset TREC (Li and Roth, 2002), subjective/objective classification dataset SUBJ (Pang and Lee, 2005), Stanford Sentiment Treebank datasets SST-2 (Socher et al., 2013), and NLI dataset Sci-Tail (Khot et al., 2018).For IMDB and SUBJ datasets, we hold out a portion of the training set as the development set.For the other datasets, we use the original data splits.The statistics of the datasets are given in Table 1.
Implementation Details : The proposed sparse coding layer consists of three one-layer MLPs as shown in Eq. 2 and Eq. 3, i.e., a representation MLP MLP rep , a gated MLP MLP gate , and a probability MLP MLP proba .In detail, MLP rep and MLP gate project the 768-dimension token representation into 100-dimension, and afterwards, MLP proba outputs 1-dimension feature attributions.In the training stage, we optimize the sparse coding layer and the neural classifier in an end-to-end fashion with the aforementioned three objectives.In the testing stage, we only retain the neural classifier and verify its improvement.
To show that SCC could improve the interpretability and the classification performance for a given model, we apply it to two typical neural models, i.e., BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019).In detail, we implement SCC in PyTorch 1.7.1 and carry all evaluations on the Red Hat 6.4 OS with a Tesla P100 GPU.To optimize the networks, we adopt AdamW optimizer (Loshchilov and Hutter, 2017) with the default parameters and set the initial learning rate as 1e −5 .The softmax temperature τ and τ 2 are all set to 1.0 and the maximal training epoch is fixed to 10 on all datasets.For fair comparisons, we adopt the best checkpoint on the validation set for all tested methods in terms of accuracy.Baselines: To show the promising performance of SCC, we compare it with the following methods: i) L2X (Chen et al., 2018) learns the importance of features by maximizing the mutual information between the chosen words and the output of the model.ii) IBA (Schulz et al., 2020) learns the feature attributions based on the information bottleneck theory.iii) VMASK (Chen and Ji, 2020) proposes a learnable mask that removes the irrelevant words and keeps the explanation of the model.iv) SimCSE (Gao et al., 2021) proposes a simple contrastive learning framework whose performance is remarkably benefited from the dropout augmentation.v) Base model which denotes the model is trained by minimizing the cross entropy loss only.Note that, L2X and IBA are proposed for generating post-hoc explanations.To investigate the effectiveness of post-hoc explanations in performance improvement, we integrate L2X and IBA into the model training stage by adding an extra word mask layer as suggested by VMASK.

Quantitative Evaluation of Interpretability
We adopt two interpretability metrics to evaluate the faithfulness and sufficiency of the model, i.e., AOPC (Nguyen, 2018) and post-hoc accuracy (Chen et al., 2018).In brief, AOPC measures the fidelity by masking the top-scored words and calculating the difference on the predicted probability, while post-hoc accuracy measures the sufficiency of interpretability by keeping the most important words.

AOPC:
We calculate the area over the perturbation curve (AOPC) to evaluate the faithfulness of explanations to models.To be specific, AOPC cal-culates the average change of prediction probability on the predicted class over all test data by deleting top k words.Mathematically, where ŷ is the predicted label, and M is the number of samples.x(k) is constructed by deleting the top k important words of x and LIME (Ribeiro et al., 2016) is used to measure the importance of words.Higher AOPC indicates better explanations, i.e., the deleted words are crucial to the model prediction.Note that, due to the over-high computation costs of LIME (Ribeiro et al., 2016), we randomly pick up 2,000 examples from YELP and IMDB, and use the whole SST2, SUBJ, and TREC in the evaluation.
Post-hoc Accuracy: It could evaluate the sufficiency of important words to the model prediction.More specifically, we select the top v words based on feature attributions for classification and compare the performance with the case of the whole text.Note that the importance of words is computed by the word masks (baselines) or the sparse coding layer (SCC).Mathematically, post-hoc accuracy is defined as, ) where ỹi is the predicted label of i-th sample and ỹ (v) i is that of i-th sample with top v words.Higher values denote better explanations.
Results: Table 2 reports the AOPC scores when the most important 5 and 10 words are deleted (marked as AOPC-5 and AOPC-10).From the results, one could observe that: i) SCC outperforms all baselines on most datasets in terms of AOPC-5 and AOPC-10.For example, SCC surpasses the best baseline by 1.94% on TREC dataset with BERT in terms of AOPC-5.ii) On the YELP dataset, the AOPC of BASE is even better than L2X and IBA.This phenomenon reveals that assembling post-hoc explanation methods to neural networks is not always encouraging, and proves the effectiveness of our training framework.iii) The AOPC-10 score on TREC is extremely high because the maximum length of sentences is 15 in TREC, i.e., removing the top 10 words would probably exclude most informative words.Figure 3 shows the results of post-hoc accuracy, which shows that SCC significantly outperforms all the tested baselines in all evaluations.

Qualitative Evaluation of Interpretability
To intuitively investigate the effectiveness of our method, we first present the feature attributions of two examples randomly selected from the SST2 dataset with the BERT backbone.As shown in Figure 4, although all methods have made correct semantic predictions, the interpretability is quite different in such a qualitative evaluation.More specifically, SCC correctly captures the sentiment words "gory" and "silly" in the first example, while the baselines fail in capturing "silly".In the second example, IBA and VMASK ignore "cute", L2X ignores "amusing", while our method captures all three important words.To sum up, SCC could capture more precise sentiment words that indicate the same sentiment polarity with the prediction.More examples are presented in Figure 5.

Evaluation of Classification
As aforementioned, one major goal of this study is to improve the classification performance by utilizing the explainability.To investigate such a capacity, we compare the tested methods on five datasets in terms of classification accuracy.As shown in Table 3, SCC significantly outperforms baselines by a large performance margin on almost all datasets.It should be pointed out that, SCC is even better than SimCSE (Gao et al., 2021) which is designed for representation learning rather than interpretability.

Evaluation of Natural Language Inference
Natural language inference is the task of determining whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise.To verify the universality of our method, we further compare the tested methods on SciTail (Khot et al., 2018) in terms of accuracy and post-hoc accuracy.As shown in Table 4, the remarkable improvement suggests that our SCC could be generalized to different tasks and further improve both the interpretability and performance of the model.

Ablation Study
To evaluate our design decisions, we conduct ablation studies on the SST2 dataset.The experiments are designed to isolate the effect of contrastive coding loss and sparse coding loss.Moreover, we also ablate the disentangled backgrounds from contrastive learning to verify the effectiveness of the negative sample mining as discussed in Section 3.3.As shown in Table 5, all objectives are helpful in improving the interpretability and classification performance.

Conclusion
In this paper, we show a feasible solution to solve two less-touched problems in XAI, i.e., how to develop a model-agnostic method to explain the black-box and utilize the explanations to improve the model performance and interpretability.We take text classification and natural language inference as evaluation tasks, and quantitatively and qualitatively show the superiority of our method in terms of interpretability and classification metrics.
In the future, we plan to explore the potential of our framework in other applications like medical diagnosis and extend our idea to other data domains images.

Limitations
The motivation of this work is to highlight another important perspective of explainable AI, i.e., increasing the trustworthiness and performance of black neural network models in decision making.However, we need to retrain the whole network again for improving the black-boxes, which might consume a lot of energy and cause massive CO2 emissions.In addition, there is no need to hide that this paper only considers the word-level explainability (i.e., feature attributions), and it is unclear how to extend this idea to other explainability due to the diversity and rapid development of XAI.

Figure 2 :
Figure2: Overview of the proposed SCC.(a) Sparse coding layer (SCL) g φ is designed to measure the feature attributions M based on the output of each hidden layer h(i) in the model f θ .(b) SCC contains three joint optimization losses, namely, sparse coding loss L sc , contrastive coding loss L cc , and cross entropy loss L ce .Specifically, the cross entropy loss is enforced between the prediction of texts masked by the feature attributions and the ground-truth to achieve word-level explainability.The sparse coding loss is enforced on the feature attributions to distinguish the irrelevant words and make the explanation concise.The contrastive coding loss is enforced on the whole text z, foregrounds (task-relevant words) z f , and backgrounds (task-irrelevant words) z b to boost the interpretability and performance of the model.The subscripts i, p, and n denote the i-th sample and corresponding within-class and between-class samples.

Figure 3 :
Figure 3: Post-hoc accuracy.Higher is better.The vertical axis and horizontal axis denote the post-hoc accuracy and the number of reserved important words, respectively.SCC has achieved significant improvement compared with others.

Figure 4 :
Figure 4: Qualitative evaluation.The most important four words are highlighted and the color saturation indicates the word attribution.As shown, SCC could capture more precise sentiment words that indicate the same sentiment polarity with the prediction.

Figure 5 :
Figure 5: Qualitative evaluation.The most important four words are highlighted and the color saturation indicates the word attribution.

Table 1 :
Summary statistics of the datasets.C is the number of classes, L is the padded sentence length, B is the training batchsize, and # denotes the number of samples in train/dev/test sets.

Table 2 :
AOPC scores.Higher is better.The best and second-best results are highlight in bold and underline.SCC focuses on the most important for prediction compared with baselines thanks to our sparse coding layer and three jointly learning losses.

Table 3 :
Classification accuracy.The top and the bottom six rows denote the results based on BERT and RoBERTa backbones, respectively.As illustrated, SCC outperforms five baselines with two different models on all five datasets.

Table 4 :
NLI accuracy.SCC outperforms four baselines in terms of post-hoc accuracy and classification accuracy.

Table 5 :
Ablation study.We select top 4 words for calculating post-hoc accuracy.All loss terms play indispensable roles in SCC.