DISK-CSV: Distilling Interpretable Semantic Knowledge with a Class Semantic Vector

Neural networks (NN) applied to natural language processing (NLP) are becoming deeper and more complex, making them increasingly difficult to understand and interpret. Even in applications of limited scope on fixed data, the creation of these complex “black-boxes” creates substantial challenges for debugging, understanding, and generalization. But rapid development in this field has now lead to building more straightforward and interpretable models. We propose a new technique (DISK-CSV) to distill knowledge concurrently from any neural network architecture for text classification, captured as a lightweight interpretable/explainable classifier. Across multiple datasets, our approach achieves better performance than the target black-box. In addition, our approach provides better explanations than existing techniques.


Introduction
Deep Neural Networks (DNNs) are popular in many applications, including computer vision and natural language processing. However, two major factors still require attention: (1) understanding the classifier's prediction for the end-users to develop trust in the model, and to enable machine learning engineers to refine it. This is especially true for high-stakes domains such as clinical decision support. There has been attention on models that attempt to make a neural network explainable, for instance, (Sundararajan et al., 2017), and (Ribeiro et al., 2016), which create post-hoc explanations to support explainability. Secondly, (2) artificially high numbers of parameters make the inference time expensive. Neural networks tend to be deep, with millions of parameters. For example, GPT-2 (Radford et al., 2019) needs over 1.5 billion parameters. As a result, they are compute-intensive, thus making it difficult to deploy in real-world applications. We here propose a model-agnostic interpretable knowledge distillation method for neural network text classification.
As shown in Figure 1, we learn a class semantic vector for each output class, concurrently when training the black box. We then use the semantic vectors to create the nearest neighbor classifier (compressed interpretable/explainable classifier ) from the black-box version. Knowledge distillation refers to the process of transferring the implicit knowledge learned by a teacher model to a student model (Liu and Matwin, 2018). Dark knowledge refers to the salient information hidden in the predicted probabilities for all classes, which are more informative than the predicted classes themselves. Our contributions can be summarized as follows: • We propose a knowledge distillation method where dark knowledge can be learned concurrently by a student model, while building a black-box model.
• We propose an interpretable classifier, which provides a user explanation for predicting a single class label.
• We integrate a clustering technique within our interpretable classifier model.
• We provide an interactive explanation mode, where users can directly request a word or a phrase query and receive feedback.
• Our smaller model shows even better performance than the original black-box, with drastically reduced hyper-parameters. That smaller model can be deployed as an on-line service in real-time applications in resource-restricted devices.

Related Work
This work has connection with research on explainablity and model compression.
Explainability. Most of the existing explainable AI (XAI) techniques for Natural Language Processing text classification focus on assigning a score to each word in the document w.r.t. predicted class, typically using gradient-based or perturbation-based methods (Arras et al., 2017;Sundararajan et al., 2017;Shrikumar et al., 2017a;Bach et al., 2015). The most popular technique for model-agnostic explanation is LIME (Ribeiro et al., 2016), which focuses on creating an interpretable classifier by approximating it locally, with a linear model.
The main drawback of these methods is that those explanations are not faithful to the target model (Rudin, 2018). There are other methods, which focused on constructing a self-explainable network (Bastings et al., 2019) and (Lei et al., 2016). These techniques have limited explanations and thus do not explain phrases. Our work is different from post-hoc and self-explainable approaches as it attempts to learn an explainable smaller classifier concurrently with the target black-box model. Our explanations are also generated from the interpretable classifier itself, without extra calculation as in post-hoc techniques.
Model compression. A variety of research devoted their efforts to compressing large networks to accelerate inference, transfer, and storage. One of the earliest attempts focused on pruning unimportant weights (LeCun et al., 1990). Other methods focused on modifying devices to improve floating point operations (Tang et al., 2018). In contrast, some works focused on quantizing neural networks (Wu et al., 2018). Other investigations have focused on knowledge distillation, i.e., the ability to transfer the knowledge from a larger model to a smaller model (Ba and Caruana, 2014) and (Hinton et al., 2015). However, the main drawbacks of the methods mentioned above are that: (1) they only work with pre-trained networks, (2) the compressed models are still treated as black-box, and (3) the compression techniques require another training step or additional computation which complicates the process. In contrast, we concurrently transfer the knowledge from the black-box into a smaller interpretable model.

DISK-CSV
In a text classification task, an input sequence x = x 1 , ..., x l , x i ∈ R d , where l is the length of the input text and d is the vector dimension, is mapped to a distribution over class labels using a parameterized θ neural network model (e.g., a Long Short Term Memory network (LSTM), Transformer, etc.), which we denote as F(x; θ). The output y is a vector of class probabilities, and the predicted classŷ is a categorical outcome, such as an entailment decision. In this work, we are interested in learning a simpler compressed nearest neighbor classifier (e.g., easy-to-explain its prediction) from any neural network model, but concurrently, while training the larger model. We refer to the large model (black-box) as T and the smaller interpretable/explainable model as S.
We call our method DISK-CSV -Distilling Interpretable Semantic Knowledge with a Class Semantic Vector. In the next subsections we provide the following details for DISK-CSV: (a) how to distill knowledge from T into S, (b) how to construct interpretable representations for S, and (c) how to interact with the model to achieve better explainability (e.g., by clustering data, explaining essential phrases, and providing a semicounterfactual explanation). Neural networks learn by optimizing a loss function to reflect the true objective of the end-user. For S, our objective is to generalize in the same way as T and approximate an explanation for each prediction. To demonstrate our idea, we show how we can learn S concurrently with a long short-term memory network (LSTM) and then discuss how it can be generalized to different types of architectures for text classification. An LSTM network processes the input word by word, and at time-step t, the memory c t and the hidden state h t are updated. The last state h l ∈ R d is fed into a linear layer with parameters W ∈ R d×k which gives a probability distribution over k classes: The classifier uses cross-entropy loss to penalize miss-classification as L classif ier = − 1 k k i=1 r i log(y i ), where r ∈ R k is the one-hot represented ground truth and r i is the target probability (0 or 1) for class i. The network's weights are updated via back-propagation when training the black-box. We intend to augment the neural nets that typically use embeddings to represent discrete variables (e.g., words) as continuous vectors. Words that have a similar context will have similar meanings. The simplest form of concurrent knowledge distillation is to transfer the knowledge from the embedding space of T into a k-Class Semantic Vectors (CSVs) v i ∈ v, where the dimension of each v i is equal to the dimension of the embedding vector x i , and k is the number of target classes. In other words, for each class label, we would like to learn a vector that captures the semantic information related to that class from the embedding layer.
These semantic vectors have the following properties: (1) Each vector v i should capture/encode the semantics about the class i from the black-box; (2) These vectors are used by the nearest neighbor classifier for the prediction of the correct class label; (3) By using cosine similarity, we can compute the contribution of each word in the input with the corresponding v i to the class i; (4) These vectors add another level of abstraction by explaining the feature importance of a phrase that expands a single word, and (5) The weights of the CSV are initialized in the same way we initialize the embedding layer and adjusted via back-propagation. We reformulate the optimization of T to update/adjust the weights of the CSVs as follows: where D is the pairwise Euclidean distance de- ,ŷ is the index of the predicted class with the highest probability, and {λ 1 , λ 2 , λ t r3} are used to weight the importance of the terms. In what follows, we discuss the new terms added to the optimization problem.
Capturing semantics: The second term of Equation 2 is the second loss function we use, which attempts to encode the information of semantically consistent sentences in a single CSV vŷ. An obvious way to learn semantic information is to minimize the cosine distance between the average of the embeddingx of the input sentence x and the predicted class semantic vector vŷ. This objective will ensure that the vector v i captures the semantics of consistent inputs to encourage semantic consistency.
Hidden knowledge extraction: The last hidden state h l in recurrent nets is typically used by the output layer as the feature vector to predict the class label. As a result, the salient information learned by T is encoded in this feature vector. To distill this knowledge and enrich the representation of v i so that S generalizes well, we again minimize the cosine distance between the class semantic vector v i and the last hidden state h l in the third loss function, which is the third term in Equation 2. This objective allows the model S to generalize similarly to the black-box T . The only constraint here is that the dimension of h l must be the same as that of x i so that we can minimize the cosine distance, i.e., h l ∈ R d .
Vector-separation: Our ultimate goal is to create a simple interpretable nearest neighbor classifier S from the black-box. Therefore, we want to make sure that the CSVs are well separated from each other so that the cosine distance is maximum between them. To address this problem, we maximize the pairwise Euclidean distance between these vectors using the fourth term in Equation 2.

The smaller interpretable classifier S based on CSV
Our smaller model S is the nearest neighbor classifier, which relies mainly on the semantic information encoded in the vectors v learned via backpropagation when training T . The model S takes the input sentence and computes the averagex of the input embedding. Then we compute the cosine distance betweenx and each v i . Finally, the target class is decided as the index i of the v i with the lowest cosine distance. Besides, this classifier is interpretable, i.e., we can understand the mechanism of making a prediction. It can also be easily transferred or stored. The smaller model extracts the semantics from the larger model concurrently when training the black-box model. The algorithm is summarized in Figure 2.
• Word feature importance: To understand the contribution of each word w.r.t. predicted classŷ, we rely on the semantic similarity between each x i and the nearest class vector vŷ. A word with high semantic similarity with the predicted vŷ will have a high contribution to the predicted class. To understand the semantic contribution of each word in the input, we calculate the semantic similarity of every word to vŷ using cosine similarity between the embedding vector of the input word and vŷ.
• Document clustering: Every text instance is clustered around its class semantic vector by computing the mean (x-axis) and the standard deviation (y-axis) of the elements in the vector (x +tanh(vŷ) 2 ) for eachx, where vŷ is the nearest CSV to the input document. We found that the 2-D points (mean, standard deviation) of the elements in the vector (x +tanh(vŷ) 2 ) for the instances belonging to a specific class are close to each other and far from those of the instances belonging to other classes. The merit is that we do not need to use a clustering algorithm.
• End-user interaction through phrase feature importance: Word feature importance is sometimes not enough to explain a model's prediction. The end-user might also be interested in querying the classifier to answer different types of questions. For example, in the situation where the model shows the feature importance (in sentiment classification) of each individual word "not," "too," and "bad," an end-user might also be interested in the importance of the phrase "not too bad," which cannot be calculated just by merging the three different feature importance values. Our approach is capable of giving feedback to the user's query about a phrase. To obtain the feature importance for a phrase, we average the embedding vectors of the words in the phrases and then compute the cosine similarity w.r.t. the predicted CSV vŷ.
• Semi-counterfactual explanation: Our approach is also capable of providing a semicounterfactual explanation, i.e., explaining a semi-casual situation (i.e., what kind of features prevent the classifier from changing the prediction to another class). We can provide a feature importance value w.r.t. nonpredicted classes by calculating the cosine similarity between the embedding vector of each word/phrase and the class semantic vector of a non-predicted class). Through this semi-counterfactual explanation, the user can reason that "if the feature X had not occurred, the class prediction would have been changed."

Generalizing to other models
Our method can be adapted to a variety of architectures such as Bi-LSTM, GRU, and RNNs, as it requires access to only the last hidden state (feature vector) and the embedding layer from the network. A further restriction is that the feature vector used in the output layer must have the same dimension as the embedding feature vector. For the Transformer, to handle the dimensionality issue, we average the Transformer's representations before the output layer as the feature vector.

Datasets
The summary of the datasets is shown in Table 1. IMDB reviews were proposed by (Maas et al., 2011) for sentiment classification from movie reviews. It consists of two classes, i.e., positive and negative sentiments.
AGnews was proposed by (Zhang et al., 2015a) for researchers to test machine learning models for news classification. It consists of four classes (sports, world, business, and sci/tech).
HealthLink constructed by Alberta Health Services, Canada. It contains a set of text transcripts written by registered nurses while talking with callers to the Tele-Health service in real-time. It consists of 2 classes ("go to hospital" and "home care"), and each class can be sub-categorized into sub-classes. This dataset will be available based on request.
Transformer employs a multi-head selfattention mechanism based on scaled dot-product attention. We use only the encoder layer, and average the new representations before arriving in the classification's output layer.
IndRNN is an improvement over RNNs, where neurons in the same layer are independent of each other and connected across layers. We use the last hidden state as the feature vector.
Bi-LSTM employs an attention-based bidirectional mechanism on the LSTM network, which captures the salient semantic information (wordattention) in a sentence. These attentions enable the network to attend differently to more and less critical content when learning the representation. The last hidden state is used for classification.
Hierarchical attention provides two levels of attention mechanisms applied to the word and sen-tence level. In this paper, we use a sentence levelattention mechanism applied on a Bi-LSTM. The feature vector for classification is based on aggregating the hidden representation values (following the authors' implementation).
LSTM and GRU process the input word by word, and the last hidden state is used as the feature vector for classification.

Network configuration and training
We tokenize sentences and use the top N words that appeared in every instance for the vocabulary size. We did not use any pre-trained embeddings, and thus we randomly initialized the embedding layer. We also randomly initialized the CSVs. We did not use any hyper-parameter tuning on the validation as we are not focusing on achieving state-of-the-art predictive accuracy. Instead, we want to show that our method can achieve similar/better performance to the black-box, and provides a better explanation than existing approaches. The word embedding, semantic vector, and feature vector (at the output layer) dimensions are 128. For training each network, we use the Adam optimizer (Kingma and Ba, 2017) with a batch size of 64 and a learning rate of 0.0001. We also used a dropout with a probability 0.5.

Classifier performance
We trained six different models (architectures) on four datasets. We have tried different values as the weight of each proposed loss term. The results depicted in Table 2 show that our semantic distillation approach captures more useful information from the training data than the baseline black-box. Our smaller model outperforms the black-boxes on all datasets, achieving better performance than the black-box. The new optimization problem does not affect the performance of the black-box model (see BBO (Black-Box with our new Objective function) in Table 2).

Explainability
In this part of our experiments, we focus on local explanations for text classification, i.e., explaining the output made by our proposed nearest neighbor classifier using CSV for an individual instance. Local explanations should exhibit high local fidelity, i.e., they should match the underlying model behavior. We evaluate our technique against the following methods:  classifier achieves better performance than the black-box models. For the black-box models, we followed the implementation proposed by the authors of each baseline.
• Random. A random selection of words from the input sentence.
• LIME (Ribeiro et al., 2016) is a modelagnostic approach which involves training an interpretable model such as a linear model on instances created around the specific data point by perturbing the data. We evaluated by training the linear classifier using ∼ 5000 samples.
We show the effectiveness of our method in explaining the prediction on three architectures (Transformer, IndRNN and hierarchical attention network) in Figures 3-8. Automatic evaluation. We use model-agnostic evaluation metrics to demonstrate the effectiveness of our approach. (Nguyen, 2018) found that human evaluation correlates moderately with automatic evaluation metrics for local explanations. Hence, in our experiments, we use the idea of automatic evaluation to verify whether or not our explanations are faithful to what the model computes. We measure the local fidelity by deleting words in the order of their estimated importance for the prediction, then evaluate the change in F1 score w.r.t. the predicted class when no word is deleted. This type of evaluation is similar to other metrics used for model interpretation (Nguyen, 2018;Arras et al., 2017) except that we use F1 instead of classification accuracy. Results are shown in Figures 3-4. We obtained the plots by measuring the effect of word deletions and reporting the F1 when the classifier prediction changes. A larger drop in F1 indicates that the method could identify the words contributing most towards the predicted class by our classifier. Through Figures 3, 4 and 5, we can clearly see that our approach is capable of identifying the most salient features better than LIME. Please note that LIME requires probabilities (as the classifier's output), and hence we convert the outputs made by our nearest neighbor into valid probabilities. Change in log-odds. Another automatic metric for evaluating explainability methods is to observe the change in the log-odds ratio (for the output probabilities). This metric has been used for a model's explanations (Shrikumar et al., 2017b;  2018). We normalize the cosine distance into valid probability distributions. This metric requires no knowledge of the underlying feature representation, and it requires access to only the instances. A log-odds ratio is a fine-grained approach, as it uses actual probability values instead of the predicted label as used in the previous experiment. But like the previous experiment, instead of tracking the change in F1, we observe the change in probabilities. We mask the top k features ranked by semantic similarity, and zero paddings replace those masked words. We then feed the input and measure the drop of the value between the target class's probability when no word is deleted and when k words are removed. Results are shown in Figures 6-8 reveal the effectiveness of our approach in capturing the words that affect the classifier's prediction. The experimental results show that our method delivers more insightful explanations than LIME. Interactive explanations. In some cases, endusers are interested in understanding the contribution of phrases instead of words. In addition, an end-user might be interested in understanding the   contribution of a word/phrase w.r.t. other classes, not only to the predicted class (semi-counterfactual explanation). Our model's results in support of these interests are shown in Table 3. Our technique can identify the contributions of phrases instead of words, and it can provide evidence w.r.t. other classes. For example, our method can recognize that "bad cough" has a stronger semantic contribution than "cough" w.r.t. the label "going to the hospital." Our approach can also recognize the difference between "mild chest pain" and "chest pain." In sentiment analysis, our method understands that "good" contributes to a positive sentiment while "not good" contribute to negative sentiment. Also, note that "very good" contributes more importantly to a positive sentiment than "good." Clustering textual data. Another feature of the CSV classifier is the ability to cluster documents via CSVs without using dimensionality reduction techniques such as PCA clustering algorithms. Results of document clustering based on the distilled knowledge from the Transformer on the IMDB and AGnews are shown in Figure 9 and 10. The clusters explain our classifier's behavior and hence provide a global explanation of the model's prediction. We also show the critical role of using the pairwise Euclidean distance in our classification by clustering sentences into their predicted classes.   Parameter reduction. We compare the number of parameters used by our nearest neighbor classifier and that of the black-box approach, using the HealthLink data in Table 4. The number of parameters used by our compressed classifier is less than that of each black-box. Our model relies only on the embeddings and the CSVs, and the rest of the layers are dropped. The number of parameters of the proposed classifier is the same for all architectures because our classifier has the same size of the embedding layer and CSVs on each black-box architecture. Our model also reduced the inference time from 0.037 − 0.085 seconds to 0.007 seconds, as shown in Table 4.  Semantics. We compare our proposed method's performance with and without capturing the semantic information (Equation 2). Results depicted in Table 5 show the importance of encoding semantic into the class discriminative vector.  Analyzing words. We are interested in what kind of words contribute most to the class prediction. For this analysis, we exploit the word-level sentiment annotation (Opinion Lexicon) provided by Liu 2 to track the top 10 words whose importance was the highest when predicting the sentiment class in the IMDB dataset. We evaluated the number of words contributing to each of the negative and positive sentiments on 1000 movie reviews. Table 6 shows that our approach can identify more salient words that lead to correct sentiment classification, i.e., our method can pick better sentiment lexicons than LIME.

Discussion
We have shown that semantic information can be extracted and used to create a simple interpretable/explainable classifier that performs better than the target black-box models. This simple classifier has the following proprieties: • It captures the discriminative representations encoded by the black-box and encodes them in the CSV.
2 https://www.cs.uic.edu/˜liub/FBS/ sentiment-analysis.html#lexicon • For text classification, the distance is the lowest between the text input and the CSV of the correct class, and is high for the other CSVs of the incorrect classes.

Conclusion and Future Work
We have explored an approach to knowledge distillation concurrently from any black-box model to produce a simple, explainable classifier. The distilled model achieves better results than the original black-box models in terms of the model's performance. Also, we showed that our distilled model provides a better explanation than LIME. We have also proposed new types of explanations: First, a user can query with any-length phrases and receive feedback about the phrase's contribution to the classes. Second, we also provide word(feature) importance to non-predicted classes, which can be used as a semi-counterfactual explanation. Third, we showed how we could cluster the documents without employing the existing clustering method.
In future work, we would like to extend this idea to pre-trained networks, and we also plan to more deeply investigate the value of counterfactual explanations.