HILDIF: Interactive Debugging of NLI Models Using Influence Functions

Biases and artifacts in training data can cause unwelcome behavior in text classifiers (such as shallow pattern matching), leading to lack of generalizability. One solution to this problem is to include users in the loop and leverage their feedback to improve models. We propose a novel explanatory debugging pipeline called HILDIF, enabling humans to improve deep text classifiers using influence functions as an explanation method. We experiment on the Natural Language Inference (NLI) task, showing that HILDIF can effectively alleviate artifact problems in fine-tuned BERT models and result in increased model generalizability.


Introduction
Given two sentences, a premise and a hypothesis, Natural Language Inference (NLI) is the task of determining whether the premise entails the hypothesis, and it has been considered by many as a sign of language understanding (Condoravdi et al., 2003;Dagan et al., 2005). Although recent deep learning models have shown to achieve good performances on different NLI datasets, as in other tasks, they have been shown to learn shallow heuristics. For example, a model is very likely to predict entailment for all hypotheses constructed from words in the premise (McCoy et al., 2019). A key challenge is therefore to understand when and why state-of-the-art NLI models fail and try to mitigate the problems accordingly.
In order to bring to light this kind of pathology, one can use explanation techniques to comprehend how a black box model makes particular predictions. For instance, feature attribution methods explain by identifying parts of inputs that mainly contribute to predictions (Smilkov et al., 2017;Sundararajan et al., 2016;Ribeiro et al., 2016;Lundberg and Lee, 2017). Further, example-based methods, such as influence functions (Koh and Liang, 2017), identify training data points which are the most important for particular predictions. Existing works have proposed ways to improve models by incorporating human feedback, in response to the explanations, by: adding model constraints by fixing certain parameters (Stumpf et al., 2009;Lertvittayakumjorn et al., 2020), adding training samples (Teso and Kersting, 2019), and adjusting models' weights directly (Kulesza et al., 2015).
In this paper, we propose a novel interactive model debugging pipeline called HILDIF -Human In the Loop Debugging using Influence Functions. With the NLI task as a target, we use influence functions as an explanation method to help users understand the model reasoning via influential training examples. Then, for each influential example shown, the users provide feedback to create augmented training samples for fine tuning the model. Using HILDIF, we effectively mitigate artifact issues of BERT models (Devlin et al., 2019) trained on the MNLI dataset (Williams et al., 2018) and tested on the HANS dataset (Mc-Coy et al., 2019), which is a known pathological setting for most deep NLI models working on English language. Our code can be found at https://github.com/hugozylberajch/HILDIF.  (Koh and Liang, 2017). They are particularly useful when feature attribution scores are not sufficient to illustrate how the model reasons. In the NLI task, for example, single input words may not suffice to explain a certain prediction, and the overall semantics and structures in the input may be needed.
Recently, Han et al. (2020) showed that influence functions can capture key fine-grained interactions among input words and detect the presence of artifacts that lead to incorrect NLI predictions.
Although very appealing, influence functions are computationally expensive. Hence, Koh and Liang (2017) reduced computational complexity by using the LInear time Stochastic Second order Algorithm (LISSA) for calculating approximations. Guo et al. (2020) proposed FASTIF, which further speeds up the calculation using the k-nearest neighbors algorithm. They also fine tuned the model with influential training samples of anchor points (i.e., some data points in the validation set) to correct model errors. We will use FASTIF as a tool to explain BERT model's predictions on the NLI task in our experiment.
Explanatory Interactive Debugging, where we improve a model by leveraging user feedback after presenting explanations for model predictions, was first introduced using simple statistical models such as Naïve Bayes models or Support Vector Machines with simple explanatory techniques (Stumpf et al., 2009). Recently, explanatory debugging has been applied to more complex models using refined interpretability methods. In FIND (Lertvittayakumjorn et al., 2020), a masking matrix is added at the end of a CNN text classifier so as to disable particular CNN filters based on human feedback in response to LRP-based explanations (Arras et al., 2016). In CAIPI (Teso and Kersting, 2019), the user investigates and corrects a LIMEbased explanation (Ribeiro et al., 2016) for each prediction. Then additional training samples, created based on the correction, are used to fine tune the model. For more details on explanatory debugging, we refer interested readers to the survey by Lertvittayakumjorn and Toni (2021).
As in CAIPI, we will exploit user feedback to control the generation of augmented samples for fine tuning the model. However, our explanations are influential training samples which are more suitable for explaining NLI predictions. This is an improvement from Guo et al. (2020) that simply fine tuned the model on influential samples without human feedback involved.

HILDIF
We propose in Algorithm 1 a new pipeline called HILDIF (Human In the Loop Debugging with Influence Functions) for debugging deep text clas-Algorithm 1: HILDIF. L is a labeled training set, V is a labeled validation set, T is the number of iteration, and g is a data augmentation method. t Obtain a similarity score s ij for the influential example z ij ; sifiers using influence functions. As far as our knowledge goes, this is the first interactive explanatory debugging algorithm that makes effective use of influence functions. To improve a model f using HILDIF, a set of anchor points X = (x 1 , x 2 , ..., x n ) is first selected from the validation dataset V, and the predictionsŶ = (ŷ 1 ,ŷ 2 , ...,ŷ n ) are computed using the model f . Then, for each anchor point x i , we use FASTIF to identify p influential training samples Z i = (z i1 , z i2 , ..., z ip ), and we define Z as a collection of Z i for all x i ∈ X . Next, for each pair of (x i , z ij ), i ∈ {1, ..., n}, j ∈ {1, ..., p}, the user will give a score of similarity s ij that will be used to generate synthetic data using a data augmentation function g. Finally, the model is fine tuned on the new generated data samples.
Next, we explain, in detail, each step of HILDIF, including explanation generation, user feedback collection, and data augmentation.
Explanation Generation. From the validation set V, we can either select anchor points randomly or handpick some that contain particular heuristics we want to debug. After that, the user is presented with a list of top-p most negatively influential training data points for each anchor point. These influential data points contribute to the decrease of the model's loss when upweighted. Hence, fine-tuning the model using these data points should improve the model performance as studied by Guo et al. (2020). However, since HILDIF relies on FASTIF which only approximates influence scores, we hypothesize that we can achieve better performance by asking humans to assess relevancy of the influential training samples returned by FASTIF before fine-tuning.
User Feedback Collection. For each anchor point x i and corresponding influential sample z ij , the user is asked the question: The test case and the presented sample are: (1) Very different; (2) Different; (3) Can't decide; (4) Similar; (5) Very similar; the user can then answer by selecting a radio button. Then z ij will obtain a similarity score s ij from 1 to 5 accordingly based on the user's answer. Similar in this context means that both samples share the same type of heuristics or lexical artifacts.

Data Augmentation.
To create an augmented sample for the NLI task, we have to make sure that the overall semantics of the premise and the hypothesis as well as the overall relation between the two sentences are preserved. We therefore choose random word replacement with synonyms as well as back translation for data augmentation since neither changes the semantic of the sentences. Moreover, we found empirically by testing different configurations that generating 10 × s ij augmented samples for the influential sample z ij yielded the best results. For instance, an influential sample with the score 3 leads to 30 augmented samples with the same label as the original sample.

Experimental Setup
Datasets and Models. We evaluate our pipeline with a pretrained BERT-base cased model. We use the MNLI dataset (Williams et al., 2018) for training and validation, and the HANS dataset, which is known to be a dataset where BERT performs poorly (McCoy et al., 2019), for testing. For the MNLI training and validation set, we merge the class neutral and contradiction into a single non-entailment class, following the HANS dataset's setting. HANS targets three heuristics of NLI and includes examples showcasing these heuristics: Lexical Overlap where the hypothesis is constructed with words from the premise, Constituent, where the hypothesis is a subtree of the premise's parse tree, and Subsequence, where the hypothesis is a contiguous subsequence of the premise (see Table 1 in the Appendix for some examples). NLI models almost always predict entailment for any example containing these heuristics although sometimes the correct label is non-entailment. So, our goal is to make the model better detect non-entailment cases while maintaining its performance on the entailment cases. For the overall performance, we chose accuracy as our evaluation metric because HANS is a balanced dataset (containing, for each subgroup of heuristics, 5,000 samples of the entailment class and 5,000 samples of the non-entailment class). Implementation Details. All our models are implemented using the pytorch library and trained using the AdamW optimizer. The HANS dataset is held-out during training and fine-tuning and is only used for testing. For computing influence functions, we use the FASTIF algorithm and FAISS library (Johnson et al., 2019) for k-nearest neighbors search. Finally, we ran all our experiment on a single 12GB NVIDIA Tesla K80 GPU. With this setting, the computation of influence scores of 5,000 training points for a corresponding anchor point takes approximately seven minutes. The BERT-base model is trained for two epochs on the MNLI training dataset.
Regarding user feedback collection, due to human resources constraints, we did our interactive experiments with one expert user. Further experiments could be conducted with more users, and results for the same pair of anchor point and influential point could be aggregated in order to reduce human bias.
Comparison. We experimented with T = 1, using five anchor points with 10 and 20 influential samples each. We introduce three binary propositions that will define the debugging pipeline: HS: Human scoring, DA: Data augmentation, and H: Handpicked anchor points. Without human scoring (¬HS), every influential sample receives a score of 5. Without data augmentation (¬DA), the fine tuning is done on each influential sample only, and without handpicked anchor points (¬H), anchor points are selected randomly. Note that our handpicked anchor points were chosen among the validation samples that contain either the lexical overlap or the subsequence heuristic (see Table 2 in the Appendix). We compared the performance of eight different configurations of debugging algorithms that stem from these three binary propositions. For each configuration, we trained and im-   proved three models using different random seeds and averaged the final performance on the test set. Note that the (¬HS, ¬DA, ¬H) configuration is the algorithm used in Guo et al. (2020) whereas the (HS, DA, H) and (HS, DA, ¬H) configurations are our HILDIF algorithm. We can see that HILDIF consistently achieved a higher accuracy in all three categories of heuristics for the non-entailment class and a slightly lower accuracy for the entailment class. Actually, we observe the trade-off between the accuracies of both classes in all the four configurations. However, HILDIF still got higher overall accuracy on the HANS dataset than the baseline and the other configurations. Moreover, interactive debugging with human scores yielded better accuracies than debugging without human scores for the Lexical Overlap and the Subsequence categories. Meanwhile, on the Constituent category, handpicking anchor points with targeted heuristic led to a big jump in accuracy that outperformed the configuration with human feedback but random anchor points. Therefore, incorporating human knowledge since the selecting anchors step is also helpful when we have prior knowledge about the model bugs. Note also, in Figure 1a, that the model accuracies on the MNLI for HILDIF and the other configurations stay close to the baseline model's accuracy, as desired.

Results
The table in Figure 1c shows that fine tuning the model with augmented data samples, instead of the influential samples only, gave better results in most cases. This was likely because data augmentation could help prevent the model from overfitting the influential samples. Besides, there was little to no improvement in the model performance when we added user feedback (i.e., human scores) for random anchor points but a substantial improvement for handpicked anchor points. This can be because, during user feedback collection, most of the data samples are difficult to compare as they either satisfy several heuristics or no heuristics relevant to the NLI task. When looking at some influential samples for handpicked anchor points, most satisfy the same heuristic and, if not, they can be easily spotted by human eyes. Although we are still far from chance performance on the non-entailment class, HILDIF achieved a substantial increase in accuracy with just five anchor points.

Conclusion
We introduced HILDIF, an interactive explanatory debugging pipeline for deep text classifiers, and ran experiments on the NLI task, achieving high accuracies with MNLI-trained BERT across all categories of the pathological HANS dataset. Future work includes enhancement of the data augmentation part, including the use of a variational autoencoder or a GPT-2 based generative model for synthetic data generation. Also, with more human resources, experiments can be conducted by finetuning more than one iterations (T > 1) with more anchor points for each iteration. Finally, it would be interesting to apply HILDIF to other text classification tasks, given that, except for the handpicked anchor points that are chosen with knowledge of the task, every step of the pipeline is task-independent.