Faithfulness Tests for Natural Language Explanations

Explanations of neural models aim to reveal a model’s decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model’s inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.


Introduction
Explanations of neural models aim to uncover the reasons behind model predictions in order to provide evidence on whether the model is trustworthy. To this end, explanations have to be faithful, i.e., reflect the decision-making process of the model, otherwise, they could be harmful (Hancox-Li, 2020). However, recent studies show that explanations can often be unfaithful, covering flaws and biases of the model. Adebayo et al. (2018) show that certain widely deployed explainability approaches that provide saliency maps (with importance scores for each part of the input, e.g., words or super-pixels) can even be independent of the training data or of the model parameters. Others also question the effectiveness and reliability of counterfactuals (Slack et al., 2021), concept activations, and training point ranking explanations (Adebayo et al., 2022).
In this work, we investigate the degree of faithfulness of natural language explanations (NLEs), which explain model predictions with free text. NLEs are not constrained to contain only input segments, thus they provide more expressive (Camburu et al., 2021) and usually more human-readable explanations than, e.g., saliency maps (Wiegreffe and Marasovic, 2021). Evaluating the faithfulness of explanations is very challenging in general, as the ground-truth reasons used by a model for a prediction are usually unknown. Evaluating the faithfulness of NLEs is further complicated, as they consist of free text, often including words not present in the input. Thus, existing tests evaluating other types of explanations, e.g., saliency maps, cannot be directly applied to NLEs. As a stepping stone towards evaluating how faithful NLEs are, we design two tests. Our first test investigates whether NLE models are faithful to reasons for counterfactual predictions. We introduce a counterfactual input editor that makes counterfactual interventions resulting in new instances on which the model prediction changes but the NLE does not reflect the intervention leading to the change. Our second test reconstructs an input from the reasons stated in a generated NLE, and checks whether the new input leads to a different prediction. We apply our tests to four NLE models over three datasets. We aim for our tests to be an important tool to assess the faithfulness of existing and upcoming NLE models. Premise: Many people standing outside of a place talking to each other in front of a building that has a sign that says 'HI-POINTE.' Hypothesis: The people are having a chat before going into the work building. Prediction: neutral NLE: Just because people are talking does not mean they are having a chat.
➜Premise: People are talking. ➜Hypothesis: They are having a chat. ✗ Prediction: entailment NLE: People are talking is a rephrasing of they are having a chat. Unfaithfulness cause: The reasons in the NLE for the original instance lead to a different prediction. Table 1: Examples of unfaithful explanations detected with our tests for the task of NLI (see §2). We apply the tests on an original instance (second column), which results in a new instance (third column). The parts of the input changed by the test are marked with ➜, and the intervention made by the test is in blue. ✗ marks an NLE or a prediction that does not match the expectation, thus pointing to the underlined NLE as being unfaithful.   . 3), we include the results of the random baseline (Rand) and of the counterfactual editor (Edit). The "% Counter" column indicates the editor's success in finding inserts that change the model's prediction. "% Counter Unfaith" presents the percentage of instances where the inserted text was not found in the associated NLE among the instances where the prediction was changed. "% Total Unfaith" presents the percentage of instances where the prediction was changed and the inserted text was not found in the associated NLE among all the instances in the test set. The highest rates of success in each pair of (Rand, Edit) tests are in bold.
The highest total percentage of detected unfaithful NLEs for each dataset is underlined. humans usually seek counterfactuals by looking for factors that explain why event A occurred instead of B (Miller, 2019). Counterfactual explanations were proposed for ML models by making interventions either on the input (Wu et al., 2021;Ross et al., 2021) or on the representation space (Jacovi et al., 2021).
For our test, we search for interventions that insert tokens into the input such that the model gives a different prediction, and we check whether the NLE reflects these tokens. Thus, we define an intervention h(x i , y C i ) = x ′ i that, for a given counterfactual label y C i , generates a set of words W ={w j } that, inserted into x i , produces a new instance While one can insert each word in W at a different position in x i , here we define W to be a contiguous set of words, which is computationally less expensive. As W is the counterfactual for the change in prediction, then at least one word from W should be present in the NLE for the counterfactual prediction: where the s superscript indicates that the operator is used at the semantic level. Sample counterfactual interventions satisfying Eq. 1 are in Table 1. More examples are in Tables 4 and 5 in the Appendix.
To generate the input edits W , we propose an editor h as a neural model and follow Ross et al. (2021). The authors generate input edits that change the model prediction to target predictions and refer to these edits as explanations. We note that apart from the input edits, there could be confounding factors causing the change in prediction, e.g., the edits could make the model change its focus towards other parts of the input and not base its decision on the edit itself. In this work, we presume that it is still important for the NLEs to point to the edits, since the model changed its prediction when the edit was inserted. This is in accordance with the literature on counterfactual explanations, where such edits are seen as explanations (Guidotti, 2022). We also hypothesize that such confounding factors are rare, especially when insertions rather than deletions are performed. We leave such investigation for future work.
During the training of h, we mask n 1 % tokens in x i , provide as an input to h the label predicted by the model, i.e., y C i = y i , and use the masked tokens to supervise the generation of the masked text (corresponding to W ). During inference, we provide as target labels y C i ∈ Y, y C i ̸ = y i , and we search over n 2 different positions to insert n 3 candidate tokens at each position at a time. The training objective is the cross-entropy loss for generating the inserts.
We use as a metric of unfaithfulness the percentage of the instances in the test set for which h finds counterfactual interventions that satisfy Eq. 1. To compute this automatically, we use ∩ s at the syntactical level. As paraphrases of W might appear in the NLEs, we manually verify a subset of NLEs. We leave the introduction of an automated evaluation for the semantic level for future work.
Our metric is not a complete measure of the overall faithfulness of the NLEs, as (1) we only check whether the NLEs are faithful to the reasons for counterfactual predictions, and (2) it depends on the performance of h. But if h does not succeed in finding a significant number of counterfactual reasons not reflected in the NLEs, it could be seen as evidence of the faithfulness of the model's NLEs. 2.2 The Input Reconstruction Test: Are the reasons in an NLE sufficient to lead to the same prediction as the one for which the NLE was generated? Existing work points out that for an explanation to be faithful to the underlying model, the reasons r i in the explanation should be sufficient for the model to make the same prediction as on the original input (Yu et al., 2019):  Table 3: Results for the input reconstruction test. "% Reconst" shows the percentage of instances for which we managed to form a reconstructed input. "% Total Unfaith" shows the total percentage of unfaithful NLEs found among all instances in the test set of each dataset. The highest detected percentage of unfaithful NLEs for each dataset is in bold.
where R is the function that builds a new input r i given x i and e i . Sufficiency has been employed to evaluate saliency explanations, where the direct mapping between tokens and saliency scores allows r i to be easily constructed (by preserving only the top-N most salient tokens) (DeYoung et al., 2020; Atanasova et al., 2020). For NLEs, which lack such direct mapping, designing an automated extraction R of the reasons in e i is challenging.
Here, we propose automated agents Rs that are task-dependent. We build Rs for e-SNLI (Camburu et al., 2018) and ComVE (Wang et al., 2020), due to the structure of the NLEs and the nature of these datasets. However, we could not construct an R for CoS-E (Rajani et al., 2019). For e-SNLI, a large number of NLEs follow certain templates even if this was not imposed during the NLEs collection. Camburu et al. (2020) provide a list of templates covering 97.4% of the NLEs in the training set. For example, "<X> is the same as <Y>" is an NLE template for entailment. Thus, many of the generated NLEs also follow these templates. In our test, we simply use the <X> and <Y> from the templates as the reconstructed pair of premise and hypothesis, respectively. If the NLE for the original input was faithful, then we expect the prediction for the reconstructed input to be the same as for the original.
Given two sentences, the ComVe task is to pick the one that contradicts common sense. If the generated NLE is faithful, replacing the correct sentence with the NLE should lead the model to the same prediction.

Experiments
Following Hase et al. (2020), we experiment with four setups for NLE models, which can be grouped by whether the prediction and NLE generation are trained with a multi-task objective using a joint model (MT) or with single-task objectives using separate models (ST). They can also be grouped by whether they generate NLEs conditioned on the predicted label (rationalizing models (Ra)), or not conditioned on it (reasoning models (Re)). The general notation f (x i ) = ( e i , y i ) used in §2 includes all four setups: where f p,ex is a joint model for task prediction and NLE generation, f p is a model only for task prediction, and f ex is a model only for NLE generation. The ST-Ra setup produces one NLE e i,j for each y j ∈ L. Given e i,j and x i , f p predicts the probability of the corresponding label y j and selects as y i the label with the highest probability.
For both f and the editor h, we employ the pretrained T5-base model (Raffel et al., 2020). The editor uses task-specific prefixes for insertion and NLE generation. We train both f and h for 20 epochs, evaluate them on the validation set at each epoch, and select the checkpoints with the highest success rate (see §2). We use a learning rate of 1e-4 with the Adam optimizer (Kingma and Ba, 2014). For the editor, during training, we mask n 1 consecutive tokens with one mask token, where n 1 is chosen at random in [1,3]. During inference, we generate candidate insertions for n 2 = 4 random positions, with n 3 = 4 candidates for each position at a time. The hyper-parameters are chosen with a grid search over the validation set. 4 For the manual evaluation, an author annotated the first 100 test instances for each model (800 in total). The manual evaluation has been designed in accordance with related work (Camburu et al., 2018), which also evaluated 100 instances per model. We found that no instances were using paraphrases. Hence, in our work, the automatic metric can be trusted.
Baseline. For the counterfactual test, we incorporate a random baseline as a comparison. Specifically, we insert a random adjective before a noun or a random adverb before a verb. We randomly select n 2 = 4 positions where we insert the said words, and, for each position at a time, we consider n 3 = 4 random candidate words. The candidates are single words randomly chosen from the complete list of adjectives or adverbs available in WordNet (Fellbaum, 2010). We identify the nouns and verbs in the text with spaCy (Honnibal et al., 2020). Datasets. We use three popular datasets with NLEs: e-SNLI (Camburu et al., 2018), CoS-E (Rajani et al., 2019), and ComVE (Wang et al., 2020). e-SNLI contains NLEs for SNLI (Bowman et al., 2015), where, given a premise and a hypothesis, one has to predict whether they are in a relationship of entailment (the premise entails the hypothesis), contradiction (the hypothesis contradicts the premise), or neutral (neither entailment nor contradiction hold). CoS-E contains NLEs for commonsense question answering, where given a question, one has to pick the correct answer out of three given options. ComVE contains NLEs for commonsense reasoning, where given two sentences, one has to pick the one that violates common sense.

Results
Counterfactual Test. Table 2 shows the results of our counterfactual test. First, we observe that when the random baseline finds words that change the prediction of the model, the words are more often not found in the corresponding NLE compared to the counterfactual editor (% Counter Unfaith). We conjecture that this is because the randomly selected words are rare for the dataset compared to the words that the editor learns to insert. Second, the counterfactual editor is better at finding words that lead to a change in the model's prediction, which in turn results in a higher percentage of unfaithful instances in general (% Total Unfaith). We also observe that the insertions W lead to counterfactual predictions for up to 55% of the instances (for ST-Ra-Edit on CoS-E). For up to 44.04% of the instances (for ST-Re-Edit on CoS-E), the editor is able to find an insertion for which the counterfactual NLE is unfaithful. Table 1, row 1, presents one such example. More examples for the random baseline can be found in Table 4, and for the counterfactual editor in Table 5.
We see that for all datasets and models, the total percentages of unfaithfulness to counterfactual are high, between 20.15% (for ST-Re-Rand on e-SNLI) and 44.04% (ST-Re-Edit for CoS-E). We re-emphasize that this should not be interpreted as an overall estimate of unfaithfulness, as our test is not complete (see §2). The Input Reconstruction Test. Table 3 shows the results of the input reconstruction test. We were able to reconstruct inputs for up to 4487 out of the 10K test instances in e-SNLI, and for all test instances in ComVE. There are, again, a substantial number of unfaithful NLEs: up to 14% for e-SNLI, and up to 40% for ComVE. An example is in Table  1, row 2. More examples can be found in Table  6. We also notice that this test identified considerably more unfaithful NLEs for ComVE than for e-SNLI, while for our first test, the gap was not as pronounced. This shows the utility of developing diverse faithfulness tests.
Finally, all four types of models had similar faithfulness results 5 on all datasets and tests, with no consistent ranking among them. This opposes the intuition that some configurations may be more faithful than others, e.g., Camburu et al. (2018) hypothesized that ST-Re may be more faithful than MT-Re, which is the case in most but not all of the cases, e.g., on CoS-E the editorial finds more unfaithfulness for ST-Re (44.04%) than for MT-Re (42.76 %). We also observe that Re models tend to be less faithful than Ra models in most cases.

Related Work
Tests for Saliency Maps. The faithfulness and, more generally, the utility of explanations were predominantly explored for saliency maps. Comprehensiveness and sufficiency (DeYoung et al., 2020) were proposed for evaluating the faithfulness of existing saliency maps. They measure the decrease in a model's performance when only the most or the least important tokens are removed from the input. Madsen et al. (2022) propose another faithfulness metric for saliency maps, ROAR, obtained by masking allegedly important tokens and then retraining the model. The authors argue that features with high importance should result in worse model performance compared to masking random tokens. In addition, Yin et al. (2022) and Hsieh et al. (2021) evaluate saliency maps through adversarial input manipulations presuming that model predictions should be more sensitive to manipulations of the more important input regions as per the saliency 5 Task accuracy and NLE quality are given in Table 7. map. Chan et al. (2022b) provide a comparative study of faithfulness measures for saliency maps. Further faithfulness testing for saliency maps was introduced by Camburu et al. (2019). Existing studies also pointed out that saliency maps can be manipulated to hide a classifier's biases towards dataset properties such as gender and race (Dombrowski et al., 2019;Slack et al., 2020;Anders et al., 2020). While diagnostic methods for saliency maps rely on the one-to-one correspondence between the saliency scores and the regions of the input, this correspondence is not present for NLEs, where text not in the input can be included. Thus, diagnostic methods for saliency maps are not directly applicable to NLEs. To this end, we propose diagnostic tests that can be used to evaluate the faithfulness of NLE models. Tests for NLEs. Existing work often only looks at the plausibility of the NLEs (Rajani et al., 2019;Kayser et al., 2021;Marasović et al., 2022;Narang et al., 2020;Kayser et al., 2022;Yordanov et al., 2022). In addition, Sun et al. (2022) investigated whether the additional context available in humanand model-generated NLEs can benefit model prediction as they benefit human users. Differently, Hase et al. (2020) proposed to measure the utility of NLEs in terms of how well an observer can simulate a model's output given the generated NLE. The observer could be an agent (Chan et al., 2022a) or a human (Jolly et al., 2022). The only work we are aware of that introduces sanity tests for the faithfulness of NLEs is that of , who suggest that an association between labels and NLEs is necessary for faithful NLEs and propose two pass/fail tests: (1) whether the predicted label and generated NLE are similarly robust to noise, (2) whether task prediction and NLE generation share the most important input tokens for each. Majumder et al. (2022) use these tests as a sanity check for the faithfulness of their model. Our tests are complementary and offer quantitative metrics.

Summary and Outlook
In this work, we introduced two tests to evaluate the faithfulness of NLE models. We find that all four high-level setups of NLE models are prone to generate unfaithful NLEs, reinforcing the need for proof of faithfulness. Our tests can be used to ensure the faithfulness of emerging NLE models and inspire the community to design complementary faithfulness tests.

Limitations
While our tests are an important stepping stone for evaluating the faithfulness of NLEs, they are not comprehensive. Hence, a model that would perform perfectly on our tests may still generate unfaithful NLEs.
Our first test inspects whether NLE models are faithful to reasons for counterfactual predictions. It is important to highlight that NLEs may not comprehensively capture all the underlying reasons for a model's prediction. Thus, an NLE that fails to accurately represent the reasons for counterfactual predictions may still offer faithful explanations by reflecting other relevant factors contributing to the predictions. Additionally, both the random baseline and the counterfactual editor can generate insertions that result in text lacking semantic coherence. To address this limitation, future research can explore methods to generate insertion candidates that are both semantically coherent and reveal unfaithful NLEs.
Our second test uses heuristics that are taskdependent and may not be applicable to any task. The reconstruction functions Rs proposed in this work are based on hand-crafted rules for the e-SNLI and ComVe datasets. However, due to the nature of the CoS-E NLEs, rule-based input reconstructions were not possible for this dataset. To address this limitation, future research could investigate automated reconstruction functions that utilize machine learning models. These models would be trained to generate reconstructed inputs based on the generated NLEs, where a small number of annotations would be provided as training instances. For example, for CoS-E, one such training annotation could be: Premise: A man wearing glasses and a ragged costume is playing a Jaguar electric guitar and singing with the accompaniment of a drummer.
Hypothesis: A man with glasses and a disheveled outfit is playing a guitar and singing along with a drummer. Prediction: entailment NLE: A ragged costume is a disheveled outfit.
Premise: A man wearing glasses and a ragged costume is playing a Jaguar electric guitar and singing with the accompaniment of a drummer. ➜Hypothesis: A man with glasses and a disheveled outfit is playing a guitar and singing along with a semi-formal drummer. Prediction: neutral ✗ NLE: Not all ragged costumes are disheveled. Unfaithfulness cause: inserted word 'semi-formal' / ∈ NLE but changed the prediction.  ➜Sent 1: My knee was scrapped and I put ointment on it. ➜Sent 2: Ointment is not used to scrape a knee. ✗ Prediction: second sentence Explanation: Ointment is used to scrape a knee. e-SNLI, ST-RE Premise: People are riding bicycles in the street, and they are all wearing helmets. Hypothesis: A group of friends are grabbing their bikes, getting ready for the morning bike ride. Prediction: contradiction Explanation: Just because people are riding bicycles does not mean they are friends.
➜Premise: People are riding bicycles. ➜Hypothesis: They are friends. ✗ Prediction: neutral Explanation: People riding bicycles are not necessarily friends. e-SNLI, ST-RA Premise: A woman is walking her dog and using her cellphone. Hypothesis: The woman is playing a game on her cellphone. Prediction: neutral Explanation: Just because a woman is using her cellphone does not mean she is playing a game.
➜Premise: A woman is using her cellphone. ➜Hypothesis: She is playing a game. ✗ Prediction: contradiction Explanation: The woman can not be using her cellphone and playing a game at the same time.  . We apply the test on an original instance (second column), which results in a new instance (third column). The parts of the input changed by the test are marked with ➜, and the intervention made by the test is in blue. ✗ marks an NLE or a prediction that does not match the expectation, thus pointing to the underlined NLE being unfaithful. The unfaithfulness cause for the instances is that the reasons in the NLE for the original instance lead to a different prediction.