Measuring Association Between Labels and Free-Text Rationales

In interpretable NLP, we require faithful rationales that reflect the model’s decision-making process for an explained instance. While prior work focuses on extractive rationales (a subset of the input words), we investigate their less-studied counterpart: free-text natural language rationales. We demonstrate that *pipelines*, models for faithful rationalization on information-extraction style tasks, do not work as well on “reasoning” tasks requiring free-text rationales. We turn to models that *jointly* predict and rationalize, a class of widely used high-performance models for free-text rationalization. We investigate the extent to which the labels and rationales predicted by these models are associated, a necessary property of faithful explanation. Via two tests, *robustness equivalence* and *feature importance agreement*, we find that state-of-the-art T5-based joint models exhibit desirable properties for explaining commonsense question-answering and natural language inference, indicating their potential for producing faithful free-text rationales.


Introduction
Interpretable NLP aims to better understand predictive models' internals for purposes such as debugging, validating safety before deployment, or revealing unintended biases and behavior (Molnar, 2019). These objectives require faithful rationalesexplanations of the model's behavior that are accurate representations of its decision process (Melis and Jaakkola, 2018).
One way towards faithfulness is to introduce architectural modifications or constraints that produce rationales with desirable properties (Andreas et al., 2016;Schwartz et al., 2018;Jiang et al., 2019, inter alia). For example, pipeline models ( Figure 2) were designed for information extraction (IE) tasks Figure 1: A categorization of interpretable NLP on an illustrative faithfulness spectrum. Two predominant forms of explanation exist that align with two predominant classes of NLP tasks. Unlike models for IE tasks, the desirable properties of interpretable models for reasoning tasks have not been explored. We investigate architectures and tests for explaining reasoning tasks.
for which a rationale can be extracted as a subset of the input and is sufficient to make a prediction on its own, without the rest of the input (Lei et al., 2016). Such models approach faithfulness by construction (Jain et al., 2020).
There is a growing interest in tasks that require world and commonsense "knowledge" and "reasoning", such as commonsense question-answering (CommonsenseQA; Talmor et al., 2019) and natural language inference (SNLI; Bowman et al., 2015). Here, extractive rationales necessarily fall short-rationales must instead take the form of free-text natural language to fill in the reasoning or knowledge gap (Camburu et al., 2018;Rajani et al., 2019). 2 In Table 1, for example, the highlighted extractive rationale of the first problem instance lacks at least one reasoning step to adequately justify the answer; the natural language rationale (which is not extractive) fills in the gap.

Commonsense QA (CoS-E)
Question: While eating a hamburger with friends , what are people trying to do? Answer choices: have fun, tasty, or indigestion Natural language rationale: Usually a hamburger with friends indicates a good time.
Natural Language Inference (E-SNLI) Premise: A child in a yellow plastic safety swing is laughing as a dark-haired woman stands behind her. Hypothesis: A young mother is playing with her daughter in a swing. Label choices: neutral, entailment, or contradiction Natural language rationale: Child does not imply daughter and woman does not imply mother. Table 1: Examples from the CoS-E v1.0 and E-SNLI datasets ( §2). Extractive rationales annotated by humans are highlighted, while human-written free-text rationales are presented underneath the answer/label choices. These examples illustrate that the extractive rationales fail to adequately explain the correct (underlined) label.
We study two distinct model classes: selfrationalizing models, which are fully differentiable and jointly predict the task output with the rationale; and pipelines, which rationalize first and then predict task output with a separate model. We first show that, for CommonsenseQA and SNLI, a selfrationalizing model provides rationales that better indicate the correct label than a pipeline ( §3.1). Next, we show that sufficiency is not universally applicable: a natural language rationale on its own does not generally provide enough information to arrive at the correct answer ( §3.2). These findings suggest that a faithful-by-construction pipeline is not an ideal approach for reasoning tasks, leading us to ask: is there is a way to achieve faithful freetext rationalization with self-rationalizing models?
We note that there is currently no way to assess the relationship between a prediction and a freetext rationale within the same fully differentiable model. Jacovi and Goldberg (2020) argue for the development of evaluations that measure the extent and likelihood that a rationale is faithful in practice (illustrated in Figure 1). To do so, we propose two measurements to initiate testing the extent to which predicted labels and explanations are associated within the model that produces them.
The first experiment, robustness equivalence ( §4.1), analyzes whether a predicted label and generated rationale are similarly robust to noise. The second, feature importance agreement ( §4.2), analyzes whether the gradient-attributions of the input with respect to the predicted label are similar to those with respect to the predicted rationale. We show that a self-rationalizing finetuned variant of T5 (Raffel et al., 2020;Narang et al., 2020) demonstrates good robustness equivalence and feature importance agreement on the datasets investigated. This result motivates future work on more measurements for testing label-rationale association.

Tasks, Datasets, and Models
Before we turn to our analyses we introduce datasets and models used for our experiments.

Tasks and Datasets
We explore two large-scale datasets for textual reasoning tasks that contain human-written natural language rationales: E-SNLI (Camburu et al., 2018), an extension of SNLI (Bowman et al., 2015);and CoS-E (Rajani et al., 2019), an extension of CommonsenseQA (Talmor et al., 2019) (both in English). For the former, the task is to infer whether a given hypothesis sentence entails, contradicts, or is neutral towards a premise sentence. For the latter, the task is to select the correct answer from 3 (v1.0) or 5 (v1.11) answer choices for a question. We use both versions of CoS-E in our experiments (see Appendix A.2). Table 1 contains examples and Table 7 (Appendix A.2) data statistics. 3 T5 Models All of the models in this work are based on T5, though our methods can in principle be applied to any architecture. The base version of T5 is a 220M-parameter transformer encoderdecoder (Vaswani et al., 2017). To carry out supervised finetuning, T5 is trained by maximizing the conditional likelihood of the correct text output (from annotated data), given the text input.
We finetune five T5-Base models for each dataset, supervising with ground-truth labels and rationales (further details in Appendix A.3-A.4): • I→R, which maps task inputs to rationales, without ever being exposed to task outputs. • R→O, which maps rationales to task outputs.
The only input elements this model is exposed to are answer choices (for CoS-E). • I→OR, which maps inputs to outputs and rationales.  Table 2: An overview of text-only datasets and rationale types (E for extractive, NL for natural language rationales) used in prior work on pipeline architectures. We focus on the two tasks we believe require a more complex notion of "reasoning" to solve: CommonsenseQA (CQA) and NLI. Unlike the other tasks in the table, prior work for rationalizing these two tasks lacks consensus on (1) the type of rationales best-suited, and (2) the form of the model for these tasks. We argue for natural language rationales, and demonstrate that pipeline models are poorlysuited for CQA and SNLI given this choice. Dataset citations: Appendix A.1. • IR→O, which maps pairs of inputs and rationales to outputs. • I→O, which maps inputs to outputs. We provide input-output formatting in Table 8 (Appendix A.3). Using these building blocks, we can instantiate two important approaches.
Pipeline Model (I→R;R→O) This architecture composes I→R with R→O, each of which is trained entirely separately, for a total of 440M parameters. It is illustrated in Figure 2 and is faithfulby-construction (with caveats; see Jacovi and Goldberg, 2021). The vast majority of prior work using pipelines has focused only on extractive rationales (see Table 2).
Self-Rationalizing Model (I→OR) A joint, self-rationalizing model (Melis and Jaakkola, 2018), illustrated in Figure 3, predicts both a label and rationale. This is the most common approach to free-text rationalization (Hendricks et al., 2016;Kim et al., 2018;Hancock et al., 2018;Camburu et al., 2018;Ehsan et al., 2018;Liu et al., 2019a;Wu and Mooney, 2019;Narang et al., 2020;Do et al., 2020;Tang et al., 2020), but little is understood about model internals. I→OR models are desirable for their ease-of-use, task-effectiveness, parameter efficiency, and their ability to generate fluent and plausible rationales. We expect models of this kind to play an important role in continuing research on explainable AI for these reasons.
We use the I→OR variant of T5 (Narang et al., 2020). Because only one instance of T5 is used to instantiate it, the total number of parameters is half that of the pipeline. We replicate two prior findings (Tables 9-10 in Appendix B): the T5 pipeline does not perform as well as the self-rationalizing model (despite having double the parameters), and T5-Base outperforms pretrained models used in prior work.
Evaluation We do not report BLEU scores (Papineni et al., 2002), because BLEU and related metrics do not measure plausibility (Camburu et al., 2018;Kayser et al., 2021;Clinciu et al., 2021) or faithfulness (Jacovi and Goldberg, 2020). In addition to low correlation with human scores, there can be many valid rationales for a given instance (Miller, 2019); metrics that compare generated rationales to a single ground-truth do not address this and are thus a poor measure of quality.  Human simulatability (Doshi-Velez and Kim, 2017) has a rich history in machine learning interpretability as a reliable measure of rationale quality from the lens of utility to an end-user (Kim et al., 2016;Chandrasekaran et al., 2018;Hase and Bansal, 2020;Yeung et al., 2020;Poursabzi-Sangdeh et al., 2021;Rajagopal et al., 2021, i.a.). Rather than computing word-level overlap with a ground-truth explanation, simulatability measures the additional predictive ability a rationale provides over the input, computed as the difference between task performance when a rationale is given as input vs. when it is not (IR→O minus I→O). Historically, humans have served as the predictors, but recent work has shown that the computation of simulatability can be automated using trained models. Hase et al. (2020) demonstrate that automated metrics for simulatability have moderate to high correlation with human scores in both an expert and a crowdsourced setting. 4 We thus use simulatability score as a measure of rationale quality.

Shortcomings of Free-Text Pipelines
We first analyze "faithful-by-construction" pipeline models (I→R;R→O) for free-text rationalization with respect to two properties: quality of generated rationales ( §3.1) and appropriateness of the sufficiency assumption ( §3.2).

Joint Model Rationales are More Indicative of Labels
Rationales should be a function of the input and the predicted label. To demonstrate why this is the case, consider training an I→R model on a dataset with multiple annotation layers, e.g., OntoNotes, that contains word sense, predicate structure, and coref-4 Given the large scale of our analysis (>250K instances evaluated), an automated metric provides coverage, reproducibility and consistency not achievable with human annotation. An author of this paper annotated 60 instances from both E-SNLI and CoS-E v1.11, and found 82.5% agreement between their simulatability score and the automated metric. 19.09 -12.77 -6.06 Table 4: Simulatability scores (higher is better) of ground-truth rationales (R * ) and rationales generated from two model architectures: I→OR and I→R. These results demonstrate that rationales generated as a function of the input and the predicted label (I→OR) are higher quality than those generated as a function of the input alone (I→R) across datasets ( §3.1).
erence (Pradhan et al., 2007). Without additional task-specific input, this model would produce the same rationale, regardless of the task being rationalized. Prior work has also critiqued I→R;R→O models because it is counter-intuitive to generate a rationale before deciding the label to explain (Kumar and Talukdar, 2020;Jacovi and Goldberg, 2021). Therefore, the I→R model will first need to implicitly predict a label. But can I→R infer the label well, when it is trained without label signal? To address this question, we study whether I→OR rationales are better at predicting the gold labels than I→R rationales. We train a R→O model on ground-truth rationales (R * ), and evaluate on the following inputs: • test set ground-truth R * rationales, • test set rationales generated by I→OR, and • test set rationales generated by I→R.
In Table 2, we show that I→OR rationales recover 8-9% more ground-truth (R * ) performance than R→O rationales on both versions of CoS-E, and 1% on E-SNLI. A smaller improvement for E-SNLI could be explained by the fact that E-SNLI has substantially more training examples for each label than CoS-E, which helps a pipeline model learn features predictive of each label. 5 We additionally demonstrate I→OR rationales are higher quality than R→O's, as measured by the simulatability metric ( Table 4). The fact that the pipeline's strong performance does not generalize to a complex prediction task such as CoS-E empirically demonstrates that training on label signal O is important to generate good-quality rationales and avoid cascading errors. Figure 3: An example of a joint architecture (I→OR; §2) for CoS-E v1.0 with a human-written rationale. Trained on both task signal and human rationales, these models are effective at generating fluent rationales while retaining good task performance (Table 9-10 in Appendix B).  Table 5: A comparison of the IR→O and R→O models ( §2) evaluated with ground-truth natural language rationales (R * ). In some cases accuracy improves substantially with the addition of the input, indicating that rationales are not always sufficient and pipelines are not always effective.

Sufficiency is not Universally Valid
"Faithful-by-construction" pipelines rely on the sufficiency assumption: the selected rationale must be sufficient to make the prediction without the remaining input. This assumption is suitable for IE tasks for which a subset of the input tokens is predictive of the label. Indeed, humans can serve as R→O models on certain IE tasks and make accurate predictions, validating that rationales are sufficient for these tasks (Jain et al., 2020).
To illustrate why sufficiency might not be justified for reasoning tasks, consider the example in Figure 2. The task of the R→O model is to select between the answer choices "have fun", "tasty", and "indigestion" given the rationale "Usually a hamburger with friends indicates a good time". The rationale is designed to complement the input question, but the R→O model does not see the question, changing the fundamental nature of the task it is solving. We thus wonder: does task obfuscation hurt pipelines' ability to perform the task?
We report the accuracy difference between a R→O model and a model that receives both the input and rationale (IR→O), both trained on R * . We evaluate on test set R * . 6 In Table 5, the IR→O models on CoS-E have a 5-12% increase in accuracy over R→O, indicating that the rationales are not sufficient. The difference is much smaller for E-SNLI (1%), likely due to the fact that E-SNLI was collected by instructing annotators to provide 6 Evaluating on R * instead of generated rationales serves as an upper-bound on pipeline performance, removing the confounding factor that I→R rationales can be poor ( §3.1). self-contained rationales. However, using datasetcollection to explicitly collect sufficient rationales does not address the unnaturalness of such a task formulation (Wiegreffe and Marasović, 2021). 7 Table 5 indicates that (especially) in the case of CoS-E, sufficiency is not a valid assumption, and the use of I→R;R→O models is sub-optimal in these cases.
So far, we have highlighted shortcomings of pipelines for reasoning tasks: • cascading errors caused by low-quality rationales that are not indicative of labels ( §3.1), • missing information due to rationales not being sufficient ( §3.2), • double the number of parameters and more manual labor needed to reach comparable performance to an end-to-end (I→O) model; still often performing worse ( §2). We next turn our focus to self-rationalizing (I→OR) models currently in widespread use, which in contrast to pipelines are high-performing, easy to implement via a multi-task loss, and more parameter-efficient ( §2).

Analyzing Necessary Properties of Joint Models
Despite their popularity and widespread use, the extent to which self-rationalizing models exhibit faithful rationalization has not been studied. To illustrate this point, we reference Narang et al.  models cannot be treated as faithful explanations without further investigation. At minimum, rationales must be implicitly or explicitly tied to the model's prediction. We present two metrics to analyze the association between the mechanisms that produce labels and rationales in a multi-task, I→OR model: robustness equivalence ( §4.1) and feature importance agreement ( §4.2). These experiments serve as a necessary sanity check for the reliability of I→OR models' explanations.

Robustness Equivalence
We aim to analyze whether predicted labels and rationales are similarly or dissimilarly robust to noise applied to the input. The former indicates predicted labels and rationales are strongly associated, while the latter indicates the opposite. Given some amount of noise, there are four possible cases for a model's output: {l stable , l unstable } × {r stable , r unstable }, where l is a label and r is a rationale.
The case where l and r are both stable or both unstable indicates that both tasks are similarly affected by noise. The case where l is unstable but r is stable (or vice versa) is a failure case-if only one output is stable, we conclude the two generation mechanisms cannot be strongly associated within the model.
Method Following related work Lakshmi Narayan et al., 2019;Liu et al., 2019b), we add zero-mean Gaussian noise N (0, σ 2 ) to each input embedding in the I→OR encoder at inference time. We measure changes in label prediction as the number of predicted test set labels that flip, i.e., change from their original prediction to something else, alongside changes in accuracy of the I→OR model. We measure changes in rationale quality using simulatability ( §2). We illustrate details of the simulatability calculation in Figure 6a. We report metrics on rationales generated by I→OR under different levels of noise, controlled by σ 2 .
An example of noisy outputs for the running CoS-E v1.0 example is presented in Table 6.
Results We present results on the effect of noise on labels in Figure 4 (E-SNLI and CoS-E v1.11 in Figure 8 of Appendix B). As expected, the accuracy of the I→OR model (red line) and the percent of labels in the I→OR model which have not flipped (black line) are almost identical for all three datasets. We present results on the effect of noise on rationales in Figure 5 for CoS-E (E-SNLI in Figure 9 of Appendix B).
By examining the regions of largest slope, we gain insights into model behavior. On the rationale quality measure, both versions of CoS-E's rationales reach a minimum contribution to task accuracy at σ 2 = 20 ( Figure 5). We similarly observe the largest drop in task accuracy ( Figure 4) for CoS-E v1.0 between σ 2 = 15 to σ 2 = 20. 8 Thus, at lower noise levels (0-15), the model exhibits both stable labels and rationales, and at higher levels (20+), both unstable, indicating robustness equivalence. Similar conclusions can be reached for E-SNLI and CoS-E v1.11; we conclude that the I→OR model demonstrates high label-rationale association for all 3 datasets as measured by robustness equivalence.

Feature Importance Agreement
If label prediction and rationale generation are associated, input tokens important for label prediction should be important for rationale generation and vice versa. We refer to this property as feature importance agreement. To measure to what extent I→OR models exhibit this property, we use gradient-based attribution (Baehrens et al., 2010;Simonyan et al., 2014) to identify tokens important for label prediction, and the Remove and Retrain (ROAR) occlusion method (Hooker et al., 2019) to analyze their impact on rationale generation (or vice versa).
Gradient Attribution For a predicted class p, gradient attribution is a function of the gradient of the predicted class' logit l p with respect to an input token embedding x (i) ∈ R d : where the function f reduces the gradient to a scalar. Choices for f include L 1 or L 2 norm (Atanasova et al., 2020), or an element-wise sum (Wallace et al., 2019). Intuitively, the gradient measures how much an infinitesimally small change in the input changes the predicted class' logit, using a first-order Taylor series approximation of the logit function. Such methods have been extended to sequence-output models such as neural machine translation ( p } m k=1 with respect to the input: The attribution of a sequence of n input token embeddings, X ∈ R n×d , is a vector a(X) = [a(x (1) ), . . . , a(x (n) )] ∈ R n , where a(x (i) ) is shorthand for the value defined in Equation 2. By decomposing the term in Equation 2 into two parts, we obtain two attribution vectors over the input tokens; one for the predicted label logits L, and one for the predicted rationale logits R in the decoded output: a(X) = a(X)L + a(X)R.
Reliability of Gradient Attribution Before we measure feature importance agreement, it is critical to evaluate whether the gradient-attribution scores truly capture token importance, since these methods can be unreliable for certain datasets or architectures (Kindermans et al., 2019). To validate that our attributions are reliable, we perform the ROAR test (Hooker et al., 2019). Using attribution scores, we obtain the top-k% attributed tokens for every instance and occlude them following T5's pretraining procedure and mask tokens. We retrain a model on the occluded training set and evaluate on the occluded test set. We repeat this procedure for k ∈ {10%, 20%, 30%}, and compare the drop in performance as k increases to a baseline in which k% random tokens are dropped.
To the extent that the occluded model fails to match the random model's performance, we can attribute such degradation to the removal of tokens that the original model finds informative. A large drop in performance indicates that gradient attributions successfully identify important tokens in the input.
We first use this method to select an optimal gradient-attribution method and f function (Figure 11 in Appendix B). We find the L 1 norm of the embedding vector as f to outperform the element-wise sum (which may suffer from dampened magnitudes). Unlike prior work in computer vision (Hooker et al., 2019), we find raw gradients to perform comparably to the input*gradient variant (Shrikumar et al., 2017). Thus, we compute attributions in subsequent experiments following Equation 1 with f equal to the L 1 norm.
We validate that attributions from the label logits, a(X) L , degrade label accuracy when compared to random occlusion (orange vs. blue line in Figure 7, left). The two simulatability lines (Figure 7, right) for CoS-E v1.0 have an inflection point. We illustrate how simulatability is calculated in Figure 6b. Similar to §4.1, we expect this is due to rationales so noisy that IR→O ignores them and behaves like I→O. If an input attribution degrades rationale quality (as measured by simulatability) more than a random attribution, then the line corresponding to that attribution (for values of k for which neither that attribution nor the random attribution have reached the inflection point) has to be below the "random" line. For values of k for which both attributions have passed the inflection point, the "random" line should be below the attribution line, assuming that after this point, a noisier ratio-nale is more similar is IR→O to I→O and hence simulatability is closer to 0. Both criteria hold for attributions from the rationale logits, a(X) R (green vs. blue line in Figure 7, right) for CoS-E v1.0 and other datasets (see Figure 12 in Appendix B). This reliability check confirms that gradient-attribution works well in our setting.

Agreement Method and Results
To measure feature importance agreement-whether tokens important for label prediction are important for rationale generation (and vice versa)-we repeat the same experiment, but measure performance with respect to the other output's metric. For attributions computed from label logits, a(X) L , we measure the effect of their occlusion on rationale quality using simulatability score. For attributions with respect to rationale logits, a(X) R , we measure the effect of their occlusion on label accuracy. If at least one of these values is notably different from random, we can conclude that the I→OR model displays feature-importance similarity in a given direction.
Results for CoS-E v1.0 are once again in Figure 7 (and for other datasets in Figure 12 in Appendix B). In Figure 7 (left), we find that removing top-k% tokens by a(X) R magnitude degrades label performance compared to the baseline (green vs. blue line). Intuitively, this drop is less than token attributions by a(X) L magnitude (orange line). In Figure 7 (right), we observe that removing top-k% tokens by a(X) L consistently degrades rationale performance more than random according to the two criteria for comparing simulatability lines (orange vs. blue line). This also holds for E-SNLI and CoS-E v1.11 ( Figure 12). We conclude that the I→OR model demonstrates label-rationale association as measured by feature importance agreement for the datasets studied.

Related Work
Analysis of NLP Models Structural tests for analyzing models' internals include probing (Tenney et al., 2019)  Although gradient-attribution has been extensively studied in NLP, its interplay with free-text  2020) show that saliency maps and model predictions can be independently adversarially attacked in vision and clinical tasks, and conclude this is due to a misalignment between the saliency map generator and model predictor. Such methods have not been tested for models producing natural language (NL) rationales. Future work could include expanding robustness equivalence ( §4.1) to model discrete edits of input words.
Analyzing Faithfulness The aim of our work is to initiate placing models that provide NL rationales on the faithfulness spectrum conceptualized by Jacovi and Goldberg (2020). Prior work proposing models (Jain et al., 2020;Schuff et al., 2020;Jacovi and Goldberg, 2021) and evaluations (DeYoung et al., 2020) of faithful explanation focus on extractive rationales and generally rely on the sufficiency assumption. Schuff et al. (2020) propose a regularization term to couple answers and extractive explanations on HotPotQA.
Turning to exceptions that focus on natural language rationales, Latcinnik and Berant (2020) train a differentiable I→R;IR→O pipeline for Common-senseQA, controlling the complexity of the IR→O model to increase the likelihood that the model is faithful to the rationale. Kumar and Talukdar (2020) propose an IO→R;IR→O pipeline that gen-erates an explanation for every possible NLI label using label-specific explanation generators-an alternative solution to the problem raised in §3.1 for datasets with a small number of shared labels.

Conclusion
After demonstrating the weaknesses that pipeline models exhibit for free-text rationalization tasks, we propose two measurements of label-rationale association in self-rationalizing models. We find that on three free-text rationalization datasets for CommonsenseQA and SNLI, models based on T5 exhibit high robustness equivalence and feature importance agreement, demonstrating that they pass a necessary sanity check for generating faithful free-text rationales.
Future work can expand analysis to more properties. We believe this research direction to be important moving forward due to the advantages of large multi-task explanation models, and as a complement to development of interpretable architectures that can be fickle and task-specific. Although our measurements address only necessary and not sufficient properties, by viewing faithful interpretability as a spectrum, we make a step to quantitatively situate common models on it.

A.2 Details of Datasets
We summarize dataset statistics in Table 7. The two versions (v1.0, v1.11) of CoS-E correspond to the first and second versions of the Common-senseQA dataset. CoS-E v1.11 has some noise in its annotations (Narang et al., 2020). 9 This is our primary motivation for reporting on v1.0 as well, which we observe does not have these issues.

A.3 Details of T5
The T5 model (Raffel et al., 2020) is pretrained on a multi-task mixture of unsupervised and supervised tasks, including machine translation, question answering, abstractive summarization, and text classification. Its inputs and outputs to every task are text sequences; we provide the input-output formatting for training and decoding of our T5 models in Table 8. T5 can provide any word in the vocabulary as an answer.

A.4 Implementation Details
We use Huggingface Datasets 10 to access all datasets, and Huggingface Transformers (Wolf et al., 2020) to access pretrained T5 weights and tokenizer. To optimize, we use Adam with = 1e-8, β 1 = 0.9, and β 2 = 0.99. We use gradient clipping to a maximum norm of 1.0 and a dropout rate of 0.1. We train each model on a NVIDIA RTX 8000 GPU (48GB memory) for maximum 200 epochs with a batch size of 64 and a learning rate linearly decaying from 5e-5. Training ends if the validation set loss has not decreased for 10 epochs. Early stopping occurs within 15 epochs for most models. Most CoS-E models train in less than 1 hour and most E-SNLI models in around 30. At inference time, we greedy-decode until an EOS token is generated (or for 200 tokens). Approximating the 64-batch model with a batch-size of 16 and 4 gradient accumulation steps on 8GB memory cloud GPUs, we sweep starting learning rates of 1e-2, 1e-3, 1e-4, 5e-5, and 1e-5. The two largest learning rates never result in good performance. Among the smallest three rates, performance across all model variants (I→R, I→OR, R→O, I→O, and IR→O) on E-SNLI and CoS-E v1.0 never varies by more than 1.58% accuracy or 0.34 BLEU.

A.5 Note on Robustness Equivalence Convergence
Worst-case model performance under large noise values in the Robustness Equivalence experiments (Figures 4 and 8) reaches 0 rather than random accuracy due to structure of the models' output. The I→OR model is trained to produce a delimiter to distinguish the label from the rationale in a long string of output tokens. When it fails to produce the delimiter under high noise, we cannot delineate the label from the rationale in the output where multiple answer choices are often mentioned, so we mark the label as incorrect.

B Additional Results
We provide additional results that supplement the main body of the paper: • Table 9 presents results comparing the selfrationalizing T5 model to baselines. • Table 10 presents results comparing the selfrationalizing T5 model to its pipeline variant (from §2).      house explanation: a house is the only place that would have air conditioning. 10 house explanation: a house is the only place that would have air conditioning. 15 <extra_id_0> house explanation: a house is the only place that will have air conditioning. 20 <extra_id_0> movie theatre explanation: movie theatre is the only option that is not a movie. 911 911 911. . . 25 <extra_id_0> explain<extra_id_1> explain<extra_id_2> explain<extra_id_3> explain<extra_id_4> explain<extra_id_5> movie theatre<extra_id_6>. . . 30 house of house of house of house of house of house of house of house of house of house of house of house of house . . . 35 house of house of office office office office office office office office office office office office office office office office office office. . . Table 11: Noised output of the I→OR model for the CoS-E v1.0 example "A man wants air conditioning while we watches the game on Saturday, where will it likely be installed?" The correct answer is "house". σ 2 Predicted Output 0 stress explanation: a computer is used to communicate with a granddaughter. 5 stress explanation: a computer is used to communicate with a granddaughter. 10 stress explanation: a computer is used to talk to people. 15 stress explanation: a computer is used to talk to people. 20 <extra_id_0> is using a computer to<extra_id_1> to talk to<extra_id_2> is using a computer to talk to a person is using a computer to talk to a person. . . 25 <extra_id_0> answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer:. . . 30 <extra_id_0> answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer: answer:. . . 35 office of the office of the office of the office of the office of the office of the office of the office of the office of the office of the office . . . Table 12: Noised output of the I→OR model for the CoS-E v1.0 example "If a person is using a computer to talk to their granddaughter, what might the computer cause for them?" The correct answer is "happiness". σ 2 Predicted Output 0 transfer of information explanation: transfer of information is the only option that would be appropriate when communicating with a boss. 5 transfer of information explanation: transfer of information is the only option that would be appropriate when communicating with a boss. 10 transfer of information explanation: transfer of information is the only option that would be appropriate when communicating with a boss. 15 transfer of information explanation: transfer of information is the only option that would be appropriate when communicating with my boss. 20 transfer of information: transfer of information is the only thing that is transfer of information. transfer of information is the only thing that is. . . 25 transfer of information: transfer of information is information. transfer of information is information. transfer of information is information.. . . 30 i believe that is the answer of the question. argument is the answer of the question. argument is the answer of the question. argument is the answer of the question.. . . 35 i can't handle the argument argument argument is the answer of the argument argument is the answer of the argument. . . Table 13: Noised output of the I→OR model for the CoS-E v1.0 example "When communicating with my boss, what should I do?". The correct answer is "transfer of information". Accuracy of the I→OR model (red) and % stable labels in the I→OR model (black) show that most changes take place in the 10-20 σ 2 range for CoS-E and 15-30 σ 2 for E-SNLI. See §4.1. Figure 9: Results of the rationale portion of the robustness equivalence test for E-SNLI.  Figures 2-3. The decoded label is "have fun" and generated rationale "having fun is the only thing that people are trying to do". Important input terms vary across the two loss terms. For example, the predicted label term assigns high importance to the predicted answer choice, "have fun", while the explanation attends more uniformly over the input with peaks on relevant entities and verbs such as "trying". In this example, the explanation-and label-attribution vectors are each L 1 -normalized in order to compare the relative importance of tokens (irrespective of gradient magnitudes). Figure 11: Effect of various gradient attribution methods on the ROAR test at k = 10 − 30% occlusion for the CoS-E v1.0 validation set. We compute attributions with respect to the label logit and measure label accuracy of the resulting model after masking and re-training (see §4.2 for details). The largest drop in performance comes from the L 1 norm embedding-combination method, and raw gradients are not significantly different from input*gradient. On average, input*gradient and raw gradients share 17% of tokens in the top 10%, 24% of tokens in the top 20%, and 31% of tokens in the top 30%.
(a) Impact of occlusion by source of attribution on label accuracy.
(b) Impact of occlusion by source of attribution on rationale quality. Simulatability is calculated as IR→O minus I→O. Figure 12: ROAR Feature Importance Agreement results on E-SNLI (left) and CoS-E v1.11 (right). Figure 12a shows label accuracy of the I→OR model. Figure 12b shows quality of generated rationales from the I→OR model.