REV: Information-Theoretic Evaluation of Free-Text Rationales

Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models’ reasoning and prediction processes.


Introduction
Model explanations have been indispensable for trust and interpretability in natural language processing (NLP) (Ribeiro et al., 2016(Ribeiro et al., , 2020Lipton, 2018;Chen et al., 2020Chen et al., , 2021a. Free-text rationales, which explain a model prediction in natural language, have been especially appealing due to their flexibility in eliciting the reasoning process behind the model's decision making (Camburu et al., 2018;Narang et al., 2020;Rajani et al., 2019;Kumar and Talukdar, 2020;Brahman et al., 2021), making them closer to human explanations. However, existing metrics for free-text rationale evaluation remain narrowly focused on the extent to which a rationale can help a (proxy) model predict the label it explains (i.e., accuracy based) (Hase et al., 2020;. These metrics offer little understanding of the new information contained in the rationale, as added to the original input, that could explain why the label is selectedthe very purpose a rationale is designed to serve. For instance, the two rationales r * 1 andr 1,a in Fig.  1 would be considered equally valuable under existing metrics, even though they supply different amount of novel and relevant information. In this paper, we overcome this shortcoming by introducing an automatic evaluation for free-text rationales along two dimensions: (1) whether the rationale supports (i.e., is predictive of) the intended label, and (2) how much new information does it provide to justify the label, beyond what is contained in the input. For example, rationaler 1,b in Fig. 1 violates (1) because it is not predictive of the label, "enjoy nature". Rationaler 1,a does support the label but contains no new information that justifies it, beyond what is stated in the input x; thus, it violates (2). Rationale r * 1 is satisfied along both dimensions: it supports the label and does so by providing new and relevant information, beyond what is in the input. Our proposed evaluation is designed to penalize bothr 1,a andr 1,b , while rewarding rationales like r * 1 . We introduce REV 2 , which adapts an information-theoretic framework from Xu et al. (2020) for evaluating free-text rationales along the two dimensions mentioned above. Specifically, REV is based on conditional V-information Figure 1: Our evaluation framework for different free-text rationales (r). r * 1 is a human-written rationale,r 1,a and r 1,b are two generated rationales for the true label y 1 . Our metric, REV, based on CVI (Hewitt et al., 2021) is able to distinguish all three rationales by measuring how much new and label-relevant information each adds over a vacuous rationale, b; performance-based evaluations can only distinguish betweenr 1,a andr 1,b . For an (arguably) incorrect label, y 2 , REV still gives a positive score highlighting thatr 2 is able to provide new information for why it supports y 2 . Prediction accuracy can be augmented with REV to provide a fuller interpretability of model decisions. (Hewitt et al., 2021), which quantifies the degree of information contained in a representation beyond another (baseline) representation, accessible to a model family V. As our baseline representation, we consider any vacuous rationale which simply (and declaratively) combines an input with a given label, without providing any new information relevant to answering why the label was chosen. REV adapts conditional V-information to evaluate rationales, where we compare two representations-one from an evaluation model trained to produce the label given the input and the rationale, and the other from another evaluation model for the same task but considering only the input (disguised as a vacuous rationale). Other metrics do not take into consideration vacuous rationales, and are hence unable to measure new and label-relevant information in rationales.
In our experiments, we present evaluations with REV for rationales under two reasoning tasks, commonsense question-answering (CQA; Talmor et al., 2019) and natural language inference (NLI; Bowman et al., 2015), across four benchmarks. Several quantitative evaluations demonstrate the capabilities of REV in providing evaluations along new dimensions for free-text rationales, while also being more consistent with human judgements compared to existing metrics. We also provide comparisons to demonstrate the sensitivity of REV to various degrees of input perturbations. Additionally, evaluation with REV offers insights into why rationales obtained through chain-of-thought prompting (Wei et al., 2022) do not necessarily improve prediction performance.

REV: Information-Theoretic Evaluation of Rationales
We introduce a new metric, REV, Rationale Evaluation with conditional V-information, for evaluation of free-text rationales on the proposed dimensions ( §2.2), based on the framework of conditional V-information ( §2.1). We consider the setting where we have input X ∈ X , label Y ∈ Y, and free-text rationale R ∈ R generated for label Y . A common strategy to evaluate rationale R is through an evaluator function f ∶ Z → Y , which maps a variable Z to a label distribution. Here, Z can be defined based on the evaluation framework; e.g., Z can be a concatenation of X and R, or contains only X. These metrics evaluate the utility of R based on how much R helps f predict Y . The evaluator f is typically trained on a set of input, label and rationale triples D train = {(x j , y j , r j )}, and applied to D test = {(x i , y i , r i )} for evaluation. The utility of R is formulated as the difference between the performance of the evaluator on predicting Y with R, and without it, i.e.
where a larger performance gap indicates a better rationale. Existing metrics (Hase et al., 2020; compute the performance gap based on prediction accuracies. However, accuracy-based evaluation can only indicate whether or not a rationale is predictive of a label, but cannot quantify how much new information the rationale provides to justify the label. Figure 1 illustrates this issue via an example. Here, accuracy-based evaluation can distinguish between r 1,a andr 1,b sincer 1,a supports y 1 andr 1,b does not. However, it is unable to distinguish between r * 1 andr 1,a (since both are predictive of y 1 ), despite the fact thatr 1,a does not provide any unique and relevant information to answer why the label should be y 1 . In practice, vacuous rationales such asr 1,a are commonly seen in model generations (Sun et al., 2022;Wiegreffe and Marasovic, 2021). This calls for an evaluation metric which is able to identify and penalize such vacuous rationales.

An Information-Theoretic Perspective on Rationale Evaluation
The key quantity of interest for our evaluation of rationale R is the amount of new information expressed in R (e.g., background knowledge, reasoning process) that can justify a label Y . The mutual information between R and Y , I(Y ; R), can be helpful for evaluating this quantity. However, we are not interested in the information that is already captured in the input X. A vacuous rationale, such asr 1,a in Fig. 1-which simply combines the input X and the label, Y declaratively-captures all the information in X and Y without specifying any new information to help understand why Y has been chosen for X. We denote such rationales as B. Thus, we argue that a good evaluation metric must be able to measure the amount of new and label-relevant information contained in a rationale beyond what is contained in any vacuous rationale, B, that leads to the prediction of Y . Then the new information in R beyond what is available in B can be grounded with conditional mutual information (Shannon, 1948) as follows, where the difference of two information quantities demonstrates the performance gap in Equation 1. Directly computing mutual information, however, is challenging because true distributions of random variables are usually unknown, and we do not have unbounded computation. A recently introduced information-theoretic framework called Vinformation circumvents this by restricting the computation to certain predictive model families, V (Xu et al., 2020). Given a model family V that maps two random variables R and Y , V-information defines the usable information that can be extracted from R by models in V to predict Y , i.e. I V (R → Y ).
If V generalizes to the set of all possible functions, then V-information is mutual information (Shannon, 1948). In practice, it is feasible to estimate the usable information from R about Y by selecting any neural model without frozen parameters as V.
3 Our approach to evaluate rationales builds on a modification of this framework for conditional information by Hewitt et al. (2021), as described below.
Conditional V-information Following conditional mutual information in information theory (Cover and Thomas, 2006), V-information has been extended to conditional V-information (CVI; Hewitt et al., 2021). CVI quantifies the V-usable information in R about Y conditioned on a variable B, i.e.
Here B is any vacuous rationale that leads to the prediction of Y . In this work, we consider B simply as the declarative combination of X and Y . (Xu et al., 2020;Hewitt et al., 2021;Ethayarajh et al., 2022), defined as where f [b] and f [r, b] produce a probability distribution over the labels given b and [r, b] as inputs respectively. 4 Further, given g ′ , g ∈ V which optimize Equations 3 and 4 respectively, we consider pointwise CVI for individual triples (r, y, b):

Computing REV for Rationale Evaluation
Building on the framework of CVI, we propose a new metric REV, for Rationale Evaluation with conditional V-information. We compute REV over a given test set, D test = {(x i , y i , r i )}, by estimating CVI over the set with evaluation models, g, g ′ ∈ V.
For a test example (x, y, r), the REV score denoted as REV(x, y, r) is computed based on Equation 5, where b is constructed by combining x and y. , The REV score for the entire test corpus D test , is given by the average pointwise REV score: Algorithm 1 Computing REV Scores 1: Input: evaluation models g and g ′ , test set Construct the baseline rationale b i 5: S.add(REV(x i , y i , r i )) 7: end for 8: REV D = mean(S) 9: Output: S, REV D Algorithm 1 shows the process of computing both pointwise and aggregate REV scores. The higher the REV score, the more additional (new and relevant) information the rationale r contains to explain the label beyond the baseline rationale b. REV(x i , y i , r i ) can take positive, negative, or zero values. When REV(x i , y i , r i ) > 0, the rationale supplies additional new information for supporting the label (e.g., r * 1 in Fig. 1); when REV(x i , y i , r i ) = 0, the rationale provides no additional information beyond the baseline (e.g., r 1,a in Fig. 1); and when REV(x i , y i , r i ) < 0, the rationale does not support the label (e.g.,r 1,b in Fig. 1). REV can assign a positive score to a rationale for an incorrect prediction as long as the rationale supports it and provides additional information beyond a vacuous baseline rationale (e.g., r 2 in Fig. 1). Thus, REV cannot be seen as a replacement for prediction accuracy, but rather as an orthogonal metric to interpret the usefulness of a generated rationale for the model decision.

Experimental Setup
We outline our experimental setup by describing the reasoning tasks and datasets ( §3.1), followed by the task and evaluation models ( §3.2), and the baseline metrics for comparison ( §3.3). Additional details on the setup are provided in Appendix B.

Datasets
We explore two reasoning tasks, namely Common-senseQA (CQA) and Natural Language Inference (NLI) across four datasets, all containing humanannotated free-text rationales. For CQA task, we use ECQA (Aggarwal et al., 2021), CoS-E (v1.11;Rajani et al., 2019) and QuaRTz (Tafjord et al., 2019). For both ECQA and CoS-E, each commonsense question is paired with five candidate choices and the task is to select an answer from the candidates. ECQA contains higher quality humanwritten rationales compared to CoS-E (Aggarwal et al., 2021;Sun et al., 2022). QuaRTz is for opendomain reasoning about textual qualitative relationships, and the task is to select an answer from two options to the question based on the textual qualitative knowledge (rationale). For the NLI task, we use the e-SNLI (Camburu et al., 2018) dataset containing explanations for SNLI (Bowman et al., 2015), where the task is given a premise to predict if a hypothesis entails, contradicts or is neutral to it. More details on the datasets are in Appendix B.1.

Task and Evaluation Models
Task models We choose T5 Large (Raffel et al., 2020) as the task model (finetuned on groundtruth labels and rationales) to produce generated rationale-label pairs under three settings: • XY * →R: Given an input text and the groundtruth label, generate a rationale.
• X→YR: Given an input text, generate a label followed by a rationale. Since T5 decodes tokens sequentially, each R is generated conditioned on the predicted Y.
• X→RY: Given an input text, generate a rationale followed by a label. Here, we compute a likelihood for each candidate Y conditioned on R, and then select the most probable candidate. This operation can improve the model prediction accuracy, while weakening the consistency and relevance between the generated rationales and predicted labels.
After training, we collect three types of rationalelabel pairs by applying the three task models on the test set of each dataset. In addition to these three settings, we also evaluate ground-truth labels paired with crowd-sourced rationales (Y * ;R * ).
Constructing a Baseline with Vacuous Rationales Given an input x and a label y (groundtruth or model-generated), we construct a baseline rationale b by declaratively combining x and y into a sentence. For the CQA task, we adopt a T5-3B  we first use a template to convert (premise, hypothesis, label) tuple into a baseline rationale: "premise implies / contradicts / is not related to hypothesis". Then we paraphrase these templated, vacuous NLI rationales using a pre-trained model 6 in order to prevent the evaluators from learning the template patterns. Table 1 shows some examples of constructed vacuous baseline rationales.
Training Evaluation Models, g and g ′ We train two evaluation models, g and g ′ , which take [r, b] and b as inputs, respectively (see Equation 5  ing, the evaluation models are applied to evaluate a rationale-label pair (y, r) w.r.t. an input x. The rationale-label pair (y, r) can be model-generated and the label may not be ground-truth (e.g., y 2 in Fig. 1), while REV is able to provide an assessment on the rationale along the two dimensions ( §1). We refer readers to the Appendix B.3 for results of using T5 Base, BART Large (Lewis et al., 2020), and GPT-2 Large (Radford et al., 2019) as evaluation model architectures.

Other Metrics for Rationale Evaluation
We compare with two existing automatic metrics for free-text rationale evaluation: LAS (Hase et al., 2020) and RQ . Analogous to our evaluation models, both approaches use proxy models; we use the same architecture 5 https://github.com/jifan-chen/ QA-Verification-Via-NLI 6 https://huggingface.co/humarin/chatgpt_ paraphraser_on_T5_base (T5 Large) across metrics in our reported results.
Leakage-Adjusted Simulatability (LAS) Hase et al. (2020) evaluate the quality of free-text rationales via a proxy model, trained with the task model outputs as labels and original input texts combined with rationales as input sequences. The metric computes the difference between its prediction accuracy on the predicted label when the rationale is included into the input vs. when it is

Evaluating REV
We first compare REV with existing metrics ( §4.1) and human judgments ( §4.2) on the ECQA dataset, as well as show REV on other CQA and NLI benchmarks. We then test the sensitivity of different metrics to input perturbations ( §4.3). Next, we apply REV to generations via few-shot prompting (4.4). Additional experiments are listed in Appendix C.

Comparison Between Evaluation Metrics
We compare REV with LAS and RQ, in evaluating different rationale-label pairs on the ECQA dataset. In addition to XY * →R, X→YR, X→RY, and (Y * ;R * ), we also explore the evaluation on the vacuous baseline rationales (Y * ;B) that are constructed with ground-truth labels. LAS, RQ and REV are not directly comparable due to different comparison scales and criteria (e.g., log-probability vs. accuracy); hence, our focus remains on the ranking over different sources of rationale-label pairs.
Results are shown in Figure 2 (left panel). All three metrics rank the crowdsourced rationales (Y * ;R * ) in ECQA the highest. While by definition, REV for vacuous rationales (Y * ;B) is low, both LAS and RQ scores for these rationales are quite high, showing that these metrics are incapable of measuring the amount of additional information in rationales. Intuitively, we expect weaker rationalelabel consistency in X→RY setting compared to X→YR, as the labels are forcefully selected among the candidates as opposed to being freely generated by the task model ( §3.2). While REV is able to capture this intuition and ranks X→YR higher than X→RY, LAS and RQ have a different ranking. Qualitative results comparing all three metrics are provided in Table 4 in Appendix C.1; Table 8 qualitatively analyzes rationales with negative REV scores.
We additionally analyze REV for "inputirrelevant rationales": sentences extracted from a knowledge base that contain the ground-truth labels but do not necessarily explain the labels w.r.t. the inputs. Results in Appendix C.2 show that REV penalizes such irrelevant rationales.
Next, we apply REV to evaluate crowdsourced and model generated rationale-label pairs (Y * ;R * , XY * →R, X→YR, X→RY) across different datasets. For each dataset, the evaluation models are trained on the training set with gold labels and crowdsourced rationales. The results are shown in Table 2. We observe that the gold rationales in the ECQA dataset achieve higher REV score than those in CoS-E. This observation is in line with the known quality issues of crowdsourced rationales in CoS-E (Aggarwal et al., 2021;Sun et al., 2022). Interestingly, model-generated rationales (XY * →R) have higher REV score than crowdsourced rationales for CoS-E (see examples in Table 7  see Appendix C.3 for a qualitative analysis on CoS-E rationales. QuaRTz has better quality of rationales compared to ECQA, CoS-E, and e-SNLI. In the case of e-SNLI, the problem is severe as most of the crowdsourced or generated rationales do not provide reasoning but rather follow a label-specific template e.g., A implies (that) B (Kumar and Taluk dar, 2020;Brahman et al., 2021).

Human Evaluation
We collect crowdworker judgments via Amazon Mechanical Turk to understand how REV correlates with human judgments of rationales. We randomly sample 230 examples from the ECQA test set and ask workers to evaluate the four types of rationale-label pairs (Y * ;R * , XY * →R, X→YR, X→RY) for each example. 7 We present workers with a question (input text), an answer (label) and an explanation (rationale), and ask them whether the explanation justifies the answer (yes/no). If they answer yes, we further ask them to evaluate the amount of additional information supplied by the explanation that explains why the answer might have been chosen for the question by choosing from none / little / some / enough, corresponding to a 4-point Likert-scale (0/1/2/3). We collect 3 annotations per instance and use majority vote to decide whether the rationale can justify the label. If yes, 7 We do not consider (Y * ;B) because we have trained workers to recognize baseline rationales as vacuous. we take the average over the 3 human-annotated scores as the amount of information. Otherwise, we give a score of -1. More details of human evaluation are in Appendix C.4. Results are shown in the right panel of Fig. 2, where the ranking of the four types of rationalelabel pairs is Y X→RY. While LAS and RQ rank X→RY better than X→YR (see the left part of Fig. 2), the ranking from REV is more consistent with human judgments, suggesting its effectiveness in evaluating rationales.

Is REV sensitive to input perturbations?
A robust metric should be sensitive to the change of rationale-label pairs and reflect their relationships under input perturbations. We test the sensitivity of all automatic metrics to input (X) perturbations in the task model, under two settings: X→YR and X→RY. Following , we add zero-mean Gaussian noise N (0, σ 2 ) to input word embeddings during inference, inducing task models to produce progressively degenerate rationales and labels. Results in Fig. 4.3 indicate that REV (b) and RQ (c) follow similar trends as for X→RY. However, LAS is less sensitive to noise for both joint models, X→RY (a) and X→YR (d). Since the proxy model for LAS was trained on the task models' predicted labels and generated rationales, it can overfit to the degenerate rationale-label pairs under input perturbations, hence being less sensitive to input noise during inference. The largest differences between REV and RQ are for X→YR.
We observe the task model can predict incorrect labels and then make up reasonable-sounding rationales for its wrong predictions under certain input perturbations; prior work also reports this finding (Narang et al., 2020;. REV does not drop under a certain amount of input perturbations (e.g., σ 2 ≤ 20) in Fig. 3 likely because the generated rationales still provide new information for describing both correct and incorrect labels (also see the example in Table 6). However, as the noise exceeds the certain level, REV decreases indicating that the task model is no longer able to make up rationales for very noisy inputs. On the other hand, the behavior of RQ in Fig. 3 (e) is quite different to REV. Since RQ is computed based on gold labels ( §3.3), it has reduced sensitivity to input perturbations. When the prediction accuracy decreases, the overall evaluation of RQ is dominated by the results on incorrect predictions, as shown in Fig. 3 (e). We refer readers to the Table 6 in Appendix C.5 for qualitative analysis on sensitivity test.

Evaluating Rationales in Few-shot Prompting
We test the ability of REV in evaluating rationales generated by few-shot prompting, and get insights into the reasoning and prediction processes of large language models (e.g., GPT-3).
GPT-3 Rationales for Gold Labels Wiegreffe et al. (2022) collected 250 high quality free-text rationales generated by few-shot prompting with GPT-3 (Brown et al., 2020) for CQA (given gold labels). Each example was assessed by 3 crowdworkers. We focus on two aspects of their annotations: "supports the gold label" and "amount of information". Crowdworkers provide a yes / no answer to justify whether a rationale supports the corresponding gold label.
Only when the answer is yes, they are further asked to evaluate the amount of information contained in the rationale for justifying the label. The amount of information is roughly categorized into 3 levels: "Not Enough", "Enough", "Too Much", each annotated with a Likert-scale score. 8 In Fig. 4, we compare human annotation scores for amount of information 9 with the pointwise scores obtained by three automatic metrics, LAS, RQ, and REV. For automatic metrics, the evaluation models of REV and the proxy models of LAS and RQ are trained on the ECQA training set with gold labels and human-annotated rationales ( §3.2). We observe that REV provides finer-grained assessment of the information contained in rationales compared to LAS and RQ which only take {-1, 0, 1} values. When LAS and RQ are zero, it is unclear whether the rationale supports the label or not because the model proxy may predict the label based on the input only. The judgments of REV on whether rationales support labels (REV > 0 ) are close to human judgments (i.e., 80% agreement). The support rates of LAS and RQ are relatively low, i.e. 35% and 23%, while a large portion (56% and 60% respectively) corresponds to a zero LAS / RQ score.  Figure 4: Histograms of human-annotated amount of information and pointwise REV, LAS and RQ scores on GPT-3 few-shot prompted rationales for gold labels. 8 The original human-annotated scores w.r.t. the three levels are: -1, 0, 1. Since Wiegreffe et al. (2022) suggest "a value of 0 is preferred to a value of 1", we map the scores {-1, 0, 1} to {0, 1, 2} accordingly. The value "-1" is then given to examples annotated as "not supporting gold labels". 9 We take majority vote to decide "supports the gold label", and average "amount of information" over 3 workers. Chain of Thought Rationales Wei et al. (2022) propose chain of thought prompting to teach large language models to produce intermediate reasoning steps (rationales) before prediction, which improves their prediction performance on a range of reasoning tasks (e.g., arithmetic and symbolic reasoning). However, the reported improvement is trivial for CQA (Wei et al., 2022) Figure 5 shows the distributions of REV for correctly and incorrectly predicted instances from GPT-3 and LaMDA, respectively. For both GPT-3 and LaMDA, the REV distributions of correct and incorrect predictions are similar and most instances have positive REV scores. The average REV scores over correct and incorrect predictions (blue and red dashed lines, resp.) are close, especially for GPT-3. This is consistent with our observation that most generated rationales from the two models are describing their predicted labels. The prediction accuracy of GPT-3 is much higher than that of LaMDA (77% vs. 59%), while the average REV scores over all instances (gray dashed lines) are close (0.92 vs. 0.99). An insight we obtain is that the generated intermediate reasoning steps (rationales) support models' predictions (consistent REV scores), but cannot guarantee their correctness (discrepant accuracies between GPT-3 and LaMDA). This partially explains the minor improvement of 10 Available at https://github.com/jasonwei20/ chain-of-thought-prompting chain of thought prompting on CQA.

Related Work
Model rationales broadly fall into two categories: extractive rationales and free-text rationales. Extractive rationales contain some important features extracted from input texts that make models produce final predictions (Lei et al., 2016;DeYoung et al., 2020;Jain et al., 2020;Schulz et al., 2020). Free-text rationales are produced by generative models in the form of natural language. Compared to extractive rationales, free-text rationales explain model predictions in a more human-like way and fill the gap in explaining reasoning tasks (Camburu et al., 2018;Narang et al., 2020;Rajani et al., 2019;Kumar and Talukdar, 2020;Brahman et al., 2021).
Evaluations on extractive rationales have been well studied, generally from two perspectivesfaithfulness and plausibility (DeYoung et al., 2020;Pruthi et al., 2022;Chan et al., 2022b). Faithfulness measures to which extent rationales reflect the true reasoning process of models, while plausibility evaluates how convincing rationales are to humans (Jacovi and Goldberg, 2020). Other perspectives include the ability of rationales in helping a student model simulate a teacher model (Pruthi et al., 2022) or bridging the communication between a classifier and a layperson (Treviso and Martins, 2020). Existing automatic metrics for free-text rationales focus on rationale-label association, and measure the utility of a rationale based on how much it helps a model proxy predict the given label (inspired by human simulatability (Doshi-Velez and Kim, 2017)) (Hase et al., 2020) or the gold label  given the input. Chan et al. (2022a) further propose a framework to evaluate the automatic metrics. However, none of them consider measuring the amount of additional new information in free-text rationales. Sun et al. (2022) conduct a human study on the additional knowledge provided by free-text rationales. This work is the first that proposes an automatic metric to quantify the new information in free-text rationales.

Conclusion
We introduce REV, an information-theoretic measure to evaluate the amount of new, label-relevant information in free-text rationales, beyond the information contained in the input. We empirically demonstrate the advantage of REV compared to existing metrics focusing simply on label-rationale association, and show that REV is more consistent with human judgments. REV also offers insights into evaluating rationales generated via few-shot prompting. While we recommend the usage of REV alongside traditional performance metrics, future work might explore a combined metric to measure the correctness of a prediction as well as the informativeness of the rationale towards this prediction. Ultimately, free-text rationales are for the benefit of human users and there exist multiple criteria for human utility of rationales (Joshi et al., 2023), beyond label relevance and informativeness.

Limitations
In its current formulation, REV might reward a rationale for an incorrect prediction as long as the rationale supports the prediction with relevant additional information. Additionally, our metric does not consider the factuality of rationales. Future work might explore evaluation that penalizes rationales which support incorrect predictions, thus bridging together predictive performance with interpretability metrics. We considered a single declarative construction for baseline rationales and leave analyzing how different baseline construction impacts our metric to future work. Another limitation is that the utility of REV depends on the quality of crowd-sourced rationales used to train the evaluator. Building a good automatic metric REV requires high-quality rationales that provide sufficient new information (e.g., commonsense knowledge) to explain the corresponding labels. The architecture of evaluation models also has an impact on REV evaluation. Using different evaluator architectures may result in varying REV scores, as discussed in Appendix B.3.

Ethics Statement
All datasets used in this work are public, and deal with situations encountered in daily life; these are the examples provided for human annotation. Generated rationales sometimes contain non-factual statements or misinformation. While it is plausible that some rationales generated by the model or some data instances might contain offensive material, to the best of our knowledge we did not encounter such examples. We did not collect any personal information (e.g. demographics and identities) of participants in any of the human evaluation experiments.

A Properties of Conditional V-information
As proved by Hewitt et al. (2021), CVI has several useful properties: 1. Non-Negativity: An implication from Monotonicity is complex models (e.g., pre-trained language models) might do better than simpler ones (e.g., linear models) in estimating V-usable information. Since CVI measures the additional V-usable information in R about Y beyond what's already extracted from B by models in V, it grounds the goal of the proposed metric REV. , where each commonsense question is paired with 5 candidate choices and the task is to select an answer from the candidates. ECQA contains higher quality free-text rationales compared to CoS-E, in terms of comprehensiveness, coherence, non-redundancy, etc. (Aggarwal et al., 2021;Sun et al., 2022).
QuaRTz is an open-domain reasoning task about textual qualitative relationships. Each instance contains a situated qualitative question, two answer options and a knowledge statement. The task is to select an answer from the two options to the question based on the textual qualitative knowledge. We use the knowledge statement as a free-text rationale since it explains why the answer is to the question. For NLI task, we use e-SNLI (Camburu et al., 2018) which is an extension of SNLI (Bowman et al., 2015) with augmented free-text human-written rationales. The task is to predict the entailment relationship between a premise and a hypothesis. Figure 6 shows the summary statistics of the four datasets. We use Huggingface Transformers (Wolf et al., 2020) to access all task and evaluation models. We train each model for up to 20 epochs with a learning rate 5e − 6 and a batch size 8. All experiments were performed on a single NVIDIA RTX 8000 GPU. Table 3 shows input-output formattings of different task models for different tasks.

B.3 Comparison Between Evaluator Architectures
We apply REV to evaluate different types of free-text rationales w.r.t. labels on the ECQA dataset. Figure 7 shows REV scores of the four types of rationale-label pairs evaluated by four evaluator architectures. The ranking of the four groups of rationalelabel pairs is consistent across the four evaluators, i.e. Y * ;R * > XY * →R > X→YR > X→RY. This ranking is also consistent with human evaluation in §4.2. Since ECQA contains high-quality crowdsourced rationales (Aggarwal et al., 2021), it is expected that the REV of gold rationale-label pairs (Y * ;R * ) is the highest. The REV of XY * →R is close to that of Y * ;R * , indicating the task model (T5 Large) can produce good quality rationales when it is prompted with ground-truth labels. All four evaluators agree that the generated rationales of X→YR contain 11 We use the version v1.11 where each question is paired with 5 answer choices, for comparison with ECQA. 12 Since CoS-E does not provide rationales for instances in the test set, we use the original development set as the test set and hold out 10% of training data as the new development set. For e-SNLI, we follow Hase et al. (2020) and randomly sample 10% of training data to form the training set for finetuning our models.     Table 8 shows some examples of X→RY with negative REV scores on the ECQA dataset. When REV < 0, we observe in most cases the rationale does not support the given label, while indicating other labels, or something even beyond the label candidates (e.g., "helicopter" in the second example), or they could repeat the input (e.g., the first example). The same observation holds for other types of rationale-label pairs.

C.2 Additional Analysis on Label-Related But Input-Irrelevant "Rationales"
In some cases, a rationale contains the given label and provides new information related to the label, but does not necessarily explain why the label is selected for the input. To evaluate such rationales, we randomly select 250 gold labels in ECQA and extract their related sentences from a large-scale knowledge base-GenericsKB (Bhakthavatsalam et al., 2020). Those sentences contain the labels, while might provide little or irrelevant new information to explain the labels w.r.t. the inputs. We use them as trivial rationales for evaluation. The average REV scores for those trivial rationales and their crowdsourced counterparts are 0.26 and 1.14 respectively, indicating the effectiveness of REV in identifying the new and relevant information in rationales. Table 5 shows the REV scores of some examples and the corresponding crowdsourced rationales. The results show that REV can distinguish the new information in different rationales and penalize meaningless rationales. Overall, REV gives higher scores to crowdsourced rationales than trivial sentences from GenericsKB. Table 7 shows the exemplar of REV scores for crowdsourced and model-generated (XY * →R) rationales

C.3 Qualitative Analysis of CoS-E Rationales
for CoS-E. The main observation is model-generated rationales (XY * →R) generally support labels, though provide limited new information, while many crowdsourced rationales in CoS-E are noisy or uninformative. Specifically, compared to the crowdsourced rationales in CoS-E, we observe that XY * →R can produce better rationales that support the labels, which also corresponds to higher REV scores. However, the new information contained in those rationales is still limited (please see examples). A possible reason is the task model (XY * →R) hardly learns to produce more informative rationales when trained using lower quality rationales from CoS-E, known quality issue as reported in prior work (Aggarwal et al., 2021;Sun et al., 2022).

C.4 Human Evaluation Details
We randomly select 230 examples from the ECQA test set and conduct human evaluation on the four types of rationale-label pairs (Y * ;R * , XY * →R, X→YR, X→RY) w.r.t. each example through the Amazon Mechanical Turk (AMT). We select workers located in Australia, Canada, the UK, or the US, with a past HIT approval rate of >98% and >5000 HITs approved. Each instance is assessed by 3 workers. We pay the workers $0.08 for assessing each instance. Figure 8 shows the instructions we provide to workers. In Figure 9, we show three examples, illustrating when the explanation (rationale) does not justify the answer (label), when the explanation supports the answer while not supplying additional information, and when the explanation supports the answer and provides additional information. Figure 10 shows the interface of the actual hit for human evaluation.
For each instance, we provide a question (input), an answer (label), and an explanation (rationale), and ask the workers to answer the following two questions: 1. Does the Explanation justify the given Answer? (yes or no) The question is to ask workers to judge whether the rationale supports the label or not.
2. If yes, how much additional information does the Explanation have to justify the Answer beyond just reiterating what is stated in Question and Answer? (No additional info, Little additional info, Some additional info, Enough additional info) We only ask this question if the workers choose "yes" for the first question. We design this question to ask workers to evaluate the extent to which the rationale provides additional information for justifying the label beyond repeating it w.r.t. the input.
C.5 Qualitative Results of Sensitivity Test Table 6 shows some examples from the sensitivity test in §4.3.    σ 2 = 30, REV = 0 Output: banishing oneself from one's own body by removing it from the body of another person. This is a way of preventing one from being harmed by other people. It is possible to get rid of people from their own bodies if they are living in the same house. That is why they have to live together.

X→RY
Question: What happens when not paying attention when bathing? Choices: dry skin, using soap, reading newspaper, shampoo, slip and fall Gold label: slip and fall Gold rationale: Bathing is generally done in a bathroom. Bathrooms could be slippery due to water. If not careful slipperiness can cause one to slip and fall.   What enables most people to transport themselves? own cars People who believe in god are able to transport themselves through helicopter. -0.19 Where would you expect to find popcorn in a public place? movie theater Popcorn can be found in a public place. -1.27 What are you usually at when you sit on a bench on a curb? city Ohio is a state in the United States. You are usually at street corner when you sit on bench on curb.
-0.27 Table 8: Exemplar of negative REV scores for rationale-label pairs from X→RY on the ECQA dataset.