ZARA: Improving Few-Shot Self-Rationalization for Small Language Models

Language models (LMs) that jointly generate end-task answers as well as free-text rationales are known as self-rationalization models. Recent works demonstrate great performance gain for self-rationalization by few-shot prompting LMs with rationale-augmented exemplars. However, the ability to benefit from explanations only emerges with large-scale LMs, which have poor accessibility. In this work, we explore the less-studied setting of leveraging explanations for small LMs to improve few-shot self-rationalization. We first revisit the relationship between rationales and answers. Inspired by the implicit mental process of how human beings assess explanations, we present a novel approach, Zero-shot Augmentation of Rationale-Answer pairs (ZARA), to automatically construct pseudo-parallel data for self-training by reducing the problem of plausibility judgement to natural language inference. Experimental results show ZARA achieves SOTA performance on the FEB benchmark, for both the task accuracy and the explanation metric. In addition, we conduct human and quantitative evaluation validating ZARA's ability to automatically identify plausible and accurate rationale-answer pairs.


Introduction
Driven by the concerns of whether the decisions made by the artificial intelligence models are trustworthy, providing free-text, natural language explanations (NLEs) has drawn substantial attention in the research community (Camburu et al., 2018;Li et al., 2018;Rajani et al., 2019;Aggarwal et al., 2021;Chen et al., 2022).Comparing with popular explanation techniques within the input scope, e.g., attributing feature importance scores to tokens (Li et al., 2016;Godin et al., 2018) or extracting fragments of text highlights (Lei et al., 2016;Jain Answer: (B)

Rationale:
You can't drive a computer.

Question:
Which sentence is more nonsensical?(A) I drove my car to the gas station.(B) I drove my computer to the gas station.et al., 2020), free-text explanation1 is more expressive, inherently apt for human comprehension and brings richer information in addition to input context (Camburu et al., 2018;Wiegreffe et al., 2021).Yet, the construction of NLE datasets is expensive and challenging due to quality control issues such as inconsistency and under-specification (Wiegreffe and Marasovic, 2021).The development of interpretable NLP systems which can provide NLEs in few-shot is necessitated.
Recent works (Wei et al., 2022;Wang et al., 2022b;Lampinen et al., 2022) achieve few-shot self-rationalization, i.e., jointly generating freetext explanations and end-task labels, by extending the usage of NLEs to compose chain-of-thought (CoT) input-rationale-output demonstrations for prompt-based learning.Comparing with standard prompting (i.e., without rationales), prompting with rationale-augmented exemplars triggers LM's complex reasoning ability, significantly boosting the end-task performance.However, the main drawback is that only excessively large LMs (generally 100B-plus) demonstrate this ability to leverage explanations, which sharply emerges when scaling model size sufficiently (Wei et al., 2022;Lampinen et al., 2022).
In this work, we explore the less-studied setting of improving few-shot self-rationalization only relying on affordable, small LMs (200M∼2.7B).We adopt self-training (Scudder, 1965), a simple yet effective methodology that is not practical for large LMs in most real-world scenarios.We first investigate the relationship between the generated explanations and end-task predictions, and find plausible explanations are usually paired with correct label predictions.Namely, plausibility is a strong indicator for answer correctness.Motivated by this finding, we propose Zero-shot Augmentation of Rational-Answer pairs (ZARA) for self-training.
Specifically, we reduce the problem of assessing rationale plausibility to the task of natural language inference (NLI), and propose a zero-shot plausibility approximator towards automatic assessment of the generated rationales, without requiring any ground-truth labels or golden explanations.The approximator can be viewed as an agent for plausibility judgement.As illustrated in Figure 1, to determine the plausibility of the rationale, humans implicitly ask themselves whether they can draw conclusions to the predicted answer by understanding the task, the input question, and the supported rationale with their logic and reasoning.To approximate such process explicitly, the approximator leverages the ability of textual entailment to yield a probability score indicating the explanation plausibility.Connecting to the self-training paradigm, we first train a self-rationalization model by fewshot prompt-based learning with natural language prompts, and leverage the approximator to collect pseudo-parallel data, i.e, unlabeled inputs paired with high-confident rationale-answer pairs, for creating an augmented training set which is then used to learn an improved self-rationalization model.
With various small-size LMs, experiments show our approach notably improves the FEB benchmark2 (Marasovic et al., 2022)-a recently pro-posed standardized few-shot self-rationalization benchmark-with 3.4% ∼ 5.1% and 3.0% ∼ 5.8% for task accuracy and the associated explanation metric, respectively.Additionally, we validate the approximator's ability with both human and quantitative evaluations.The results suggest our approximator can effectively select plausible explanations that lead to higher accuracy for end-task predictions.In summary, our main contributions are three-fold: 1. We show how to leverage explanations for small LMs by an in-depth analysis of the relationship between rationales and task labels.
2. We propose ZARA, a novel approach for small LMs to improve self-rationalization with self-training.
3. Our NLI-based approximator sheds light on the potential of automatic evaluation for explanation plausibility and post-hoc verification for label accuracy.

Background and Motivation
Given a trained self-rationalization model f θ (•) and an input sequence x, we denote a prediction f θ (x) = (r, â), where r is the generated free-text rationale and â is the predicted answer, typically a classification label.Note that r and â are parsed from the output sequence of f θ (x).Evaluation of a self-rationalization model requires assessing both â for the end-task performance and r for the quality of the explanation.With the lack of an ideal and unified automatic metric, the current gold standard for determining the quality of r is to conduct a human evaluation to check its plausibility (Marasović et al., 2020;Kayser et al., 2021;Wiegreffe et al., 2022;Marasovic et al., 2022).An ideal r is considered to be plausible if it is able to justify â, that is, providing a logical and reasonable explanation supporting the model's prediction.However, if r is deemed plausible by human, it does not mean â is correct.As the example in Table 1, commonsense would know "bed" is likely the answer, yet the generated explanation for the corresponding prediction "couch" is still plausible.Plausibility illustrates the degree of convincement towards the model's prediction, regardless of whether the model is actually making an accurate prediction or not (Jacovi and Goldberg, 2021 Naturally, generating plausible explanations that can justify the wrong answers should be much harder comparing to justifying the correct answers.Since such r usually demonstrates a slight pivot from commonsense yet still introduces a sound reason to support the inaccurate â.We hypothesize this-plausible explanation towards inaccurate endtask prediction-is not the circumstance in most cases of (â, r).In other words, if r is considered to be plausible, it is likely that â is a correct prediction.Hence, the first research question arises: RQ1: "To what extent do plausible explanations imply correct label predictions?"And if we could verify RQ1, the follow-up question would be RQ2: "Is it possible to automatically identify plausible r and utilize (r, â) for further model improvement?" In the following of our work, we answer RQ1 by inspecting the interrelationship between the plausibility of r and the correctness of â (Section 4), where we show evidence supporting the linkage to RQ2.Ergo, we propose ZARA coupled with selftraining to accomplish RQ2 (Section 5), improving few-shot self-rationalization models.

Datasets and Tasks
We adopt FEB (Marasovic et al., 2022), a newly proposed few-shot self-rationalization benchmark, 2021; Marasovic et al., 2022) only evaluates r with â = a, i.e, explanation for the correctly predicted answer.This may overestimate the quality of explanations (Wiegreffe et al., 2022).
as the dataset for experiments throughout this work.FEB consists of four sub-tasks from existing English-language explainable datasets with free-text explanations: (1) Nonsensical sentence selection (COMVE; Wang et al., 2019).Given two sentences, select the sentence that is less likely to make sense.(2) Offensiveness classification (SBIC; Sap et al., 2020).Classify a given post as offensive or not.(3) Natural language inference (E-SNLI; Camburu et al., 2018).Classify the relationship between two sequences as entailment, neutral, or contradiction.(4) Multiple-choice commonsense QA (ECQA; Aggarwal et al., 2021).Given a question, select the correct answer from five choices.
The goal for each sub-task is the same, namely, to predict a label for the underling classification task and generate a free-text explanation supporting the model's decision.Each sub-task has 60 episodes, and each episode is a train-test split with 48 training examples and 350 evaluation examples.This design of no extra validation data encompasses the FLEX principles (Bragg et al., 2021) for performing robust few-shot NLP evaluation to avoid per-episode hyper-parameter tuning, which could inflate the evaluation results considerably as shown in previous work (Gao et al., 2021).Hence, a single set of hyper-parameter is used across all episodes.

Correlation between Plausibility and Correctness
As described in Section 2, following we attempt to answer RQ1 by measuring the correlation between the plausibility of r and the correctness of â.We conduct human studies on results from a self-rationalization model (without self-training) using the FEB dataset.We adopt prompt-based finetuning with natural language prompt on a sequenceto-sequence language model to perform few-shot self-rationalization.
For each episode of the sub-task, we train a self- Hypothesis: The answer of the question "[question]" is [Answer's choice].NLI Class: Entailment

Mapped example
Mapped example Premise: A woman in a black mesh skirt plays acoustic guitar.The woman is wearing a black mesh, she is wearing black.Hypothesis: A woman is wearing black.
Premise: Because state park is a [...] gardens are places with lots of trees and plants.
Hypothesis: The answer of the question "what is a place that has a bench nestled in trees?" is state park.
Table 2: The mapping design for the four sub-tasks with non-cherry-picked examples.See Section 3 for description about each sub-task.rationalization model with the training set and generate rationale-answer pairs on the test set.We then gather all predictions from the 60 episodes and randomly select 350 examples for human studies.We present the description of the task, the input instance x and the rationale-answer pair (r, â) for the annotators, and ask them to judge the plausibility of r, i.e., whether it can justify â.Following prior works (Marasović et al., 2020;Marasovic et al., 2022), the annotator determines the plausibility by assigning labels from {"no", "weak no", "weak yes", "yes"}.We then map labels to plausibility scores {1, 2, 3, 4} and instances with average scores above 2.5 are deemed plausible.We provide inter-annotator agreement details in Appendix C.
The results are shown in Figure 2. We can observe that for all sub-tasks, when the explanations are judged as plausible, they are much more likely paired with correctly predicted answers in constrast to implausible ones.This verifies our hypothesis (discussed in Section 2) and shows plausibility to be a strong signal for correct label predictions.Our results also align with the prior work (Wiegreffe et al., 2021), where they find self-rationalization models demonstrate high label-rationale association against robustness testing.In conclusion, identifying (r, â) pairs that have plausible r spurs great potential for boosting model performance, and connects us to RQ2.

Zero-Shot Augmentation of Rationale-Answer Pairs
As shown in Section 4, plausible explanations imply that the corresponding task predictions are more likely to be correct.Following we present ZARAthe approach towards automatically judging the plausibility of generated explanations, and leverages the high confident rationale-answer pairs to boost model performance via self-training.

Reduce plausibility judgement to NLI
Given a rationale-answer pair (r, â) output by a self-rationalization model, human evaluates whether r is plausible by understanding the input context and the task objective, and applying reasoning ability to determine if r justifies â.Specifically, human implicitly form propositions from the input context and rationale by understanding the problem (the task).Then do inference, i.e., apply logic and reasoning to draw conclusions, in their mind to decide if the propositions support the predicted answer.This mental process of assessing plausibility resembles determining the relationship between a premise and a hypothesis.Driven by this formulation, we reduce the problem of judging the plausibility of explanations to the task of natural language inference (NLI), and construct a zero-shot approximator, which leverages existing NLI models to automatically approximate the human judgement of plausibility.
NLI Mapping.The formulation as NLI requires the mapping of (x, r, â) → (p, h), where x, p, and h are the input instance, premise, and hypothesis, respectively.We manually create the mappings for each FEB sub-task as shown in Table 2. Constructing such mappings can be easily achieved with minimal effort4 compared with human evaluation on all r.Consider the COMVE example in Table 2, the goal is to select the nonsensical sentence from two sentences.As we can see "i drove my computer to the gas station." is nonsensical, and the rationale justifies it by stating "you can't drive a computer.",which explains why the answer is nonsensical by providing information refuting the answer sentence, resulting in a contradiction relationship between the two.Hence, the approximator can estimate the degree of plausibility by referring to the score of the contradiction class.
The approximator.For developing the approximator, we ensemble three state-of-the-art pre-trained NLI models by averaging their output scores for the decision of NLI class.Specifically, we adopt RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2020), and BART (Lewis et al., 2020), trained on the MultiNLI corpus (Williams et al., 2018), one of the largest available NLI dataset.The approximator is zero-shot, i.e., all three models are used off-the-shelf (See Appendix A for details) without any fine-tuning on our dataset, accommodating the few-shot, data scarcity setting.

Self-training
In the self-training paradigm, a trained model augments its own training set by constructing pseudo-parallel data with predictions on unlabeled instances, where the most confident predictions are collected as new training examples and used to re-train an improved model.For applying self-training, most works focus on classification tasks (Miyato et al., 2018;Xie et al., 2020;Gera et al., 2022) with common strategies based on operations of confidence scores such as probability values to select new examples.E.g., finding predictions that are far from the decision boundary (Slonim et al., 2011).However, the adoption of self-training for selfrationalization differs from typical classification tasks in two aspects: (1) Compared with fixed classification labels, the target space of neural sequence generation is much more complex.(2) The selection requires considering both the task label â and the rationale r with their relationship.By a proxy model, i.e, the approximator, we could reduce the target dimensions to fixed class labels to address the former.For the latter, we could resolve it by only considering the plausibility of r since plausible r likely implies correct â as shown in Section 4. Following we introduce our self-training paradigm of ZARA-a train-judge-train procedure.See Figure 3 for illustration.
Given an episode E consisting of a training split D train and a test split D test , where an example in E is an input-rationale-answer tuple (x, r, a).We first train a LM M 0 on D train for self-rationalization by prompt-based fine-tuning with natural language prompts.The trained model is denoted as M 1 .
Next, we perform inference with M 1 on unlabeled instances x ∈ D unlabled , where D unlabled is a non-overlapping set randomly sampled from other episodes with size |D unlabled | = |D test |.For each prediction, the input x and the generated rationalepair (r, â) are mapped to the NLI format, i.e., (x, r, â) → (p, h), and passed to the zero-shot plausibility approximator. 5The approximator automatically judges the plausibility of r, where the most confident predictions are selected by a plausibility threshold α, i.e., a probability score (See Appendix B for details).This process does not require any ground truth label or golden rationale.
The collected high-confident (x, r, â) predictions become new instances to augment D train .Also, we ensure the added instances are balanced for classification tasks by downsampling majority classes.We then re-train M 0 on the augmented training split to obtain our final self-rationalization model M 2 , and evaluate on D test .

Experiments
In this section, we discuss the experimental setup and present the results of our proposed method, ZARA, for improving few-shot self-rationalization via self-training.We also perform human and quantitative evaluations to validate the automatic plausibility assessment for our approximator.

Model
For comparison purposes, we follow FEB and use UNIFIEDQA (Khashabi et al., 2020), a T5 (Raffel et al., 2020) variant trained on a multi-task mixture of QA datasets, as our self-rationalization model for all experiments.The model performs few-shot learning via fine-tuning with natural language prompts.We experiment with three model sizes: UNIFIEDQA-base (200M), UNI-FIEDQA-large (770M), and UNIFIEDQA-3B (2.7B).The results presented in Section 4 are conducted with UNIFIEDQA-3B.More details of the experimental setups and configurations are provided in Appendix A.

Main results
The evaluation metrics of FEB are accuracy and BERTscore (Zhang et al., 2019) for end-task labels and explanations, respectively.6For each sub-task, we train 60 models (one per episode) and report the mean and standard error of accuracy/BERTscore in Table 3.We also provide statistics on the number of instances added for augmentation in Appendix D.
To the best of our knowledge, we present the first results on the newly introduced FEB benchmark (besides their original approach in the paper).
We experiment with three model sizes: base, large and 3B.In ZARA, both training stages adopt models of the same size; the original FEB baseline only involves training one model (one stage).As observed in Table 3, our method substantially outperforms the FEB baseline for all datasets.In general, COMVE, SBIC and E-SNLI demonstrate relatively consistent improvements across model size.The only anomoly is for ECQA.We hypothesis as ECQA requires commonsense knowledge outside FEB training data but is encoded in models' parameters originally, the under-parameterized models (base and large) suffer forgetting from continuous learning with the augmented data.However, for the 3B model (which is still significantly smaller than most large-scale LMs) great performance gain is exhibited.

Approximator evaluation
Plausibility evaluation We conduct human evaluation to validate our approximator.Specifically, the human evaluation can be considered as a metaevluation-the evaluation of evaluation-for evaluating the approximator's ability to evaluate explanations, i.e., its ability to assess plausibility.
To recap, the approximator's output probability of the corresponding NLI class (based on the mapping design in Table 2) represents an estimation of plausibility degree, i.e., a pseudo-plausibility score.We use the same batch of annotated data from Section 4. That is, 350 randomly selected examples generated by the stage-one model with human judgement of plausibility value {1, 2, 3, 4} mapped from {"no", "weak no", "weak yes", "yes"} and averaged across annotators.
The results are presented in Figure 4. We group the instances into four bins, each containing 25% of data according to the percentile ranking of their pseudo-plausibility score.In general, the median performance of human plausibility judgement increases with higher percentile groups, especially for the COMVE and SBIC sub-tasks.Interestingly, due to the nature of NLI model of the approximator, its output (i.e., pseudo-plausibility scores) may be effected by spurious surface features learned only for NLI tasks (transferred from the MultiNLI dataset), giving rise to the larger interquartile range of the top percentile group in E-SNLI.Overall, the results show our approximator is capable of reflecting human plausibility judgement.
Correctness evaluation As stated in Section 4, plausible rationales likely indicate correct answer predictions.We further evaluate our approximator regarding this property by checking the end-task answer accuracy of the data subset selected for augmentation from stage-one model's prediction pool.We consider three selection strategies: (1) ZARA, i.e., our proposed method, which selects confident (high-scoring) predictions; (2) Random, the data subset is selected randomly from prediction pool; (3) Lowest, in contrast to ZARA, we select a subset from the data with lowest-ranking pseudoplausibility scores.For each episode, the number of augmented instances for (2) and ( 3) are determined by (1), i.e., we randomly select n instances or select n bottom-ranking instances, where n is the number of instances for augmentation using ZARA.The results are shown in Figure 5.We can observe ZARA consistently outperforms Random and Lowest with substantial margins under different model sizes across all four datasets, and Lowest demonstrates the poorest accuracy.This suggest our approximator is able to verify label predictions post-hoc, i.e., the high/low pseudo-plausibility score suggests the prediction is accurate/inaccurate.In conclusion, the overall evaluation results suggest our approximator can effectively extract rationale-answer pairs which are more plausible and accurate.
7 Related Work

Few-shot self-rationalization
To provide NLEs under low supervision, Marasovic et al. (2022) propose the FEB benchmark and establish the first results by exploring natural language prompt-based fine-tuning.Wiegreffe et al. (2022) focus on improving NLEs with an overgenerate-and-filter pipeline: prompting GPT-3 with gold labels to generate explanation candidates which are then filtered by a model trained with human annotations.Recent works (Wei et al., 2022;Wang et al., 2022b;Huang et al., 2022) leverage rationale-augmented chain-of-thought (CoT) inputs to prompt frozen large-scale LMs in few-shot.Zelikman et al. (2022) further bootstrap CoT by repeatedly fine-tuning a GPT-J model.Concurrent works (Wang et al., 2022a;Ho et al., 2022;Hsieh et al., 2023) propose pipeline frameworks to distill knowledge by prompting a large "teacher" LM to generate diverse reasoning rationales which are then used to fine-tuning a small "student" LM.In comparison, ZARA directly optimizes small LMs on downstream tasks, without access to any large LMs.A similar work by Ye and Durrett (2022) leverage NLEs to boost end-task predictions posthoc by training a calibrator.In comparison, we directly improve self-rationalization and our approximator does not require any further training.Moreover, all LMs used in their work are 175B.

Leveraging NLI for downstream tasks
The framework of NLI has been expanded to benefit many NLP tasks.Welleck et al. (2019) 2021) propose a framework to verify QA systems' predictions with NLI by training models to generate premise-hypothesis pairs from QA instances.Yet, the related work that inspires us the most is by Yin et al. (2019).Driven by human reasoning, they approach text classification in zero-shot by formulating it as an entailment problem-given the input text (premise), human mentally construct hypotheses "the text is about [label choice]" to determine the answer-and adopt out-of-the-box NLI models for predictions.

Conclusion
In this work, we first show evidences that plausible explanations imply correct end-task predictions, and leverage NLI to propose a zero-shot approximator which is capable of automatically identifying plausible rationales paired with correct answers from unlabeled results.By collecting such rationale-answer pairs with self-training, we can effectively improve the performance of few-shot self-rationalization for small LMs.
Moreover, we demonstrate the potential for automatic evaluation of free-text explanations.In light of this, we believe developing a supervised approximator with a unified NLI mapping schema across tasks to be a promising avenue for future works.

Limitations
The success of the approximator relies on the quality of the NLI mapping.Though we showcase great improvement across four different tasks, if the complexity of a task makes the mapping construction non-trivial, the created mapping might not be able to accurately reflect human plausibility judgement of the generated rationales, and the benefit of selftraining could not be guaranteed.Namely, the approximator may identify noisy instances that would instead hurt model performance.

Figure 1 :
Figure 1: The role of our plausibility agent.As the main component of ZARA, the agent, i.e, the approximator, imitates how human assess the plausibility of explanations, in an explicit fashion.

Figure 5 :
Figure 5: The average accuracy (per episode) of the selected subset from the stage-one model's generated results by different selection strategies: (1) ZARA, (2) Random, and (3) Lowest.
develop a dataset to improve dialogue models by framing the dialogue consistency problem as NLI.Honovich et al. (2021); Dziri et al. (2022) use NLI to design automatic metrics evaluating factuality of knowledge-grounded dialogue systems.Falke et al. (2019); Kryscinski et al. (2020); Laban et al. (2022) use NLI models to detect factual errors in abstractive summarization tasks.For question answering, Chen et al. ( i drove my car to the gas station.choice2:i drove my computer to the gas station.Answer: choice2 Rationale: you can't drive a computer.Post: just when i thought women couldn't get any stupider.Answer: offensive Rationale: this post implies that women are stupid.
Premise: Because [Rationale] Overview of the self-training paradigm for ZARA.A language model is fine-tuned to generate predictions for unlabeled instances, which are then mapped to NLI formats.The approximator then identifies high-confident (likely plausibile) ones as augmentation for learning an new model.
1 number of classes , are also presented.