SCOTT: Self-Consistent Chain-of-Thought Distillation

Large language models (LMs) beyond a certain scale, demonstrate the emergent capability of generating free-text rationales for their predictions via chain-of-thought (CoT) prompting.While CoT can yield dramatically improved performance, such gains are only observed for sufficiently large LMs. Even more concerning, there is little guarantee that the generated rationales are consistent with LM’s predictions or faithfully justify the decisions. In this work, we propose SCOTT, a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To form better supervision, we elicit rationales supporting the gold answers from a large LM (teacher) by contrastive decoding, which encourages the teacher to generate tokens that become more plausible only when the answer is considered. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective, which prevents the student from ignoring the rationales to make inconsistent predictions. Experiments show that while yielding comparable performance, our method leads to a more faithful model than baselines. Further analysis shows that such a model respects the rationales more when making decisions; thus, we can improve its performance more by refining its rationales.


Introduction
Large language models (LMs) elicit strong reasoning capabilities through chain-of-thought (CoT) prompting (Wei et al., 2022b), which asks LMs to generate free-text rationale for explaining their multi-step reasoning.However, CoT prompting does not guarantee that the rationale is consistent with the prediction, rendering the rationale  Error 1 (42%): Do not provide new information.

GPT-3
Figure 1: Vacuous rationales generated by a prompted LM (GPT-3) for StrategyQA.In both types of error cases, LM fails to give rationales consistent with the answers due to hallucination.
useless for justifying the model's behavior.In this work, we present Self-Consistent Chain-Of-Thought DisTillation (SCOTT), a knowledge distillation (KD) method for eliciting faithful CoT reasoning, where a small student model learns from a large teacher model to generate CoT rationales that are consistent to its own predictions.
Existing works (Shridhar et al., 2022;Li et al., 2022a) propose learning to reason from large LMs mainly for computation efficiency or task performance.They prompt a large LM (the teacher) to generate rationales for a downstream dataset, which is then used to train a small LM (the student).However, these works neglect the following two issues which could undermine the faithfulness of the rationales.First, LMs are prone to hallucination, meaning they often generate text that is not grounded by the input (Maynez et al., 2020;Ji et al., 2022).Therefore, the teacher may not generate on-topic rationales, which fully support the answer.In our pioneer study (Figure 1) over 100 random rationales generated by GPT-3, we found 42% of them not providing new information that is not stated in the task input and 37% of them not justifying the answer1 .This inconsistency between the rationale and answer would then be inherited by the student.Second, the student may treat ra- A: The answer is yes.Hamsters are prey ...

Q: Could Brooke Shields succeed at University of Pennsylvania?
A: The answer is yes.Brooke Shields went to Princeton University ...

Q: <question>
A: The answer is <answer>.tionale generation and answer prediction as two independent processes.This is due to the spurious correlations between the question and answer, which is exploited as a reasoning shortcut by the student (Branco et al., 2021).The two issues together would lead to an unfaithful student which learns to generate vacuous rationales and may make predictions inconsistent with the rationales.
To address these issues, we propose to enhance the vanilla KD process from two ends respectively.To elicit more on-topic rationales from the teacher, we propose to leverage contrastive decoding which aims to ground each rationale to the answer ( § 3.1).This technique encourages the teacher to generate tokens that are more plausible only when the answer is considered instead of the ones that are fairly plausible even without the answer during decoding.To train a faithful student, we ask the student to conduct counterfactual reasoning, i.e., predicting accordingly when the rationales are leading to different answers ( § 3.2).We obtain the training data by asking the teacher to generate a rationale for a sampled incorrect answer.The reasoning shortcut between the question and the gold answer is thus removed since now the student needs to give a different answer for the same question, according to the rationales provided during training.
We conduct experiments on several open-domain question answering tasks that require knowledgeintensive reasoning.Experiments show that: (1) Contrastive decoding can lead to a more consistent teacher which generates rationales that are more supportive of the gold answers.(2) Trained on the more consistent rationale-answer pairs, the student learns to better associate the answer prediction with the rationale generation.(3) With counterfactual reasoning as an auxiliary training objective, the student learns not to take the reasoning shortcut and instead respect the rationale more.(4) Despite being more faithful, our model performs comparably to the baselines.( 5) Ablation study shows that although performing better, larger student models are more prone to being inconsistent.Our method robustly remedies the inconsistency regardless of the size of the student model.( 6) With a more faithful student, we can better improve its performance by correcting its rationale, demonstrating the utility of our method in model refinement.

Chain-of-Thought Distillation
Our goal is to 1) elicit consistent rationales, i.e., those well justifying the gold answers, from a large LM as supervision, and then 2) train a selfconsistent student model to reason faithfully, i.e., answer accordingly to its generated rationale.We consider the task of language-based reasoning where the required knowledge is not provided in the task input.Specifically, we focus on open-domain question answering (QA) which is the most general setting adopted by prior works: given a question q, a QA system is asked to predict the gold answer a * .For interpretability, we also require the model to provide a free-text rationale r, which justifies its prediction.Below we describe the overview of a vanilla KD framework as illustrated in Figure 2. We then discuss the limitations and propose our method in § 3.

Generating Rationale Annotation
Instead of asking humans to annotate a rationale for each question-answer tuple {q, a * }, we obtain the rationale from a teacher model automatically using in-context learning.The idea is to prompt a frozen LM as the teacher with only a few an-notated examples as demonstration before a new instance is provided.Each example consists of a question q randomly sampled from the training set, the gold answer a * and a human-annotated rationale r which justifies why a * is correct.The prompt p is structured in the format as shown in Figure 2 (the Prompt in the left part).To obtain the rationale for a new question q, one basic strategy could be greedy decoding, which selects the most plausible token at each step: (1)

Training a Student Model
Now with the annotated training data {q, r, a * }, we can train a smaller model as the student.There are many ways to implement a QA model that can make a prediction as well as generate a rationale.
In this work, we focus on the self-rationalization paradigm, where the student firstly generates a rationale and then predicts the answer conditioning on the generated rationale.This is in contrast to related works which conduct post-rationalization, i.e., generating the rationale after the answer is predicted, or multi-task learning, which treats rationale generation as an auxiliary task besides answer prediction.The reason is that the generation of the rationale for the latter two paradigms does not affect the decision making by design, and therefore the faithfulness of the rationale is not guaranteed in the first place.Given a question q, the student model is trained to output a sequence of rationale tokens concatenated with the answer tokens as shown in Figure 2 (the output in the right part).One straightforward implementation is simply fine-tuning a text-to-text LM over the silver training data generated by the teacher using standard language modeling loss: which we refer as factual reasoning loss.

Distilling a Self-Consistent Student
There are two vital issues with the vanilla KD process described in the previous section.Firstly, neural LMs are known to suffer from the issue of hallucination, meaning they often generate text that is not grounded by the input (Maynez et al., 2020;Ji et al., 2022).This would lead to the generated rationale not supporting the given answer.The inconsistency between the rationale and the answer  Q: Would someone with back pain enjoy picking strawberries?
A: The answer is no.
Q: Would someone with back pain enjoy picking strawberries?
Greedy Decoding  : -The back is a part of the body.The back is not a fruit.Thus, someone with back pain would not enjoy picking strawberries.
Contrastive Decoding  -:  would then be inherited by the student, which is misled to think that the answer prediction is independent of the rationale generation.Secondly, the student model would learn to predict the answer by taking a reasoning shortcut (Branco et al., 2021), without taking into account the generated rationale (even though the answer prediction is conditioned on the rationale).This is due to the spurious correlations between the question and the answer which are found in various implicit reasoning task datasets (Gururangan et al., 2018;Zellers et al., 2019;Blodgett et al., 2020).
The two issues mentioned above would result in an untrustworthy student whose generated rationales do not consistently justify its answers.To mitigate this, we propose two corresponding techniques as detailed below.

A Consistent Teacher: Contrastive Decoding
To encourage the teacher to generate a more ontopic rationale that supports the answer, our proposed method extends a prior technique called contrastive decoding for open-ended text generation (Li et al., 2022b).The core idea is to search rationale tokens that are more plausible only when the answer is considered instead of the ones that are fairly plausible even without the answer during decoding.To implement this idea, we firstly model the hallucinating behavior by providing a perturbed answer a to the same teacher and then obtain the plausibility growth of any token t i given the answer We investigate two ways of perturbing the answer: setting a as an empty string or an incorrect answer other than a *2 .The first way (with an empty string) punishes tokens that are generally plausible when the gold answer a * is not considered by a hallucinated LM.The second way (with an incorrect answer) takes a step further by encouraging the teacher to generate a rationale that is more distinctive between gold and wrong answers.Figure 3 shows the generations for an example question from greedy decoding and contrastive decoding.
To strike a balance between language fluency and the grounding with a * , we incorporate the plausibility growth into Eq. 1 by aggregation as our final contrastive decoding strategy:

A Faithful Student: Counterfactual Reasoning
To encourage the student to reason faithfully towards its generated rationale, we train the student to conduct counterfactual reasoning (Roese, 1997), i.e., answer accordingly when the rationale is leading to a different answer.This would help remove the reasoning shortcut between a question and the gold answer (Figure 4) since now the student is asked to answer differently for the same question.
To implement this idea, we firstly replace the gold answer fed to the teacher in Eq. 4 with a wrong answer a randomly (with the same sampling strategy as in § 3.1) as if a is correct.We thus obtain a counterfactual rationale r that leads to the wrong answer a .We then train the model to generate a when r is directly fed to the decoder as teacherforcing (the language modeling loss is only applied to the answer tokens t i ∈ a ): To avoid confusing the student about the task, we indicate the training objective Eq. 2 (or Eq. 5) to the student by appending the keyword [Factual] (or [Counterfactual]) at the beginning of  both the input sequence to the encoder and the output sequence to the decoder (see Figure 4 for an example input and output).The overall training loss is the sum of Eq. 2 and Eq. 5.

Experiments
We aim to answer the following research questions in our experiments: (1) Can our contrastive decoding strategy lead to a more consistent teacher?
(2) Can a more consistent teacher and the counterfactual reasoning objective lead to a student that reasons more faithfully?(3) Can we have more control over a self-consistent student's predictions by modifying its generated rationales?

Evaluation Metrics
(1) To evaluate the consistency between the rationales generated by the teacher and the gold answers Figure 5: Simulatability (LAS) of the rationales generated from different teacher models as a measurement the consistency between the rationales and the gold answers.{Greedy, CD-Empty, CD-Wrong} refer respectively to using greedy decoding, contrastive decoding with empty/wrong answer to obtain rationale tokens from the teacher.
provided as input, we use the LAS metric (Hase et al., 2020), whose core idea is to measure how well the rationales assist a simulator to predict the gold answers a * , computed as the difference between the task performance when the rationale is provided as input vs. when it is not, namely Acc(qr → a * ) − Acc(q → a * ).
(2) To evaluate the faithfulness of the rationales generated by the student, we use LAS to measure how well the rationales help a simulator to predict a student's predictions a , namely Acc(qr → a ) − Acc(q → a ).
We implement each simulators with a fine-tuned T5-large model (Raffel et al., 2020) respectively.
(3) To evaluate how well the student preserves its task performance on the downstream datasets, we use accuracy as the metric.

Implementation Details
We use GPT-neox (Black et al., 2022), a LM with 20B parameters as the teacher since the model checkpoint is publicly available, which allows us to host it offline and have access to token-wise probabilities as required in our contrastive decoding.We then implement two teacher variants by using an empty string or a wrong answer as the perturbed answer a in Eq. 4 respectively.The obtained rationales are then used to fine-tune two T5-3b LMs as the students respectively.For both variants, we train the student using the sum of factual training loss Eq. 2 and counterfactual training loss Eq. 5.

Baselines
Chain-of-Thought (CoT) Since we elicit the rationales from GPT-neox (with 20b parameters) (Black et al., 2022) to train the student, we prompt the same model (GPT-neox) to firstly explain and then predict using CoT prompting (Wei et al., 2022b).
Learn from Human To demonstrate the advantage of our automatic way of generating rationale annotations, we implement this baseline as a finetuned T5-3b LM over human-annotated rationales, which are expensive to obtain and could be noisy.
Learn from Greedy Decoding We implement this baseline as a fine-tuned T5-3b LM over the rationales obtained by greedy decoding using the same LM as our main method.We also implement another variant by adding the counterfactual reasoning loss when fine-tuning the student, where the rationales for the wrong answers are obtained by greedy decoding.
We also implement two baselines of our method by training the student with the rationales obtained by contrastive decoding with empty/wrong answers based on factual reasoning only.We run all the experiments for 5 times using a fixed set of random seeds and report the average results.

Main Results
Can contrastive decoding lead to a more consistent teacher? Figure 5 shows the consistency between the rationales generated by different teachers and the gold answers measured by LAS.Across four datasets, contrastive decoding with either empty or wrong answers yield more consistent rationales compared to human annotation and greedy decoding.This demonstrates the effectiveness of our contrastive decoding strategy in encouraging the teacher to generate more on-topic rationales.Moreover, using wrong answers is better than using empty strings for contrastive decoding.This shows that by contrasting with the wrong answers, the teacher can generate more distinguishable rationales that lead to the gold answers, thus obtain higher consistency.Greedy decoding yields less consistent rationales compared to human annotation, verifying our claim that LMs are prone to generating text not grounded by the gold answers.We also conduct a human evaluation over 100 rationales generated by different decoding strategies for StrategyQA.Annotators are asked to judge the rationales by 3 dimensions: 1) Grammaticality (Is the rationale grammatical?) 2) New Info (Does the rationale provide new information not expressed in the question?) 3) Supports Answer (Does the rationale justify the answer?).Table 1 confirms that our two contrastive decoding strategies yield more informative and on-topic rationales than greedy de-coding, with a slightly worse grammaticality.We list examples in Table 2 (appendix) to showcase how rationales from contrative decoding are more consistent with gold answers than greedy decoding.
Can a more consistent teacher train a more faithful student? Figure 6 (upper parts of each sub-figure) shows the faithfulness of the students measured in LAS on the experimented datasets.First, the CoT method often achieves much lower LAS compared to the KD methods across four datasets, showing that the generated rationales do not faithfully reflect the decision making in CoT.Second, we observe that students trained with the rationales from contrastive decoding with either empty strings or wrong answers generally achieve higher LAS scores compared to the baselines.Together with the observation on the consistency of the teacher (Figure 5), this validates that a more consistent teacher train a more faithful student and the inconsistency in the training data generated by the teacher will be inherited by the student.
Can couterfactual reasoning loss further improve the faithfulness?Figure 6 shows the students fine-tuned additionally with counterfactual training loss achieve higher faithfulness than their counterparts which are fine-tuned with factual training only.This validates that counterfactual reasoning can further improve the student's faithfulness, as it may still treat rationale generation and answer prediction as two independent processes.
Can a faithful student still preserve its performance?Figure 6 (lower parts of each sub-figure ) shows the performance of the students measured in accuracy.First, CoT methods achieve lower accuracy compared to the KD methods, showing the benefit of combining the supervision from the teacher (the rationales) and the labeled datasets (the answers).Second, all the KD methods achieve comparable performance.Together with the observation over faithfulness, this demonstrates our method can improve faithfulness of the model while not hurting its performance.Note that the student which learns from human annotation achieves slightly better results compared to other students.This is because the human rationales are less consistent with the answers (as evidenced in Figure 5).Therefore, the student learns to generate the rationales and predict the answers more independently, which allows it to exploit the spurious correlation and achieve better performance.Our further analysis ( § 4.7) shows that such performance gain is suspicious as changing the rationales does not change the student's predictions mostly.

Ablation on the student model size
We ablate the student model size to see how its faithfulness and performance are affected.From Figure 7, we observe that larger student models achieve higher performance but lower faithfulness.This confirms that it requires sufficient capacity for storing knowledge necessary for reasoning (Wei et al., 2022a), but larger models are also better at answering the questions independently of the rationales.Still, our models are more faithful than baselines and comparable in performance with different model sizes.

Controlling the behavior of the Student
One important utility of faithful rationales is that we can have more control over the behavior of the  student via changing its rationales.If the model can make predictions consistent with its rationales, we can either impair or improve the its performance by perturbing or refining its rationales.To verify this, we conduct two types of edition to the rationales generated by the student, namely perturbation and refinement as described below.We then feed the edited rationales to the decoder of the student directly (as teacher forcing) and see if the student will act accordingly, i.e., predict more badly (or accurately) due to the worse (or better) rationales.
Rationales Perturbation For perturbing the rationales, we randomly replace 50% of the tokens in the generated rationales from the student and then feed the perturbed rationales r back to the decoder of the student.We finally calculate the performance drop (or sensitivity), i.e., Acc(qr → a * ) − Acc(qr → a * ). Figure 8 (the lower parts) shows the results on CSQA and CREAK.First, perturbing the rationales from the student that is finetuned with human-annotation has little (down to 1.1% on CSQA) impact on its performance, meaning that the student largely ignores the rationales when making prediction.Second, learning from rationales obtained by contrastive decoding with empty or wrong answers leads to a student that is more sensitive to the rationale perturbation compared to learning from greedy decoding.This again verifies the necessity of having a consistent teacher in order to train a faithful student.Lastly, our counterfactual training loss further improves the sensi- tivity of the student, demonstrating that the student is more faithful towards the rationales.

Rationales Refinement
As a proxy refinement, we obtain the oracle rationales r * automatically by asking the teacher to for gold answers using each compared decoding strategy.For the student trained with human annotation, we directly use the annotated rationales as the oracle.We then calculate the performance gain, i.e., Acc(qr * → a * ) − Acc(qr → a * ). Figure 8 (the upper parts) shows the results on CSQA and CREAK.First, we observe that oracle human-annotated rationales do not bring as much performance gain as machinegenerated rationales do.This demonstrates that even trained with human annotation, the student is still prone to being unfaithful to its rationales.Second, we observe that contrastive decoding (with either empty strings or wrong answers) leads to higher performance gains from the student.By adding counterfactual training, the performance gains are further increased.This demonstrates the advantage brought by our method, which is that we can have more success in debugging a reasoning model by refining its rationales.

Related Works
Free-text Rationales A variety of datasets have been proposed to collect human-annotated rationales alongside each task instance (Camburu et al., 2018;Rajani et al., 2019;Aggarwal et al., 2021), aiming to train the downstream models to explain their predictions in natural language.However, human annotation is expensive and the resulting ratio-nales are reported to be of poor quality (Aggarwal et al., 2021;Sun et al., 2022).Our work leverages a prompted LM to obtain rationales automatically for supporting both correct and incorrect answers, using only a few annotated examples as demonstration.The rationales for supporting the incorrect answers further enable the student to conduct counterfactual reasoning, which is not available from existing human annotation.
Prompted Self-Rationalization Models Recent works have been proposed to prompt large LMs to generate a free-text rationale before making the prediction (Nye et al., 2021;Wei et al., 2022b).However, this technique relies on extremely large LMs (with over 100B parameters) to work effectively (Wei et al., 2022b,a), which requires significant computation resources or expensive API calls (Shridhar et al., 2022).Meanwhile, the rationales generated by such models are shown to contradict the context (Ye and Durrett, 2022) and fail to faithfully represent the underlying reasoning process (Wang et al., 2022).In contrast, our student is trained to be more faithful towards its generated rationales using a smaller LM.
Knowledge Distillation There exist some works that explore the idea of distilling rationales knowledge from a large LM to a small LM as the student.

Conclusion
This work presents a faithful KD framework for learning a small, self-consistent CoT model from a large teacher model.To ensure the student reason faithfully, we propose (1) contrastive decoding for obtaining a consistent teacher and (2) counterfactual reasoning for teaching a faithful student.Experiments show that these two techniques jointly lead to a more faithful student compared to the baselines, while preserving much performance accuracy.Our further analysis shows that changing the rationales has a larger impact on the student's behavior and thus we can have more success in debugging the model by refining its rationales.

Limitations
Compared to a standard knowledge distillation process, our method requires additional computation when preparing training data and training the student.First, our contrastive decoding needs to perform forward pass in the teacher model one time more than greedy decoding does to obtain the perturbed plausibility for each token generated (Eq.4).Second, our KD process introduces additional training data for training the student with the counterfactual reasoning objective (Eq.5).Besides computation cost, this work focuses on improving faithfulness of the rationales rather than performance, which is complementary to prior works which leverages rationales for improving the performance only.

Ethics Statement
Our KD process leverages large LMs to obtain rationale annotation, which may expose social bias encoded in these models (Lucy and Bamman, 2021).The bias may be further inherited by the student model.Nevertheless, our method improves the faithfulness of the rationales, making the predictions from the student accountable.Without the faithful rationales, it would be unclear to users about whether the model is making predictions based on some unintended bias.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
In the "Limitations" section after Section 6.
A2. Did you discuss any potential risks of your work?
In the Ethics Statements section before the References section.
A3. Do the abstract and introduction summarize the paper's main claims?
In the abstract section and Section 1. B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 3.2 explains how the generated rationales are used as supervision.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 4.1 B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 4.3.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.
5558 5558 Can a Bengal cat survive eating only pancakes?The answer is no.Why?Is material from an aloe plant sometimes enclosed in petroleum-derived products?The answer is yes.Why?A Bengal cat cannot survive eating only pancakes.Aloe is a plant Plants are made of cells.Cells are made of molecules.Molecules are made of atoms.
<question> = Could the Great Wall of China connect the Dodgers to the White Sox? <answer> = yes The Great Wall of China is about 5,500 miles long… Input: Would a vegan eat a traditional Paella dish?Output: Paella is a dish that traditionally contains seafood and meat ... So the answer is no.Do hamsters provide food for any animals?

Figure 2 :
Figure 2: Overview of our knowledge distillation framework for faithful reasoning.(a) Teacher: A large LM prompted to generate a consistent rationale given a question and the gold answer in the training set via contrastive decoding.(b) Student: A small LM fine-tuned to generate a rationale and then answer via counterfactual reasoning.
(a) <perturbed_answer> = empty string -Manual labor can cause back pain.Thus, someone with back pain would not enjoy picking strawberries.(b) <perturbed_answer> = yes -The spine is needed to support the body.If someone has back pain, they would not be able to pick strawberries.

Figure 3 :
Figure 3: Contrastive decoding for obtaining rationales that are more grounded by the gold answers, by preferring tokens that are more plausible only when the answer is considered.
Input: [Facutal] Do black-tailed jackrabbits fear the European wildcat?Output: [Factual] The European wildcat is not a predator of the black-tailed jackrabbit.Thus, the black-tailed jackrabbit does not fear the European wildcat.So the answer is no.Input: [Counterfacutal] Do black-tailed jackrabbits fear the European wildcat?Output: [Counterfacutal] The European wildcat is a predator of the black-tailed jackrabbit.Thus, the European wildcat is a threat to the black-tailed jackrabbit.So the answer is yes.

Figure 4 :
Figure 4: Counterfactual reasoning for teaching the student to reason faithfully, i.e., answer differently according to the rationale.

Figure 7 :
Figure 7: Faithfulness (LAS) and task performance (accuracy) of the compared methods with different student model sizes.Each model is named by the teacher it learns from and the training objective as teacher model:training objective.

Figure 8 :
Figure 8: Performance gain (drop) of the compared methods when the oracle (perturbed) rationales are fed to the decoder of the model on CSQA and CREAK.

A4.
Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 4.1 B1.Did you cite the creators of artifacts you used?Section 4.1 B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
(Cobbe et al., 2021)to learn a student model that only predicts answers from a teacher model that is augmented with rationales.Eisenstein et al. proposedto train the student to extract the sentence containing the answer, which is not applicable to reasoning tasks that require background knowledge.Shridhar et al. proposedto train the student to ask and answer sub-questions necessary for decomposing the main question, which is tailored to solve math word problems(Cobbe et al., 2021)with an equation generator for guiding the student while we do not have such a constraint.Li et al. proposedto train the student on the joint task of generating the answers and the rationales, which only act as a regularization and do not affect the student's prediction during inference.More importantly, bothShridhar et al. and Li et al.do not consider the faithfulness of the rationales, which is critical for examining the behavior of the student.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Models (language models) are fine-tuned with default hyperparameters specified by the original papers.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4.4.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.