OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models

We conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the Super-NaturalInstructions benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model’s performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which reasoning skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4%) and Analogical (+13.9%) reasoning, as well as skills that exhibit negligible or negative effects.


Introduction
Recently, there has been a surge in the release of Large Language Models (LLMs) by both industrial and academic institutions.These models vary from open-source releases such as OPT (Zhang et al., 2022) and LLAMA (Touvron et al., 2023) to closed-source ones like GPT-3 (Brown et al., 2020) and PALM (Chowdhery et al., 2022).In addition, researchers have developed models that are finetuned on top of these foundational models to better Figure 1: Three-Dimensional Grid of Fine-Tuning, Prompting, and Scale.Each dimension is represented as an axis, with three levels for each of finetuning, prompting, and scale plotted on each axis.The resulting grid consists of 27 different combinations evaluated on various reasoning tasks.It should be noted that there is a hidden dimension, the scoring function, comprising four components.This results in a comprehensive total of 6,156 evaluations.
follow instructions, such as OPT-IML (Iyer et al., 2022) and Alpaca (Taori et al., 2023).Despite the remarkable progress in LLMs' performance in Natural Language Processing (NLP) tasks, reasoning remains a challenging area.For example, prior work have shown that LLMs struggle with commonsense reasoning (West et al., 2022) and arithmetic reasoning (Hendrycks et al., 2021) to name a few.
Recent efforts have attempted to improve the reasoning performance of LLMs by decomposing answers into step-by-step reasoning chains using incontext learning (Wei et al., 2022b;Kojima et al., 2022) or during finetuning (Chung et al., 2022;Wei et al., 2021a).While these approaches have shown some improvement on benchmarks such as GSM8K (Cobbe et al., 2021), it is not clear how those explanations affect finetuning, prompting, or {Task Definition} Provide your answer followed by a brief reasoning.

Options: {options}
Output: The answer is {answer} because {explanation} Figure 2: Template used during both training and inference.The model is tasked with predicting the answer followed by the explanation.their combination.Concurrent work has investigated the generalization capability of such models to reasoning skills beyond those encountered during finetuning (Yu et al., 2022), but a comprehensive evaluation of the role of explanation during finetuning and prompting with respect to reasoning skills is still lacking.
In this paper, we aim to address this gap.We investigate OPT (Zhang et al., 2022) as a representative of such models and utilize it as our base model.Through finetuning OPT on a collection of carefully curated open-source reasoning datasets that come with explanations for each instance, we evaluate its performance on 57 tasks drawn from the SUPER-NATURALINSTRUCTIONS benchmark (Wang et al., 2022), covering 26 different reasoning skills.Our experiments are structured around three key dimensions: finetuning, prompting, and scale, each of which is comprised of three distinct components (See Figure 1).Finetuning: (1) a (vanilla) unfinetuned OPT model; (2) A finetuned OPT model without explanations (OPT-R); and, (3) A finetuned OPT model with explanations (OPT-RE).Prompting: (1) zero-shot prompting; (2) Fewshot prompting without explanations; and, (3) Fewshot prompting with explanations.Finally, Scale: (1) 1.3B; (2) 6.7B; and, (3) 13B.Accordingly, we create grid of 27 different components, providing a detailed analysis measuring the impact of explanations during finetuning and inference across different model scales.
Our findings reveals that finetuning on reasoning datasets leads to statistically significant improvements in seven reasoning skills, including Numerical, Analogical and Reasoning on Objects, with Physical, Counting and Textual Entailment showing a significant effect only for the OPT-RE model, across both fewshot prompting conditions and model sizes, as compared to the vanilla OPT model (see Table 2).However, we also find that this approach significantly hinders the performance of three other reasoning skills (see Table 3).We also investigate the impact of incorporating explanations during fewshot prompting and find that it does not have a significant impact on the performance of the finetuned models, as measured by the variance in the difference between both prompting methods across reasoning skills for each model.However, we notice that it has a more noticeable effect on the performance of the vanilla OPT model, as shown in Table 5.Additionally, we observe a consistent increase in the average performance across all tasks from Fewshot to Fewshot-E, as well as from OPT to OPT-R to OPT-RE models, indicating that explanations do have a small effect on performance during both finetuning and prompting.Finally, Table 4 presents a summary of the results, indicating which reasoning skills demonstrate improvement due to the incorporation of explanations during either finetuning or prompting, which skills show a negative effect, and which skills have negligible effects regarding explanations.The finetuning corpus utilized to refine OPT is composed of various reasoning datasets, each of which includes a corresponding explanation or rationale for the answer.These rationales may consist of a sequence of smaller steps (i.e.chain-ofthought) or a free-form text that elucidates the reasoning behind the answer.As shown in Figure 2, we employ a uniform template for all tasks during the training process.The input to the model begins with a task definition, followed by an instruction to provide an answer followed by a brief reasoning.Next, we extract two random in-context examples uniformly from the training set that remain constant throughout training for each instance.The input for the current training instance is then presented in a format specific to each task.The options for the answer are then included in the input, but not in the in-context examples (see Appendix A for further details on task-specific definitions and options).The options are pre-shuffled for each training instance.The model is finally provided with the answer prefix, "Output: The answer is", and is tasked to predict the answer, followed by an explanation if OPT-RE is being finetuned.Similarly, the in-context examples only comprise an explanation when training OPT-RE.

Reasoning Datasets with Explanations
Below is a brief description of each dataset used during finetuning.See Figure 3 for the relative size of each dataset.
AQUA-RAT The Algebra Question Answering with Rationales dataset (Ling et al., 2017) rendering the task of solving algebraic word problems more feasible by dividing the problem into a series of smaller steps.They create a 100k-sample dataset that contains questions, answers and rationales in natural language and human-readable mathematical expressions that can be used to derive the final answer.
CoQA The Conversational Question Answering dataset Reddy et al. (2019).It consists of 127k questions and answers, compiled from 8k conversations about passages from seven different domains.Given a passage that contains a conversation, the model is tasked with answering a question by highlighting the corresponding evidence from the passage.
CoS-E The Common Sense Explanations dataset Rajani et al. (2019) to induce language models with commonsense reasoning.In this dataset, the model is given a question and a set of choices and is tasked with selecting one of the provided choices along with providing an explanation in natural language as to why that choice is correct.
ECQA The Explanations for Commonsense Question Answering dataset Aggarwal et al. (2021).It is similar to CoS-E since it requires the model to choose one of the provided options to answer the given question, and also provide an explanation.
ESNLI The Stanford Natural Language Inference dataset with Explanations Camburu et al. (2018) to train models to provide interpretable and robust explanations for their decisions.The authors extend the SNLI dataset (Bowman et al., 2015) with human-annotated explanations.Similar to any NLI task, the model is given a premise and hypothesis and the task is to determine whether the hypothesis sentence entails, contradicts, or is neutral with respect to the given premise.

GSM8K
The Grade School Math dataset Cobbe et al. (2021) to train models to better perform multistep mathematical reasoning.It consists of 8.5k linguistically diverse grade school math word problems.Therefore, the task for the model is to answer the question by performing a series of arithmetic operations to obtain a final answer, while explaining it's reasoning steps.
ProofWriter The ProofWriter dataset Tafjord et al. (2021) to generate both the implications of a theory from the RuleTaker dataset (Clark et al., 2020) and the natural language proofs that support them.Specifically, given a sequence of facts and rules, the model is tasked with answering a question using "Yes", "No", or "Unknown" and provide the reasoning path by referring to the provided facts and rules.We consider the open-world assumption subset of RuleTaker with questions that requires reasoning up to a depth of 5.
StrategyQA The Strategy Question Answering dataset Geva et al. (2021) to improve multi-hop reasoning for questions where the required reasoning steps are implicit in the question.Therefore, the task of the model is to answer the question using "Yes" or "No" then provide a strategy that explains the answer by decomposing it into a number of steps.

Finetuning Procedures
OPT The Open Pretrained Transformers (OPT) models are a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters released by Zhang et al. (2022).In this work, we use three OPT models with sizes of 1.3B, 6.7B and 13B.The details of each model architecture, pre-training corpus and training configuration (e.g.weight initialization, optimizer, tokenizer, hyperparameters, etc.) can be found in Zhang et al. (2022).Implementation Details To finetune the selected models, we utilized the metaseq1 implementation since it enables higher training efficiency compared to other codebases (Zhang et al., 2022).Each model is finetuned twice for 10 epochs, once with explanations and once without (i.e.OPT-RE vs OPT-R, respectively).Models are evaluated at the end of each epoch on a chosen set of SUPER-NATURALINSTRUCTIONS validation tasks, and the checkpoint with the best performance is selected for evaluation on the testing tasks.The loss is calculated only on the tokens the model is tasked to predict during inference, and not the full input, what is referred to as label-loss in (Iyer et al., 2022).
The samples across all datasets are shuffled during training.Further, the model is provided with two in-context examples during finetuning in addition to the task definition to match inference time following (Wang et al., 2022).
3 Evaluating the Models

SUPER-NATURALINSTRUCTIONS Tasks
In this study, we focus on a subset of the SUPER-NATURALINSTRUCTIONS benchmark version 2.62 (SUP-NATINST for short) proposed by Wang et al. (2022), which comprises 1,616 varied NLP tasks and includes meta-labels for each task, such as task type, domain and more importantly for this work: the underlying reasoning skills.Specifically, we select a subset of tasks that satisfy two key criteria: (i) the task focuses on a single reasoning skill, enabling us to evaluate a specific atomic skill, and (ii) the task can be tested using classification mode, as detailed in Section 3.2.Note that there is no data contamination between finetuning data and the evaluation benchmark.Benchmark Splits Following the task selection process, we apply a random sampling technique to ensure diversity within the testing set.Specifically, we select a maximum of three tasks from each reasoning skill, and allocate any remaining tasks to the validation set.Notably, this approach enables us to obtain a representative sample of the selected reasoning skills for testing, while also ensuring that our model's performance is not influenced by a particular subset of tasks.Table 1 shows the complete list of tasks used for evaluating our finetuned models for each reasoning skill.

Evaluation Setup
Earlier, we mentioned that we selected 57 tasks spanning 26 reasoning skills from SUP-NATINST to evaluate our finetuned models.To meet our criteria, as detailed in Section 3.1, each task had to fulfill two conditions.The second condition required that the task can be considered a classification task.That means there is a discrete set of candidates (one of which is correct) and thereby treating it as a classification problem where the highest-scoring candidate is considered the answer.To ensure this, we utilized a straightforward heuristic: we only sampled tasks that had no more than 10 possible candidate answers.
Classification Method To determine the correct answer, we conduct a forward pass for each potential candidate answer and utilize a scoring function to measure the likelihood that the candidate tokens follows the input, similar to Brown et al. (2020).This process is repeated four times using distinct scoring functions, as detailed in the subsequent paragraph.The highest accuracy score from the four scoring functions is considered as result of the task.
Scoring Functions This is considered the fourth dimension of this work since we evaluate each task using four different scoring functions and take the maximum accuracy as the result.The four scoring functions used are as follows: (1) mean, which involves computing the average of the log probabilities of candidate tokens, also referred to as token score.
(2) unconditional-norm, which computes the difference between the sum of token scores of the candidate when unconditioned by any previous tokens and the sum of candidate token scores when conditioned by previous input.(3) suffix, which computes the sum of the conditioned candidate's token scores alone.Finally, (4) sum, which involves calculating the sum of all the token scores passed to the model.The reason we employed different functions is that we observed significant gains in performance when using one scoring function over the other for specific tasks.Therefore, in order to ensure fairness across all tasks, we selected the highest accuracy over all scoring functions for each task.

Results & Findings
In this section, we present the results and findings of our experiments.First, we illustrate in Figure 4 the outcome of our evaluation on the effectiveness of finetuned models as compared to the vanilla OPT model, across three different scales when using both fewshot prompting with and without explanations.Furthermore, we observe a monotonic increase in the performance of each model as we increase the scale under those two prompting condition, which indicates a positive correlation between the model's capacity and its overall performance.However, we note that this trend does not apply to the zeroshot prompting method, since we are testing out-of-distribution tasks and that the finetuned models were trained with fewshot exemplars in their context.This leads us to focus only on the fewshot prompting methods, with and without explanations, for the remaining of our evaluations.Specifically, we investigate the impact of finetuning the OPT models on reasoning datasets, as compared to the vanilla OPT model, and explore the effect of explanations during finetuning and prompting, both in terms of the reasoning skill.

Model Performance for Reasoning Skills
The results reported in this and the following section are the classification accuracy of each reasoning skill across different conditions, such as model sizes and fewshot prompting methods.Counting skill, the OPT-RE variant outperforms both the OPT-R and OPT models, underscoring the criticality of incorporating explanations during the finetuning process for mathematical datasets.Likewise, the Physical Reasoning tasks exhibit a similar trend.On the other hand, we can see that for the Argument, Deductive Textual Entailment and Commonsense skills the non-finetuned version outperforms considerably.

Fine-Grained Skill Analysis
Table 4 shows the classification accuracy results obtained from the three models, in relation to the reasoning skill and few-shot prompting method used.The best accuracy value for each reasoning skill is indicated in bold, and the cells are shaded with colors ranging from green to white to indicate their position in the accuracy spectrum of each reasoning skill.The skills with similar performance across different models are assigned a lighter shade of green, indicating that their color spectrum ends earlier than that of other skills where the difference in performance between models is more significant.Table 4: Classification accuracy results achieved by different models as a function of the reasoning skill and few-shot prompting method employed.The best accuracy obtained for each reasoning skill is highlighted in bold.The cells are shaded with colors ranging from green to white to indicate their position in the accuracy spectrum.Reasoning skills with smaller variance in achieved results are assigned a lighter shade of green to convey the extent of similarity between models.The first block highlights skills where the finetuned models perform notably better than the vanilla OPT.The second block emphasizes the skills where OPT-RE outperforms other models.In contrast, the third block showcases the skills where OPT outperforms the other models.Lastly, the fourth block identifies skills where the choice of model or prompting method has little impact on the overall performance.
Explanations' Effect One of the central questions that we sought to investigate in this study is the extent to which explanations play a role in improving the reasoning capabilities of OPT models during finetuning and prompting.The results presented in Table 5 suggest that the presence or absence of explanations in the fewshot examples employed for prompting does not significantly impact the performance of the model when the model is finetuned on reasoning datasets.Concretely, in Table 5, we present the variance of the absolute accuracy difference for each model across reason-ing skills by excluding the Temporal skill, which was identified as an outlier.Specifically, we compute the difference between the two corresponding columns for each model in Table 4.These values provide insights into the impact of including explanations during prompting on the performance of the models.Our findings reveal that the difference is negligible for OPT-R and OPT-RE models, suggesting that the choice of prompting method does not significantly affect the model's accuracy.However, for the vanilla OPT model, the difference is more substantial, emphasizing the impor-tance of employing explanations during fewshot prompting.However, the mean performance of each model across the distinct fewshot prompting methods demonstrates a slight yet consistent increase in classification accuracy, from Fewshot to Fewshot-E (incorporating explanations), as well as from OPT to OPT-R to OPT-RE models showing that explanations do have a small effect on performance during both finetuning and prompting.

Related Work
Reasoning LLMs LLMs have made significant advancements in the field of NLP and related areas (Brown et al., 2020;Chowdhery et al., 2022;Chung et al., 2022), especially with the advent of the pre-train, prompt, and predict paradigm (Liu et al., 2021).This paradigm has enabled these models to solve a multitude of tasks through incontext fewshot or zeroshot learning using instructions (Wei et al., 2021b;Iyer et al., 2022).However, their reasoning abilities have been a subject of debate in recent literature (Huang and Chang, 2022;AlKhamissi et al., 2022).Several studies suggest that increasing the size of an LM trained through the same next-token prediction method can lead to the emergence of complex behaviors (Wei et al., 2022a), including reasoning.For instance, some research has demonstrated that sufficiently large LMs can use chain-of-thought prompting (Wei et al., 2022b) to simulate human-like reasoning.Other studies have shown that the addition of a simple prompt, such as "Let's think step-by-step" (Kojima et al., 2022) can elicit reasoning abilities in LLMs by generating explicit reasoning steps before decoding the final answer.However, some researchers contend that emulating the human reasoning thought process is distinct from claiming that the model can truly reason (Wei et al., 2022b).
Finetuned LLMs Concurrent studies have finetuned LLMs to follow instructions to improve their generalization ability to unseen tasks through zero and fewshot learning (Iyer et al., 2022;Chung et al., 2022).However, our approach differs in that we only finetune on a selected number of open-source datasets that provide explanations for each instance.This enables us to focus on the importance of explanations during finetuning in the context of reasoning skills.While concurrent works, such as (Iyer et al., 2022;Wang et al., 2022), have experimented with different prompting methods during finetuning and inference, our study focuses primarily on evaluating the reasoning ability of the finetuned models across a set of reasoning skills.Other concurrent studies have explored the impact of finetuning on a set of held-out reasoning tasks (Yu et al., 2022), but their evaluation approach, which involves generating answers, may be influenced by various factors such as decoding strategy, decoding parameters, and prompt templates.In contrast, we adopt a rank classification approach similar to (Brown et al., 2020), which better captures the reasoning performance of the model being evaluated, in addition to covering a larger number of reasoning skills and tasks.

Conclusion
In this study, we investigated the impact of incorporating explanations during finetuning and prompting on three different sizes of the OPT model.Through a systematic and comprehensive evaluation process that considered three key dimensions, we found that while explanations did provide a small improvement in performance, the effect was not significant when incorporated in the in-context demonstrations during inference for the finetuned models.Additionally, our results showed that both finetuned models exhibited significant improvements in reasoning skills such as Numerical, Analogical and Reasoning on Objects.Moreover, we demonstrated that skills such as Physical, Counting, and Textual Entailment benefited from incorporating explanations during the finetuning process.Overall, our findings provide insights into the impact of incorporating explanations on the reasoning capabilities of LLMs and offer guidance on which reasoning skills would benefit most from the inclusion and exclusion of explanations during finetuning and prompting.

Limitations
While our study provides valuable insights into the impact of finetuning on reasoning performance and the role of explanations during finetuning and prompting with respect to various reasoning skills, there are several limitations to our work.Firstly, we only consider a single LLM, OPT, as our base model.Our results may not generalize to other LLMs with different architectures or pretraining objectives.Secondly, we only use a limited set of reasoning datasets for finetuning due to the limited availability of open-source datasets with explanations.However, it is possible that our findings may not hold for models finetuned on larger closed datasets as usually seen in real-world scenarios.Thirdly, our experiments only cover a limited range of model sizes due to limitations in computational budget, therefore it is possible that our findings may not hold for much larger models.Finally, we only consider finetuning using fewshot prompting conditions in our experiments, and it is possible that our findings may not hold for models finetuned without in-context exemplars.Overall, while our study provides valuable insights into the impact of finetuning and explanations on reasoning performance, further research is needed to investigate these factors across a broader range of models, datasets, and finetuning strategies.

Dataset Task Definition Options
AQuA You are given an algebraic word question.Questions in this task often requires executing a series of arithmetic operations to obtain a final answer.You are also given 5 answer options (associated with 'A', 'B, 'C', 'D', 'E').Do not generate anything else apart from one of the following characters: "A", "B", "C", "D", "E" and the corresponding explanation.
-A -B -C -D -E

CoQA
You are given a passage that contains a conversation and a question.The task is to answer the question and provide an explanation that highlights the corresponding evidence in the passage.

Free-form text
CoS-E You are given a passage that contains a sentence and a question.The task is to answer the question by selecting one of the provided choices.
Select one of the provided choices

ECQA
You are given a question that requires commonsense reasoning.The task is to answer the question by selecting one of the provided choices.
one of the provided choices ESNLI You will be presented with a premise and a hypothesis sentence.The task is to determine whether the hypothesis sentence entails (implies), contradicts (opposes), or is neutral with respect to the given premise sentence.Please answer with "Contradiction", "Neutral",or "Entailment".

GSM8K
You will be presented with a passage that contains a grade school math word problem.The task is to answer the question by performing a series of arithmetic operations to obtain a final answer.

Number
ProofWriter You are given a sequence of facts and rules followed by a question.The task is to answer the question using "Yes", "No" or "Unknown".
-Yes -No -Unknown StrategyQA You are given a sentence and a question.The required reasoning steps are implicit in the question.The task is to answer the question using "Yes" or "No" then provide a strategy that explains the answer by decomposing it into a number of steps.

-Yes -No
Table 6: Task definition and options used for each of the finetuning reasoning datasets.

A Finetuning Task Definition and Options
Table 6 shows the task definition and options provided as input to the template shown in Figure 2 during finetuning the OPT models on the reasoning datasets.

Figure 3 :
Figure 3: Number of samples in each dataset of the training corpus.Y-axis in log scale.

Figure 4 :
Figure 4: Results achieved across all tasks as a function of the three primary dimensions analyzed in this study: Finetuning, Prompting and Scale.

Table 5 :
The first column shows the variance of the absolute difference in accuracy for each model across different reasoning skills, when using Fewshot (F) and Fewshot-E (FE) prompting methods.The second and third columns show the average performance of each model across each prompting method.Results are obtained after dropping the outlier Temporal skill.