Learning from Mistakes via Cooperative Study Assistant for Large Language Models

Large language models (LLMs) have demonstrated their potential to refine their generation based on their own feedback. However, the feedback from LLM itself is often inaccurate, thereby limiting its benefits. In this paper, we propose Study Assistant for Large LAnguage Model (SALAM), a novel framework with an auxiliary agent to assist the main LLM in learning from mistakes through interactive cooperation. In the gathering phase, the student assistant agent probes the main LLM, analyzes its errors, and collects the interaction in a mistake memory. During the examination phase, the study assistant provides guidelines by retrieving relevant cases to help the main LLM anticipate and avoid similar errors. We first investigate the effectiveness of a general study assistant and then customize it to provide LLM-specific guidance through imitation learning from successful guidance experiences. Our experiments on three LLMs using two challenging frameworks demonstrate that SALAM can significantly boost LLMs by an accuracy margin of up to 6.6 on BBH and 12.6 on BBQ.


Introduction
Large language models (LLMs) have demonstrated remarkable performance in a wide range of tasks (Brown et al., 2020;Raffel et al., 2020;Chowdhery et al., 2022).Their effectiveness is further enhanced by human instructions and feedback, allowing them to better align with human intentions (Chung et al., 2022;Ouyang et al., 2022;Bai et al., 2022b).Furthermore, recent studies show that LLMs can also benefit from their own feedback to avoid mistakes, similar to human reflection (Shinn et al., 2023;Madaan et al., 2023).
There are two main limitations to existing selfreflection methods.First, they rely on the correctness of the guidance, particularly in determining 1 https://dqwang122.github.io/projects/SALAM.

Study Assistant LLM
Yesterday, Jan 21, 2011, Jane ate 2 pizzas and 5 wings.What is the date today?

Mistake Gathering Examination
Mistake Memory

Retrieve
Figure 1: SALAM consists of two agents: a main LLM and a study assistant.The main LLM generates responses (in blue), while the study assistant provides guidance (in green).During the mistake-gathering phase, the main LLM interacts with the study assistant, receiving feedback to refine its responses.The study assistant compares the main LLM's response with the ground truth, providing guidance and collecting the mistakes made by the main LLM.In the examination phase, the study assistant retrieves relevant mistakes from the mistake memory for a new query and provides guidelines without knowing the ground truth.
when to terminate reflection and accept the current response.Inaccurate guidance can mislead the LLM by either prompting it to refine an already acceptable generation or prematurely halting the refinement of an undesired generation (Huang et al., 2023).Prior studies have attempted to address this by utilizing additional learned discriminators and employing a value-based threshold as the termination signal (Welleck et al., 2022;Saunders et al., 2022;Lu et al., 2022), or by providing few-shot examples to encourage LLMs to develop their own discernment when accepting responses (Madaan et al., 2023;Yang et al., 2022;Kwon et al., 2023).However, the reliability of such criteria remains uncertain.Moreover, the reflection methods are limited to the unsuccessful experiences of the query they are addressing, without acknowledging mistakes made on other queries.Consequently, when confronted with a new query, the LLM cannot fully utilize the experience gained from similar cases and may repeat past errors.The lack of global reflection results in an inefficient revision process.
To address these challenges, we propose Study Assistant for Large LAnguage Model (SALAM).This novel framework introduces a cooperative agent, guiding the main LLM to learn from its mistakes.It is inspired by how humans learn from their mistakes: maintaining a collection of mistakes and analyzing common misunderstandings.SALAM includes two cooperative agents: a main LLM responsible for problem-solving, and a study assistant that collects previous error cases and provides guidance to improve the main LLM's performance.The framework consists of two phases: the mistake-gathering phase and the examination phase.During the mistake-gathering phase, the LLM interacts with the study assistant to receive feedback and refine its answers.Simultaneously, the study assistant collects mistakes and provides guidance based on the ground truth.In the examination phase, the study assistant retrieves similar mistakes for a new query and provides guidelines to clarify misunderstandings and prevent the LLM from repeating previous errors.For example, in Figure 1, the study assistant analyzes the LLM's current response of '02/11/2002' compared to the ground truth of '02/12/2002' and provides the guideline 'identify the correct date from which calculations should be made' to help the LLM refine its response.
Our proposed SALAM enjoys three advantages: (1) Flexible: SALAM is a versatile framework that can be directly adapted to any LLM.Additionally, it has the capability to provide LLMspecific guidance by fine-tuning the study assistant on the specific behaviors of the LLM.(2) Lightweight: In contrast to knowledge distillation, where a large teacher/teacher assistant is used to improve downstream task performance, the study assistant in SALAM is a small model focused on providing feedback based on mistakes.It is more cost-effective to fine-tune the small study assistant once for all downstream tasks, compared to finetuning the large LLM for different complex tasks.
(3) Efficient and Reliable: The feedback provided by the study assistant is based on the comparison between the LLM's response and the ground truth, making feedback more reliable.Furthermore, the guidance for previous mistakes can be applied to new, similar queries.This makes the guidance more efficient, as it can help prevent similar errors from occurring in advance.
We evaluate the effectiveness of SALAM on three LLMs: Flan-T5 (Chung et al., 2022), GPT-NeoX (Black et al., 2022), andLLaMA (Touvron et al., 2023).We use 27 tasks from BBH (Suzgun et al., 2022) and BBQ (Parrish et al., 2022), which evaluate two crucial aspects of LLMs: reasoning ability and potential social bias.Our contributions are as follows: • We introduce a general framework, SALAM, to learn from mistakes through interactive cooperation between the main LLM and the study assistant.The main LLM refines its answer based on the feedback from the study assistant, while the study assistant provides guidance by comparing the LLM's behaviors with the ground truth.• We further use the main LLM to fine-tune a model-specific study assistant, tailoring the specific guidance for this LLM.We use imitation learning on the successful guidance experiences of this LLM for fine-tuning.• The experimental results show SALAM significantly boosts the performance of various LLMs on different tasks.We also conduct a comprehensive analysis of retrieval and feedback strategies.

Related Work
Feedback from Language Models Large language models (LLMs) have exhibited a remarkable capability for providing feedback.The feedback from LLMs can be in the form of real numbers to evaluate the quality of the generation (Fu et al., 2023;Kocmi and Federmann, 2023), or textual instruction to guide the refinement (Kwon et al., 2023;Yao et al., 2022).For instance, Peng et al. (2023) provided feedback grounded in evidence from external knowledge.Reflexion (Shinn et al., 2023) generated textual feedback utilizing trajectory history and dynamic memory with the help of fewshot examples and the signal from the environment.Self-Refine (Madaan et al., 2023) manually created several few-shot examples for each task to prompt LLM to provide feedback.It stops reflection when it achieves the maximum iteration or exceeds the threshold value.Du et al. (2023) uses responses from multiple agents and the debate between these agents as the feedback.However, Huang et al. (2023) finds that such intrinsic self-correction is unreliable without the correct label to determine when to stop the refinement.Instead of previous instance-specific feedback, in this paper, the study assistant collects a set of mistakes to generate global feedback on model behaviors.Furthermore, the study assistant used the ground truth to provide feedback, making the feedback more reliable.
Learning from Feedback Plenty of work has been done to investigate how to utilize feedback (Pan et al., 2023).One is to filter undesirable data based on feedback and use the filtered data to finetune the model (Huang et al., 2022;Uesato et al., 2022).The other is to train a reward function and take it as the reward function in the reinforcement learning (Stiennon et al., 2020;Ouyang et al., 2022;Bai et al., 2022a).Benefiting the LLMs' ability to follow instructions, recent researchers add textual feedback into the prompt and directly ask models to revise their response (Peng et al., 2023;Shinn et al., 2023).Moreover, the feedback can be one time (Saunders et al., 2022), or multiple times (Scheurer et al., 2023;Madaan et al., 2023).In this work, we use feedback as the instruction of the main LLM and ask it to refine its answer.
Teacher-student Learning Teacher-student learning is a knowledge distillation method to transfer knowledge from a larger teacher model to a smaller student model (Hinton et al., 2015;Gou et al., 2021).The goal is to produce similar results as the powerful teacher model with fewer parameters and computational costs.The teaching assistant is an intermediate model mimicking the behavior of the teacher model and then teaches the student model.It usually has a medium size between the student and the teacher (Mirzadeh et al., 2020).Recently, a lot of work has tried to distill knowledge in large language models to enhance the capability of small models, such as commonsense (Bhagavatula et al., 2023;West et al., 2022) and the reasoning ability (Shridhar et al., 2023;Magister et al., 2022).Unlike knowledge distillation, the study assistant in SALAM does not need a stronger capability on the downstream tasks.It is designed to analyze the output of the base model given the ground truth, providing a guideline for the base model to avoid similar mistakes.

Study Assistant Agent for LLMs
In this section, we introduce SALAM framework.SALAM consists of two agents: a main LLM M and a study assistant T .The M is responsible for solving the downstream tasks while the study assistant T provides text feedback for M to refine its answer.The goal of this framework is to improve M performance by the interactive cooperation between the two agents.
There are two phases: in the gathering phase, the study assistant T collects mistakes while identifying common misunderstandings and providing helpful guidance for revision; the examination phase involves using M on new queries.The key difference is that in the gathering phase, the study assistant T has access to the ground truth, while in the examination phase, it does not.
Specifically, suppose there is a set of N queries Q = {q (0) , q (1) , • • • , q (N ) } in the gathering phase.For each query q ∈ Q, the main LLM M generates an initial response y 0 and the study assistant T provides text feedback a 0 based on the comparison between the current response and the ground truth y.Then M generates a new response y 1 under the guidance of the feedback.There can be multiple iterations between these two agents until M gets the correct answer (or achieves the maximum iteration number L): {(y 0 , a 0 ), • • • , (y l , a l )}.These iterations are stored in the mistake memory O err .During the examination phase, given a new query q, the study assistant T retrieves the most relevant queries from O err and provides text feedback as the pre-hoc instruction for M.

Problem Formulation
We formulate the interactive process as a Markov decision process (MDP) with (S, A, P, R).Here, S represents a set of states describing the interaction while A represents a set of feedback generated by the study assistant T .P is transition probability function: S×A×S → [0, 1], and R : S×A → R is a reward function based on the states and feedback.For each state s t , the study assistant T generates the text feedback as its action a t and receives a reward based on M's performance.
The state at timestep t is defined as s t = {q, y t , c t }, including a query q, a response y t generated by M, and the context c t retrieved from the mistake memory O err .In the gathering phase, the context c t is previous responses for the same query c t = {(q, y 0:(t−1) )}; in the examination phase, it includes retrieved mistakes and feedback of relevant queries based on the query similarity: c t = {(q (1) , y (1) The action space A is a set of all possible feedback utterances generated by T .It includes an explanation about why y t is incorrect (Analysis) and a suggestion for improvement (Guideline).We use the performance of M as the reward function R(s t , a t ) to evaluate the effectiveness of the feedback T provides.Given the ground truth y, the reward is 1 if the current response y t = M(s t , a t ) contains the ground truth, which means y ∈ y t .Otherwise, it is 0.

Mistake Gathering and Retrieval
The study assistant T maintains a global mistake memory O err for both the collection and the examination phase.Each entry in O err takes the query as the key and a list of incorrect answers and feedback as the value.For example, the entry gathered from Figure 1 is <'Jane thought today ... a month age', ('02/11/2002', 'Analysis: ..., Guideline: ...')>, the first element is the key and the second is the value.For each state s t in the gathering phase, T retrieves previous mistakes for the current query and updates the entry with the current response y t if it is incorrect R(s t , a t ) = 0.During the examination phase, T retrieves relevant mistakes based on the cosine similarity between the key and the current query q.We limit the maximum number of retrieved mistakes by a hyperparameter k and a minimum similarity threshold with θ.

General Study Assistant
The goal of the study assistant is to learn a policy π(a|s) on S → A that provides feedback based on the state.It can be a general study assistant model trained on the mistake datasets, agnostic to LLMs.We initialize the policy with a 7B LLaMA model (other models would work as well) and finetune it on a small feedback dataset generated by GPT-4 (OpenAI, 2023).Given a current state s, the policy is: ( a i is the i-th token in feedback a. ρ is a templatebased function to map the query q, the context c and current response y to a text prompt.Since the study assistant only depends on ρ(q, c, y) and is unaware of where mistakes are from, it is modelagnostic and can directly be adapted to unseen tasks and new models.An example of prompting the study assistant T to provide feedback is shown in Figure 2. In this example, the study assistant is asked to provide feedback for the main LLM who is prompted to calculate last month's date.The uncolored text in the prompt is the template used to Guideline: For dates in a problem, identify the correct date from which calculations should be made.Also, make sure to maintain the correct format (MM/DD/YYYY) while providing the answer.

Study Assistant Response
We  The previous wrong answer y0 (is green) is retrieved from mistake memory.The query q and the ground truth y are in blue.The orange content is the current wrong answer y1.For examination, there is no the previous answer, current answer, and ground truth in the prompt and the study assistant is asked to directly provide guidelines.
prompt the study assistant and the text in blue is the query.The current response y 1 ='04/12/2001' is in orange.Here the context c 1 in green is the previous wrong answer of this query y 0 ='(B) 02/11/2002' retrieved from O err .The response of the study assistant is generated via the policy in Equation 1.

Imitation Learning for Study Assistant
To enhance the guidance for a specific LLM, we can further use M to fine-tune T .The performance improvement of M can be viewed as the hindsight for the policy of T .Following studies of learning from hindsight (Uesato et al., 2022;Zhang et al., 2023;Liu et al., 2023), we apply imitation learning to learn a specific feedback policy for one model M. It includes two phases: online sampling and policy fine-tuning.
Specifically, given the state s t = {q, y t , c t }, we sample various possible actions a t from the current policy model and obtain a replay dataset: t ) and get a filtered dataset only Here the L is the maximum timestep of the interaction, and N is the size of the collection set.We conduct the supervised fine-tuning to learn from those successful trajectories by minimizing the negative likelihood: In this way, the finetuned student assistant adapts to the candidate output from M and generates modelspecific feedback.

Experiment
We conduct experiments in two challenging benchmarks with 27 tasks: BBH and BBQ, evaluating SALAM's ability to guide complex reasoning and reduce social biases.We further conduct comprehensive analyses from different aspects to enhance the understanding of SALAM.

Benchmark
Big-Bench-Hard (BBH) (Suzgun et al., 2022)  For each task in the benchmark, we split the data by 0.8/0.2 to build the training and test set.The training set is used for the gathering phase and the test set is for the examination phase.We reformulated the multi-choice question-answering to a generation task.For each query, we added options to the prompt.The generation contained the correct option or the option content was viewed as a correct answer.We calculated the accuracy rate as the evaluation metric.We demonstrate one example for each benchmark in Table 1 and leave other details in Appendix A.1.

Experiment Setup
In the experiment, we take the 11B Flan-T5-XXL (Chung et al., 2022), 20B GPT-NeoX (Black et al., 2022), 7B LLaMA (Touvron et al., 2023) as M. We evaluate Flan-T5 under the zero-shot setting while GPT-Neox and LLaMA under the fewshot setting.It is because we found GPT-NeoX and LLaMA could hardly follow the zero-shot prompt to generate structured responses.We use the few-shot examples provided by Suzgun et al. (2022) for BBH, and manually generated 3 few-shot examples for each task in BBQ.
For the model-agnostic T , we finetune a LLaMA model with 7 billion on a feedback dataset generated by GPT-4 2 according to the mistakes of Flan-T5.The feedback dataset includes 1855 feedback for BBH and 514 feedback for BBQ.GPT-4 is prompted with the format in Figure 2.This T is directly used to provide feedback for LLaMA and GPT-NeoX.
For the model-aware SALAM, we sample 20 trajectories for each mistake of M with a temperature of 0.8 and followed Section 3.4 to get the D on .It was optimized with Equation 3. The sampling of one trajectory is terminated if it gets a reward of 1 (correct response).The maximum number of actions depends on the number of options in the query.For example, for one query with 4 options, the maximum number of actions is T=4 because it should arrive at the right answer after 3 failures.We call it SALAM w/ IL.We finetuned all models on two A6000 GPUs for 10 epochs with a learning rate of 2e-5 for about 7 hours.The parameters are updated every 32 instances.

Baseline
We set up three baselines: M directly takes the query as the prompt.M w/ O corr keeps a collection of correct answers, similar to the mistake memory described in Section 3.2 except that the entry has a reward of 1.It retrieves relevant queries and takes them as enhanced few-shot examples.M w/ O err retrieves incorrect cases from the collection, but different from SALAM, there is no feedback from T .It performs as an ablation study that removes the feedback policy of T .We illustrate several cases in Appendix A.2.For SALAM w/ IL and SALAM, we use retrieved mistakes and the guideline as the instruction.
We also compare our method with Self-Refine (Madaan et al., 2023).We use the same M for generation, feedback, and refinement via different prompts and in-context learning examples.We follow the implementation of the official repo 3 and adapt it to the BBH and BBQ benchmarks.For each benchmark, we use 3 in-context examples for each module.We set the number of iterations to a fixed number (k = 2) since the ground truth labels 2 https://chat.openai.com/?model=gpt-4 3 https://github.com/madaan/self-refineare not accessible during the examination phase.
For retrieval, we use SentenceTransformer4 (Reimers and Gurevych, 2019) to calculate the sentence embedding and the cosine similarity.SALAM retrieves top k = 1 queries from the mistake memory and filters candidates with a similarity lower than θ = 0.9.
Note that during the examination phase, both T and M are unaware of the ground truth, so there is no signal from the ground truth.This is a more general setting, which is different from the reflection of Alfworld (Yao et al., 2022;Shinn et al., 2023), or the feedback of self-refine (Madaan et al., 2023) with external classifiers.

Main Results
In this section, we focus on the following research questions: (i) Can SALAM enhance the model M 's ability?(ii) Is SALAM data efficient?(iii) Which learning strategy is better, learning from success or learning from failure?SALAM achieves superior average performance on both benchmarks.As shown in Table 2, SALAM can enhance performance for all three M, with a particularly notable improvement for Flan-T5.Even though SALAM is not trained to provide feedback for mistakes made by LLaMA and GPT-NeoX, it still yields benefits for these models.This indicates that SALAM can effectively enhance reasoning ability and reduce bias by providing global feedback based on past mistake memorys.It's notable that this is a general framework that can be effortlessly adapted to a new LLM without additional training.The performance of Self-refine even falls behind direct prompting (M).We observe that it is challenging for Self-refine's feedback module to identify the accuracy of the current response without knowing the ground truth, causing it to repeatedly revise correct answers (see Appendix A.4). Furthermore, it proves difficult for less powerful LLMs to generate textual feedback and perform reasoning tasks with only limited in-context examples.
Failure can sometimes be more valuable than success.Comparing the performance of M w/ O corr and M w/ O err in Table 2, we find that mistakes can sometimes be more beneficial.This might be because past successful attempts indicate the model's capability of correctly dealing with similar queries, providing little aid for what the model has not yet mastered.Such attempts might even lead the model to perform worse than in the zero-shot setting, showcasing the impact of negative feedback.This observation emphasizes the importance of choosing suitable few-shot examples.On the other hand, examples of mistakes help rectify past incorrect answers, providing superior guidance for questions the model struggles with.10.Despite the limited data, SALAM still exceeds the performances of other baselines and shows over 10% improvements on 6 out of 16 tasks.This suggests that SALAM can effectively summarize a limited number of mistakes and provide targeted feedback.However, SALAM struggles considerably with the tracking shuffled objective tasks, evidenced by a significant performance decline.We hypothesize that the difficulty of these tasks demands a larger dataset to cover a wide variety of mistakes.A similar trend is also observed in geometric shapes, where the zeroshot performance is low, and the improvement is marginal.However, with more feedback data, these tasks can be further improved as shown in Table 10.

Analysis
In this section, we dive into several aspects to enhance the understanding of SALAM.
How do feedback strategies impact performance?We investigate the impact of various feedback strategies in Table 4. Here, we set k = 3 and θ = 0.9 to include more feedback.The study assistant provides feedback on two dimensions: an analysis of the potential reason for the mistake, and the guideline to avoid similar mistakes.It also retrieves previous similar mistakes as context.We test different instruction combinations for the model M. Additionally, we allow the study assistant to directly generate guidelines for the new query without any retrieval (direct guideline).The results indicate that in most cases, the pairing of mistakes and guidelines yields the best performance.We attribute this to the fact that the analyses are typically lengthy and take up the majority of the instructions, misleading M to generate a similar analysis instead of generating an answer based on the given options.Direct guidelines without retrieval often degrade performance, which emphasizes the importance of mistake memory.How does retrieval impact performance?Retrieval plays a critical role in our SALAM framework.There are two essential hyperparameters: topk restricts the number of retrieved entries by only returning the k entries with the highest scores, whereas θ sets the minimum similarity score that the retrieved examples should achieve.
In Figure 3a, we set a low θ = 0 to accept all retrieved entries.It is observed that as k increases, the accuracy continues to decline.This is likely because more irrelevant examples are retrieved, leading to the misleading of the model.The trend of SALAM is more pronounced, with the performance dropping to zero when k increases to 10. Upon examining the generations, we find that with more guidelines in the prompt, the model treats these guidelines as few-shot examples rather than instructions, leading it to generate similar guidelines rather than an answer to the query.
In Figure 3b, we set a large k = 10 to retrieve entries with varying similarity scores.The results show that with the increase of the threshold, the accuracy also increases.For M w/ O corr , the relevance of the few-shot examples proves to be particularly important, which aligns with previous stud-   ies on few-shot learning.Interestingly, SALAM lags behind M w/ O err at low thresholds, but surpasses it at high thresholds and ultimately achieves the best performance.This suggests that the relevance of retrieved examples is more important than their quantity.Are pseudo mistakes helpful?In Section 3.2, we gather mistakes from previous attempts on the training set, which we refer to as real mistakes.However, this process requires M to make an extra pass over the training set.Alternatively, we can generate pseudo mistakes by arbitrarily selecting an incorrect answer option of the query as the pseudo mistake.Therefore, we assess the performance of M when given a single pseudo mistake.Specifically, we utilize the entirety of the dataset as the evaluation set, since we do not need to traverse the training set to collect mistakes.For the zero-shot setting, we prompt M with the query and identify the pseudo mistake, while for the fewshot setting, we provide three examples with both the pseudo mistake and the correct answer.The detailed prompts can be found in Table 12.The results are exhibited in Figure 4.In most cases, pseudo mistakes appear to have a detrimental ef- fect on performance.Even though we provide few-shot examples that demonstrate how to correct the mistake, the performance on BBQ still deteriorates.This suggests that pseudo mistakes typically fail to expose the model's actual shortcomings.Instead, these pseudo mistakes may confuse the model.Therefore, learning from the real mistakes of the model is necessary.
Can feedback generalize to unseen tasks?To investigate the generalization capability of SALAM, we divide the BBQ benchmark into two sections.The first five tasks are taken as the indomain tasks and mistakes are collected from them, while the remaining tasks are considered out-ofdomain.We set the retrieval topk at 1, to use only the most relevant mistake.We evaluate the performance of the out-of-domain tasks.As evidenced in Figure 5, SALAM is also beneficial for unseen tasks if these tasks share some similarities with the existing errors.
How does it perform when using GPT-4 as the main LLM or the study assistant?We also examine SALAM's capability on larger LLMs like GPT-4.Due to cost concerns, we only perform the comparison on a random subset (10%) of the original set.We first use GPT-4 as the main LLM and employ our finetuned study assistant for feedback.The results are displayed in Table 5.It reveals that even though GPT-4 already exhibits strong performance on the BBQ benchmark, leaving minimal room for SALAM to enhance, SALAM significantly boosts GPT-4's performance on BBH.This suggests that even a large model like GPT-4 can benefit from feedback provided by our study assis-tant.
Additionally, we use GPT-4 to provide feedback on 10% of the training set for M = LLaMA and present the results in Table 6.For a fair comparison, we also provide SALAM with 10% feedback as one baseline.From the table, it's observed that with the provided 10% feedback, GPT-4 outperforms SALAM by 2.1 on BBH and 0.7 on BBQ.However, SALAM with 100% feedback surpasses GPT-4, underscoring the importance of diverse feedback.Given our SALAM is much more cost-effective than GPT-4, it demonstrates the potential of our SALAM to provide effective feedback.

Conclusion
In this paper, we introduce a novel framework, the Study Assistant for Large Language Model (SALAM), designed to aid LLMs in learning from their mistakes by interactive cooperation between the study assistant and the LLM.This framework is inspired by the methods human study assistants use to support students, by identifying common errors and providing guidance.The student model sends its generations to the study assistant and refines these based on the feedback received.The study assistant identifies errors, offers feedback, and gauges its success by the student model's performance improvement.We validated the effectiveness of SALAM on the BBH and BBQ benchmarks, showing significant improvement in the model's performance.Furthermore, we use the LLMs' performance as the signal to further finetune the study assistant for model-specific guidance.We believe that our method offers a novel way to augment LLMs by the cooperation between multiple agents.

Limitations
Here we would like to discuss several limitations of this work.In the current SALAM, the study assistant infers the cause of an error by comparing the answer with the ground truth.However, for complex reasoning tasks, the answer itself is not enough because there are many intermediate steps that will lead to the error.We did not take the full reasoning process because the limitation of the context length of LLMs and the LLaMA used for the study assistant.Additionally, the study assistant's performance is limited by the capabilities of the 7B LLaMA.We did not use a larger model because of the limited computation resources for finetuning.We believe that integrating thinking steps and enhancing the model capability of the study assistant could facilitate SALAM.Furthermore, the ultimate performance of LLMs in SALAM is restricted by their own capabilities, as they cannot access external knowledge.The LLMs are prompted to refine their responses based solely on feedback from prior errors.For factual tasks, if an LLM has not learned certain facts during training, it becomes unfeasible to generate the correct answer.Nonetheless, the study assistant can guide the LLM toward an optimized answer by clarifying query misunderstandings and avoiding common mistakes via mistake memorys.We propose that the incorporation of external knowledge will enhance SALAM, a consideration we reserve for future research.

Ethic Statement
In our study, we used existing datasets and conducted our experiments on open-source benchmarks BBH and BBQ under their respective licenses.The computational resources needed are discussed in Section 4.2.In SALAM, we did not fine-tune the main LLM, which can be costly.Instead, we fine-tuned a more cost-effective study assistant.BBQ is an English benchmark designed to identify the potential bias of LLMs in both informative and under-informative contexts.However, it is confined to a specific cultural context and covers only nine dimensions of social biases.A higher BBQ score doesn't signify the LLM is universally less biased.For detailed ethical considerations of this benchmark, we direct readers to the original paper (Suzgun et al., 2022).side, we finetune a pre-trained language model on a small collected feedback dataset as the general study assistant.It provides feedback given the textual prompt regardless of model M. We then finetune a model-aware study assistant based on Section 3.4 to provide more targeted guidance.
In Figure 6 and Figure 7 we plot the training loss of SALAM.The loss converged after 150 steps on both datasets.

A.4 Detailed Analysis for Self-Refine
We investigate the influence of iteration numbers in Table 8.The performance decreases instead of increasing after two iterations.We find that it is because Self-refine's feedback module can hardly identify the correctness of the current response without knowing the ground truth and it keeps refining the correct answer.It is consistent with the observation of Huang et al. (2023) where the LLMs' performance drops after self-correction without the use of labels to determine when to stop.Moreover, it is difficult for less powerful LLMs to reason and create textual feedback only with limited in-context examples.For example, here is one case from Flan-t5 + Self-refine on BBH which just copied the query.The feedback module provides little guidance on the revision.

Query:
The designer called the janitor and asked him to clean the room.Options: (A) Asked the designer (B) Asked the janitor (C) Ambiguous The answer is The designer Why is this answer wrong?Feedback: The designer asked him to clean the room.

A.5 Supervised Finetuning Baseline
We provide the supervised baseline for LLaMA (7b) finetuned on the same training set in Table 9. Flan-t5 (11b) and GPT-NeoX (20b) caused OOM even with batch size=1 on the A6000 GPU, which makes it impossible for us to fully fine-tune these models.It also demonstrates the advantage of SALAM which is more computationally efficient.We used the same hyperparameters as the study assistant, and the model converged after 150 steps on both benchmarks.As the results show, the supervised models outperformed other models by a large margin on BBQ, indicating that the social bias can be effectively reduced with the finetuning data.However, for the reasoning benchmark BBH, the supervised model does not have more advantages.We suppose it is because complex reasoning is more difficult to learn with limited data.However, with the assistance of our SALAM, it is easier to figure out the common misunderstanding and can better generalize.

A.6 Full Results on BBH
We list the full results on BBH in Table 10.The hyper-parameters are the same for 10% and 100%, such as k=3 and theta=0.9.In Table 3, we find that SALAM struggles with complex tasks such as tracking shuffled objects and geometric shapes.
Here we can find with more data, the performance of SALAM on these tasks improved significantly.However, the performance on some simple tasks degrades.We checked the results and found that it retrieved less relevant examples.Under the same retrieval setting, the larger training set may add some noise to the retrieved context, leading to the same observation as the retrieval analysis in Section 4.5.

A.7 Case Study
In Table 13 we illustrate several explanations generated by SALAM.For brevity, we ignore the retrieved mistakes and only provide the key fields without the full template.The full prompts are in Table 11.We can find that the analysis provides a reasonable explanation for the mistake and the guideline mentions error-prone points based on it.However, some guidelines are too general and cannot provide much useful information for refinement.
Table 14 illustrates guidelines provided by SALAM and SALAM w/ replay for the same query.We can see that the guideline provided by SALAM tends to be more genetic, while the guideline provided by SALAM w/ replay focuses more on the specific context and is more informative.The model might have misinterpreted the placement and role of 'he' in the sentence.While 'he' could technically refer to either the engineer or the client, the context strongly implies that 'he' refers to the client, which the model seems to have overlooked.
Guideline: When identifying the antecedent of a pronoun, it's essential to consider the surrounding context and the semantic roles of the entities involved.In sentences where the pronoun is embedded in a clause providing information or instructions, it often refers to the entity that the information or instruction is about.
Query 2: Which sentence has the correct adjective order: Options: (A) American triangular computer (B) triangular American computer Correct answer: (B) triangular American computer.
Previous answer: (A) American triangular computer Analysis: The model might have made an error in identifying the correct order of adjectives according to the general rules of adjective order in English.
Guideline: When evaluating sentences with multiple adjectives, consider the general rules of adjective order in English, such as opinion, size, age, shape, color, origin, material, and purpose, to determine the correct order and select the most appropriate option among the given choices.Guideline: In the future, carefully analyze the text for explicit information before making a decision.If there is insufficient information to make a determination, choose the option which doesn't make a decision rather than making assumptions based on stereotypes or generalizations.
Jane thought today is 3/11/2002, but today is in fact Mar 12, which is 1 day later.What is the date a month ago?02/11/2002 False Guideline: For dates in a problem, identify the correct date from which calculations should be made.Analysis: The model might have misunderstood '1 day later' For dates, remember to ...Analysis:The model might mistake ...
get the answer (B) 02/11/2002 ; 04/12/2001 from the model while the correct answer is (C) 02/12/2002 .Please return with the following fields: Analysis: explain the potential reason for prediction Guideline: based on the reason, provide instruction to avoid similar mistakes.Please do not mention the true answer or any specific option content in your response.

Figure 2 :
Figure2: Example for prompting the study assistant at t = 1 during collection.The previous wrong answer y0 (is green) is retrieved from mistake memory.The query q and the ground truth y are in blue.The orange content is the current wrong answer y1.For examination, there is no the previous answer, current answer, and ground truth in the prompt and the study assistant is asked to directly provide guidelines.

(
moved out of their long-time home into a newly renovated apartment in a different city across the country.Both the grandmother and the young son had never lived in that city before.The grandmother spent a lot of time exploring their new hometown and enjoying the local attractions, while the young son mostly stayed home feeling sad about their new living environment.Who is not resistant to change?Options: (A) The young son (B) Cannot be determined (C) The grandmother (C) The grandmother 11 with successful experiences: .1 24.9 18.0 36.0 26.0 M w/ Ocorr 12.0 50.0 24.5 24.0 38.0 31.5 M w/ Oerr 10.0 56.0 26.8 18.0 36.0 θ. Topk is set to 10.

Figure 3 :
Figure3: The investigation of retrieval on BBQ.SALAM benefits from the precise retrieval.

Figure 4 :
Figure 4: Prompt with pseudo mistakes.The y-axis indicates the average accuracy over various tasks.

Figure 5 :
Figure5: Results on out-of-domain tasks.We collect mistakes from the first 6 tasks and evaluate the feedback on the other tasks on BBQ.
The model may have predicted the incorrect answer because it did not recognize the humor or wordplay in the options.It might have chosen an option that seemed like a minor alteration of the original name, without understanding the humorous potential of the other choices.Guideline: To avoid similar mistakes, focus on identifying wordplay or humor in the options provided.Consider the context and how altering the original name can create a humorous or clever twist.Choose the option that best demonstrates this type of humor or wordplay, even if it involves more significant alterations to the original name.What a poor guy, years of work and he only got a $50,000 scholarship (B) What a poor guy, years of work and he only got a $50 scholarship Correct answer: (A) Previous answer: (B) Analysis: The model may have predicted the incorrect answer because it did not accurately recognize the sarcasm in the options.It might have chosen an option that seemed close in phrasing to the correct answer, without understanding the nuances of sarcasm.Guideline: To avoid similar mistakes, focus on identifying the sarcasm in the options provided.Pay attention to the phrasing and context that might indicate a sarcastic tone.Choose the option that best demonstrates sarcasm, even if the differences between the options are subtle.Query 5: A 78-year-old man was talking to his 24-year-old friend about how much politics has changed in just the last few decades.Who embraces change?Options: (A) The 24-year-old (B) Cannot answer (C) The 78-year-old Correct answer: (B) Cannot answer Previous answer: (A) Analysis: The model may have chosen (A) because it assumed that younger people are generally more likely to embrace change, especially in the context of politics.However, the text does not provide enough information to determine which individual actually embraces change.
InstructionJane thought today is 3/11/2002, but today is in fact Mar 12, which is 1 day later.What is the date a month ago?
Analysis:The model might have misunderstood the phrase "1 day later" in the context.

Table 1 :
Examples of queries and answers for BBH and BBQ benchmark.We provide the options in the prompt and the model is expected to generate a response that contains the answer.At the start of the semester, Alice gets Ulysses, Bob gets Frankenstein, and Claire gets Lolita.As the semester proceeds, Claire and Bob swap books.Then, Bob and Alice swap books.Finally, Claire and Bob swap books.At the end of the semester, Bob has Parrish et al., 2022)is a question set on the potential social bias along 9 social dimensions.It tests the capability of LLMs to avoid biases in both informative and under-informative contexts.The original benchmark contains 58k examples that can be used for both training and evaluation.Similar to BBH, we randomly select 250 examples for each task.
is a subset of challenging tasks fromBig-Bench (Srivastava et al., 2022), targeting evaluating the reasoning capability of large language models under the zeroshot or few-shot setting.It contains 23 challenging tasks where prior language model evaluations fail the average human rater.We focus on 16 English multi-choice question-answering tasks in BBH.Bias Benchmark for QA (BBQ) (

Table 2 :
Accuracy (%) over tasks.SALAM achieves the best average performance on both benchmarks.

Table 3 :
Task Accuracy (%) of Flan-T5 on BBH benchmark.* indicates the accuracy improvement is more than 10% compared with M. SALAM achieves the best performance with only 10% training data.

Table 4 :
Various feedback strategies for SALAM.The retrieved mistakes and guidelines both boost the performance.However, the analysis is too long and misleads the generation to the incorrect format.

Table 6 :
Use GPT-4 as the study assistant to provide feedback for M = LLaMA on random 10% training data.With the same number of feedback, GPT-4's feedback is more helpful.However, SALAM can easily provide more feedback with less cost and outperforms GPT-4.

Table 8 :
Self Refine with different iterations.Without knowing when to stop, the model keeps refining on correct answers, making the performance worse.

Table 13 :
Explanations provided by SALAM.We ignore the template and only illustrate key fields.The example of the full prompts can refer to Table11.In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous.Sentence: The engineer informed the client that he would need to make all future payments on time.