ExpNote: Black-box Large Language Models are Better Task Solvers with Experience Notebook

Black-box Large Language Models (LLMs) have shown great power in solving various tasks and are considered general problem solvers. However, LLMs still fail in many specific tasks although understand the task instruction. In this paper, we focus on the problem of boosting the ability of black-box LLMs to solve downstream tasks. We propose ExpNote, an automated framework to help LLMs better adapt to unfamiliar tasks through reflecting and noting experiences from training data and retrieving them from external memory during testing. We evaluate ExpNote on multiple tasks and the experimental results demonstrate that the proposed method significantly improves the performance of black-box LLMs. The data and code are available at https://github.com/forangel2014/ExpNote


Introduction
Large Language Models (LLMs) have demonstrated astonishing capabilities in natural language understanding and generation (Wei et al., 2022;Huang et al., 2022;Sun et al., 2022;Bang et al., 2023).However, due to the limited parameters and context processing length, LLMs are not able to master all task-specific knowledge in real-world applications.As a result, LLMs may perform mediocre on some specific tasks, such as inductive reasoning (Bang et al., 2023) and entity recognition (Chen et al., 2023).
Therefore, how to make LLMs adapt to the downstream tasks has tracked more and more attention.Recent techniques such as prefix-tuning (Li and Liang, 2021), P-tuning (Liu et al., 2021) and LoRA (Hu et al., 2021) proposed low-cost solutions for fine-tuning LLMs.However, these methods are not capable of black-box powerful LLMs, such as ChatGPT and GPT4 (OpenAI, 2023).
A: Jack is Maria's son.
Q: [Chris] and his son [Jack] are at a bar waiting for their drinks.[Maria] showed up and sat with her husband [Chris].Who is Jack to Maria?

Answer directly (CoT)
Answer with experience (ExpNote) A: Jack is Chris's son, Chris is Maria's husband, so Jack is Maria's stepson.To empower black-box LLMs on specific tasks, several works (Madaan et al., 2022;Dalvi et al., 2022) have focused on equipping the LLMs with external dynamic memory to store useful taskspecific knowledge and facts.However, these taskspecific knowledge and facts in the memory usually come from expert annotation or human feedback, which is very costly to obtain.On the other hand, several researchers (Shinn et al., 2023;Madaan et al., 2023;Akyürek et al., 2023) try to exploit the reflection ability of LLMs to automatically generate such knowledge for specific tasks.However, most reflection-based methods are only able to empower the LLMs in the same case, without the ability to generalize to other instances.Thus, this paper proposes a framework Exp-Note (Experience Notebook), to empower blackbox LLMs on downstream tasks by learning and using task-specific experience automatically.We equip the LLMs with a dynamic memory and design several commands to help LLMs interact with it.In specific, in the training stage, ExpNote guides LLMs to generate task-specific experiences and store them in an external memory.In the testing stage, ExpNote uses a retriever to retrieve relevant experiences from the memory, the learned experiences will help the LLMs to solve the cases which they failed to answer directly (Figure 1).
We evaluate ExpNote on multiple tasks.The results show that ExpNote can empower the LLMs effectively, and significantly outperform other prompting methods like ordinary in-context learning (CoT, Wei et al. 2022), memory-based method (TeachMe, Dalvi et al. 2022), and case-by-case reflection-based method (Reflexion, Shinn et al. 2023).
Moreover, we empirically compared different types of experiences to examine their effectiveness in helping LLMs adapt to unfamiliar tasks.Specifically, we compared learned task-specific experiences with original cases and experiences learned from positive cases (succeeded) and negative cases (failed).We find that prompting with experiences is more helpful than original cases for LLMs to generalize to new cases, and experiences from both positive and negative cases are beneficial.
The major contributions of this paper are twofold: • We propose a framework ExpNote to empower the LLMs in various tasks through interacting with dynamic memory.ExpNote conducts fully automated reflection, noting, and retrieval, without the need of any annotated knowledge and facts or any human feedback.• We investigate different types of experiences and show the learned task-specific experiences help LLMs to better generalize than the original cases in the task, and experiences from both positive and negative cases are beneficial.
2 Related Work

Language Model Taking Action
In order to address the lack of knowledge, reasoning, and specific abilities in large models, many efforts have focused on utilizing external knowledge sources and tools to assist large models in completing tasks.Toolformer (Schick et al., 2023) proposed fine-tuning on API-augmented datasets to enable the LLMs to master the ability to use external tools with API calls, thereby improving the performance of the LLMs in a series of tasks.ReAct (Yao et al., 2022) proposed that by using a Wikipedia search API and generating trajectories similar to human thinking, the LLMs can utilize external knowledge during reasoning and provide interpretable reasoning paths.HuggingGPT (Shen et al., 2023) proposes to solve any AI task by using the models on the huggingface as its toolkit.

Language Model with Dynamic Memory
Some existing works have noticed the need to equip LLMs with dynamic memory.MemPrompt (Madaan et al., 2022) retrieves the stored user feedback of the intention for similar questions to enhance the current prompt for the LLMs.TeachMe (Dalvi et al., 2022) allows LLMs to store the missing and wrong facts during the QA task with the correction of user feedback.These methods created a new paradigm to boost the ability of LLMs in a general way.However, they rely heavily on human feedback or annotated facts.REMEMBERER (Zhang et al., 2023) proposed to consider LLMs as a semi-parametric RL agent.It trains LLMs to take the next action based on the retrieved (observation, action, Q-value) tuple.

Language Model Reflection
Recently, some works have been proposed to correct the mistakes of LLMs in conducting specific tasks by using their capacity of self-reflection.Reflexion (Shinn et al., 2023) focused on sequential decision-making tasks.A heuristic function is adopted to judge whether the trial is successful or not.And LLMs will reflect those trials that are thought to have failed.The reflective information will be used to support the LLMs in improving their own decision-making process in the next trial.Selfrefine (Madaan et al., 2023) proposed a method to iteratively improve the output of a large model through its own feedback, achieving improvements in multiple generation tasks.However, these reflection methods are limited to certain cases, without being abstract and able to generalize to other data points.

Language Model Thought Chain
Furthermore, there have been some efforts to improve the reasoning performance of LLMs by enhancing their thought chains in specific tasks.For example, DIVERSE (Li et al., 2023) proposed a method that generates multiple different reasoning paths and uses a validator for weighted voting to filter out incorrect answers.However, this method demands manual construction of reasoning paths for each training question-answer pair and extensive human evaluation, restricting its use on large datasets.
Drozdov (Drozdov et al., 2022) and others introduced a technique that decomposes complex questions into sub-questions and provides answers to these sub-questions, serving as hints to assist the model in reaching the final answer.Faithful CoT (Lyu et al., 2023) , on the other hand, prompts a language model to translate complex queries into a reasoning chain that includes question decomposition and corresponding symbolic language solving, thus enhancing interpretability.
These approaches offer intriguing ideas for improving the reasoning performance of LLMs but still face challenges related to the need for substantial high-quality annotations, difficulties in reusing experiences, and sample generalization.

The Framework
As shown in Figure 2, all the tasks are formalized as the tuple (x, y), where x is the input question and y is the desired answer.For each task, we write prompts P train and P test that encourage the LLM to use ExpNote following the illustrations.In the training stage, LLM is first ordered to infer the answer like ordinary CoT reasoning.
After the answer is obtained, the ExpNote will produce feedback F (ŷ, y) to LLM depending on whether ŷ = y.Note that this feedback F (ŷ, y) only includes a simple prompt containing the ground-truth of the current question, without any additional knowledge like TeachMe (Dalvi et al., 2022).Then LLM is supposed to reflect and store the learned experience e into the memory.
Where e is a key-value pair of learned task-specific experience, e.g.key = daughter, married and value = if A has a daughter B, and A is married to C, then B is also the daughter of C (Figure 2).This process is achieved by taking n extra actions to interact with the memory, which we will describe in Sec 3.2.
In the testing stage, ExpNote will use the testing instance as the search query to retrieve k experiences from the dynamic memory.The retrieved experiences will be added to the prompts for the LLM.Then LLM will decide whether to refer to these experiences and finally output the answer.
The full examples of ExpNote on different tasks are shown in Appendix D.

Interaction Commands
ExpNote designs several commands for LLMs to interact with the memory, including summarizing and applying the experiences (THINK), storing the experiences in the memory (NOTE), and retrieving relevant experiences for the testing instances (RECALL).They are described in detail as follows: • THINK[arg].Inspired by ReAct (Yao et al., 2022), in both training and testing stages, we enable LLM to use command THINK to organize its current thoughts and make the next decision.• NOTE[arg1]: arg2.In the training stage, we prompt the LLMs to use this command NOTE after answering each question.The command NOTE will store the experience arg2 as a value in the memory with its key arg1.
• RECALL[arg].In the testing stage, this command is automatically executed by ExpNote at the beginning of each question to recall relevant experiences.ExpNote will use a retriever to retrieve up to k relevant experiences using the question arg as the search query.The content of these experiences will then be added to the prompt.
• CoT (Wei et al., 2022): Several cases of solving the task using Chain-of-Thought are shown to the LLM.• TeachMe (Dalvi et al., 2022): As the core facts or human feedback of these specific tasks are hard to obtain, we adopt a commonsense knowledge base, Conceptnet (Speer et al., 2017), to serve as the memory for TeachMe.• Reflexion (Shinn et al., 2023): As the heuristic function of repetitive action detection described in Reflexion is not working for these tasks.Thus we allow Reflexion to do a little cheating: it is allowed to try again after every failed trial without being informed of the ground-truths.This setting is equivalent to obtaining a golden function that accurately determines the success/failure of each trial with a 100% success rate.
To answer the RQ2, we have also implemented several variants of ExpNote: • disabled: This variant adopts the reasoning form of ExpNote while disabling its retrieval function.

Results
As shown in Table 1, the full ExpNote method achieved the best performance on all datasets, 20.5% higher on average than the CoT method.TeachMe failed to outperform few-shot, as taskspecific knowledge is hard to obtain without human feedback.Compared with Reflexion, note that even if we make Reflexion cheat to identify the failed trial with 100% success rate, it still falls behind ExpNote.
Compared with other variants of ExpNote, disabled retrieves no experience in the testing stage, thus degrading the performance of CoT (even worse) as expected.We also discovered that case performs worse than full ExpNote although retrieving exactly the same cases for all of 4 tasks.We can then conclude that abstract knowledge or rules are more capable of helping LLMs to generalize to testing cases.Moreover, positive and negative both fall behind the full ExpNote while still outperforming baselines.We made an efficiency analysis in Appendix C and the results show that experiences from both positive and negative cases are more efficient than the other respectively on two datasets.These results indicated that experiences learned from both positive cases and negative cases are useful for LLM to generalize to test sets.
We also observed the performance changes of the model with the number of training samples.As shown in Figure 3

Improvement Analysis
We also analyze how many cases are corrected by introducing experiences in each dataset.As shown in Figure 4, we plot the distribution of cases in 4 conditions: • F => F: a case is originally answered incorrectly in disabled and also answered incorrectly with ExpNote.• F => T: a case is originally answered incorrectly in disabled but answered correctly with ExpNote.• T => T: a case is originally answered correctly in disabled and also answered correctly with Exp-Note.• T => F: a case is originally answered correctly in disabled but answered incorrectly with ExpNote.
In Figure 4, we demonstrate that ExpNote helps LLMs correct a certain amount of errors (the green part) at the cost of producing a few new errors (the red part) in all 4 datasets.And we can observe around 50% incorrect answers in disabled (gray + green) are corrected (green) with ExpNote.

Conclusion
In this paper, we propose ExpNote, an automated framework to help black-box LLMs adapt to specific downstream tasks by interacting with dynamic memory.We carried out experiments on multiple datasets from different tasks and showed that Ex-pNote can effectively improve the performance of LLMs better than other prompting methods.We also found that the learned task-specific experiences help LLMs to better generalize than the original cases in the task, and experiences learned from both positive cases and negative cases are valuable.

Limitations
Although ExpNote is able to empower the LLMs in various tasks, it may be less effective on these case-by-case tasks, like summarizing or creative writing.In these tasks, the cases share little common knowledge or rules, which makes it hard for ExpNote to help LLMs generalize.

Ethics Statement
This paper proposes a method for augmenting black-box LLMs.All experiments are conducted on publicly available datasets.Thus there is no data privacy concern.Meanwhile, this paper does not involve human annotations, and there are no related ethical concerns.

A LETS Dataset
As existing symbolic reasoning datasets, such as word sorting in BIG-bench (Srivastava et al., 2022), are designed to test the zero-shot reasoning ability of LLMs and always lack a training set, we therefore propose the LETS, a similar symbolic reasoning dataset while enabling LLMs to learn and generalize.
LETS require the language model to splice the letters at a given index of several words together.For example, given the query Splice the 5th letter of "sleep", the 2nd letter of "official", and the 5th letter of "neglect" together, the model is supposed to output pfe as the answer.
We randomly select 100 words with lengths of 4-10 as the vocabulary.To generate the training and testing set, for each instance, we randomly picked 3 different words from the vocabulary and randomly selected their indexes.
For each task, due to the size limitations of the datasets themselves, we test all methods on 100 testing cases.In fact, a large amount of related work is also tested using samples of similar magnitude, such as TeachMe (OBQA, 500, Dalvi et al. 2022), ReAct (ALFWorld, 134;WebShop, 500, Yao et al. 2022), Reflexion (consistent with ReAct, Shinn et al. 2023).Considering Expnote will interact with the environment multiple turns for a single case, the actual number of generations for LLMs can be 4 to 5 times higher.And We adopt a minimal training set with it size 2:1 to the testing set (and 1:1 in EMOJI and LETS datasets).
For all ExpNote variants, we write 2-3 ExpNote usage cases for the LLM as few-shot prompting; we choose n = 4 for training (the LLM is able to take 4 extra actions to THINK and NOTE after obtaining the ground-truth of each case), and n = 0 for testing (the LLM is not able to access to the ground-truth).
For the retriever, we implemented a word-based retriever to retrieve experience by matching words in the query and the key of experience, and it retrieves up to k = 3 experiences for each case in the testing stage.When ExpNote fails to retrieve relevant experience, a failure prompt "No relevant experience" will be returned to the LLM.

C Effciency Analysis
We can define the efficiency of each type of experience as eff(type) = Perf(type) − Perf(disabled) Cnt(type) (4) where type refers to positive or negative, Perf() represents the corresponding variant's performance in Table 2, and Cnt(type) represents the number of experiences of that type.Then we can calculate the efficiency of positive and negative experiences based on the data in Table 2.
As shown in other on two datasets.These results indicated that experiences learned from both positive cases and negative cases are useful for LLM to generalize to test sets.

Figure 1 :
Figure 1: An illustration of how ExpNote assists LLM in enhancing the effectiveness of task-solving.ExpNote can automatically generalize relevant experiences from other samples and apply them to specific tasks.
Figure 2: The framework of ExpNote.This framework shows how LLMs use ExpNote to solve specific tasks, including the training (left) and testing (right) stages.
, in CLUTRR, ExpNote starts from training with 0 samples (equivalent to disabled) and ends with training with 200 samples.The performance of ExpNote on the testing set continually grows with the number of training samples, showing that ExpNote continually learns new knowledge during the training stage.

Table 3 :
Table 3, experiences from both positive and negative cases are more efficient than the Efficiency of Positive and Negative Experiences on 4 datasets.