RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Despite their unprecedented success, even the largest language models make mistakes.Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.


Introduction
Correcting model outputs is a pressing challenge in natural language generation (Ribeiro et al., 2018;Reid and Neubig, 2022), emerging across many use-cases such as style transfer (Mallinson et al., 2020;Malmi et al., 2022), grammatical (Lichtarge 1 A significant portion of this work was done while Feyza was an intern at Allen AI. Correspondences to Afra Feyza Akyürek (akyurek@bu.edu).

put soap in dishwasher
Action Planning Topic-Based Summarization input ( ): input ( ): x model output ( ):   (Tandon et al., 2021) and summarization (Saunders et al., 2022) tasks showcase a scenario where initial predictions by a learned model (ŷ) are incorrect. Human-written critiques (c) indicate errors in model outputs. While humans can reliably critique each other, machines lack such ability. This paper studies a multiagent collaborative framework where one language model can generate critiques to improve its peer's performance. et al., 2019) or factual error correction (Mitchell et al., 2022b), debiasing and detoxification (Schick et al., 2021). Unlike humans who can understand natural language feedback and improve using the information, most of the previous work relied on sequence tagging (Reid and Neubig, 2022), retraining from scratch (Sun et al., 2019) or parameter editing (Mitchell et al., 2022a) to repair model predictions. models can correct their answer given more sophisticated feedback formulated in natural language (Schick et al., 2022;Saunders et al., 2022). For example, in Fig. 1, we present sample feedback for two tasks. Both of these examples exemplify the case where the initial model outputsŷ have flaws. In topic-based summarization, an automatically generated summary of a story involves factually incorrect statements such as "... he was betrayed by his father Bill ..." where an appropriate critique is "Bill is not Michael's father". In action planning, given a goal x, the objective is to generate a set of steps y to achieve the goal. The initial sequence of actions in Fig. 1, denoted byŷ, has a missing a step. The human-written natural language critiques c describe the ways in whichŷ's are incorrect and y new denotes the corrected prediction conditioned on the critique. Note that in many situations helpful critiques do not necessarily reproduce an entire answer-they may simply point out one way in which the answer could be improved.
Researchers use crowd-sourcing to collect critiques for model outputs (Saunders et al., 2022). However, collecting feedback from humans is infeasible in an online setting where a model is required to produce a rapid stream of outputs. The goal of this paper is to shed light on whether the task of critiquing language model predictions can be effectively passed on to an external agent while keeping the language model itself intact.
Our multi-agent collaborative framework involves two language models where one model's job is to criticize the other as the latter performs a task of interest, such as summarization. This setting comprises a task model, denoted by LM task , which learns the mapping from an input x (e.g. passage) to a ground truth output y (e.g. summary); and a critiquing model LM critique which provides natural language critiques for LM task 's outputsŷ ∼ LM task (x). The framework can additionally involve a separate model (say LM refine ) for repairing model outputs conditioned on critiques. We follow past work (Schick et al., 2022), and merge LM task and LM refine into a single model. Hence, in addition to predicting y given x, LM task is also tasked to improve its initial output conditioned on a critiqueĉ sampled from LM critique (x,ŷ).
We introduce RL4F (Reinforcement Learning for Feedback Generation), a cascade (Dohan et al., 2022) of two language models for automatic critique generation and refinement. RL4F is trained to maximize target task performance of LM task by learning to provide critiques for its outputs via LM critique . RL4F advances retrieval-based methods with learned critique generation . Unlike previous work which teaches LM task to read a crowd-sourced set of critiques (Schick et al., 2022;Saunders et al., 2022), RL4F learns the particular set of critiques that will steer LM task into improving its predictions without requiring any updates to LM task parameters. Treating LM task as fixed is especially important in era of limitedaccess large language models which are costly, if not impossible, to fine-tune.
RL4F is illustrated in Fig. 2(a,c). Previous work demonstrate that language models smaller than roughly 50 billion parameters lack the ability to understand and act upon a natural language critique (Saunders et al., 2022;Bai et al., 2022). Therefore, we chose GPT-3 as the LM task model which is a clear example of an inaccessible LM that shows this ability. While RL4F is general enough to accommodate an ensemble of feedback generators, in this work we focus one single model as LM critique for simplicity.
In summary, this work 2 : • Presents a reinforced critique generator which advances simple supervision in improving the end-task performance without retraining the downstream model.
• Demonstrates effectiveness of RL4F on three tasks: topic-based summarization, action planning and alphabetization (e.g. sorting a list of words alphabetically) with relative improvements up to 10%.
• Showcases that RL4F exhibits promising scaling properties and remains to be useful when applied iteratively.

Related Works
Past works differ to a large extent with respect to what they call human feedback and how they make use of it. In this section, after elucidating the use of human feedback in previous works, we briefly describe connections of RL4F to the parameterefficient fine-tuning and discrete prompt learning literature.

What kind of feedback is used and where does it originate?
Human feedback on model predictions come in different flavors. The most notable ones include (1) binary feedback, e.g. thumbs up/down and pairwise comparisons Bai et al., 2022;Gao et al., 2022), (2) natural language critiques Schick et al., 2022;Saunders et al., 2022;Murty et al., 2022;Chen et al., 2023; and (3) direct textual refinements to outcomes (Scheurer et al., 2022;Shi et al., 2022). Bai et al. (2022) introduce what they call Reinforcement Learning from AI Feedback (RLAIF) in which they replace human preference labels with those of the model's itself; the model is prompted to evaluate its own predictions in consideration of human values and preferences. In a similar vein, Gao et al. (2022) use accuracy for extractive question answering as a reward signal when fine-tuning their policy model.
In another thread, Schick et al. (2022) use comments from forums and Wikipedia edit histories as natural language feedback. Scheurer et al. (2022) and Shi et al. (2022) collect human natural language critiques and associated refinements. They then fine-tune the task model on the refinements. Our work is similar to these works in that we also use human-generated critiques in the first stage of our algorithm. Aside from human-written critiques, we additionally use synthetically generated critiques in the absence of the former.

How is feedback used?
An overwhelming majority of past work simply fine-tunes their task model using human feedback; whether it is a general purpose language model Bai et al., 2022) or a taskspecific model (Shi et al., 2022;Gao et al., 2022;Saunders et al., 2022;Scheurer et al., 2022;Schick et al., 2022).  differently finetunes a separate corrector model which takes in a retrieved critique utterance to correct initial outputs. Similarly,  retrieves from a memory of previous critiques to improve GPT-3 predictions via few-shot prompting.
Our work separates from existing work by focusing on critique generation and harnessing critiques that yield better final outcomes by LM task . Similar to Schick et al. (2022), we effectively propose a multi-agent setup by disentangling critique gener-ation and conditional refinement. Differently, we keep the latter model intact and only train the critique generator LM critique via reinforcement learning. Moreover, we take a step forward by leveraging end task data for the first time and directly optimize the critique generation process to improve final task performance. In contrast to RLHF whose policy network (LM task ) is trained to maximize human alignment (Wiener, 1960), our policy network (LM critique ) is trained to bootstrap end-task success of LM task . Our proposal RL4F is orthogonal to RLHF; in fact we use an RLHF fine-tuned checkpoint in our experiments. For further discussion, please refer to Fernandes et al. (2023) who catalogue different approaches on integrating natural language feedback to textual generations.

Adapters & Discrete Prompt Learning
A large body of existing work finds that parameterefficient fine-tuning, often referred to as adapters (Pfeiffer et al., 2020) is as effective as full finetuning while being computationally cheap. RL4F can also be interpreted as an alternative "adapter" under the strict setting where only textual access to task model is available. Furthermore, our work can also be viewed from the perspective of learning discrete prompts for language models. Past work propose to generate knowledge pieces (Liu et al., 2022) or arbitrary textual snippets (Deng et al., 2022) which they append to the input via reinforcement learning. These works are different than ours in that their policy is conditioned solely on the input x whereas in our case we sample critiques of machine-generated predictions based on x andŷ.

Background
The problem of learning from critiques entails two major challenges: (1) generating critiques and (2) the task of refining initial answers based on a critique. In our experiments (Section 6), GPT-3 responds very well to ground-truth critiques. This observation suggests that given quality critique, GPT-3 is indeed able to improve a potentially erroneous prediction for the better. Hence, in this study we focus our efforts on (1). Our ultimate goal is to reach, and eventually exceed, human-level critique performance using machines.
Following Saunders et al. (2022), we identify four primary functions towards studying the problem of learning with natural language critiques.  Figure 2: a) A downstream task model takes in an input (e.g. a passage and a question) and predicts the output (e.g. summary). b) Past work proposed using a supervised learning scheme (Saunders et al., 2022;Schick et al., 2022) or retrieval  for critique generation (CRITIQUE) and refinement tasks (REFINE). In our setting, we only train LMcritique and parameters of the task model are left unchanged. c) RL4F uses LMcritique that was produced as a result of the training in part b. Using task data pairs (e.g. passages and summaries) we continue fine-tuning LMcritique with policy gradient such that critiques steer LMtask to produce better outputs.
First is PREDICT: the base task of predicting without using critiques to model x → y. As an example, if x is a passage, y is the summary (see Fig. 1). Moreover, we refer the task of learning to generate critiques x,ŷ → c whereŷ ∼ LM task (x) as CRITIQUE. Lastly, we call the conditional refinement objective x,ŷ, c → y as REFINE and repairing an answer without a critique x,ŷ → y as DIREC-TREFINE 3 . We useŷ andĉ notation to indicate an estimate of ground truth y, and similarly for c, from a respective model.

SUPERVISED: Supervised Learning for Critique Generation
We initialize LM critique to be a pretrained encoderdecoder-type model and fine-tune it to generate critiques satisfying the CRITIQUE objective x,ŷ → c using natural language critiques. Namely, if LM critique is parameterized by θ we maximize E [log p θ (c|x,ŷ)]. We delegate PREDICT and RE-FINE tasks to GPT-3 via in-context learning. The procedure is depicted in Fig. 2a The main difference of our implementation of SUPERVISED to that of Saunders et al. (2022)'s is that we rely on separate models for CRITIQUE and the rest of the tasks while they train a single GPT-3style model to collectively achieve PREDICT, CRI-TIQUE, REFINE and DIRECTREFINE; effectively merging LM critique and LM task into a single model. While this may seem parameter-efficient, our version has a few key advantages. First, leaving any LM task model intact (parameters frozen) enables us to work with models that are already-deployed as LM task and those with expensive training and inference processes. Moreover, our approach refrains from disturbing overall integrity of a generalpurpose language model by conditioning it to a specific task. Lastly, training LM critique , which is multiple orders of magnitude smaller than GPT-3, is much more computationally efficient and therefore accessible to a broader range of users.

Direct-Refinement
Madaan et al. (2023); Chen et al. (2023) propose that using the critiques from the model itself via few-shot prompting results in improved performance. On the contrary, Saunders et al. (2022) and Bai et al. (2022) argue that direct refinement (as denoted with DIRECTREFINE in this work) i.e. the practice of prompting a language model without self-generated critiques to directly repair its own answers proves a stronger baseline, especially when the model size is large >50B. They hypothesize that this is primarily due to model's initial answers getting increasingly more difficult to selfcritique as the model size grows. In fact, both Saunders et al. (2022) and Bai et al. (2022) showed that their largest model achieves superior end-task performance when performing DIRECTREFINE than refining using self-generated critiques. Hence, we use Direct-Refinement as a baseline and describe how we implement it via in-context learning in Section 6 while providing further discussions in Appendix B.5.

RL4F: Reinforcement Learning for
Feedback Generation SUPERVISED is straightforward to implement but it does not make use of any final task data (x → y) that is usually more abundant than natural language critiques. Moreover, it fails to provide ground for adaptation when the critiques in the train set are generally applicable but not entirely tailored to improving a target model. We describe RL4F where we follow supervised training with policy gradient learning using end-task data in order to generate critiques. We assume that the task model LM task is already deployed and treat it as a fixed module. In all of our implementations we train the natural language critique generator LM critique alone. In both SUPERVISED and RL4F, LM critique takes in the input x and an initial predictionŷ and produces a (natural language) critiqueĉ: Fig. 2c provides an illustration of RL4F. We implement LM task as GPT-3 given its adaptability into new tasks using few-shot prompting. Our implementation which is primarily based on the RL4LMs library 4 (Ramamurthy et al., 2022) will be publicly available.
Learning via Policy Gradient We warm-start RL4F by first fine-tuning LM critique for CRITIQUE which we defined as is the supervised objective of learning to generate natural language critiques c conditioned on x,ŷ. We continue fine-tuning the policy network (LM critique ) to maximize the reward using Proximal Policy Optimization (Schulman et al., 2017). We utilize the implementation of PPO provided by Ramamurthy et al. (2022) and refer the readers to the original work about the details for KL-regularized PPO objective. While any policy gradient approach could be used e.g. REIN-FORCE (Williams, 1992), our initial experiments showed that PPO works best in this setting.
Pseudocode for RL4F is provided in Algorithm 1 where we use two sets of in-context examples for prompting GPT-3. We define E to be a set of incontext-learning examples in the form of (x, y) to get GPT-3 solve PREDICT. Similarly, E c contains in-context examples to prompt GPT-3 to fix an initial attemptŷ into y conditioned on the natural language critique c which we termed as REFINE; E c = {(x,ŷ, c, y)}. As per our reward function, 4 https://github.com/allenai/RL4LMs

Algorithm 1 RL4F
Pseudocode of the algorithm used to train feedback model. Eq.
(2) Compute the advantage estimateÂt Update the LMcritique by maximizing the PPO objective until convergence and return LMcritique we opt to use a lexical similarity metric ROUGE (1/2/L) (Lin, 2004) in Eq. (2) for planning and summarization datasets. Measuring ROUGE is computationally fast, making it easy to use in an online learning setting. Reward is only collected at a terminal stage i.e. either when the end of sentence token is produced or maximum number of tokens is reached. (2)

Topic-Based Summarization
Saunders et al. (2022) crowd-sourced natural language critiques for topic-based summarization. The train, validation and test sets contain 14230, 1150 and 2658 tuples of (x,ŷ,ĉ). The dataset provides multiple questions for a given passage each inquiring about a different aspect. Given a passage and question (x) multiple summaries are sampled from the model. Human annotators provide natural language critiques for the answers along with improved summaries. One example is provided in Fig. 1 and more are available in the Appendix (Table 8).

Interscript
Interscript (Tandon et al., 2021) is an action planning dataset for everyday tasks such as "put on a costume" or "play musical chairs". Each goal x is associated with a sequence of ground truth actions y. Along with x, y pairs, it contains erroneous action plansŷ and natural language critiquesĉ suggesting a fix. An example is provided in Fig. 1 for "put soap in dishwasher". Other examples of critiques are "You need to have music to play musi-cal chairs." and "You need to pay for the costume before leaving the store". More examples are available in the Appendix (see Table 6). Interscript represents a low-resource scenario: it contains 253, 45 and 169 examples for train, validation and test sets where each example contains 1-4 reference texts.

Synthetic Task: Alphabetization
We synthetically generate a task for alphabetically sorting a list of words with lengths ranging between 3-12. We use the lexicon #11 by Keith Vertanen (2018) which contains 43K unique English words. Given an unsorted list and a ground truth sorting of the list we identify 5 operations to sample a incorrect sorting of y denoted byŷ and associated critique c articulating what is wrong aboutŷ in natural language. One example is shown below: x: mug, greek, book, house y: book, greek, house, muĝ y: book, greek, house c: The word mug is missing.
The operations we use for distortion are REORDER, REPLACE, ADD, REPEAT and REMOVE (shown above). We also leave majority of sorted lists intact for which the ground truth critique is "The list is correctly sorted". We use a total of 40K examples for warm-starting LM critique for the CRI-TIQUE objective and another 10K, 1K and 1K examples for PPO stage, for train, dev and test splits, respectively. Examples delineating other operations in action and corresponding natural language critiques are provided in Appendix A.
In alphabetization, we use Inverse-Levenstein distance for the reward function R as defined in Eq. (3) where |·| measures length of the list. Levenstein distance is a form of edit distance for single character edit operations such as removal, insertion and substitution in a string. We count word-level operations rather than character-level. Note that the higher inverse-Levenstein score of a predicted ordering, the closer it is to the alphabetically sorted version. The sorted list gets the maximum reward of 1.

Experiments and Results
Our experiments are designed to test effectiveness of RL4F, along with other sources of critiques, in both natural and controlled settings. In our evaluations, we test the usefulness of critiques by looking at the final task performance rather than evaluating generated critiques themselves; as multiple critiques may lead to the same correct answer.
Sampling Critiques We sample critiques from LM critique as in Eq. (1) by first concatenating the input and initial prediction. The specific input format for LM critique we use for Interscript is given below and the other two can be found in Appendix B.1.
We initialize LM critique with pretrained T5-large which is a 0.77M parameter encoder-decoder type language model trained on large web text (Raffel et al., 2020).

Downstream Predictions
In our experiments, we consider GPT-3 as the LM task model. GPT-3 can handle a wide range of tasks with promptingusing a handful of task examples in the input and without requiring task-specific fine-tuning (Brown et al., 2020). GPT-3 is not only able to tackle numerous tasks conveniently but also can refine initial predictions when given a natural language critique . Since our setting requires LM task model to be able to model both the main task objective x → y and the refinement objective x,ŷ, c → y, GPT-3 is a suitable candidate that can adapt to both, using few-shot exemplars. The prompt template we use for the latter is shown in Fig. 3 where we provide the model with an initial attempt to the question initial_answer and re-sample a revised prediction conditioned on the question and critique for the summarization task. We use code-davinci-002 checkpoint via OpenAI API 5 and 3, 1 and 6 hand-written incontext examples for planning, summarization and alphabetizations tasks, respectively as we exhaust the 4096 token input limit. In action planning, instead of resampling entire plans, we prompt GPT-3 to produce an edit operation on the initial planŷ. The set of edit operations identified in the original dataset are Insert, Remove and Reorder where each critique comes with a corresponding edit operation. Note that, these operations can algorithmically be applied toŷ. While Reorder and Remove are expected to refer to existing steps inŷ, we expect Insert to introduce a  Table 1: Results for action sequence generation with Interscript (Tandon et al., 2021) and topic-based summarization by Saunders et al. (2022). We evaluate the performance of different sources for natural language critiques in steering LMtask to improve its predictions. Best scores in each column are made bold. We compare our method, RL4F, to three strong baselines and human-generated critiques. Self-Refinement prompts GPT-3 to self-repair its answer. MemPrompt uses memory to store human-generated critiques to previous outputs . ROUGE and BERTScore are out of 100 while BLEURT can be negative or positive and should be used in comparing different methods. See Appendix B.3 for a note on standard deviations.  Table 2: Results for alphabetization. Best scores are highlighted. Initial Outputs are obtained from GPT-3 (code-davinci-002) via in-context learning. SUPERVISED critiques misguides GPT-3, hurting its initial performance, as with MemPrompt. RL4F improves over the performance of SUPERVISED model by 27 absolute points. Self-Refinement around the same as RL4F. In Fig. 5, we further discuss advantages of RL4F over Self-Refinement when we sample and refine iteratively.

Source of Critiques Exact Match Inverse Levenstein
novel action. Hence, we stick with a generic lexical similarity metric in calculating reward (Eq. (2)) for this task. In summarization, we compare humanwritten summaries with the repaired model summaries.
Baselines We compare effectiveness of RL4F to SUPERVISED which is described in Section 3.1. This is the closest baseline to the approach by Saunders et al. (2022) and Schick et al. (2022) that abides by our condition that LM task should remain unchanged. We use the same set of initial predictionsŷ when comparing different critique generators.
In addition to SUPERVISED, we use a simple Direct-Refinement baseline where we ask LM task to revise the initial prediction given a fixed critique "Improve the answer." (DIRECTREFINE). The prompt template is otherwise the same as in other methods. We configure our in-context examples to show that not allŷ need to be repaired. Hence, Figure 3: Prompt template for topic-based summarization. We ask GPT-3 to refine the initial prediction by using critique.
LM task is free to update the prediction or leave it as is when it is correct. Despite its simplicity, Direct-Refinement has been established as a strong baseline (Saunders et al., 2022).
Moreover, we compare to MemPrompt . In their work, authors study a setup where LM task generates an understanding along with the target output. For example, given a question "What sounds like good?", the model generates an understanding of the question "The question is asking for homonym." before saying "wood". In their critique retriever, they train a mapping to model x into an understanding u. However understanding is redundant in particular tasks e.g. summarization where the question is no different than u, thus throughout our experiments, we replace the learned retriever in MemPrompt with BM25 (Harter, 1975).
Lastly, we use human-written critiques (gold feedback) for REFINE in getting LM task to repair outputs and report this as an upperbound.

Planning and Summarization
Our main results for Interscript and topic-based summarization are provided in Table 1. Given the free-from nature of the outputs, we evaluate planning and summarization tasks using text similarity metrics to capture semantic and lexical similarities. We utilize learned metrics such as BLEURT (Sellam et al., 2020) and BERTScore (Zhang* et al., 2020) along with ROUGE (Lin, 2004). We compare the performance achieved by using different sources of critiques to that of human-written critiques. Across all metrics, RL4F yields one of the closest outcomes to human-written critiques. 6

Alphabetization
We initialize our LM critique using the synthetic critiques as described Section 5.3. Our results are provided in Table 2. For alphabetization we compute exact match and inverse-Levenstein scores as defined in Eq. (3). As an additional baseline, we fine-tune davinci (Brown et al., 2020) on the same train set as our RL4F.
Because of the synthetic procedure to create x,ŷ, c triplets, the generatedŷ as well as c do not necessarily reflect the kinds of errors that LM task would do. We observe this in the scores of SU-PERVISED which fails to improve upon initial outputs. Nevertheless, RL4F procedure helps the policy network to capture a useful distribution of critiques, improving over SUPERVISED by more than 27 absolute points. In this simple task, Direct-Refinement prompt also yields a competitive performance. Compared to full fine-tuning, we observe that despite training substantially fewer parameters RL4F achieves a significantly better accuracy. For a comparison to concurrent work , please refer to the appendix.

Analysis
Scaling Properties While we use T5-large as our main model for all of our experiments to initialize LM critique , we inquire about different model sizes. In Fig. 4, we consider three different model sizes to tackle Interscript ranging from 60M to 770M parameters. On the y-axis we provide av- 6 We have identified a handful of examples where a pair of train and test examples differs by only a single concept e.g. all occurrences of "noodle" in the train sample was replaced with "food" to produce the test sample. The goal and steps are the same otherwise. MemPrompt does exceedingly well on these 7 cases, hence performing occasionally higher, yet fails in the rest of the test/val examples.  erage of three ROUGE scores for the generated plans. RL4F greatly benefits from an increase in the model size where a similar trend in SUPER-VISED is non-existent.
Semantic Drift In goal-oriented training, semantic drift occurs when the strings produced by the policy begin diverging from initial language (Lee et al., 2019;Blank, 1999). Although, RL4F does not guarantee thatĉ ∼ LM critique will be natural language, we find no sign of semantic drift in the sampled critiques with respect to fluency and naturalness. We speculate that may be due to GPT-3 responding to natural language best than gibberish, though future work should look closely into this to make a more conclusive argument. We provide sample predictions from both models in Appendix C for all three tasks.
Iterative Improvement In Section 6, we provide results with applying only one round of critiques in alphabetization. Past work advocated for iterative editing (Reid and Neubig, 2022;Falt-ings et al., 2021) as opposed to one-shot editing. In Fig. 5, we sample and apply critiques from LM critique toŷ's iteratively and see if the number of correctly sorted lists increase or decrease. Note that critiques may also lead to deteriorating performance as we are not eliminating the solved examples and it is at LM critique 's discretion to declare a solution as correct e.g. by saying "This list is correctly sorted.". In fact, when we ask GPT-3 to simply improve its predictions iteratively (via Direct-Refinement as described in Section 3.2), it occasionally scrambles an already correct ordering while not scoring any new points. In contrast, RL4F leads to up to 7 more corrections (see Fig. 5).

Conclusion
We have described a collaborative framework involving two language models where one model, the critique generator, is trained to improve the performance of the task model. We train the former via policy gradient learning while treating the task model as a black-box. We show that RL4F leads to superior final performance across three domains compared to other strong baselines without resulting as the critiques remain fluent and natural. Future work might focus on generalizing the critique generator into a mixture of experts allowing humans and other models to contribute to critiqueing procedure.

Limitations
RL4F is primarily targeted at improving final performance. While we have found that the critiques learned by RL4F remain natural, we do not introduce any explicit restraints preventing semantic drift. As though it may raise end-task performance, semantic drift would also hinder interpretability. Future work might study datasets that are not covered by this dataset and quantify semantic drift along with proposing measures to prevent it, as necessary. Moreover, this work does not provide an explicit mechanism to incorporate new critique labels that might become available in the future nor it identifies a framework that could combine critiques from multiple experts such humans and other machines. Lastly, we limit our analysis to GPT-3 and focus on a scenario where it is inefficient or impossible to train the task model while this may be a conservative assumption for other settings.

Acknowledgement
We thank Anna Ivanova, Jacob Andreas, Zilu Tang, Shashank Gupta and Ashish Sabharwal for their valuable feedback on earlier drafts of this paper. We also thank Rajkumar Ramamurthy and Prithviraj Ammanabrolu for helpful discussions on using their RL4LMs repository which facilitated the experiments of this work. Afra Feyza Akyürek is supported in part by the U.S. NSF grant 1838193 and DARPA HR001118S0044 (the LwLL program). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA and the U.S. Government. Ekin Akyürek is supported by an MIT-Amazon Sci-enceHub fellowship and by the MIT-IBM Watson AI Lab.

A Dataset Processing
Action Planning Interscript is larger but we are only using a subset, removing distractors. The scripts used for data cleaning will be released along with the codebase.
Alphabetization We sample initial predictions from GPT-3 for alphabetization some of which comprise multiple distortions simultaneously, yet we use one-step distortions to warm-start LM critique .
Given the following a pair of unsorted and sorted word lists e.g.
x: mug, greek, book, house y: book, greek, house, mug below are the operations we used to create our data: REORDER y: book, house, greek, mug c: The word greek is placed in an incorrect position. REPLACÊ y: book, greek, house, mud c: The word mug is replaced with mud REMOVÊ y: book, greek, mug c: The word house is missing REPEAT y: book, house, greek, house, mug c: The word house is repeated ADD y: book, hair, greek, house, mug c: The word hair is not in the original list NOTHINĜ y: book, greek, house, mug c: The list is correctly sorted.

B Experiment Details
We use code-davinci-002 as GPT-3 unless otherwise specified. We compute ROUGE implementation in the datasets library and set use_stemmer=True for summarization and Interscript.

B.1 Data Formats
We use the following input formats for LM critique : • Summarization:

{unsorted_list} ||| {initial_answer}
We train separate models for each of the datasets and evaluate individually. We use T5-large provided by transformers library.

B.2 Prompts for GPT-3
We provide prompt templates used for Alphabetization and Interscript when prompting GPT-3 for REFINE. Template for summarization is provided in the main text.  In Direct-Refinement, templates remain the same and critique's are replaced with "Improve the answer.". Exact prompt exemplars will be made available in the released code repository.

B.4 Hyperparameters
In all of our experiments we use temperature 0 for prompting GPT-3 except when sampling initial predictions for alphabetization we set it to 0.5. We provide hyperparameters for RL4LMs (Ramamurthy et al., 2022) in Table 3, Table 4 and Table 5.
B.5 Results for Self-Refine  on Alphabetization  and Chen et al. (2023) propose that self-generated critiques (sampling cri-   tiques simply via few-shot prompting) is useful for a range of tasks. We examine if self-generated critiques are more useful for Alphabetization task than other techniques proposed in Table 2. Doing so, we curate a few-shot prompt: Below is a given list of words which are supposed to be sorted in alphabetical order.
Describe what is wrong in the provided ordering.
--- After sampling critiques using the above prompt, we follow the same steps described in Section 6 for refinement. We obtain 21.6% exact match accuracy using these critiques which is a significant drop from code-davinci-002's initial performance. When we sample critiques and refinements from text-davinci-002, the exact match score increases notably (to 58.6%) while still hurting the initial accuracy of 63.6%. Self-Refine may improve over initial performance when critiques are sampled from more capable models such as text-davinci-003 . Nonetheless, RL4F's critique engine (initialized as T5-large) already produces useful critiques despite being dramatically smaller than text-davinci-003.

C Sample Predictions
Sample predictions for all three tasks are provided in Table 6, Table 7 and Table 8.  Summer is a student in Jeffery's statistics class. She is a beautiful girl with long dark hair and a sunflower dress.
Every Wednesday at 12:30, she waits for Jeffery to come out of class so she can see his smiling face. However, she has never had the courage to ask him to a date. [GPT-3 ignores the critique and leaves the answer as is. It's Jeffery who is interested in asking Summer out not the other way around.] The answer doesn't mention Jeffery asking her to a date. It should mention that he is not sure how to ask her to a date. [This is an important detail that's is confused by the model's initial prediction.] Summer is a student in Jeffery's statistics class. She is a beautiful girl with long dark hair and a sunflower dress. Every Wednesday at 12:30, she waits for Jeffery to come out of class so she can see his smiling face. However, she has never had the courage to ask him to a date. Jeffery is not sure how to ask her to a date. [GPT-3 successfully adds that it's Jeffery who is into Summer, yet it fails to remove the previous incorrect sentence suggesting otherwise.] Gary left the house at six am on Wednesday; headed out to endure another long workday. After he walked through the yard to the driveway, he made sure to latch the gate securely, because that's what Sarah liked. He smiled as he did so, and thought of how earlier, he had closed the shower curtain all the way (another Sarah preference), and gently lowered the toilet seat. [. . . ]Summarize how Gary feels about his life. Answer: Gary feels that his life is boring and that he is stuck in a rut. He feels that he is a slave to his wife and his job. He wants to be free of his responsibilities and wants to be a bartender. Gary feels that his life is boring and that he is stuck in a rut. He feels that he is a slave to his wife and his job. He wants to be free of his responsibilities and wants to be a bartender. He is angry at his wife and has a son, Steven. [Having seen the critique finds out that Gary's anger is actually towards his wife and improves its answer accordingly.] SR-25 pattern magazine An SR-25, AR-10 or LR-308 PATTERN MAGA-ZINE is a type of detachable firearm magazine based on the original Armalite AR-10 "waffle" magazine design introduced in the late 1950s.
[. . . ]What does the text say about the SR-25, AR-10, or LR-308 Pattern Magazine? Answer: A detachable firearm magazine based on the original Armalite AR-10 'waffle' magazine design introduced in the late 1950s, the design has been used on many AR-10 derivatives. It was reintroduced in 2012 with their new AR-10A model. A detachable firearm magazine based on the original Armalite AR-10 'waffle' magazine design introduced in the late 1950s, the design has been used on many AR-10 derivatives. It was reintroduced in 2012 with their new AR-10A model. [GPT-3 repeats the initial answer.] The response is missing several important information about the magazine. The first sentence should be more specific. [Critique calls for a more detailed summary.] A detachable firearm magazine based on the original Armalite AR-10 'waffle' magazine design introduced in the late 1950s, the design has been used on many AR-10 derivatives. It was reintroduced in 2012 with their new AR-10A model. The magazine is used in many AR-10 derivatives such as the Knight's Armament SR-25 and DPMS Panther LR-308, as well as bolt action rifles such as the Mossberg MVP.308, Ruger Precision Rifle and Q FIX. Aftermarket magazines are produced by Magpul, Lancer Systems, and others. [More comprehensive summary of magazine designs.]

D Learning Curves for Reinforcement Learning
In Fig. 6 through Fig. 9, we provide how evaluation metrics progress as LM critique in RL4F is trained.