Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increasingly important as it enables sophisticated steering of LLMs. However, it requires significant human demonstrations and feedback or distillation from proprietary LLMs such as ChatGPT. In this work, we propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations and proprietary LLMs. First, we perform reward modeling (RM) with synthetic feedback by contrasting responses from vanilla LLMs with various sizes and prompts. Then, we use the RM to simulate high-quality demonstrations to train a supervised policy and further optimize the model with reinforcement learning. Our resulting model, Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms recent open-sourced models, which are trained on the outputs of InstructGPT or human-annotated demonstrations, in alignment benchmarks. In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively. Further analyses demonstrate the efficacy and importance of synthetic feedback in our framework. The code is available at https://github.com/naver-ai/almost


Introduction
Alignment learning has been an essential learning scheme to align the behaviors of large language models (LLMs) with human values like safety and truthfulness while following the intention of users accurately (Ouyang et al., 2022).Vanilla LLMsthose not aligned yet -could misunderstand user intentions or produce unsafe and inaccurate responses.Desirable human values such as helpfulness, harmlessness, or honesty can be defined, and human demonstrations with these values are then used for the alignment learning (Askell et al., 2021;Bai et al., 2022a). 1 The code is available at github.com/naver-ai/almost. Figure 1: A procedure of reward modeling through synthetic feedback.We assume that the response from a larger LLM with more and better demonstrations might be better overall.We train a reward model with synthetic comparisons generated top on the assumption.
However, the three-stage training recipe requires significant human effort, especially in the first two stages.More specifically, both the SFT and RM training stages must be provided with an abundance of high-quality human demonstrations and ranking datasets for obtaining models to facilitate RLHF.For instance, Ouyang et al. (2022) prepare and utilize 13k human demonstrations and 33k comparisons.
On the other hand, Self-Instruct (Wang et al., 2022) attempts to generate synthetic self-generated instruction datasets using in-context learning with a few seed demonstrations.Meanwhile, the release of LLaMA (Touvron et al., 2023) brings upon many open-sourced aligned LLMs trained on the outputs of proprietary LLMs or human-annotated instructions.However, it still heavily depends on proprietary LLM APIs such as InstructGPT and Chat-GPT (Ouyang et al., 2022;OpenAI, 2023;Taori et al., 2023;Chiang et al., 2023) or intensive human annotations (DataBricks, 2023;Köpf et al., 2023).
In this paper, we introduce a novel framework for alignment learning that only requires minimal human labor and is not dependent on proprietary LLMs.Unlike the conventional alignment learning procedure that collects demonstration first (Ouyang et al., 2022), we first develop a reward model (RM) on a synthetic comparison dataset constructed by contrasting outputs from vanilla LLMs in various configurations, as shown in Figure 1.The rules of generating these synthetic ranking data originate from our hypothesis that the responses generated by larger, optimally prompted models are superior to those produced by smaller, inadequately prompted models, as reported by previous work (Askell et al., 2021).Then, we introduce a Reward-Model-guided Self-Play (RMSP) to simulate high-quality demonstrations with rejection sampling using the RM (Ouyang et al., 2022).We train LLaMA-7B (Touvron et al., 2023) on the synthetic demonstrations (SFT) and further optimize the model with rewards from the synthetic RM, namely, Reinforcement Learning from Synthetic Feedback (RLSF).
Our Aligned Language Model with Synthetic Training dataset (ALMoST) outperforms Alpaca (Taori et al., 2023) -distilled from In-structGPT (Ouyang et al., 2022) -and Dolly-v2 (DataBricks, 2023) and OpenAssistant (Köpf et al., 2023) that are trained on human-annotated demonstrations in the alignment-related benchmarks (Askell et al., 2021;Lin et al., 2021;Chiang et al., 2023).Notably, our model is preferred to recent open-sourced models, Alpaca and Dolly-v2, 55-58% of the time (winning rate) in human evaluation without distillation from proprietary LLMs nor intensive human annotations.We speculate the strong performance of our model is due to the empirical indicators of well-aligned behaviors that have been effectively incorporated into a strong backbone model through synthetic feedback, allow-ing the inherent capability to self-align to elicit and partially replace the need for human feedback.
Our main contributions are three folds: • We propose a novel alignment learning framework by introducing synthetic feedback.It automatically constructs high-quality comparisons and demonstrations without relying on human feedback and proprietary LLMs.
• Our resulting model, ALMoST, shows wellaligned behaviors with human values in alignment benchmarks.In the human study, AL-MoST is preferred to Alpaca and Dolly-v2, showing a 55-58% winning rate.
• Analysis on RM further demonstrates the efficacy of synthetic feedback and highlights the importance of injecting empirical priors, e.g., our proposed filtering method and faithful prompt design.

Method
In this section, we will describe detailed procedures of our framework as depicted in Figure 2.

Step 1: Reward Modeling with Synthetic Feedback
Prompted Baseline As we do not have aligned baselines available for the comparisons yet, we utilize HHH (Helpful, Harmless, and Honest) prompt devised by Askell et al. (2021).It contains 14 human-written conversations for guiding LLM alignment2 .We employ The HHH prompted LLaMA models to generate synthetic comparisons (Touvron et al., 2023).
Generating Synthetic Comparison Instead of collecting human feedback, we rather generate synthetic comparisons based on naive assumptions according to empirical observations.Askell et al. (2021) demonstrate that the larger model performs somewhat better than the smaller model, and the model with longer prompts is better than the model with shorter prompts in terms of human preference.
In short, we assume the quality of the response follows the rule of thumb: • Larger model > Smaller model  Step 1.We first conduct reward modeling with a synthetically generated comparison dataset (synthetic feedback).
Step 2. The demonstration dataset is generated by simulation with the guidance of the reward model and train supervised policy with the synthetic demonstrations.
Step 3. We further optimize the model against the reward model with reinforcement learning.
For the same input x, we first sample the responses Y = {y 1 , y 2 , ..., y |Y | } from the models with various configurations.Then, we apply the rule to choose the better response among the generated responses.More specifically, we involve {7, 13, 30}B LLMs with {1, 3, 5} shots of HHH demonstrations for the comparison.As illustrated in Figure 1, if we sample responses from (1) 30B with 5 shots, (2) 13B with 3 shots, and (3) 7B with 1 shot, the ranking becomes y 1 > y 2 > y 3 according to our rule of thumb.Then, we can get a set of binarized comparisons, e.g., {(y 1 , y 2 ), (y 2 , y 3 ), (y 1 , y 3 )}.We denote the former as a 'chosen' response (y 1 ) and the latter as a 'rejected' response (y 2 ) in a comparison pair (y 1 , y 2 ).
Post Validation Our assumptions are often wrong because of the stochastic nature of the prompt-based generation.These noises in the dataset make the reward modeling unstable and divergent in the end.Thus, we come up with post validation method to filter out such noises.
First, we devise Heuristic Filter (HF) based on prior knowledge.It discards bad responses containing or beginning with keywords such as "I don't know" or "well".Also, we empirically find that the better response usually has a longer length than the worse one.Especially if the response is short, it often tends to be a case of probabilistic generation failure.However, training RM only on comparisons with longer chosen responses would make the resulting model biased by length.Thus, we apply HF to take comparison pairs whose chosen response is longer than either the rejected one or M − S/2, where M is the mean, and S is the standard deviation of the lengths of Y in the character level.This length constraint reduces the probability that short-generation would be stochastic generation failure by checking whether the length of each response is in the confidence interval.Furthermore, it does not fall into the length bias.Please see Appendix B for detailed examples.We will demonstrate the benefits in Section 4.2.
Second, we leverage As-is RM for further data filtering.Specifically, we train another RM with a community QA dataset such as StackExchange (Askell et al., 2021;Beeching et al., 2023).Our preliminary study does not find the benefits of large-scale pre-training for RM discussed in Askell et al. (2021).Thus, we sample 20k pairs from the pre-processed StackExchange dataset for our training3 .We keep the resulting synthetic comparisons only when the As-is RM agrees with the decision.
Reward Modeling Finally, we train the reward model based on the synthetic comparisons described above.We follow the ranked preference modeling objective from previous works (Askell et al., 2021;Ouyang et al., 2022).The objective is to make the reward model r θ assign a scalar value for the overall quality of a response y j for a given query x comparative to its counterpart baseline response y k .The loss function is defined as follows: where D is the training set of synthetic comparisons and r θ (x, y) is the reward model's scalar output indicating the overall response quality y for its input x.
Implementation Details We start by generating initial queries, i.e., a diverse set of input x.In particular, we adopt the recipe of Self-Instruct (Wang et al., 2022) to generate 10k initial queries based on few-shot in-context learning.More details of the query generation are in Appendix C.
For response generation, we include five prompted models with the below configurations.
• A. LLaMA-30B-Faithful-3shot For each query, we generate five responses from the models and take rankings, y A > y B > y C > y D > y E , reflecting the rule of thumb.The Faithful indicates our manually designed prompts consisting of three conversations responding more faithfully and longer while considering the response format, and the HHH indicates the prompts written by Askell et al. (2021).The detailed examples are in Appendix A. Finally, we produce 13k binarized synthetic comparisons after post-validation (HF and As-is RM) and train a reward model with the synthetic comparisons.

Step 2: Supervised Fine-Tuning
In the second step, we propose a Reward-Modelguided Self-Play (RMSP) to simulate high-quality demonstrations, i.e., conversations between the user and AI assistant.The simulated demonstrations are used to supervised fine-tuning for the initially aligned policy model (SFT).

Self-Play
The basic simulation is enabled by turn-taking between the user and assistant role models with corresponding prompts, i.e., selfplay.We continue to use the same prompted baseline, LLaMA-30B-Faithful-3shot, for the assistant role.In addition, we've made minor adjustments to the original HHH prompt (Askell et al., 2021) to suit the user's role better, LLaMA-30B-User-3shot4 .Starting from the initial queries generated in the first stage, the LLaMA-30B-Faithful-3shot generates responses for the queries.Then, the LLaMA-30B-User-3shot follows up the assistant's response.The turn-taking is continued until the maximum turn T .
RM-guided Self-Play (RMSP) To ensure a more aligned response from the assistant, we suggest including the synthetic RM, trained in the first stage, in the loop, namely Reward-Model-guided Self-Play (RMSP).In this setup, the assistant model, LLaMA-30B-Faithful-3shot, first samples N responses for a given conversational context.Then, the RM scores the N responses, and the best-scored response is chosen as the final response for the simulation, i.e., the RM performs rejection sampling (best-of-N sampling) (Ouyang et al., 2022;Scheurer et al., 2023).Like the Self-Play, the turntaking with LLaMA-30B-User-3shot is continued until the maximum turn.Please see Figure 10 for the examples.

Implementation Details
We generate about 20k high-quality demonstrations using RMSP.We set the maximum turn to 2 for simplicity, focusing on the single-turn scenario.The number of rejection sampling N is set to 4 considering resource constraints5 .Then, we train LLaMA-7B on the generated demonstrations, i.e., a supervised policy fine-tuning (SFT).More training details are in Appendix D.

Step 3: Reinforcement Learning from
Synthetic Feedback (RLSF) In the last stage, we perform reinforcement learning from synthetic feedback (RLSF) to further align the SFT model using a reward signal from the synthetic RM.Following previous works (Ouyang et al., 2022;Bai et al., 2022a) (Chiang et al., 2023).Notably, our models only leverage synthetic datasets while not relying on the pre-aligned LLMs or extensive human annotations.
erates a response y given a prompt x.Subsequently, a reward score r θ (x, y) is determined by the reward model r θ .The training objective is to maximize the expected reward.
Stiennon et al. (2020) proposes that adding an estimated KL penalty term between the initial policy ρ and the policy π ϕ to r θ (x, y) can enhance performance.This adjustment leads to the final objective as follows: where λ is a KL coefficient.

Implementation Details
We initialize the policy ρ with the SFT-tuned LLaMA-7B from Step 2. Also, the prompts for the PPO training are compiled by extracting only the inputs (initial queries) from the demonstration dataset generated by RMSP described in Section 2.2.More details for PPO training are in Appendix D.

Dataset
Static HHH alignment and TruthfulQA Askell et al. (2021) introduce Static HHH alignment benchmark to measure how models are aligned well with the human values6 .Similar to the comparison dataset, the model should choose a more proper response for input between the two options based on human values.The dataset consists of three human value categories, helpful, harmless, and honest, and contains a misc (others) category.
We include the dataset to get relationships of tension among the human values from the evaluation, although the entire dataset is just 221.Lin et al. (2021) propose TruthfulQA to measure how LLM generates truthful answers for a given question7 .It especially contains 817 adversarial questions to elicit imitative falsehood answers from LLMs.For simplicity, we evaluate the models with a multiplechoice setup (MC1) instead of a generative setup.Note that all the evaluation is based on zero-shot, which means we do not fine-tune the target dataset.
Please see Appendix I for more details of the evaluation prompt.
Vicuna Questions We test our models using Vicuna evaluation questions (Chiang et al., 2023).It consists of 80 questions on various topics spanning general QA, writing, reasoning, etc., to identify user preference8 .First, two different models gen-  erate an answer for each same question.Then, we conduct human A/B tests to choose a preferred answer between answers from the two models.In particular, we recruit three workers for each test.
We ask the workers to choose the more helpful, harmless, and honest answer with careful consideration over the contents in the answers (Askell et al., 2021).Please see Appendix E.1 for more details of the human evaluation.In addition, we conduct an automatic evaluation using GPT-4.In the test, GPT-4 assesses the two answers by giving a 1-10 scalar score for the corresponding answer and providing appropriate explanations for the judgment (OpenAI, 2023).Even though it is not a rigorous evaluation, we can compare the overall responding qualities of the models with reasonable costs.Considering the positional bias of the GPT-4 assessment9 , we evaluate the same instance twice by reversing the order of the two answers.

Baselines
We include recent open-sourced models to compare the aligned behaviors of the  (Köpf et al., 2023).It also releases an Oasst model (SFT) trained on the humanannotated dataset10 .

Evaluation Results
Static HHH alignment and TruthfulQA Our models outperform Alpaca, Dolly-v2, and OpenAssistant without any distillation from the proprietary LLMs or intensive human annotations, as shown in Table 1.For all sizes, our ALMoSTs show consistently better accuracy in all HHH splits and Truth-fulQA except for Vicuna trained on ChatGPT's outputs (Chiang et al., 2023;OpenAI, 2023).However, our RM shows excellent performance in choosing appropriate responses according to human values, even beating the Vicuna-7B.It is the consistent observation with Askell et al. (2021).Moreover, it is notable our ALMoST-PPO achieves the highest accuracy in the Helpful split of HHH evaluation, even including 13 billion models.When comparing our SFT and PPO-trained models, the PPO model improves helpfulness, harmlessness, and truthfulness while sacrificing honesty.Honesty and truthfulness look similar, but they are slightly different.The honesty is related to expressing uncertainty, while the truthfulness mainly measures how robust the model is against adversarial falsehood.
Human Evaluation on Vicuna Questions We confirm our models' strengths in actual human preferences in human evaluation 11 .In Figure 3a, we find that humans also more favorably assess our ALMoST model than Dolly-v2 and Alpaca (DataBricks, 2023;Taori et al., 2023) 3b, we can observe a similar tendency of GPT-4 evaluation with the human evaluation.The ALMoST-PPO consistently shows a higher winning rate against Dolly-v2, Alpaca, and ALMoST-SFT.However, we find that the GPT-4 is not likely to give the same scores for the answers showing a generally lower tie rate than the human evaluation.Moreover, GPT-4 assesses Vicuna's responses more favorably than humans did.Nevertheless, we can obtain the overall gaps among the models with reasonable cost from the GPT-4 evaluation.
When we extend the evaluation to various-sized models, our 7B model outperforms 12-13B baselines, including Alpaca-13B, Oasst-12B, and Dolly-v2-12B in the evaluation as shown in Figure 8.

Probing Main Hypothesis
We further inspect our assumptions for the synthetic feedback.winning rate against Alpaca-7B increases monotonically as we increase the model size.The gap between 13B and 30B is especially large.Second, when comparing LLaMA-7B-HHH-{1,3,5}shot models, the number of demonstrations also improves the winning rate, but improvements by this factor are relatively small.Finally, we find that the demonstration's quality is the most important factor.Surprisingly, the smaller model with the welldesigned prompt (LLaMA-7B-Faithful-3shot) outperforms the larger model with the normal prompt (LLaMA-30B-HHH-5shot).Through this intrinsic evaluation, we can find our synthetic feedback dataset effectively covers responses of varying quality.(Bai et al., 2022a).The lengthy baseline always chooses a longer response between the pairs.The * indicates the training data is a subset of the original, only including single-turn.

RM Evaluation
We further evaluate our RM on another comparison dataset, HH-RLHF (Bai et al., 2022a) to validate our synthetically generated comparisons (Synthetic Feedback).HH-RLHF contains various splits according to the development stage, e.g., the base set to build the initial policy or the online set collected with the deployed system.We focus on the 'Helpful-base' split, assuming we do not have a deployable system.

Reward Modeling
In Table 3, we find our RM trained with Synthetic Feedback achieves 90% performance of its upper-bound trained on the full training dataset.Also, it achieves the same accuracy with the result fine-tuned on the single-turn subset (Helpful-base * ).Please note HH-RLHF includes multi-turn context, while our synthetic dataset focuses on single-turn scenarios.(Bai et al., 2022a).

Effect of Post Validation
We conduct two types of post-validation to reduce noises in the synthetic comparisons described in Section 2.1.Table 4 shows that each filtering method contributes to the final reward model quality.Notably, we find the heuristic filter (HF) considering length distribution plays a crucial role in synthetic data generation.
When we exclude the HF, the performance of RM drops about 10% point.Moreover, HF prevents the RM from falling into the length bias discussed in Section 2.1.The RM, trained on the dataset with HF, outperforms the lengthy baseline which always selects the longer response as the better response.

RMSP vs Self-Play
We inspect the benefits of RM-guided Self-Play (RMSP) compared to its counterpart without RM guidance, i.e., Self-Play.Specifically, we compare two supervised policies (SFT) trained on demonstrations generated by RMSP or Self-Play.In

Conclusion
In this work, we propose a novel framework for aligning LLM with human values by introducing synthetic feedback.We identify better responses from vanilla LLMs with various sizes and prompts, relying on empirical prior knowledge.We first train a reward model with synthetically generated comparisons.Then, we produce another synthetic dataset to train aligned policies using the reward model.Experimental results demonstrate the effi-cacy of our framework showing outstanding performances in the alignment benchmarks.We believe the strong performance of our model is derived from the effective incorporation of empirical indicators of well-aligned behaviors through synthetic feedback.Furthermore, our method is costeffective in that it does not require extensive human demonstrations and not depend on the proprietary LLMs.

Limitations
Even though we show our framework works well on many alignment-related benchmarks (Askell et al., 2021;Lin et al., 2021;Chiang et al., 2023) In fact, we observe similar results when we test our models on the zero-shot MMLU and LAM-BADA tasks (Hendrycks et al., 2020;Paperno et al., 2016) to identify the alignment tax, as shown in Appendix F. Our PPO model shows deteriorated performances for both datasets, implying the presence of alignment tax.Although there is a report that less than 10B models often suffer from the alignment tax and scaling the parameters mitigates the trade-off (Bai et al., 2022a), our approach might be limited in that it mostly focuses on aligning LLMs to the target values, e.g., helpfulness.We remain a more holistic evaluation of our framework and mitigation of the alignment tax for future work.

A Examples of alignment prompts Faithful (HHH) prompt
Below are a series of dialogues between various people and an AI assistant.The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed.It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.
Human: I was wondering, is it actually important when making spaghetti to add salt?
Assistant: Yes, adding salt to the water when making spaghetti (or any pasta) is important for several reasons: 1. Flavor enhancement: Salt helps to bring out the natural flavors of the pasta and create a more flavorful and balanced dish.Pasta cooked without salt can taste bland, and seasoning after cooking is not as effective.
2. Water temperature: Salt raises the boiling point of water, allowing the pasta to cook at a slightly higher temperature.This can lead to a better al dente texture.
3. Surface texture: Adding salt to the water can help create a slightly rougher surface on the pasta, allowing sauces to cling better to the noodles.
4. The general rule of thumb is to add about 1 to 2 tablespoons of salt per gallon of water (or about 1 teaspoon per quart).Be sure to taste the pasta as it cooks to ensure it is seasoned to your liking.
Human: I see, can you recommend some spaghetti recipe for me?
2. While the pasta is cooking, heat the olive oil in a large skillet over medium heat.Add the chopped onion and cook for 5 minutes until soft and translucent.
3. Add the minced garlic to the skillet and cook for another minute until fragrant.
4. Pour the diced tomatoes and their juice into the skillet with the onions and garlic.Add salt, black pepper, and red pepper flakes if using.Stir well to combine.
5. Simmer the tomato sauce over low heat for 10-15 minutes until it has thickened slightly.
6. Drain the spaghetti and add it to the skillet with the tomato sauce.Toss the pasta with the sauce until it is coated evenly.
7. Serve the spaghetti hot with chopped fresh basil leaves and grated Parmesan cheese, if desired.
I hope you enjoy this recipe!Let me know if you have any other questions.
Figure 4: Example of Faithful prompt used for our zero-shot modeling and simulations.We manually revise the original HHH prompt to be more faithful.The full examples are found in here.

User (HHH) prompt
Below are a series of dialogues between various people and an AI assistant.Human should pursue their curiosity to obtain proper and desired responses from AI.In other words, if the response from the AI is not enough, the human should ask follow-up questions or request elaboration from the AI.
Human: I was wondering, is it actually important when making spaghetti to add salt?
Assistant: Do you mean, is it important to add salt to the water that you're boiling the spaghetti in?
Human: Yeah Assistant: It's usually recommended to bring out the flavor of the pasta.The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.
Human Recently, it has been proposed by Maldacena that large N limits of certain conformal field theories in d dimensions can be described in terms of supergravity (and string theory) on the product of d+1-dimensional AdS space with a compact manifold.
Here we elaborate on this idea and propose a precise correspondence between conformal field theory observables and those of supergravity: correlation functions in conformal field theory are given by the dependence of the supergravity action on the asymptotic behavior at infinity.In particular, dimensions of operators in conformal field theory are given by masses of particles in supergravity.As quantitative confirmation of this correspondence, we note that the Kaluza-Klein modes of Type IIB supergravity on AdS5×S5 match with the chiral operators of N=4 super Yang-Mills theory in four dimensions.With some further assumptions, one can deduce a Hamiltonian version of the correspondence and show that the N=4 theory has a large N phase transition related to the thermodynamics of AdS black holes.
Can you explain it?
Assistant: That's the abstract to one of a group of three papers credited with discovering the "AdS/CFT correspondence", which is also a version of the "Holographic Principle".This principle says that theories of gravity in d+1-dimensions can be exactly described using other, non-gravitational theories in d dimensions, so the word "holography" is a metaphor.The abstract goes on to give a bunch of concrete details about the discovery and how it's realized in string theory.
Human: What is the "AdS" and the "CFT" then?Is it "Conformal Field Theory", and what's that?
Assistant: AdS is "Anti-deSitter" space, which is really a negatively curved space. . . in informal terms, it's a surface that's shaped like a saddle.Conformal Field Theories are quantum mechanical systems that are scale invariant.They're a kind of modified special case of the strong force, which is part of the standard model of particle physics.The standard model really describes our world, but AdS/CFT is a purely theoretical idea.
Human Assistant: I think it was the second election held that year, after a controversial February election?And it was a victory for the Bangladesh Awami League, which won 146 of the 300 seats, beginning Sheikh Hasina's first-term as Prime Minister?But please double check those facts when you get a chance.
Figure 5: Examples of User prompt used for our simulations.Originally, it included 14 human-written conversations for the assistant model devised by Askell et al. (2021).We edit the instruction to suit the user model.

C Details of Synthetic Datasets C.1 Details of Initial Query Mining
We follow the recipe of Self-Instruct (Wang et al., 2022) to generate initial queries for our synthetic dataset generations.Specifically, we write 10 manual queries for the seed demonstrations for this.Then, we conduct 10-shot (7 static shots and 3 dynamic shots from previously generated queries) generation based on LLaMA-30B (Touvron et al., 2023).Then, we filter out the generated queries containing bad words such as 'image', 'graph', 'picture', and 'video'.Also, we discard the queries having a high lexical overlap with already mined queries using the Rouge-L score to promote diversity as in Wang et al. (2022).We check the maximum Rouge-L score between a newly generated query and already mined queries is higher than 0.5.We plot a nested pie chart for the resulting 10k queries in Figure 7.

C.2 Sampling Configurations
We conduct nucleus (top-p) sampling for our synthetic data generation (Holtzman et al., 2019).We set p to 0.9 with 1.2 of temperature for the initial query mining in step 1.Otherwise, we use the same p with 1.0 temperature for response sampling in steps 1 and 2. We set the maximum number of the generated tokens to 384.

C.3 Data Statistics
Dataset Type Usage # instance

Demonstration
Step 2 (SFT) 19,752 Step 3 (RLSF) Table 6: Statistics of the synthetic dataset generated by our framework.Each instance of the comparison dataset consists of a prompt (input query), chosen response, and rejected response.Each instance of the demonstration dataset consists of prompt and response pairs.

D More Training Details D.1 RM
We train our reward model for 1 epoch with 1e-5 of the learning rate, 64 of batch size, and 1024 of maximum sequence length.LLaMA-7B is employed for the initial checkpoint of the reward model (Touvron et al., 2023).We implement the training based on the Fully-Sharded Data Parallel of PyTorch and Transformers library (Wolf et al., 2020).

D.2 SFT
We use the same training configurations of Alpaca-7B (Touvron et al., 2023;Taori et al., 2023) 12 .It uses 128 of batch size, 2e-5 of the learning rate, 512 of maximum sequence length for 3 epochs of training.However, we do not follow the input template of Alpaca.Instead, we use prefixes of Bai et al. (2022a), 'Human:' for the input query, and 'Assistant:' for the assistant's response as in examples of Appendix A.

D.3 PPO
PPO models are trained over 80,000 episodes using 20,000 distinct prompts.The batch size for each iteration is 512, with a minibatch size of 32.We train on the same sample for four inner epochs.For rollouts, we limit the maximum number of tokens per model response to 128.The sampling temperature is 1.The PPO clip ratio is set to 0.2, and discount factor is set to 1.We set a KL reward coefficient λ = 0.05.We use adamW optimizer with the initial learning rate of 1e-6, β1 = 0.9 and β2 = 0.95.We use a cosine LR schedule with the minimum learning rate of 8e-7.Normalization is not applied to the reward score.For PPO, we adopt the implementation of trlX13 .

E Evaluation on Vicuna Questions E.1 Details of Human Evaluation
We recruit three different workers for each A/B test.We pay about $ 0.44 for each instance, i.e., a worker gets about $ 35 for evaluating 80 questions.As a result, we recruit 3 (workers) * 4 (tests) = 12 workers for our human evaluation.We provide detailed instructions via the interface as in Figure 9 for the workers who participate in our human study.Specifically, we request the workers to judge the better response by relying on their actual contents, not the style (Gudibande et al., 2023).We ask the workers to choose among A, Tie (Both good), Tie (Both bad), or B while we randomize the order of two responses.We combine the results of Tie (Both good) and Tie (Both bad) into Tie when we report the final evaluation result.We also measure inter-rater agreement among the workers to validate that the evaluation is well conducted.Table 7 shows the results.We find the agreement is somewhat different according to the pairs.However, the overall agreement is rated as moderate.Moreover, we get a fair agreement between the decisions from the author of this paper and one of the workers in ALMoST-PPO vs. Alpaca test.13693  30.4 28.3 29.2 27.9 28.9 65.4 Table 8: Evaluation results of zero-shot MMLU and LAMBADA to inspect the 'alignment tax' of our models.We find the ALMoST-PPO with 7B parameters experience the phenomena as reported in Bai et al. (2022a).

F Alignment Tax
Askell et al. ( 2021); Bai et al. (2022a) demonstrate the phenomenon of "alignment tax".It indicates that the alignment of LLMs sacrifices some other innate abilities showing weaker performances compared to unaligned vanilla models.We conduct two zero-shot NLP evaluations, Massive Multitask Language Understanding (MMLU) and LAMBADA (Hendrycks et al., 2020;Paperno et al., 2016), to investigate the alignment tax.The MMLU contains 57 multiple-choice tasks with various domains requiring element or expert knowledge.And Lambada is a word prediction task to measure the linguistic abilities of the models. In

Shopify
-Shopify is a fully hosted online store with a variety of different packages to choose from.Pricing starts at $9/month for a basic store with 2 staff accounts, up to $299/month for a store with unlimited products and unlimited bandwidth.The Shopify dashboard makes it easy to customize and design your store, and you can accept payments via credit card, PayPal, and more.

Prompt:
Are business intelligence and analytics the same thing?
Self-Play: RMSP: Yes.Business Intelligence (BI) and Analytics are two terms that are used interchangeably, and essentially mean the same thing.The two terms are often used in the context of business intelligence applications (also known as business analytics tools), which provide insights into a company's performance, and help business owners and executives make better decisions based on hard facts.These tools can provide a deeper understanding of various aspects of a business, including sales, performance, costs, profitability, and customer behavior.
No, they are not the same.
Analytics is the discovery and communication of meaningful patterns in data.It is a multidisciplinary field that uses many tools and techniques from computer science, statistics, and mathematics.The goal of analytics is to use these techniques to find insights that are hidden in data.
Business intelligence is the process of collecting data to better understand business operations and performance.It is used to make more informed decisions, and the primary objective of BI is to allow people to more easily understand large amounts of data and see trends in business activities.
Business intelligence uses analytics as one of the techniques to help extract insights and knowledge from data.However, the two terms are not interchangeable.In summary, analytics is one of the techniques used to make better business decisions, whereas business intelligence is the process of using data and analytics to make better business decisions.
Prompt: What if the Internet had been invented during the Renaissance period?
ALMoST-7b (Ours) During the renaissance, scholars were interested in collecting and preserving information.They would copy documents by hand, often multiple times to ensure accuracy and longevity of their work.This meant that many important works survived this way, but access was limited to a small number of people who could read them.
The internet would have drastically changed how scholarship developed during the renaissance.It would have enabled researchers and academics to share ideas quickly across borders, and make resources such as rare books available to anyone with an internet connection.While some forms of research would be lost due to the ease with which they can now be reproduced, it would have made new areas of study far more accessible than they otherwise might have been.

Vicuna-7b
It's difficult to say exactly how the Internet would have developed if it had been invented during the Renaissance period, as it would have likely been used for different purposes and in different ways than it is today.However, it's likely that the Internet would have been used primarily for communication and information sharing among a small group of people, rather than the global network it is today.During the Renaissance, communication and information sharing were important, but the technology available at the time was limited.The printing press had just been invented, and it revolutionized the spread of information, but it was still a slow and labor-intensive process.Mail service was also slow and unreliable, so people relied on word of mouth and local gatherings to share news and ideas.If the Internet had been invented during the Renaissance, it would have likely been used to facilitate communication and information sharing among the ruling classes, and possibly among scholars and scientists as well.It might have been used for things like sharing news and current events, exchanging letters and documents, and accessing information on a wide range of topics.However, it would not have the same level of interactivity and global reach that it has today.

Alpaca-7b
If the Internet had been invented during the Renaissance period, it likely would have been used primarily by academics and scholars to share ideas, research, and knowledge.Instead of being used for entertainment and communication, it would have been used primarily for the advancement of art and science.

Dolly-v2-7b
It is likely that the Internet would have been invented during the Renaissance period.In the 14th century, Johannes Gutenberg invented the movable type printing press which allowed for the wide-scale production and distribution of books.The Internet is based on the transfer of electronic data which is printed, copied, and distributed in the same way as books.In the 16th century, Francesco Colasassi built the first experimental mechanical computer.This computer used gears, rods, and springs to perform simple arithmetic operations.It was not much more than a sophisticated toy, but it demonstrated the feasibility of using mechanical machines to perform arithmetic calculations.In the 18th century, Blaise Pascal invented the first mechanical digital computer.This computer used levers, rods, and springs to perform arithmetic operations.It was not much more than a sophisticated toy, but it demonstrated the feasibility of using mechanical machines to perform arithmetic calculations.By the 20th century, electronics had developed sufficiently that it was possible to use electronics to build a digital computer.The first electronic digital computers were constructed in the 1940s during World War II.These computers were based on analog electronics and were not very powerful.In the 1970s, however, the development of integrated circuit technology allowed electronic computers to become much more powerful.These computers were based on digital electronics and were capable of performing extremely complex operations.Today, the Internet is composed of thousands of digital computers linked together to form a vast network.These computers are capable of performing extremely complex operations and are capable of communicating with each other using an incredibly fast data transmission method called light.The Internet is based on the transfer of electronic data, which is printed, copied, and distributed in the same way as books.The Internet is therefore based on the inventions made during the Renaissance period.
Figure 13: An example of responses from same-sized models in Vicuna Questions (Chiang et al., 2023).

I Evaluation Prompts
When we evaluate models on multiple-choice datasets, we utilize the corresponding prompts to elicit the proper answer option as shown in Tables 14 15, and 16.We compute the likelihood of each option to select the answer option.Also, we adjust the prompts according to each model's training prompts.Specifically, we utilize the FastChat framework to form the prompt template.Please see github.com/lmsys/FastChat/blob/main/fastchat/conversation.py for more details.

Static HHH Alignment
Human: Which response is more helpful, harmless, and honest from a given context?13700

Figure 2 :
Figure 2: Overview of our proposed framework for alignment learning of LLMs.Step 1.We first conduct reward modeling with a synthetically generated comparison dataset (synthetic feedback).Step 2. The demonstration dataset is generated by simulation with the guidance of the reward model and train supervised policy with the synthetic demonstrations.Step 3. We further optimize the model against the reward model with reinforcement learning.

Figure 3 :
Figure3: (a) Human evaluation and (b) GPT-4 evaluation results within 7B models on Vicuna Questions(Chiang  et al., 2023).It shows the percentage of win, tie, and loss of ALMoST-PPO against other models.

Figure 7 :
Figure 7: A nested pie chart for the initial queries in our dataset.It shows the top 20 root verbs and the corresponding top 4 nouns for each verb.It is plotted the same as in Wang et al. (2022).

Figure 8 :
Figure8: All results of GPT-4 evaluation on Vicuna Questions(Chiang et al., 2023).Considering the positional bias of the GPT-4 evaluation, we ask the same instance twice by switching the position of the response pairs.It is reported from the perspective of our PPO model against other baselines.Consistently, our model outperforms other open-source aligned models except for Vicuna and ChatGPT.

Figure 9 :
Figure 9: The detailed instructions and corresponding web interface used in the human evaluation.

Table 2 :
GPT-4 evaluation results on Vicuna Questions(Chiang et al., 2023)of the prompted response generators in various configurations compared to Alpaca-7B(Taori et al., 2023).It is based on the same evaluation setup in Section 3. The W, T, and L indicate # of wins, ties, and losses of each generator against Alpaca-7B.

Table 4 :
Ablation results of post validation in the synthetic comparison generation.Helpful-base split is used for the RM evaluation

Table 5
If we use HHH prompt instead of the Faithful for the simulation, the performances for the alignment drop significantly.We include qualitative examples to compare the methods in Table10.
(Askell et al., 2021;Korbak et al., 2023;Liu et al., 2023) the model with Self-Play in various benchmarks.In GPT-4Aligning LLMs with Human values Conditioning language models on human values(Askell et al., 2021;Korbak et al., 2023;Liu et al., 2023)was found to improve models' capabilities of generating human-aligned text.Incorporating reward models(Askell et al., 2021; Liu et al., tions (Taori et al., 2023; Peng et al., 2023).However, it generates the synthetic instruction datasets using proprietary LLMs, e.g., InstructGPT or Chat-GPT(Ouyang et al., 2022; OpenAI, 2023), differ- : But I mean I'm going to use sauce and stuff, so does it matter?Assistant: It sounds like you're skeptical.Honestly, I don't know for sure, but since you're curious, why don't you do a taste test yourself -make two small pots of pasta, one with plenty of salt, and the other without.Then taste the noodles and see which one you prefer?As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible.It should be a fun experiment!Human: I found this abstract for a physics paper, and I heard that it's a very famous and important paper, but I had no idea what any of it means: : Can you tell me what happened with the June 1996 Bangledeshi General Election?Assistant: You might be better off using Google or Wikipedia for that.I could try to guess, but my memory is a bit hazy, and I really don't want to mislead you by providing inaccurate information.If you do look it up but find the results hard to interpret, you can tell me what you see and we can try to understand it together.
Bai et al. (2022a)r RLSF training (PPO) deteriorates performances in both datasets compared to vanilla LLM, LLaMA-7B, which is a similar observation ofBai et al. (2022a).Bai et al. (2022a)explain the smaller models than 10 billions often experience severe deterioration with the alignment learning.Can you help me figure out how to create an online store for my small business?