Tuna: Instruction Tuning using Feedback from Large Language Models

Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel \textit{probabilistic ranking} and \textit{contextual ranking} approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call \textbf{Tuna}, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at \url{ https://github.com/microsoft/LMOps}.


Introduction
Large language models (LLMs) have made significant progress by scaling up model size and data size (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Brown et al., 2020;OpenAI, 2023) for unsupervised pre-training and subsequently applying reinforcement learning from human feedback (RLHF) to align model responses with human preferences (Christiano et al., 2017;Ouyang et al., Figure 1: The finetuning process using probabilistic ranking (top), contextual ranking (middle), and a combination of both (bottom).
2022) with Self-Instruct algorithm (Wang et al., 2022a) has emerged as a cost-effective method for aligning with human preferences.In this approach, open LLMs like LLaMA (Touvron et al., 2023) can be finetuned on instruction-following data generated by OpenAI GPT using the Self-Instruct algorithm.The Alpaca model (Taori et al., 2023) exemplifies this technique, which enables close alignment with human preferences while reducing dependence on human-labeled data.
However, instruction tuning offers only a broad guideline for the base LLMs to transition from "next token prediction" to a more interactive, instruction-following style.As a result, the model may learn some superficial features or styles from the instruction data but still lacks a deeper understanding of what constitutes a preferred response.For instance, when given a question like "Give three tips for staying healthy", a base LLM may generate fluent yet undesirable continuations, while an instruction-tuned LLM could offer three general tips.Humans might prefer more detailed tips over general tips, but such tips are less likely to be sampled since they have lower likelihood within the current model distribution.This can be attributed to the fact that they are either unseen during instruction tuning or hard to be sampled due to the exposure bias (Ranzato et al., 2015).
To address this, we propose further finetuning of an instruction-tuned LLM to discern the quality of multiple responses more precisely, using our novel probabilistic ranking (Sec.2.2; Fig. 1 top) and contextual ranking (Sec.2.3; Fig. 1 middle) approaches.Probabilistic ranking enables the instruction-tuned LLM to inherit the high-quality and low-quality responses as well as their relative rankings from the teacher LLM (e.g., textdavinci-003).In contrast, contextual ranking aims to re-balance the instruction-tuned model's own response distribution with the help of stronger LLMs (e.g., , mitigating the exposure bias issue. We apply probabilistic ranking and contextual ranking sequentially to an instruction-tuned model, i.e., Alpaca (Taori et al., 2023), resulting in a model called Tuna (Sec.2.4; Fig. 1 bottom).We evaluate Tuna on various benchmarks, including Super Natural Instructions (Wang et al., 2022b), which contains 119 diverse test tasks; LMentry (Efrat et al., 2022), comprising 25 tasks to assess the basic capabilities and robustness of LLMs; and Vicuna QA (Chiang et al., 2023) which evaluates the model's ability to answer a diverse set of questions with the assistance of GPT-4.Experimental results demonstrate that the Tuna model not only consistently outperforms the standard instruction-tuned models on all benchmarks, but also surpasses several strong RLHF baselines (Ouyang et al., 2022).
To summarize, our contributions are as follows: • We propose probabilistic ranking and contextual ranking, which enable the instructiontuned model to distinguish high-quality and low-quality responses and assign higher probability to the former accordingly.
• The Tuna model, obtained by sequentially applying probabilistic ranking and contextual ranking on an instruction-tuned LLM, achieves better results than several strong benchmarks, including RLHF models; • Our model, data and code will be released to facilitate future research.

Methodology
In this section, we describe how to obtain our Tuna model using the feedback from LLMs.We first describe the vanilla instruction tuning.We then introduce our probabilistic ranking and contextual ranking approaches.Lastly, we describe how to integrate both ranking approaches.

Instruction Tuning
LLMs like GPT-3 (Brown et al., 2020) have been trained on a massive text corpus using maximum likelihood estimation (MLE): where θ represents the parameters of the base model.The pre-training objective function compels the model to predict the next token y t given its prefix y <t = [y 0 , y 1 , ..., y t−1 ].A sufficiently-trained LLM can generate fluent continuations given almost any prefix.However, the generated continuations may not align well with human preferences.
As the primary goal of an LLM is to assist humans, it becomes essential to encourage the generation of content that follows human instructions and aligns with human preferences.The current dominant approach to enhance LLMs' instruction-following ability is called instruction tuning (Mishra et al., 2021;Wei et al., 2022;Taori et al., 2023), which finetunes the base LLMs in a supervised manner on instruction-response pairs {i, r} (where i is an instruction and r is its response) using MLE: where θ ′ represents the parameters of the instruction-tuned model.After instruction tuning, we expect the model distribution p(•|i; θ ′ ) to allocate higher probabilities to proper responses like r rather than undesirable continuations.Note that the responses in instruction-response pairs can either be annotated by humans1 or generated by strong LLMs, such as Instruct-GPT or GPT-4 (Wang et al., 2022a).A prevalent and costeffective approach for generating instruction tuning data is the Self-Instruct algorithm (Wang et al., 2022a).Specifically, it uses a strong LLM, e.g., text-davinci-003, to create instructions based on a few seed instructions, and then generates a single response for each instruction using the same LLM.

Probabilistic Ranking
Instruction tuning with the data generated by the Self-Instruct algorithm is essentially a form of sequence-level distillation (Kim and Rush, 2016).The rationale behind this class of distillation method is that the current commercial LLMs have significantly better capabilities than their opensource counterparts.Instead of learning from the single-response data, our probabilistic ranking approach leverages the relative rankings of multiple responses based on the teacher model's probabilities for better pseudo label distillation (see Fig. 1  top).
Let r denote the original response for instruction i in the instruction tuning dataset.We query strong (teacher) LLMs, such as text-davinci-003, to generate N new responses for i.Let r (0) , r (1) , . . ., r (N −1) denote these new responses, and p(r (0) |i), p(r (1) |i), . . ., p(r (N −1) |i) denote their probabilities.While the teacher LLMs are expected to produce responses of comparable quality on average, there will inevitably be some variation in the quality of the generated responses.This inherent variability manifests itself in various aspects, such as differences in accuracy (Wang et al., 2023a), response length, and level of details provided (Wang et al., 2023b).
Intuitively, if a model is perfectly distilled, the relative probabilities assigned to two samples should be the same as those of the teacher model.Specifically, let p(r (j) |i; θ ′ ) and p(r (k) |i; θ ′ ) denote the probabilities of r (j) and r (k) w.r.t. the student model.If p(r (j) |i) > p(r (k) |i), then p(r (j) |i; θ ′ ) > p(r (k) |i; θ ′ ).We use the following normalized log-likelihood as the teacher model quality score to account for differences in response lengths: (3) where |r (k) | is the length of r (k) and β represents the length penalty.
We then rank those responses in decreasing order based on s(i, r (k) ).The resulting instructionresponse pairs become {i, r, (r [0] , ...r [N −1] )}, where i, r are from the original instruction tuning data, and r [j] is considered to have better quality than r [k] , if j < k.Once we obtain the ranked responses, we can encourage our model to learn from these rankings using a pairwise ranking objective, which has been successfully employed in previous work (Zhong et al., 2020;Liu et al., 2022;Zhang et al., 2022;Zhao et al., 2023).The ranking objective function is as follows: k] |i; θ ′ , m > 0 is the margin hyper-parameter.The ranking loss, L rank , aims to teach the model to distinguish good responses from bad ones based on the teacher LLM's perspective.In addition to L rank , we also apply a cross-entropy loss on the original response as regularization: ) where r is the original response, and λ > 0 controls the importance of L MLE , which helps prevent overoptimization of the ranking loss.
After learning with probabilistic ranking, the model can better assign probabilities to superior and inferior responses.

Contextual Ranking
During the instruction tuning or the probabilistic ranking stage, the model is finetuned to generate a good r given an instruction i.However, given the same i during inference, the model may still generate a relatively low-quality response r ′ .This is related to the exposure bias problem (Ranzato et al., 2015), where the model fails to generate r due to accumulated errors during the auto-regressive generation process.To address this issue, we use our contextual ranking approach to refine the distribution of responses generated by the model itself, assigning higher probabilities to better responses with the help of strong LLMs (Fig. 1 middle), thus alleviating exposure bias (Ranzato et al., 2015).
For each instruction, we first sample N responses from the instruction-tuned model itself, i.e., r (0) , r (1) , ..., r (N −1) ∼ p(•|i; θ ′ ).We hope the samples to be diverse enough so that better responses are more likely to appear in the sampled results.
To ensure diversity, we impose a constraint on the ROUGE-L (Lin, 2004) score between each pair of responses, requiring it to be less than a threshold τ .If the ROUGE-L score exceeds τ , we increase the sampling temperature and resample another response.If multiple trials still result in a ROUGE-L score above τ , we retain the least similar response from the trials.After obtaining N responses, we leverage the contextual understanding ability of commercial LLMs, such as GPT-4 (OpenAI, 2023), to rank them based on various aspects.The ranking process consists of multiple steps.First, we ask GPT-4 to assess whether the instruction requires an open-ended answer (e.g., story generation) or a close-ended answer (e.g., solving a math problem).We then request GPT-4 to generate its own response as a reference.Next, GPT-4 compares the reference response with the N responses from different aspects and assign scores to each response.For open-ended instructions, GPT-4 evaluates relevance (score 0-5), level of details/justification (score 0-5), and accuracy (score 0-5) of the model responses compared to its reference response.For close-ended instructions, the evaluation criteria are accuracy (score 0-5), level of details/justification (score 0-5), and clarity (score 0-5).Finally, GPT-4 ranks responses in decreasing order based on the sum of their scores (see Appendix E for our complete prompt).We also manually evaluated GPT-4 rankings, which have achieved a strong correlation with human judgements (see Appendix G, H).
As in Sec.2.2, the resulting instruction tuning dataset becomes {i, r, (r [0] , ...r [N −1] )}.Note that the r [k] , 0 ≤ k ≤ N − 1, is derived from the instruction-tuned model itself.Lastly, we use the same objective function as in Eq. 6 to encourage the model to assign higher probabilities to better responses.

Integrating Probabilistic and Contextual Ranking
Given an instruction-tuned model, there are several options for further finetuning: 1) learning with probabilistic ranking alone; 2) learning with contextual ranking alone; 3) learning with probabilistic ranking followed by contextual ranking (see Fig. 1 bottom).We refer to the models finetuned with these three methods as Tuna p , Tuna c , and Tuna, respectively.
To optimally integrate both probabilistic ranking and contextual ranking techniques, it is recom-mended to first obtain a Tuna p model, followed by applying contextual ranking to Tuna p 's response distribution, resulting in the Tuna model.There are two reasons for this choice.First, although it is beneficial to learn the ranking of different responses from the teacher LLM's perspective (probabilistic ranking), the model might not fully capture the teacher's ranking knowledge due to its limited capacity.Second, contextual ranking enables the model to better adapt to its own capacity by working with the model's own generations.By generating its own responses, the model can finetune its understanding with the help of stronger LLMs and more effectively produce responses that are both closer to human preferences and compatible with its capacity constraints, alleviating the exposure bias issue (Ranzato et al., 2015).

Model and Data
In our experiments, we use a 7B LLaMA model (Touvron et al., 2023) as the base model.The instruction tuning data is sourced from Alpaca (Taori et al., 2023), which consists of 52K instructions paired with responses that are generated by text-davinci-003 using the Self-Instruct algorithm (Wang et al., 2022a).We perform instruction tuning on 52K Alpaca data using recommended hyperparameters, such as a learning rate of 2e-5 and the AdamW optimizer (0.9, 0.999) (Loshchilov and Hutter, 2019). 2 For simplicity, we also refer to the instruction-tuned model as Alpaca.
For probabilistic ranking, we input 52K instructions from Alpaca dataset into text-davinci-003 to produce N = 4 responses per instruction along with their log-likelihoods3 , with an inference temperature of 1.We calculate response scores using Eq. 3 with β being 1.3, and rank the responses accordingly.Subsequently, we finetune the Alpaca model for 1 epoch with a learning rate 1e-5, margin m = 0.1, and cross entropy regularizer weight λ = 1.0.We denote the model trained exclusively with probabilistic ranking as Tuna p .
For contextual ranking, we sample N = 4 responses from the Alpaca model with temperature T = 1 for each instruction.To avoid similar generations, we ensure the pairwise ROUGE-L (Lin, 2004)    erwise, we remove the similar response, increase the temperature by 0.1, and resample.If three trials fail to produce unique enough responses, we keep the least similar one.We then employ GPT-4 to rank responses for the first 13K instruction data with the GPT-4 inference temperature to be 0. The contextual ranking prompt is shown in Table 9. 4 The finetuning hyperprameters follow those of probabilistic ranking.We refer to the model trained on 13K contextual ranking data of the Alpaca model as Tuna c .Furthermore, we use the 13K GPT-4 ranking data to train a proxy ranking model (PRM) based on StableLM-3B. 5The PRM is employed to re-rank Alpaca's responses on 52K instructions.We refer to the Alpaca model trained with 52K ranking data totally generated by the PRM as Tuna c (PRM).
Lastly, we also collect 13K GPT-4 contextual ranking data based on Tuna p 's responses instead of Alpaca's.We refer to the model finetuned on Tuna p as Tuna.

Evaluation
Super Natural Instruction (Super NI) Super NI (Wang et al., 2022b) contains 119 test tasks designed to evaluate a model's cross-task generalization ability.It includes a variety of classification and generation tasks, such as textual entailment and title generation.We report both 0-shot and 2-shot performance, where 0-shot provides only an instruction (referred to as "definition" in their literature) and 2-shot offers two additional positive examples.The evaluation metric for all 119 tasks is ROUGE-L (Lin, 2004), which is strongly correlated with human evaluation with a Pearson coefficient of 0.998 according to Wang et al. (2022b).Greedy decoding is applied during inference.
LMentry LMentry (Efrat et al., 2022) is a benchmark that primarily focuses on the accuracy and ro-bustness aspects of LLMs' generations.It contains 25 short tasks that are trivial to humans but challenging for LLMs.The final metric is LMentry score, which is calculated by multiplying its mean accuracy on 25 tasks with the robustness score.The model will be evaluated in a 0-shot manner, and greedy decoding is applied during inference.
Vicuna QA Vicuna QA (Chiang et al., 2023) comprises 80 test questions across 9 categories that measure an LLM's ability to generate relevant, detailed and accurate responses and it has been widely adopted in many works.Instead of having a ground truth for evaluation, it conducts pairwise comparisons with the help of GPT-4 (Ope-nAI, 2023).It prompts GPT-4 to compare the outputs of our models to the Alpaca model.We report the win/lose/tie rate against the Alpaca model.
Human Evaluation Additionally, we conduct human evaluations on Vicuna QA.Specifically, responses from five anonymous systems, namely Alpaca, Alpaca + PPO-sim, Tuna, Tuna p , and Tuna c , were randomly shuffled and presented to annotators who were then asked to rank these outputs.
The scoring was designed such that the i-th ranked system receives a score of 6 − i, meaning the bestranked system receives a score of 5, and the worstranked system receives a score of 1.Each question was annotated by two different annotators, and the score was averaged.

Main Results
The main results are presented in Table 1.After instruction tuning, Alpaca demonstrates significant performance improvements over LLaMA on all three benchmarks.This highlights the successful transition from the "next token prediction" paradigm to a more interactive instructionfollowing paradigm.Furthermore, both contextual and probabilistic ranking enhance performance across all three benchmarks.Specifically, Tuna c exhibits more improvement on the Super NI7 2-shot results while Tuna p performs better on Super NI 0-shot and LMentry, narrowing the performance gap with much larger models like InstructGPT-175B.Since the 2-shot input is longer than 0-shot, we conjecture that contextual ranking might be more beneficial for longer sequence generation than probabilis-tic ranking.On the Vicuna QA benchmark, both Tuna p and Tuna c outperform Alpaca significantly on nearly 70% of the questions, as evaluated by GPT-4.Upon comparison with the RLHF baselines, Tuna p and Tuna c consistently demonstrate superior performances on both the Super NI and LMentry benchmarks.However, when it comes to the Vicuna QA benchmark, their performance is marginally lower than that of the RLHF baselines.Moreover, Tuna achieves the best performance on Vicuna QA while maintaining competitive scores on Super-NI and LMentry.Human results on Vicuna QA (see Table 2) also confirm that humans prefer the responses from our models.
Furthermore, Tuna c (PRM) demonstrates comparable performance to Tuna c on Vicuna QA and LMentry, but it underperforms both Tuna c and Alpaca on Super NI.This suggests that although the PRM has primarily learned ranking from the GPT-4 contextual ranking data, it also introduces some noise during the learning process.Overall, it is more effective to learn directly from GPT-4 contextual ranking data.8

Ablation Study
In this subsection, we delve deeper into the performance of our approach by examining several aspects, including: (a) the effect of more responses in instruction tuning, (b) the order of applying two ranking methods, (c) the influence of the cross entropy regularization, (d) the amount of probabilistic ranking data, and (e) the risks of GPT-4 evaluation.

More Responses in Instruction Tuning
We explore whether Tuna's effectiveness is solely due to the increased response data by examining the impact of adding more responses per instruction during instruction tuning.We create a new model, Alpaca-Mul, by adding four extra responses from the probabilistic ranking dataset to the Alpaca dataset and fine-tuning the LLaMA model using Eq. 2. The results are presented in Table 3.
Upon evaluation on Super NI, Alpaca-Mul's performance is nearly identical to that of Alpaca but falls short when compared to the 0-shot settings of Tuna p and Tuna.On LMentry, Alpaca-Mul outperforms Alpaca, yet it still does not reach the performance levels of Tuna p and Tuna.Interestingly, in the Vicuna QA task, Alpaca-Mul slightly underperforms compared to Alpaca.These findings suggest that merely adding more responses without differentiating them does not necessarily lead to improved response generation.Overall, the results of Alpaca-Mul indicate that Tuna's superior performance cannot be solely attributed to the availability of more response data.
Integration Order An alternative approach to Tuna involves first training the Tuna c model, and subsequently continuing training the Tuna c model with probabilistic ranking data.The resulting model is referred to as Tuna cp .
We explore various strategies for training Tuna cp : 1) finetuning Tuna c with the first 13K probabilistic ranking data (Tuna cp -13K); 2) finetuing Tuna c model with last 39K probabilistic ranking data (Tuna cp -39K); 3) finetuning Tuna c model with 52K probabilistic ranking data (Tuna cp -52K).Additionally, we also try to finetune original Alpaca model with a combination of 13K GPT-4 contextual ranking data (generated from Alpaca model's responses) and the last 39K probabilistic ranking data (mix-Tuna-52K).We also finetune Alpaca model with 52K contextual ranking data (13K GPT-4 contextual ranking + 39K ranking-model-generated data) plus 52K probabilistic ranking data (mix-Tuna-104K).The training details are listed in the Appendix C. The results are listed in Table 3.
None of the combination strategies consistently outperform both Tuna p and Tuna c across the Vicuna QA and Super NI benchmarks.On LMentry, however, finetuning Tuna c with probabilistic ranking data is beneficial, especially when no duplicate data is present (Tuna cp -39K).that shorter probabilistic ranking data are beneficial when high accuracy and robustness are top priority.Interestingly, Tuna cp is not comparable to Tuna, indicating that the order in which the model is trained with contextual and probabilistic ranking matters.One plausible explanation is that both the original Alpaca data and the probabilistic ranking data are generated by text-davinci-003, while Tuna c has significantly shifted the model distribution by re-ranking the Alpaca model's responses, making it challenging to finetune Tuna c with probabilistic ranking data again.
The Effect of Cross Entropy Regularizer We examine the influence of the weight λ of the cross entropy regularizer in Eq. 6 on performance by varying λ across different values: {0, 0.1, 1, 5, 10} while training the Tuna c model.Fig. 2 illustrates that as λ increases, the performance on accuracyoriented benchmarks such as Super NI and LMentry improves, while the performance on open questions does not necessarily follow the same trend.On one hand, this finding suggests that with a small λ, learning with contextual ranking may induce long and detailed answers, but those answers are not always accurate.On the other hand, it implies that accuracy-oriented benchmarks and open QA benchmarks are complementary, and researchers should consider more diverse test cases to thoroughly evaluate a model (Wang et al., 2023b).

The Amount of Probabilistic Ranking Data
We investigate the impact of varying the amount of probabilistic ranking data used for finetuning the Tuna p model by testing different data sizes, i.e., {0, 13000, 24000, 52000}.0 refers to the Alpaca model.The results, shown in Fig. 3, reveal that for probabilistic ranking, 13K data points are sufficient for Super NI and LMentry, while Vicuna QA requires 24K data points.We conjecture that this saturation phenomenon can be attributed to two reasons.First, 52K Alpaca instructions generated by Self-Instruct algorithm are not diverse enough, as new instructions are produced by text-davinci-003 using prompt instructions sampled from a limited seed task pool.Second, instruction tuning itself may only require a limited amount of data to perform behavior cloning, as discussed in Zhou et al. (2023).Thus, we can further reduce the cost of probabilistic ranking data generation by half.
The Risks in GPT-4 Evaluation We present evidence that evaluating a model on open QA with the help of GPT-4 may be risky.Table 4 displays the ranking length of our proxy ranking model (PRM).It shows that the PRM has inherited GPT-4 ranking's bias towards longer outputs (Li et al., 2023).However, as we discussed in Sec.3.3, the data generated by the PRM is not as good as the original 13K contextual ranking data, as assessed by more targeted automatic evaluations like Super NI and LMentry.Despite the inferior quality of the PRM-generated data, the performance on Vicuna QA remains almost unaffected (see Tuna c (PRM) in Table 1).This observation suggests that evaluating LLMs on open QA with GPT-4 may not always be as accurate as it appears, echoing the findings of Wang et al. (2023b).It highlights the need for more representative test questions or additional targeted benchmarks for evaluation.
Ranking Loss Learning through re-ranking sequence-level outputs has been studied in sequence-to-sequence models (Wiseman and Rush, 2016;Edunov et al., 2018;Liu et al., 2022;Zhang et al., 2022).BRIO and MoCa algorithms (Liu et al., 2022;Zhang et al., 2022) adopt a pairwise ranking loss to guide the model to generate summaries with higher ROUGE scores (Lin, 2004).In this paper, we use GPT-4's (OpenAI, 2023) strong contextual understanding ability and text-davinci-003's (Ouyang et al., 2022) intrinsic probability measures for ranking.In parallel with our work, Yuan et al. (2023) also propose pairwise ranking loss for finetuning LLMs.Key differences include: 1) our pipeline finetuning strategy; 2) our focus on ranking the model's responses; 3) our use of the original response for cross entropy regularization, while they select the highest-reward response.Additionally, Liu et al. (2023c) also employs GPT models for finetuning BART (Lewis et al., 2019) on the summarization task.

Conclusion
In this paper, we propose to finetune an instructiontuned LLM using our probabilistic ranking approach (Tuna p ), contextual ranking approach (Tuna c ), and a combination of both (Tuna).Our comprehensive experiments demonstrate consistent performance improvements across three benchmarks: Super Natural Instructions (119 test tasks), LMentry (25 test tasks), and vicuna QA.Furthermore, our methods outperform popular reinforcement learning from human feedback baselines that rely on the proximal policy optimization algorithm.These findings underscore the effectiveness of our approach in enhancing the performance of instruction-tuned LLMs and pave the way for future research in this area.

Limitations
Despite the promising results achieved by our Tuna model, there are several limitations that should be acknowledged.The first limitation is GPT-4 ranking inconsistency.In our experiments, we relied on GPT-4 for contextual ranking, which may introduce bias due to the inconsistency in its ranking performance.As a powerful LLM, GPT-4 is generally expected to provide accurate and reliable rankings; however, it may still be sensitive to the phrasing or structure of prompts (Dubois et al., 2023).This inconsistency may lead to suboptimal rankings and potentially affect the overall performance of the Tuna model.In future work, it would be beneficial to design more robust prompts that can mitigate the impact of GPT-4's ranking inconsistencies.Another limitation is the evaluation benchmark.In this paper, we evaluated the Tuna model on three benchmarks, which provided a diverse range of tasks and challenges.However, it is unclear how well the Tuna model would generalize to other types of tasks, domains, or languages.Further research is needed to explore the applicability of the Tuna model to a broader range of problems and settings.The last limitation is the reliance on the use of proprietary LLMs, such as GPT-4 and text-davinci-003, for generating responses and rankings.This dependency may limit the accessibility and reproducibility of our method for researchers who do not have access to these proprietary models.Developing alternative methods that can leverage open-source LLMs or other ranking mechanisms would be a valuable direction for future research.
A The Length Penalty β for Probabilistic Ranking Data In our preliminary experiments, we found that the length penalty β = 1.3 is able to induce detailed responses and validated this choice on LIMA (Zhou et al., 2023)     Instruction: "Why is open-source important?Answer in 3-5 sentences"

Alpaca:
Open-source software is important because it allows for collaboration and innovation.It allows developers to work together to create better software, and it allows users to access the source code and make changes to the software to meet their own needs.Open-source software also encourages transparency and allows users to see how the software works and how it can be improved.Additionally, open-source software is often more secure and reliable than proprietary software, as it has been tested and reviewed by a large number of developers.Finally, open-source software is often free or low-cost, making it accessible to everyone.
Tuna c : Open-source software is important because it allows for collaboration and innovation.It allows developers to work together to improve existing software and create new software, and it allows users to view and modify the source code to customize the software to their own needs.Open-source software also encourages transparency and accountability, as anyone can view and inspect the code to ensure that it is secure and reliable.Additionally, open-source software is often more cost-effective than proprietary software, as users can access the source code and modify it to suit their own needs without having to pay for a license.Finally, open-source software is often more reliable and secure than proprietary software, as it is developed and tested by a large community of developers.
Tuna p : Open-source software is important because it allows developers to access the source code of a program and make changes to it.This allows developers to customize the software to their needs and to fix any bugs or errors that may be present.Additionally, open-source software is often free to use and modify, which makes it more accessible to those who may not be able to afford proprietary software.Open-source software also allows developers to collaborate and share ideas, which can lead to the development of better and more efficient software.Finally, open-source software can be used as a learning tool, as it allows developers to see how the software works and how it is put together.This can help them to better understand programming and coding in general.

Tuna:
Open-source software is important because it allows developers to collaborate on projects, share code, and create better products.By making the source code available to the public, developers can easily collaborate on projects, allowing them to create better products and services.This collaboration also allows developers to quickly identify and fix bugs, as well as add new features and functionality.Additionally, open-source software is often more secure than proprietary software, as it can be more easily audited and reviewed by the public.Finally, open-source software is often more cost-effective than proprietary software, as it does not require licensing fees or other costs associated with closed-source software.Open-source software is an important part of the software industry, and is used by many businesses and individuals.

Figure 2 :
Figure 2: The effect of varying the weight λ of cross entropy regularization in Eq. 6 on Tuna c .The win/lose/tie rate on Vicuna is computed against Alpaca.

Figure 3 :
Figure 3: The effect of varying the number of probabilistic ranking data on Tuna p .
Fourth, compare Response 4 with Response 0/1/2/3/4 and assign each response an overall score on a scale of 0 to 15 where a higher score indicates better overall quality.For an open-ended instruction, please rate based on the relevance (score 0 to 5), level of details/justification: (score 0 to 5) and accuracy (score 0 to 5) of each response; for a close-ended instruction, please rate based on the accuracy (score 0 to 5), level of details/justification (score 0 to 5) and clarity (score 0 to 5) of each response.The ratings should have the format: 'Response k: [sum of the 3 individual scores you give to response k]'.Last, rank the responses in decreasing order of their overall scores.The ranking should have the format: 'rank:[i, j ,k, l, m]'.If there are duplicate responses, keep only one of them in the rank, that is, the ranking may become: 'rank: [i, j, k, l]', 'rank: [i, j, k]' 'rank: [i, j]' or even 'rank: [i]'.

Table 1 :
Performance comparison of different models on Super NI, LMentry and Vicuna QA.The numbers in bold indicate the top-2 results.The numbers in parentheses indicate the performance differences compared to Alpaca.

Table 2 :
Human evaluation on Vicuna QA.
* denotes that the model is significantly (p < 0.01) better than Alpaca, while † denotes that Tuna is significantly (p < 0.01) better than other models.

Table 3 :
Different combinations of probabilistic ranking data and contextual ranking data.The numbers in bold represent the top-2 results.The numbers in parentheses represent the performance difference compared to Alpaca.

Table 4 :
The average ranking lengths of contextual ranking data, probabilistic ranking data and the data generated by the proxy ranking model (PRM).
Super NI test tasks.Specifically, we first obtain Tuna p models with probabilistic ranking data scored with different β.Then, we compute the token-level negative log-likelihood (NLL) of the output of each LIMA instance under different Tuna p models and average the token likelihood over the whole LIMA training set.The results are shown in Table5.It can be seen that with β = 1.3, the model can achieve the best NLL on LIMA training set.Thus, we set β = 1.3 in our experiments.

Table 5 :
The token-level log-likelihood of LIMA training set under Tuna p models trained with probabilistic ranking data scored with different β.Response 4' in answer to the instruction.It needs to have the same format as other responses and will be used as a reference later.Third, identify if there are duplicate responses and keep only one of the duplicate responses for the following steps.

Table 12 :
Example responses of different models.