Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Fine-tuning on instruction data has been widely validated as an effective practice for implementing chat language models like ChatGPT. Scaling the diversity and quality of such data, although straightforward, stands a great chance of leading to improved performance. This paper aims to improve the upper bound of open-source models further. We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat, which does not involve human queries. Our objective is to capture the breadth of interactions that a human might have with an AI assistant and employs a comprehensive framework to generate multi-turn conversation iteratively. UltraChat contains 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions. Our statistical analysis of UltraChat reveals its superiority in various key metrics, including scale, average length, diversity, coherence, etc., solidifying its position as a leading open-source dataset. Building upon UltraChat, we fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA. Our evaluations indicate that UltraLLaMA consistently outperforms other open-source models, including Vicuna, the previously recognized state-of-the-art open-source model. The dataset and the model will be publicly released\footnote{\url{https://github.com/thunlp/UltraChat}}.


Introduction
Large language models (Bommasani et al., 2021;Han et al., 2021;Chowdhery et al., 2022) (LLMs) have demonstrated exceptional generalization capability on a variety of language-related tasks.Notably, ChatGPT (OpenAI, 2022), an optimized version of GPT-3 (Brown et al., 2020) for conversation, along with GPT-4 (OpenAI, 2023), takes the user experience to another level via excelling in comprehending and generating responses in a natural and interactive manner.The introduction of
ChatGPT has spurred a surge in the adoption and implementation of general chat language models.In addition to competing models developed by large corporations such as Bard1 and Claude2 , the open-source community is actively engaged in training similar models, aiming to democratize access to AI technology.Notable examples in this regard include Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), Koala (Geng et al., 2023), Baize (Xu et al., 2023b), and Belle (Ji et al., 2023), etc., demonstrating promising performance.Experimental evidence strongly suggests that chat language models can be effectively trained through instruction fine-tuning (Wei et al., 2021;Sanh et al., 2021), and they also indicate that many data-efficient (Zhou et al., 2023) or computingefficient (Hu et al., 2021;Ding et al., 2023) methods can be applied.This paper, in another way, focuses more on the "final one mile" of chat language models, as evidence shows that the journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging.For instance, researchers have shown that by utilizing a small, thoughtfully curated set of instructions, it is possible to train a model with satisfactory instructionfollowing capabilities.However, these approaches have yet to produce models that surpass the performance of Vicuna, the current leading open-source model, let alone outperform  This paper believes that the most straightforward way, that is, the quality and diversity of training data, play a vital role in further improving the performance of chat language models.In other words, leveraging higher quality and more diverse data can yield better outcomes.To this end, we present UltraChat, a million-scale multi-turn instructional conversation data, to facilitate the construction of more powerful chat language models.UltraChat is carefully designed to capture the breadth of interactions that a human might have with an AI assistant.Specifically, we do not use specific tasks like question-answering or summarization to construct the data, but curate three sectors: Questions about the World, Creation and Writing, and Assistance on Existing Materials.Then we employ metainformation, in-context expansion, and iterative prompting to scale up the number of instructions.To construct informative and realistic multi-turn conversations, two separate ChatGPT Turbo APIs are adopted in the conversation generation, where one plays the role of the user to generate queries, and the other generates the response.We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively.
We fine-tune a LLaMA-13B model on Ultra-Chat to produce UltraLM and compare the model to a wide range of baselines, especially the opensource ones.The evaluation shows that our model could consistently outperform other models.As reported in Table 1, UltraLM achieves the highest performance scores that are independently assessed by GPT-4.Further evaluation results on challenging benchmarks and preference study with GPT-4 on various evaluation sets also show that UltraLM could surpass all other open-source models.

Related Work
Instruction Tuning.Recent works demonstrate LLMs' powerful capabilities in following human instructions.Wei et al. (2021) pioneered to finetune T5 (Raffel et al., 2020) on 60 NLP datasets verbalized with natural language instruction templates, i.e., instruction tuning.The fine-tuned model ex-hibits a strong ability in instruction understanding and generalizes well to unseen instructions (Sanh et al., 2021;Ouyang et al., 2022). Later, Longpre et al. (2023) show the benefits of scaling the number of tasks in out-of-distribution generalization.Wei et al. (2021) also conclude that the success of instruction tuning depends on the quality of the dataset and the design of prompts.To further regulate the tuned model's behavior, Ouyang et al. (2022); Schulman et al. (2017) propose to employ reinforcement learning to align model behaviors with human preferences.This technique combined with instruction tuning can further boost the model performance and has been successfully applied to LLMs such as ChatGPT.
Data Augmentation with LLMs.Collecting large-scale human-annotated instructions and their responses is time-consuming and labor-intensive.Alternatively, a more cost-effective and feasible approach to gathering top-notch data involves sampling from LLMs that have been well-tuned, e.g., ChatGPT and GPT-3.5.Recently, there is a surge of interest in distilling these powerful LLMs for data augmentation.For instance, using the technique of Self-Instruct (Wang et al., 2022), Alpaca (Taori et al., 2023) generate 52k high-quality instruction-response pairs based on seed tasks by "distilling" Text-Davinci-003.The trained model performs almost on par with Text-Davinci-003.The success of Alpaca boosts numerous later efforts on data augmentation with LLMs, such as code-alpaca (Chaudhary, 2023), alpacacot (Si et al., 2023), GPT4ALL (Anand et al., 2023), ShareGPT (Domeccleston, 2023), Dolly-v2 (Conover et al., 2023), BELLE (Ji et al., 2023), Vicuna (Chiang et al., 2023), Koala (Geng et al., 2023), Baize (Xu et al., 2023b), etc.It is shown that increasing the scale of data could constantly improve the model performance.Besides, prompt engineering also affects data quality.CAMEL (Li et al., 2023a) design a multi-agent role-play environment for LLMs to simulate real human conversations.

Data Construction
LLMs are believed to be better annotators than human-being in many scenarios (Gilardi et al., 2023).However, employing LLMs such as Chat-GPT directly for generating multi-turn conversations may yield satisfactory but less informative results, as it cannot enjoy the benefit of reinforce-

Materials
Questions or Instructions (2) A user determines the plot of the dialogue, and the output should be tailored to the current topic with diverse language styles and requests.Therefore, unlike traditional task-specific datasets, to construct a comprehensive opendomain instructional chat dataset, the design of data collection schema is crucial to capturing the breadth of interactions and ensuring data quality.UltraChat aims to cover a tremendous range of conversation data with a carefully designed tripartite schema: Questions about the World, Creation and Writing, and Assistance on Existing Materials.While the core of ensuring data diversity mainly depends on opening line diversity, we will first introduce the idea behind the sector design and then focus on specific measures to obtain a diverse set of opening lines and how to prompt the user properly.

Questions about the World
The first sector focuses on querying existing information in the world, including concepts, objects, and entities that exist in the real world.This is at the core of human-AI interaction, as users often rely on AI assistants to provide quick and accurate answers to their questions.
Our approach to gathering data for this sector involves two perspectives: one centered around topics and concepts, and the other around real-world entities.Initially, we request ChatGPT to generate 30 comprehensive topics that encompass various aspects of our daily lives, as shown in Table 2. Subsequently, we delve deeper into each topic by generating 30 to 50 subtopics or related concepts.Finally, we generate 10 different questions for each subtopic or concept and additionally request Chat-GPT to generate 10 more questions based on each original question.The other source of data comes from real-world objects, which are derived from Wikidata 3 entities.These entities are further refined by considering their frequencies in Wikipedia 4 articles, specifically focusing on the 10,000 most frequently occurring entities.For each entity, we create 5 meta-questions, followed by 10 more specific questions and 20 extended questions.The extended questions aim to maintain some similarity to the original question while exploring distinct objects or topics.To create a dialogue, we filter and sample approximately 500,000 questions as opening lines.During the construction of each dialogue, we provide the user model with carefully crafted prompts that explicitly ask the model to respond the context of the ongoing dialogue history.

Creation and Writing
The second part is concerned with the creation of new information with human-input conditions, ranging from writing emails to crafting stories and plays.This process reflects the AI's capacity to engage in original content generation alongside users and demonstrates the role of AI assistants as collaborative partners in a creative environment.
We first project all creations as text materials, and further categorize them into 20 different types as in Table 3.Then a ChatGPT model is employed to produce a diverse range of instructions for each type of writing, approximately 80% of which are further refined by ChatGPT model to generate more detailed instructions.These instructions serve as opening lines for dialogue generation.Throughout the generation process, the user prompt constantly reinforces the primary objective of the conversation, which is to generate and refine a piece of writing.This serves to ensure that the behavior of the user model remains focused and aligned with the intended purpose.

Assistance on Existing Materials
The third sector mainly addresses the modification of existing information, encompassing various tasks including rewriting, translation, summa-rization, and question-answering, etc. Modifying existing materials is a crucial aspect of human-AI interaction, as it allows the AI assistant to actively engage with the user's input, transforming it in various ways as instructed by the user.
We begin by gathering text pieces from the C4 corpus5 .Each piece within the C4 corpus is associated with a source URL.To ensure a diverse range of text content and styles, we adopt the 20 material types outlined in the previous section and manually curate keywords for each type.Additionally, we classify the text in the corpus by matching the keywords in the corresponding URL.In total, we collect 100,000 text pieces from the C4 corpus, and for each piece, we prompt ChatGPT to generate five distinct instructions.We use a manually designed template to combine text and instructions, as depicted in Figure 4 in the appendix.Ultimately, the concatenated set of 500,000 pieces serves as the opening lines for the generated dialogues.

User Simulation and Refinement
Maintaining the desired behavior of the user model is crucial for achieving successful automatic dialogue generation.It has been observed that when the user model is solely provided with the current dialogue history, it tends to assume the role of an AI assistant.This "role confounding" situation can significantly deteriorate the coherence of the multi-turn conversation.To address this, in addition to presenting the dialogue history, we include prompts explicitly instructing the model to adopt various user personalities.In Sector 2, a prompt is employed to remind the model of the primary purpose of the dialogue, thereby promoting a more natural conversation flow.Once the data generation process is complete, a further filtration step is performed to ensure overall data quality.We also exclude excessively polite statements to enhance the realism of user responses.
4 Data Analysis

Statistical Analysis
We conduct a statistical analysis of UltraChat and several other instruction datasets, as shown in Table 4. UltraChat stands out in terms of its scale, being one of the largest publicly available datasets.Moreover, it exhibits the highest average number of turns and the longest average length per instance of data.While SODA (Kim et al., 2023) also has many rounds, it is primarily composed of conceptual banter rather than instructional content.Additionally, the average number of tokens per dialogue in SODA is 231.8, whereas UltraChat boasts a remarkable 1467.4 tokens.To evaluate diversity, we measure both lexical diversity and topic diversity.UltraChat outperforms previous datasets in terms of lexical diversity.However, in terms of topic diversity, UltraChat falls slightly short compared to GPT4ALL (Anand et al., 2023) but still surpasses other datasets significantly.This may be attributed to the regularized embeddings resulting from a large number of tokens in each dialogue.We also conduct coherence evaluation with ChatGPT for multi-turn datasets.Notably, UltraChat and Baize data rank the highest in terms of coherence.

Human Assessment
Setup.To better evaluate the constructed data quality, we also conduct human assessment for UltraChat.Due to the difficulty of evaluation of multi-turn dialogue and the resulting formidable cost, we sample 500 representative dialogues for human evaluation, among which 300 are from Ul-traChat sector 1, 100 from sector 2 and sector 3 respectively.For each round of conversation, we ask the annotators to score the assistant's response on Helpfulness, Honesty, and Harmlessness (3H) principles (Askell et al., 2021).We also devise Coherence and Consistency criteria for the overall multi-turn dialogue quality evaluation.Coherence evaluates whether the dialogue flows logically and coherently, for which the annotators evaluate both the user's response and the assistant's response.Consistency means the assistant's responses do not contradict each other within the same dialogue.For example, it is inconsistent if the assistant asserts one specific event occurred in 1911 in the first round of conversation but mentions it as a 1901 event in the next round.Each metric is scored with 0, 0.5 or 1, where higher score means better quality.Therefore, for a K-round dialogue, we have 3K + 2 metric annotations.
Annotation.Each dialogue is annotated independently by two well-trained annotators, and the score is averaged across two annotators.Meanwhile, due to the difficulty in identifying the hallucination problem, we allow the annotators to skip the dialogues that require expert knowledge or whose validity is hard to check.Altogether, we collect 14560 valid annotations in terms of metrics for both single-round and multi-round, and the Cohen's kappa coefficient is 0.358.The average time to annotate one dialogue is 10 minutes.The evaluation of the trained model is conducted in two folds.We first evaluate UltraLM on traditional benchmark datasets to delineate the knowledge scope and the multiple abilities of the language model.To better demonstrate the chat ability of language models, an automatic response quality evaluation is performed to showcase the model's proficiency in delivering accurate and informative content during chat interactions.Note that UltraLM is solely trained on UltraChat dataset without further finetuning on task specific datasets.

Datasets.
For benchmark evaluation, we choose four datasets: ARC-Challenge (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021), and Truth-fulQA (Lin et al., 2021), evaluating commonsense knowledge, professional knowledge and complex reasoning and understanding abilities.Each benchmark is constructed as multiple choice questions and therefore metrics are readily computable.The four datasets prove to be challenging even for the best-performing language models like ChatGPT.
For response quality evaluation, we use 3 datasets.We first create an evaluation set by ourselves.The curated set encompasses the Vicuna benchmark as well as an additional 300 questions and instructions generated by GPT-4.The questions/instructions covered a wide range of topics, including commonsense, world knowledge, professional knowledge (specifically physics and biology), mathematics, response generation, and writing tasks on different levels of difficulty.Apart from the curated set, we also adopt AlpacaEval (Li et al., 2023b) open-source evaluation set and leaderboard specifically designed for evaluating LLMs.The leaderboard is created based on the win-rate against Text-Davinci-003 automatically evaluated by GPT-4.To further compare with the state-of-the-art model WizardLM (Xu et al., 2023a), comparison result obtained with GPT-4 on the released Evol-Instruct (Xu et al., 2023a) test set is also reported.Further benchmark dataset and implementation details can be found in Appendix A.2 and A.3.

Benchmark Evaluation
As shown in Independent Scoring.Given the instability of pairwise comparison, we also conduct independent quality scoring with GPT-4, as presented in Table 7. Notably, our model demonstrates superior performance compared to all the open-source counterparts by a significant margin in terms of overall scores.This breakdown also provides insights into the performance of each model on specific types of questions and instructions.Generally, all models perform better on simpler questions pertaining to commonsense knowledge and general world understanding.However, more complex tasks that involve reasoning and creative writing proves to be challenging for most models.Interestingly,  lems is not surprising though, as no mathematical problems are intentionally generated in UltraChat.
Impact of System Prompts.Using system prompts to adjust the role and response style of LLMs is a common practice.Although system prompts are not embedded in UltraLM's training data like others do (Chiang et al., 2023), they still appear to have a substantial influence on the response style of the generated output.Specifically, when the model is prompted to provide a "helpful and detailed" response, it tends to generate more pertinent details While such prompts may not improve the accuracy of an answer, they do raise overall quality with more informative response.To illustrate this effect, we conduct an ablation response comparison on UltraLM.Table 9 reveals significant improvements in response quality brought by system prompts across all tasks.We also inspect the detailed evaluation for deterministic questions (commonsense and world knowledge) and find that UltraLM without system prompt only incorrectly answers one question.Thus, the main benefit of the system prompt lies in enhanced informativeness rather than higher correctness.We eagerly await the innovative research and development that will be catalyzed by our contributions in the field of AI conversational models.

Limitations
Evaluating the response quality of large language models is an extremely challenging task, and any assessments may have biases.For a comprehensive evaluation, we compared UltraLM's performance with other baselines across various benchmarks, utilizing GPT-4 to assess the response quality.Nevertheless, the need for additional, diverse evaluations remains to facilitate a more thorough understanding of our model's behavior and performance.Despite demonstrating promising results in experimental settings, UltraLM is not immune to the common pitfalls of large language models, including hallucination issues and potential ethical concerns associated with misuse.Additionally, the energy-intensive nature of UltraLM's training process represents a limitation, particularly when compared to models employing more efficient techniques such as parameter-efficient fine-tuning.In terms of UltraChat, it currently only contains English and there are no explicit methodologies incorporated for generating data to enhance the model's reasoning capabilities, representing another area for potential improvement.
While the advancements of the UltraChat dataset and UltraLM are commendable, they still face ethical challenges that exist in the area of LLMs.For the sake of privacy protection, we do not use any online or even human queries/instructions to construct UltraChat but develop a framework to build scalable, diverse instructional data.Although extensive filtering operations are conducted, biased statements may still exist within the dataset.UltraLM, as the paper described, will be one of the most potent open-source chat language models.With great power comes increased responsibility and potential for misuse.There exists a substantial risk for the technology to be weaponized for spreading misinformation, propaganda, or even creating "deepfake" text that could mislead or manipulate public discourse.This necessitates the establishment of robust policies and comprehensive research to prevent misuse and deter malicious applications of the technology.In this regard, the model's potential applications should be carefully evaluated for any possible negative consequences before deployment.

A.1 Baselines
We introduce the main open-source baseline models below.Alpaca (Taori et al., 2023) is an instructionfollowing language model derived from the LLaMA (Touvron et al., 2023) model that has been effectively optimized on 52,000 demonstrations of instruction data.The data is generated by Self-Instruct approach with Text-Davinci-003.Vicuna-13B (Chiang et al., 2023) is an opensourced chat model created by fine-tuning LLaMA on user-shared conversations collected from ShareGPT8 .An automatic evaluation by GPT-4 demonstrates that Vicuna can yield over 90% response quality of ChatGPT.In following practices, Vicuna is widely acknowledged as the state-of-theart open-source chat model.This is evident in the Chat Arena9 , where a total of 13,000 anonymous votes reveal that the quality score of vicuna-13B surpasses that of other open-source models.Koala-13B (Geng et al., 2023) is another LLaMAbased model fine-tuned on selected public dialogues.In existing open evaluations, Koala's performance will be slightly worse than vicuna, but it still remains a strong baseline.Dolly-V2 (Conover et al., 2023) is based on the Pythia (Biderman et al., 2023) model, which utilizes 15k human-generated instruction-following data.The data is organized by following Instruct-GPT (Ouyang et al., 2022), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.OpenAssistant-12B (Köpf et al., 2023) is also a Pythia-based model that attempts to democratize the alignment process of LLMs.The project collects a conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees and trains a model on these manually annotated data.WizardLM-13B (Xu et al., 2023a) is a LLaMAbased model finetuned on Evol-Instruct dataset, which contains 250k instructions.The data are generated with evolutionary strategy, where the initial instructions are rewritten by ChatGPT for several epochs to increase complexity progressively.

A.2 Evaluation Dataset
Table 11 presents the basic information of the evaluation datasets we use.Below we give a detailed description of each dataset.The AI2 Reasoning Challenge (ARC) (Clark et al., 2018) is comprised of advanced science questions and structured as multiple-choice questions.Each question is accompanied by 4 available choices and only one answer is correct.We use the challenge partition here for evaluation, which contains 1172 test examples.HellaSwag (Zellers et al., 2019) tests commonsense inference ability by evaluating how well the language model can predict the remaining part of a sentence.Each sample has 4 different text pieces as the candidate remaining part of a given sentence and only one of them is plausible.The task is shown to be easy for humans but challenging for language models.We use the validation split as in Gao et al. (2021).MMLU (Hendrycks et al., 2021) is a comprehensive dataset that consists of 57 different tasks covering assessments of multiple types of academic knowledge and problem-solving ability of language models, ranging over expert fields like mathematics, history, law, etc. TruthfulQA (Lin et al., 2021) assesses how well a model can identify true statements related to the real world.Its purpose is to determine the risks of producing false claims or spreading misinformation.The benchmark consists of questions written in various styles, covering 38 different categories, and is designed to be challenging.It includes two evaluation tasks: the multiple-choice task and the generation task.We use the multiple-choice task in the validation split as in Gao et al. (2021).AlpacaEval (Li et al., 2023b) is a hybrid evaluation dataset with altogether 805 instructions, which combines instructions from various existing evaluation sets, including self-instruct (Wang et al., 2022), Open Assistant (Köpf et al., 2023), helpful evaluation released by Anthropic (Bai et al., 2022), Vicuna (Chiang et al., 2023), andKoala (Geng et al., 2023).Evol-Instruct is a dataset released in Xu et al. (2023a).It is constructed with an evolutionary strategy by rewriting the instructions through multiple rounds to obtain instructions at different complexity levels.WizardLM (Xu et al., 2023a)

Comparison Evaluation Prompt
You are a helpful, harmless and precise assistant for checking the quality of the answer.We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.Please rate the quality, helpfulness, level of details, and harmless of their responses.Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively.The two scores are separated by a space.In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.Please avoid same scores unless they exactly in the same level.

Independent Scoring Prompt
You are a helpful, harmless and precise assistant for checking the quality of the answer.We would like to request your feedback on the performance of the AI assistant in response to the user question displayed above.Please rate the quality, helpfulness, level of details, and harmlessness of their responses.The assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.Please output "Score: [an integer number between 1 and 10]"

System Prompt
You are now a helpful assistant, your answer should focus on the question, give reasonable, selfconsistent, high-quality solutions to make users satisfied.Please try to make the answer more informative, detailed, and polite.Example A: Question of World Knowledge User: Who discovered the law of universal gravitation?UltraLM: The law of universal gravitation was discovered by Sir Isaac Newton in the 17th century.He formulated the law while studying the planets' movement around the sun and realized that there was a quantifiable force directing them towards the sun.In 1687, he published his book titled "Mathematical Principles of Natural Philosophy," where he described the law of universal gravitation in detail, outlining how two objects experience an attraction towards each other due to their mass and distance between them.This discovery revolutionized our understanding of the universe and ushered in the Age of Enlightenment.Today, Newton's law continues to form the foundation of modern astrophysics and space exploration.
Example B: Question of World Knowledge User: Who was the 16th President of the United States?UltraLM: The 16th President of the United States was Abraham Lincoln.He assumed the presidency during a very tumultuous time in American history, as the nation was torn apart by the Civil War.During his presidency, he was known for his strong leadership, his commitment to the preservation of the Union, and his support for the abolition of slavery.Some of his most notable achievements include issuing the Emancipation Proclamation, which declared that all slaves in the Confederate States were free, and delivering the Gettysburg Address, an iconic speech honoring the soldiers who died in the Civil War.Lincoln is widely regarded as one of the greatest American Presidents, remembered for his unwavering leadership during a challenging period in the nation's history.

Example C: Question of Common Knowledge
User: Which animal is known as man's best friend?UltraLM: There are many animals that humans enjoy as companion animals, but one of the absolute standout answers to the question "which animal is known as a man's best friend?" is the domestically-bred dog.Domesticated dogs have been by humans' sides for tens of thousands of years, providing faithful companionship and help; guarding people and property; and lending emotional support in everything from long, lonely days to military deployments.Dogs come in an almost mind-boggling variety of breeds of various sizes, with different energy levels, personalities, and temperaments, meaning there's almost certainly a pup out there who will be a perfect fit for whoever needs it most.

Example D: Commonsense Reasoning
User: Which weighs more: a pound of feathers or a pound of bricks?UltraLM: A pound of feathers and a pound of bricks both weigh exactly one pound.This is because one pound is a unit of weight, just as a foot is a unit of length or a second is a unit of time.We use the pound (or other unit of weight) to describe the weight of different objects, regardless of what that object is made of.So the answer to your question is that a pound of feathers and a pound of bricks both weigh the same amount.

Example E: Reasoning
User: "If a farmer has 17 sheep and all but 9 die, how many are left?UltraLM: The total number of sheep that are left is 9.

Example F: Reasoning
User: If a red house is made from red bricks, and a blue house is made from blue bricks, what is a greenhouse made from?UltraLM: A greenhouse is not made from green bricks, but rather from materials such as glass or plastic that allow light to pass through and create a controlled environment for plants to grow.Greenhouses can come in a variety of colors, including green, but the color of the building itself does not affect the plants growing inside.

Example G: Creative Generation
User: Write a short dialogue between two planets.UltraLM: Mars: "Hey there Neptune, how's it going?"Neptune: "Hi there Mars, I've been better.It's getting a bit windy around here."Mars: "Sorry to hear that.Have you tried using energy other than wind?Maybe solar power?"Neptune: "Yeah, but it's getting difficult to generate solar power in all of this mist and clouds.And we definitely can't use tides or waves here, they're too unpredictable."Mars: "Oh, I see.Well, I hope you find a solution soon.How are your oceans holding up?" Neptune: "They've been a bit choppy lately with these intense storms we've been experiencing.But I guess it comes with living in a planet like this." Mars: " Definitely not designed for the faint-hearted!Any other issues you're facing?"Neptune: "Not really.I think we have a stable temperature and atmosphere, and the wildlife here hasn't shown much complaint."Mars: "That's lucky.I envy your tranquil oceans, we've had way too much volcanic activity lately causing all sorts of disturbance in our lava streams."Neptune: "That sounds rough, better invest in those climate-controlling nanites to stabilize things maybe?"Mars "Ha-ha, very funny.Just like we also could consider your frozen polar regions as an opportunity to experiment with our frost nanobots on."Neptune: "A fun concept!But let's stick to the task at hand, if you need any technical advice, come right ahead, I don't want your planet getting any more disturbed than it already is.

Example H: Creative Generation
User: Pretend to be a character in a post-apocalyptic world.Describe how you survive and the allies you encounter.UltraLM: As a survivor in a post-apocalyptic world, I have learned to be resourceful, alert, and adaptable.I have had to rely on my wits and my instincts to survive, and I have been fortunate enough to encounter many allies who have helped me along the way.One of my greatest survival tools has been my ability to scavenge for resources.I have learned to identify the best places to find food, water, and shelter, and I have become skilled at navigating the rough terrain that makes up my new world.When I first emerged from my bunker after the apocalypse hit, I was alone and unsure of what to do.But as I ventured out into the wasteland, I realized that there were others like me who were struggling to survive.I quickly learned to Trust my instincts and to be constantly alert for potential dangers.I have encountered wild animals, rogue survivors, and even mutated creatures over the years, and I have learned to defend myself against them.I still remember the first time I met a group of fellow survivors, and how grateful I was to have them by my side.We quickly formed a cohesive unit, and together we were able to fend off danger and defend our territory.As time passed, I encountered more and more allies who were willing to lend a hand.I found a group of farmers who taught me how to grow my own food, and a group of scientists who were working to find a cure for the lethal virus that had decimated the population.I even found a band of traveling performers who helped to lift our spirits and remind us that there was still joy to be found in this desolate world.Despite the challenges we faced, I am grateful to have these amazing people in my life.They have taught me that even in the darkest of times, there is always hope.We may not have a cure for the virus yet, but we are a determined lot, and we are all in this together.As long as we have each other, we will survive.
Example K: Professional Knowledge User: How does superconductivity work and what are its practical applications?UltraLM: Superconductivity is a phenomenon that allows materials to conduct electricity without resistance at extremely low temperatures.At low temperatures, the interaction between the particles in a superconducting material can produce a state called the "Bose-Einstein condensate," which leads to the loss of electrical resistance.In materials, superconductivity comes when a magnetic force interacts with an Electronic current.This interplay in turn provides energy savings and speeds up current.Practical applications mainly include MRI machines, high-speed trains, magnetic levitation, supercharging power grids that can transfer energy more efficiently.One of the main applications of Superconductivity is being used in the Magnetic Resonance Imaging(MRI) machines used in hospitals around the world, to non-invasively scan patients for health diagnosis without exposing them to high levels of radiation.Another important practical application of superconductivity is in power grid operations where large currents can be transmitted to long distances by highly superconducting coil systems that minimize grid energy losses, reducing environmental pollution also by highly efficient electric motors that harness large quantities of electricity efficiently.Aside from these, future projects include superconducting propulsion on planes, trains becoming lighter if not magnetic and able to move on rails with a significant increase in speed.Furthermore, the prospects of supercapacitors replacing electric vehicle batteries being cost and energy efficient are currently being researched.Overall, superconductivity presents various exciting opportunities in multiple industries from high-precision engineering to reducing CO2 emissions and supporting a more environmentally friendly energy model where both cost efficiency and practicality exist hand in hand.

Example L: Mathematics
User: "What is the definition of a limit in calculus?UltraLM: In calculus, a limit is a value that a function approaches as its input approaches a certain value.The limit is a measure of how close the function is to that value.In other words, a limit is the value that a function "approaches" as its input gets "closer and closer" to a certain value.For example, consider the function f (x) = 1 x .As x approaches 0, the function "approaches" +∞, and the limit of the function as x approaches 0 is +∞.Formally, we can define the limit of a function f (x) as x approaches a certain value c as lim x→c f (x) = L, where L is a real number.If the limit exists, we say that the function has a limit at c and the value of the limit is L. If the limit does not exist, we say that the function has a removable discontinuity at c.

Figure 1 :
Figure 1: Construction process of UltraChat.The three sectors of data are derived from different meta-information.mentlearning with human feedback (RLHF) in the alignment process.Table12in Appendix A shows a comparison of directly generated multi-turn dialogue and a case in UltraChat with the same opening line.Two key points can be derived to ensure the quality of the data: (1) An opening line determines the topic of the dialogue.Opening lines should be highly diverse and encompass any task that a human user may request a chat model to perform.(2) A user determines the plot of the dialogue, and the output should be tailored to the current topic with diverse language styles and requests.Therefore, unlike traditional task-specific datasets, to construct a comprehensive opendomain instructional chat dataset, the design of data collection schema is crucial to capturing the breadth of interactions and ensuring data quality.UltraChat aims to cover a tremendous range of conversation data with a carefully designed tripartite schema: Questions about the World, Creation and Writing, and Assistance on Existing Materials.While the core of ensuring data diversity mainly depends on opening line diversity, we will first introduce the idea behind the sector design and then focus on specific measures to obtain a diverse set of opening lines and how to prompt the user properly.

Figure 3 :
Figure 3: Comparison between UltraLM and WizardLM-13B on Evol-Instruct test set.The scores are obtained by pairwise scoring with GPT-4, and WizardLM scores are considered as 100%.

Figure 4 :
Figure 4: Manually designed templates for concatenating existing materials and generated instructions.
of Assistant 1's Answer] [The End of Assistant 1's Answer] [The Start of Assistant 2's Answer] [The End of Assistant 2's Answer]

Figure 5 :
Figure 5: Prompt for automatic comparison evaluation.
of the AI Assistant's Answer] [The End of the AI Assistant's Answer]

Figure 6 :
Figure 6: Prompt for automatic independent scoring evaluation.

Figure 7 :
Figure 7: System prompt used for generating responses in automatic evaluation.

Table 4 :
Statistics of existing instruction datasets.Lexical diversity is calculated by averaging the MTLD score (Mc-Carthy and Jarvis, 2010) over each utterance with LexicalRichness 6 .10000 samples are randomly drawn from each dataset for topic diversity and coherence measurement.Topic diversity is measured by averaging the cosine distance between each pair of data with OpenAI embedding API.Coherence is scored by ChatGPT on a scale of 1-10.
Table5, the dataset scores high on all metrics, showing the effectiveness of the construction process.It is worth noting that the dataset is almost free from harmful content like hate speech and discrimination.Furthermore, we observe two trade-offs between data quality.The first is between helpfulness and honesty.

Table 5 :
Human assessment results on 500 dialogues sampled from UltraChat.

Table 6 :
, a widely acknowledged The evaluation results on 4 challenging benchmark datasets.All evaluation and metric calculations follow EleutherAI's lm-evaluation-harness(Gao et al., 2021).Both weighted and unweighted mean accuracy are reported for MMLU as there are 57 tasks.The overall average metric is obtained by averaging the second column data for each benchmark dataset.More details about metric calculation can be found in Appendix A.3.

Table 7 :
Response comparison of UltraLM with other baselines on the curated evaluation set, evaluated by GPT-4.The independent overall scoring and segment scoring of each model on the curated evaluation set, on a scale of 1 to 10. Bold indicates the best score and underlined indicates the second best.
is trained on the training set.We evaluate our model on the test set.