An Empirical Study of Instruction-tuning Large Language Models in Chinese

The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community's interest in instruction-tuning, which is deemed to accelerate ChatGPT's replication process. However, research on instruction-tuning LLMs in Chinese, the world's most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLMs that is comparable to ChatGLM. The code and data are available at https://github.com/PhoebusSi/Alpaca-CoT.


Introduction
The emergence of ChatGPT gives humanity a real sense of hope for AGI for the first time, and inspires researchers to realize the importance of LLM research.However, the closed source of LLMs (e.g., GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022)) coupled with the requirement for massive computing resources to build the exclusive LLM has deterred researchers from reaching the LLM training stage.Subsequently, a series of "API research" based on GPT-3 and ChatGPT are con-stantly emerging, which stimulate the specific capabilities of frozen LLMs (e.g., Chain-of-Thought (Wei et al., 2023;Wang et al., 2023;Kojima et al., 2023)) or guide them to complete specific tasks (Yang et al., 2022;Shen et al., 2023), by calling OpenAI interfaces and carefully designing prompts without model training.
The unexpected disclosure of the pre-trained LLaMA (Touvron et al., 2023) model changes this situation, and has sparked a surge of excitement in the LLM research community.This is the first open LLM with competitive performance.Recently, Alpaca (Taori et al., 2023) uses self-instruct (Ouyang et al., 2022) and ChatGPT to generate 52K instructions, which can enable LLaMA to respond to various human instructions like ChatGPT.This open project verifies the important role of instructiontuning (Wei et al., 2022;Chung et al., 2022) open LLMs in replicating the ChatGPT process.
Given the open LLM LLaMA and Alpaca's high-quality instruction data, there is still a challenge for researchers: even the instruction-tuning of the 7B model still requires high computational resources.To address this problem, Alpaca-LoRA extends the parameter-efficient method LoRA to LLaMA, which further reduces the computing cost of instruction-tuning.It further sparks extensive research in the open-source community on instruction-tuning for LLMs.On this basis, more LLMs (e.g, Bloom (Workshop, 2023), GPT-J (Wang and Komatsuzaki, 2021)) are shown to have significant improvements in instruction-following performance with instruction-tuning.On the other hand, more instruction data is constantly being proposed, e.g., Belle (Ji et al., 2023) constructs Chinese instructions in the same way, and ShareGPT collects a large number of real human-ChatGPT conversations.
However, research on instruction-tuning LLMs in Chinese, the world's most spoken language, is still in its early stages.LLM bases, parameter- efficient methods, and instruction data are three essential elements for customizing Chinese ChatGPTlike LLMs.There are no tutorials in the academic community on them yet.Some important questions have not yet been explored and answered: 1) "Which open LLM is more suitable as a foundation for Chinese instruction-tuning?", 2) "How do parameter-efficient methods other than LoRA affect LLMs?" and 3) "What is the impact of various types of instruction datasets?"To answer these questions, we collect a range of LLMs, parameterefficient methods, and instruction datasets.Besides, we consider the AGI (instruction-following) capability and professional knowledge reserve (human exams) of models, and correspondingly select two benchmarks Belle-eval (Ji et al., 2023) and MMCU (Zeng, 2023) for comprehensively evaluation.
We also conduct experiments to explore several other factors that may affect the final performance.Specifically, we find that tuning with Chain-of-Thought (CoT) data can improve the ability to respond to complex reasoning questions.Different LLMs may be suitable for different language prompts (excluding instruction parts) in instructiontuning.Human-value alignment results into slight performance drop.On the basis of the above findings, this paper carefully instruction-tunes a powerful Chinese LLMs that is comparable to ChatGLM.
The contributions can be summarized as follows: (1) We are the first to systematically study on instruction-tuning in Chinese through adequate experiments, which can serve as a cookbook that provides valuable findings for customizing Chinese version of ChatGPT.(2) We release a powerful Chinese LLM that is comparable to ChatGLM.

Preliminaries
Problem Formulation.LLM bases m ∈ M , parameter-efficient methods p ∈ P and instruction datasets d ∈ D are the three crucial elements in instruction-tuning.This section examines the impact of each element in the instruction-tuning triplet (m, p, d) on final performance.We traverse the target element to thoroughly explore its impact, while fixing two elements in the triplet to control the variable.For example, we analyze the impact of different types of instruction datasets by comparing the performance of {(m, p, d i )} |D| 1 .Benchmarks.We select two evaluation benchmarks, Belle-eval and MMCU, to comprehensively evaluate LLM competencies in Chinese.Belle-eval is constructed by self-instruct with ChatGPT, which has 1,000 diverse instructions that involve 10 categories covering common NLP tasks (e.g., QA) and challenging tasks (e.g., code and math).We use ChatGPT to rate the model responses based on the golden answers.This benchmark is considered to be as the assessment of AGI (instruction-following) capability.MMCU is a collection of Chinese multiple choice questions in four professional disciplines of medicine, law, psychology and education (e.g., Gaokao examination).It allows LLMs to take exams in human society in a multiple-choice test manner, making it suitable for evaluating the breadth and depth of knowledge of LLMs across multiple disciplines.More statistics and details are shown in Appendix A.1.

Open Large Language Models
To answer "Which open LLM is more suitable as a foundation for Chinese instruction-tuning?", we  1.

Evaluation of Existing LLMs
Performance on Belle-eval.LLMs still have significant room for improvement compared to ChatGPT.

Instruction-tuning Different LLMs
To determine the appropriateness of different LLMs as a foundation for instruction-tuning in Chinese, we fine-tune all the open LLMs with the same parameter-efficient method LoRA and the same instruction dataset Alpaca-GPT4.The results are shown in Figure 1 2 , where we find that: 1) On Belle-eval, the performance improvement of sft LLMs brought by instruction-tuning is not as significant as that of base LLMs, except for sft Bloomz and Bloomz-mt.This is because the instructions of xP3 used for their supervised fine-tuning are not diverse enough.2) Vicuna and ChatGLM encounter performance drops after instructiontuning, because Vicuna is trained from real human-ChatGPT conversations, with better quality than Alpaca-GPT4.ChatGLM adopts HFRL (Ouyang et al., 2022), which may be no longer suitable for further instruction-tuning.3) On MMCU, most LLMs achieve performance boosts after instructiontuning, with the exception of Bloomz and Bloomzmt, which have unexpectedly significantly decreased performance.This is because that original Bloomz and Bloomz-mt excel in multiple choice questions, but after further instruction-tuning, they suffer catastrophic forgetting.
2 Full results are shown in  After instruction-tuning, Bloom has significant improvements and performs well on both benchmarks.Although ChatGLM beats Bloom consistently, it suffers performance drop during instruction-tuning.Therefore, among all open LLMs, Bloom is most suitable as a foundation model in the subsequent experiments for Chinese instruction-tuning exploration.

Parameter-efficient Methods
For most researchers, parameter-efficient methods are essential for instruction-tuning due to limitations in computing resources.These methods tend to freeze the pre-trained model weights and injects trainable weights (adapters), which greatly reduces the number of trainable parameters.To answer "How do parameter-efficient methods other than LoRA affect LLMs?", we collect a range of parameter-efficient methods to instruction-tune "Type" column shows the data types."Source" column shows the source where the data was generated."SI" and "COL" denotes the self-instruct methods and the collection of existing datasets, respectively."MIX" denotes the joint construction of humans and machines."translated" denotes a translation from non-Chinese instructions.We filtered all datasets to remove incomplete instructions.More details of each dataset can be found in Appendix A.4.
Comparison of Parameter-efficient Methods.
From Table 4, several observations can be derived: 1) SadapterH performs the best among all parameter-efficient methods, which can be used as an alternative to LoRA. 2) P-tuning and prompttuning underperform others by large margins, indicating that only adding trainable layers in the embedding layer are not enough to support LLMs for generation tasks.3) Although AdaLoRA is an improvement of LoRA, its performance has a clear drop, possibly because the LoRA's trainable parameters for LLMs are not suitable for further reduction.4) Comparing the upper and lower parts, it can be seen that increasing the number of trainable parameters for sequential adapters (i.e., SadapterP and SadapterH) does not bring gain, while the opposite phenomenon is observed for parallel adapters (i.e., P-adapter).This may provide inspiration for the design of adapters for LLM.Since LoRA is currently the most popular parameter-efficient method, if not otherwise specified, we adopt LoRA by default in the experiments.
Training Loss. Figure 2 shows the training loss of different parameter-efficient methods.We find that: 1) Prompt-tuning and P-tuning converge the slowest and has the highest losses after convergence.This shows that embedding-only adapters are not suitable for instruction-tuning LLMs.
2) The initial loss of AdaLoRA is very high because it requires simultaneous learning of parameter budget allocation, which makes the model unable to fit the training data well.
3) The other methods can quickly converge on training data and fit it well.

Chinese Instructions Datasets
Alpaca (Taori et al., 2023) inspires researchers to further explore instruction data.To systematically explore "What is the impact of various types of instruction datasets?",we gather popular open Chinese instructions (as shown in Table 5) to fine-tune Bloom with LoRA.
Performance on Belle-eval.As shown in Table 6 upper part, it can be seen that: 1) the instruction data constructed by ChatGPT (e.g., using self-instruction methods or collecting real human-ChatGPT conversations) consistently enhances the instruction-following ability with 3.1 ∼ 11-point score increases.2) Among these datasets, Belle has the best performance due to the largest amount of instruction data.However, the performance of models trained on moss-sft-data, containing more data built in a similar way, is unsatisfactory.This is because moss-sft-data's instructions sacrifice the diversity to achieve the goals of helpfulness, honey, and harmlessness.
3) The performance brought by the Alpaca-GPT4 instructions is the second best, with only 49K being comparable to the 1.54M Belle.This is because Alpaca-GPT4 uses the GPT-4 engine while Belle uses the text-avinci-003 engine, which further illustrates that improving data quality can reduce the demand for data volumes.4) Instinwild brings the least performance gains among them because the seed instructions it crawls from Tweet ("in wild") are not as comprehensive as those (like Alpaca) carefully designed by hu- mans.5) These ChatGPT-based data mainly have a significant improvement effect on open generation tasks such as Brain Storm and Generation, while there is a significant decrease in tasks that require high reading comprehension skills, such as Close QA and Extract, which require completing tasks based on given materials.This inspires researchers to consider the reading-comprehension ability for building more comprehensive instruction datasets.
The lower part of Table 6 shows the results of models trained on dataset-based data, which is mainly constructed by collecting NLP or examination datasets.These instruction datasets cause damage to the model's instruction-following ability, because the form and intent of each NLP or examination dataset are unitary, which can easily be overfitted.Among them, COIG-trans performs the best because it involves over 2000 different tasks with a wide variety of task instructions.In contrast, xP33 and COIG-ccmc have the worst negative impact on model performance.Both of them only cover few types of tasks (translation and QA for the former, counterfactual correction conversations for the latter), which hardly cover the popular instructions and tasks for humans.
Performance on MMCU.Table 7 compares the performance on MMCU brought by different instruction datasets.1) Instruction-tuning on each dataset can always result in performance improvement.2) Among the ChatGPT-based data shown in the upper part, ShareGPT-zh underperforms others by large margins.This may be due to the fact that real users rarely ask multiple choice questions about academic topics.3) Among the datasetcollection data shown in the lower part, HC3 and COIG-ccmc results in the lowest accuracy because that the unique questions of HC3 is only 13K, and the task format of COIG-ccmc is significantly different with MMCU.4) COIG-exam4 brings the greatest accuracy improvement, benefiting from the similar task format as MMCU.

Other Important Factors
Problem Formulation.In addition to the essential three elements (m, p, d) discussed above, there are many factors worth exploring, e.g., CoT.If not otherwise specified, we use Bloom as the LLM base, LoRA as the parameter-efficient method, and Alpaca-GPT4 as the instruction data.On this basis, Table 8: The impact of chain-of-thought data on complex tasks requiring reasoning."*" denotes that using prompt " 先思考，再决定 " ("think step by step" in Chinese) during inference.
we explore its impact by observing the performance changes after considering the target factor.We collect 9 CoT datasets and their prompts from FLAN (Wei et al., 2022), and then translates them into Chinese using Google Translate.We compare the performance before and after adding CoT data during instruction-tuning in Table 8. "Alpaca-GPT4+CoT" outperforms "AlpacaGPT4" in the Code and Math tasks that require strong reasoning ability.Besides, there is also a significant improvement in MMCU education task, which is derived from the questions of Gaokao, involving a range of subjects, e.g., math, physics, history.The accuracy improvement across all subjects illustrates that the CoT reasoning ability is generally required in various subjects.However, CoT training data cannot continue to bring benefits to all tasks, and on the contrary, it will cause slight performance degradation on more tasks.The full results can be found in Appendix B.3.
Inspired by (Kojima et al., 2023), we add a sentence" 先思考，再决定 " ("think step by step" in Chinese) at the end of each instruction, to induce the model to respond to instructions based on the chain-of-thought.As shown in the line of "Alpaca-GPT4+CoT*", the simple sentence can further improve the performance of reasoning tasks Code and Education, while the Math performance is slightly inferior to "Alpaca-GPT4+CoT".This may require us to further explore more robust prompts.As shown in Figure 3, we find that the performance of "llama-voc" is severely inferior to "llama" on Belle-eval, and is almost unable to respond correctly to MMCU's instruction.This indicates that it is not feasible to perform instruction-tuning without pre-training on vast data.This is because the embedding corresponding to the newly added Chinese token are random and meaningless, which results in the model being unable to understand the meaning of the instructions.
To make the newly added Chinese token meaningful, Cui et al. uses 20B and 100B token Chinese corpus to further pre-train LLaMA and obtain "llama-voc-pre" and "llama-voc-pre-l" models.We use Alpaca-GPT4 to instruction-tune these models, and find that, pre-training on more Chinese cor-  pus with expansion of Chinese vocabulary are consistently helpful for instruction-following ability.
Counterintuitively, "llama-voc-pre-l" is inferior to "llama-voc-pre" on MMCU shows that pre-training on more data may not necessarily lead to higher performance for academic exams.
The Languages of Prompts.The popular open instruction-tuned LLMs, e.g., Alpaca and Vicuna, tend to uses prompts in English.One intuitive question is, Is instruction-tuning in Chinese more suitable for using Chinese prompts? Figure 4 shows the results of using Chinese and English prompts based on LLaMA and Bloom.When instructiontuning LLaMA, using Chinese prompts can improve the performance on both benchmarks compared to English prompts, while we observe the opposite phenomenon on Bloom.This demonstrates that using Chinese prompts for models with weaker Chinese abilities (e.g., LLaMA) can effectively help respond in Chinese, while for models with good Chinese abilities (e.g., Bloom), using prompts in English (the language they are better at) can better guide the model to understand the process of fine-tuning with instructions.
Human Value Alignment.To avoid LLMs generating toxic content, aligning them with human values is a crucial issue.We add the human-value alignment data built by COIG (see App. A.4 for details) into the instruction-tuning to explore its impact.Figure 5 compares the results of instructiontuning with and without human-value alignment, which shows that the human-value alignment results into a slight performance drop.How to balance the harmlessness and performance of LLMs is a research direction worth exploring in the future.
4 Towards a Better Chinese LLM Problem Formulation.The goal of this section is to find a optimal triplet (m, p, d) that maximizes the comprehensive capabilities: where E t denotes the evaluation of every generative ability t from both Belle-eval and MMCU T , f d (m, p) denotes the model obtained by instruction-tuning frozen LLM m with parameterefficient method p on instruction dataset d.
Our Instruction-tuned LLM.On the basis of the findings above, we carefully design the instruction-tuning process and publicly release a Bloom-based high-performance LLM, which is comparable to ChatGLM and far surpassing Moss.
In particular, we select a dataset combination with significant gains on Belle-eval or MMCU to improve our model's comprehensive ability.Besides, we carefully design a suitable prompt to induce our model for better-quality generation.The implementation details can be found in Appendix D.1.As shown in Table 2, our model is superior or comparable to ChatGLM in most categories on Belle-eval, except for the challenging Math and Extract tasks.Besides, our model slightly underperforms ChatGLM on MMCU and outperforms other LLMs that do well in Belle-eval by clear margins.It is worth emphasizing that our model has much fewer trainable parameters (16M) based on LoRA than that of ChatGLM adopting full parameter fine-tuning (6B).

Conclusion
This paper is the first to conduct a thorough empirical study on instruction-tuning open large language models in Chinese, with a detail discussion of a range of large language models, parameter-efficient methods, and Chinese instruction datasets.In addition, we explore several other important factors, including CoT, vocabulary, language of prompts and human-value alignment.Based on the empirical exploration, we publicly release a LLM, that is rival to ChatGLM, with detailed implementation details.

Limitations
Most experimental results are based on parameterefficient methods, which may differ from the results of full parameter fine-tuning.However, we believe that the findings and conclusions in this paper are still applicable for full parameter finetuning.In addition, instruction-tuning based on parameter-efficient methods has broader application and research scenarios.

Ethics Statement
The open LLMs used in this paper may be driven by certain biases in their training data, and pose a risk of toxic generation.There may also exist harmful stereotypes in the open instruction datasets we are discussing.There is still a long way to explore the safety of LLMs.A More Details about the Work Involved

A.1 Benchmarks
Belle-eval During this evaluation, ChatGPT is used to rate (from 0 to 1) the model response based on the ground truth answer.A score of 0 indicates that the model response is completely unacceptable, while a score of 1 indicates that the response perfectly solves the input instruction.The prompts and instructions for samples in each category are rich and varied.We consider the capability examined in this dataset to be AGI (instrcution-following) capability.
MMCU MMCU (Zeng, 2023) is collected from online public resources, covering 11845 multiple choice questions in four professional disciplines.
There are several subtasks under education and medicine disciplines.The average accuracy of all subtasks is considered the discipline score.Only when a generated answer and the annotated ground truth option number or option content completely match, is the answer considered correct.This evaluation is relatively rigid for expected outputs.We consider the capability examined in this dataset as the reserve of professional knowledge (to deal with human examinations).These two assessments complement each other to some extent.Table 9 shows the data statistics of these two benchmarks.

A.2.1 Base LLMs
LLaMA LLaMA (Touvron et al., 2023) is a decoder-only language model based on the Transformer (Vaswani et al., 2017) architecture, and is trained on more tokens (1T, 1.4T) than what is typically used (Hoffmann et al., 2022).It ranges from 7B to 65B parameters and outperforms existing LLMs (e.g., GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022)) with fewer parameter magnitudes.However, the vocabulary of its tokenizer contains fewer Chinese characters, which affects its expressive power in Chinese.
Bloom Bloom (Workshop, 2023) is a multilingual language model trained on dataset ROOTs, involving 46 natural and 13 programming languages.The proportion of Chinese corpus in pre-training data is second only to English corpus.The maximum version of Bloom has 175B parameters, while the most popular version is 7B.
Moss-moon-003-base Moss-moon-003-base (base model of MOSS series, moss-base for short) is initialized with CodeGen (Nijkamp et al., 2023) and further self-supervised pre-trained on high-quality Chinese (100B) and English (20B) corpus.All the pre-training data contains about 700B words.It has 16B parameters.
A.2.2 Supervised Fine-tuned LLMs Vicuna Vicuna (Chiang et al., 2023) is fine-tuned from LLaMA on 70K user-shared ChatGPT conversations gathered by ShareGPT 5 .Vicuna claims to have achieved 90% performance of ChatGPT on a preliminary evaluation using GPT-4 as a judge, making it the most popular open source LLM.However, further rigorous evaluation is needed, especially in Chinese scenarios.
Bloomz & Bloomz-mt Bloomz and Bloomz-mt are fine-tuned from Bloom on crosslingual task mixture xP3 (Muennighoff et al., 2023) and xP3mt, which contain 13 training tasks in 46 language with prompts in English and in 20 languages, respectively.This supervised fine-tune process aims to further boosts the performance of multilingual tasks.
ChatGLM ChatGLM-6B (Zeng et al., 2022) is an open bilingual LLM, supporting both Chinese and English.It first completes pre-training on about 5 https://sharegpt.com/1T tokens in Chinese and English, and then adds supervised fine-tuning and human feedback reinforcement learning (HFRL) (Ouyang et al., 2022) processes to force model to follow instructions.

A.3 Parameter-efficient Methods
LoRA Low-Rank Adaptation (LoRA) (Hu et al., 2021) injects trainable rank decomposition matrices into each attention layer of the Transformer architecture.
AdaLoRA AdaLoRA (Zhang et al., 2023b) allocates the parameter budget adaptively to each layer's LoRA module according to their importance score.Specifically, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition, which allows it to prune the singular values of unimportant updates and reduce their parameter budget.
Prefix-tuning Inspired by discrete prompts for language models, prefix-tuning (Li and Liang, 2021) adds a sequence of continuous "virtual tokens" as a soft prompt (namely prefix) before the original sequences of each transformer layer.During training, prefix weights are trainable while other model parameters are frozen.Prompt-tuning Similar to p-tuning, prompttuning (Lester et al., 2021) also involves only training the input prompt embeddings.Differently, it freezes all pre-trained weights.
Parallel Adapter Parallel Adapter, namely Padapter (He et al., 2021) adds adapter layers in parallel with attention layers or MLP layers for each transformer layer.

A.4 Chinese Instruction Datasets
AlpacaGPT4 AlpacaGPT4 (Peng et al., 2023) is deemed as an optimized version of Alpaca (Taori et al., 2023) dataset.It uses ChatGPT to translate Alpaca's prompts into Chinese first, and then regenerate these instruction-following data by GPT-4, instead of text-davinci-003.
Belle Belle (Ji et al., 2023) uses the same method as Alpaca (Taori et al., 2023) to generate instruction data by text-davinci-003, except that Belle only generates Chinese instruction-following data and artificially filters low-quality data.It contains about 1.5M instruction-following data.
Moss-002-sft-data This is a multi-turn conversation dataset covering helpfulness, honesty, and harmlessness, which is also generated by selfinstruct (Ouyang et al., 2022).We select the 0.59M Chinese conversations among them for the following experiments.
firefly Firefly (Yang, 2023) collects 23 Chinese datasets and manually writes several instruction templates for each dataset.It contains a total of 1.65M training samples, covering couplet, poem, essay, and other generation tasks for the performance of traditional literature, and 0.5M Belle data for instruction diversity.
xP3 xP3 (Muennighoff et al., 2023) is a collection of 16 natural language process datasets across 46 languages with prompts.We select the 3M Chinese instances among them.
instinwild Instruction in the Wild (instinwild) (Xue et al., 2023) no longer manually sets initial seed instructions like Alpaca, but crawls and filters 429 instructions from Twitter as the seed instructions, to avoid human involvement and cover more topics.Following self-instruct, it uses the seed instructions to generate more instructions and corresponding responses by text-davinci-003.The Chinese instructions in this dataset are about 52K.
HC3 HC3 (Guo et al.) is a corpus of Human-ChatGPT comparisons that aims to investigate how close ChatGPT is to Human Experts.To this end, it collect about questions from various public question answering datasets (e.g., medicine, law, finance QA) and the corresponding human answers and ChatGPT answers.The Chinese samples in HC3 contain 13K questions, 22K human answers, and 17K Chatgpt answers.
COIG COIG (Zhang et al., 2023a) is a Chinese instruction collection, consisting of: Translated instructions contains about 67K instructions which are translated from three datasets: 1.6K task descriptions in Super-NaturalInstructions (Wang et al., 2022) along with a single instance for each of them, 175 instructions of the seed tasks in Self-Instruct, and 66K instructions from Unnatural Instructions (Honovich et al., 2022).CCMC, namely Counterfactual Correction Multi-round Chat, contains about 68K rounds of conversations between students and teachers.This dataset is built by prompting two LLMs to generate conversations based on the entities of knowledge graph dataset CN-DBpedia (Xu et al., 2017) to alleviate the hallucination and factual inconsistency.Exam Instructions contains 63K questions from the main Chinese commonsense tests, e.g., Gaokao, Civil Servant Examination.These questions cover six main subjects: Chinese, English, Politics, Biology, History, and Geology.Exam Instructions contains 34K Chinese samples that present shared human values in the Chinese-speaking world (3K) and regional-culture human values.Table 29 shows some examples in this dataset.pCLUE pCLUE collects 9 Chinese tasks with a total of 73 different prompts and 1.2M samples.These tasks include 9 Chinese tasks e.g., news classification, natural language reasoning, semantic matching, keyword recognition, reading comprehension, etc.
Table 25 and Table 26 show representative examples of the above datasets.

B More Experimental Results
B.1 Full results of different LLMs after instruction-tuning.B.4 Full results of LLaMA and its expanded vocabulary versions.
Table 17 and Table 18 shows the full Belle-eval and MMCU results of LLaMA and its expanded vocabulary versions, respectively.
B.5 Full results of the comparison of using English and Chinese prompts.
Table 19 and Table 20 shows the full Belle-eval and MMCU results of using Chinese and English prompts based on LLaMA and Bloom.In Table 25 and Table 26, we select a representative sample for each instruction dataset to better understand their respective characteristics.

B.6 Full results with human-value alignment
C.4 Comparison of the responses from models instruction-tuned on different instruction datasets.
To compare the characteristics of models trained on different instruction datasets more intuitively, we present in Table 27 the responses of models instruction-tuned on different datasets for the same question.

C.5 Comparison of the responses from
LLaMA and its expanded vocabulary versions.
We present examples of responses from LLaMA and its expanded vocabulary versions in Table 28.
The response from "llama-voc" is clearly not understanding the meaning of the instruction.Therefore, after expanding the vocabulary, pre-training should be conducted on the vast Chinese corpus before fine-tuning instructions.
C.6 Examples from human-value alignment dataset.
The samples of human-value alignment dataset, built by COIG, are shown in Table 29.These samples are often related to topics such as "online violence" and "gender discrimination", which are designed to ensure that the model has the correct values when facing relevant topics.Our LLM is trained from Bloom with LoRA.We select a combination of datasets with significant gains on Belle or MMCU, including 10 datasets: Alpaca-GPT4, Belle, ShareGPT-zh, moss-sft-data, installwild, firefly, COIG-trans, pCLUE, and CoT data.To balance the capabilities of our model, we only select 1/3 of moss-sft-data and 1/5 of firefly and pCLUE.The model perform best with 1.3 epoch instruction-tuning.For the specific prompt, we add a sentence "回答尽可能详细具体" ("Answer as detailed and specific as possible" in Chinese) at the end of the orginal prompts.

Figure 1 :
Figure 1: Performance gains (denoted by origin bars) of open LLMs on Belle-eval (Upper) and MMCU (Lower) from instruction-tuning.The instruction-tuned performance is denoted by blue bars and red numbers.

Figure 2 :
Figure 2: Training loss over steps for different parameter-efficient methods.
Chain-of-Thought Data.Chain-of-Thought is a hot topic in LLM research.Existing works find that adding rationales or explanations to the inference prompts(Wei et al., 2023; Wang et al., 2023;Kojima et al., 2023) (based on APIs of GPT-3 and ChatGPT) or training corpus (Wei et al., 2022; Chung et al., 2022; Zhang et al., 2023c) (based on normal language models, e.g, T5(Raffel et al., 2020) and FLAN-T5(Wei et al., 2022)) can enhance the model's reasoning ability, which is useful for solving complex problems.However, extending CoT into Open LLM has not yet been thoroughly explored.Alpaca-CoT (Qingyi Si, 2023) uses several qualitative examples to demonstrate the effectiveness of CoT in reasoning.A systematic evaluation is still necessary.To this end, this paper conducts experiments to analyze the impact of CoT data for LLMs.

Figure 3 :
Figure 3: Performance comparison of LLaMA and its expanded vocabulary versions."llama-voc", "llama-vocpre" and "llama-voc-pre-l" denotes instruction-tuning the models obtained by further pre-training LLaMA with a expanded vocabulary on 0B, 20B and 100B Chinese tokens, respectively.

Figure 4 :
Figure 4: Performance comparison of instruction-tuning with prompts in English and Chinese.The specific prompts used in our experiments can be found in Appendix D.2.
Unlike prefix-tuning, p-tuning (Liu  et al., 2022)  injects trainable continuous tokens into only the embedding layer instead of each layer, resulting in fewer parameters being updated.During training, it freezes partial model parameters.

Table 1 :
Pre-training details of the popular open base LLMs (upper).Supervised fine-tuning (sft) details of the open sft LLMs."*" denotes closed source, which denotes that ChatGLM only releases the supervised fine-tuned version.More details of each open LLM can be found at Appendix A.2.

Table 2 :
Performance of open LLMs on Belle-eval."Ours" is our carefully designed instruction-tuned LLM, which is discussed in detail in Section 4. The best scores are bold, the second best scores are underlined.The results of ChatGPT are only for display and will not be compared.

Table 3 :
Performance of open LLMs on MMCU.
Table 10 and 11 in App.B.1."Layer" denotes the layers adapters are added."-l" denotes the version with large number of parameter.More details of each parameter-efficient methods can be found in Appendix A.3. steps loss AdaLoRA LoRA P-tuning Prompt Prefix SadapertP SadapterH P-adapter

Table 5 :
The details of existing Chinese instruction datasets."Con" column shows the dataset construction methods.

Table 6 :
Belle-eval performance of models instruction-tuned from Bloom on different instruction datasets.

Table 7 :
MMCU performance of models instructiontuned from Bloom on different instruction datasets.

Table 9 :
Data statistics of Belle-eval and MMCU.

Table 12 and
Table13shows the full Belle-eval and MMCU results of instruction-tuning with different parameter-efficient methods, respectively.

Table 14 :
Table14and 15 show the full Belle-eval and MMCU results of the models instruction-tuned without or with CoT data, respectively.Table16shows the detailed results on all subjects in education discipline of MMCU.Belle results of Bloom instruction-tuned with and without CoT data.

Table 15 :
MMCU results of Bloom instruction-tuned with and without CoT data.

Table 21 and
Table22shows the full results of the models instruction-tuned with and without humanvalue alignment data.

Table 16 :
Results of all subjects in MMCU's education discipline of Bloom instruction-tuned with and without CoT data.

Table 18 :
Full MMCU results of LLaMA and its expanded vocabulary versions."llama-voc", "llama-voc-pre" and "llama-voc-pre-l" denotes instruction-tuning the models obtained by further pre-training LLaMA with a expanded vocabulary on 0B, 20B and 100B Chinese tokens, respectively.

Table 19 :
Belle results of using Chinese (denoted by "-zh") and English (denoted by "-en") based on LLaMA and Bloom.

Table 20 :
MMCU results of using Chinese (denoted by "-zh") and English (denoted by "-en") based on LLaMA and Bloom.

Table 22 :
MMCU results of Bloom instruction-tuned with and without human-value alignment dataset.