Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi.


Introduction
Pre-trained on massive data, large language models (LLMs) with hundreds of billions of parameters such as GPT-3 (Rae et al., 2021) can unlock new emergent abilities that cannot be achieved with smaller models (Wei et al., 2022;Choi et al., 2023;Jiao et al., 2023).However, as LLMs are trained with the autoregressive learning objective, they might exhibit unintended behaviours from human expectations (Tamkin et al., 2021;Weidinger et al., 2021;Kenton et al., 2021).To overcome this issue, instruction fine-tuning has been proposed as a prominent approach to improve capabilities in following human instructions for LLMs and align them with human intentions in conversations (Christiano et al., 2017;Stiennon et al., 2020;Sanh et al., 2021;Ouyang et al., 2022).As such, two major techniques for instruction tuning feature supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) that are leveraged by the best commercial LLMs such as ChatGPT and GPT-4 to deliver outstanding dialog performance.
Another issue with LLMs pertains to the massive scales and closed-source nature of the commercial LLMs that greatly restrict accessibility and the extent of interactions with the technology.To this end, there have been growing efforts from the open-source community to create more accessible LLMs with affordable scales while securing competitive performance as the proprietary LLMs, e.g., LLaMA (Touvron et al., 2023), StableLM (Stabil-ityAI, 2023), Falcon (Almazrouei et al., 2023), and MTP (MosaicML, 2023).Instruction tuning has also been applied to these open-source LLMs to improve their abilities to engage with human, and different instruction datasets have been collected to facilitate the process, e.g., Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), LaMini-LM (Wu et al., 2023), and Dolly (Conover et al., 2023).
However, the instruction-following abilities of existing open-source LLMs have been developed mainly for English and some popular languages (i.e., using instruction data for those languages), failing to support many other languages of the world to serve a broader population (Taori et al., 2023;Wu et al., 2023).To overcome this challenge, a few contemporary frameworks have explored instruction tuning of LLMs for multiple languages, i.e., Phoenix (Chen et al., 2023) and Bactrian-X (Li et al., 2023).However, their multilingual instruction tuning efforts are limited to only supervised fine-tuning, which is unable to examine reinforcement learning with human feedback (RLHF) to further boost the performance for multilingual LLMs.
To fill in this gap, our work aims to develop Okapi, an open-source framework with RLHFbased instruction-tuned LLMs for multiple languages to provide resources and shed light on their performance for multilingual LLM learning.Okapi will emphasize on less studied languages and opensource LLMs to better democratize the benefits of instruction-tuned LLMs.In particular, an example in the instruction datasets involves an instruction, an input text, and a desired response output/demonstration.In SFT, the pre-trained LLMs are fine-tuned over the instruction triples (instruction, input, output) via supervised learning to promote their alignment with human expectations.In RLHF, generated outputs from the SFT-tuned LLMs are first ranked to provide training signals for the reward functions.Afterward, the SFT-tuned models will be further optimized via reinforcement learning utilizing rewards from the trained reward models.As such, RLHF has been successfully employed to create effective commercial LLMs (e.g., InstructGPT, ChatGPT), owning to its ability to learn beyond positive examples associated with only desired demonstrations.By leveraging the reward models, RLHF can observe lower ranking scores for less accurate demonstrations to obtain richer training signals for LLMs.To our knowledge, Okapi is the first work to perform instruction tuning with RLHF for open-source LLMs over multiple languages.
To develop Okapi, we need to overcome the scarcity of instruction datasets in multiple languages to train and evaluate RLHF models.Motivated by the 52K instructions from Alpaca (Taori et al., 2023), we leverage Self-Instruct (Wang et al., 2023) to generate 106K additional instructions in English, introducing a larger dataset to facilitate RLHF evaluation.Afterward, we utilize Chat-GPT to translate the instructions into a diverse set of 26 languages, including high-, medium-, and low-resource languages (e.g., Telugu, Ukrainian, Nepali, and Kannada) to offer comprehensive re-sources and insights for multilingual instructiontuning.In addition, we introduce a translationbased prompt for ChatGPT to produce rankings for multiple responses of the same instructions from the LLMs, which will be used to train the reward models for RLHF experiments.Finally, we obtain the multilingual evaluation datasets for our fine-tuned LLMs by translating three benchmark datasets for LLMs in the widely-used HuggingFace Open LLM Leaderboard (HuggingFace, 2023;Gao et al., 2021) into 26 languages, i.e., ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), and MMLU (Hendrycks et al., 2021).
Using BLOOM (Scao et al., 2022) and LLaMa (Touvron et al., 2023) as the base LLMs, our experiments illustrate that RLHF generally performs better than SFT for multilingual instruction tuning.We also highlight the greater challenges of lowresource languages for multilingual instructiontuning of LLMs that can be focused in future research.Finally, we release our framework with the created resources and fine-tuned RLHF models.We also provide scripts to interact with our models at https://github.com/nlp-uoregon/Okapi.

Data Preparation
A key requirement for our development of instruction-tuned LLMs with RLHF involves instruction, ranking, and evaluation datasets in multiple languages.To this end, we perform a comprehensive data collection process to prepare necessary data for our multilingual framework Okapi in 26 languages, divided into four major steps: English instruction generation, instruction translation, ranking data production, and evaluation data creation.
English Instruction Generation: An instruction example to tune LLMs often has three components: an instruction to specify the task, an input text, and an associated output text (i.e., demonstration or label) (Ouyang et al., 2022).As such, current public instruction datasets for LLMs mainly cover English or some popular languages.Also, we note that a few recent instruction datasets such as xP3 (Muennighoff et al., 2022) and Flan (Chung et al., 2022;Longpre et al., 2023) include multilingual data; however, their instructions are still written in English.Additionally, these datasets tend to be converted from NLP datasets with template instructions, which cannot reflect the flexibility of human-written prompts (Wang et al., 2023).Consequently, our goal is to develop instruction datasets with instructions, inputs, and output texts in multiple languages, including low-resource ones, to better realize general prompts from human.
To achieve this goal, our strategy is to first obtain English instructions and then translate them into other languages.The benefits of our approach involve consistent instruction content across languages to facilitate performance comparison while taking advantages of translation systems to enable examination for more languages.As such, to conveniently scale our data, we follow the instruction generation method in Alpaca, which in turn employs the Self-Instruct procedure in (Wang et al., 2023), to produce our English dataset.
Starting with a pool of 175 human-written seed instructions in English, at each time, Alpaca samples several instructions from the seeds to form an in-context example to prompt the text-davinci-003 model of OpenAI for new instruction generation.Overall, Alpaca releases 52K instructions for tuning LLMs.In this work, we apply the same Self-Instruct procedure as Alpaca to generate 106K additional English instructions, resulting in a larger combined dataset of 158K instructions for our RLHF-based models in Okapi.Notably, we condition our generation process on the 52K instructions from Alpaca so a new instruction is only saved if it is different enough from Alpaca's and previous instructions per the ROUGE score criteria in Alpaca (Taori et al., 2023).
Instruction Translation: Given the 158K English instructions, we aim to translate them into multiple other languages to obtain data for our multilingual models in Okapi.Table 1 presents 26 selected languages in our framework.Using the data ratios r of the languages in CommonCrawl 1 to classify languages as in previous work (Bang et al., 2023;Lai et al., 2023), our study encompasses a diverse set of languages, including 8 highresource languages (r > 1.0), 11 medium-resource languages (r > 0.1), and 7 low-resource languages (r < 0.1).Notably, several of our languages, such as Marathi, Gujarati, and Kannada, have received limited attention in NLP and instruction-tuning.
We utilize ChatGPT to translate the 158K English instructions into 26 target languages for Okapi.Compared to traditional machine translation systems, an advantage of ChatGPT is the ability to use prompts to specify different expectations for the translated texts to facilitate diverse types of instructions.For example, we can instruct ChatGPT to preserve code in the instruction examples about programming as we expect code to be the same in the instructions across natural languages.It is important to note that we directly translate the instruction, input text, and associated output in each English instruction of our data.This is in contrast to the other multilingual instructiontuning approaches (Li et al., 2023) that only translate instructions and input texts into a target language (using Google Translate), and then prompt ChatGPT to generate response outputs in the target language based on the translated instructions and inputs.The intuition for our approach concerns various potential issues of ChatGPT, e.g., hallucination, bias, mathematical reasoning, and toxic content (Bang et al., 2023;Borji, 2023), that can be exaggerated if ChatGPT is used to produce responses in non-English languages for different tasks (Lai et al., 2023).By generating the instructions and responses in English, we aim to capitalize on the greater performance of LLMs for different NLP tasks in English to avoid the exaggeration issues and achieve higher quality instructions.
Ranking Data Production: To perform RLHF for a LLM, we need to obtain ranked response outputs from the model for the same instruction and input to train a reward model.Concretely, given a LLM M and a dataset S = {inst k , input k } N k=1 with N pairs of instructions inst k and input texts input k for a target language, we first prompt M to generate T output responses output k = {output 1 k , . . ., output T k } for each pair of instruction and input text (inst k , input k ) (T > 1).Afterward, the responses in output k are ranked according to their fitness and quality for the instruction inst k and input text input k .This ranking data {inst k , input k , output k } can then be leveraged to train our reward models in Okapi.
We also employ ChatGPT to rank the response outputs for multilingual LLMs.Similar to the motivation for our translation-based approach to obtain instruction data in multiple languages, our ranking strategy first asks ChatGPT to translate the instructions and responses {inst k , input k , output k } of a target language into English; the ranking of the responses is then done over the translated English data to exploit the greater quality of Chat-GPT for English (using the translation and ranking prompts in Figure 2).For each example {inst k , input k , output k }, the translation and ranking prompts are wrapped in a two-turn dialog with ChatGPT to allow the ranking process to condition on the resulting translations.It also ensures the same output format for the ranking prompts for convenient parsing.Overall, we obtain ranked response outputs for 42K instructions from the 106K •Turn 1: Translation Prompt You will be given an instruction, an input for the instruction, and four possible responses for the instruction.The input can be empty, shown as <empty>.You need to translate the provided instruction, input, and responses into English.Instruction: . . .Input: . . .Response 1: . . .Response 2: . . .Response 3: . . .Response 4: . . .
• Turn 2: Ranking Prompt Given the translated instruction, input, and responses, you will need to rank the responses according to three factors: correctness with respect to the instruction and input, coherence, and naturalness.You will need to provide an overall rank for each response when all the three factors are considered.The overall rank for a response must be an integer between 1 and 4 where 1 is for the best response and 4 is the worst response.You cannot assign the same rank for two different responses.The format of your output must be: for each response: "<Response r>: overall rank: <1/2/3/4>".The responses must be in original order.Do not include explanation in your output.generated instructions for each language in Okapi.

An Example
Evaluation Data Creation: We employ three datasets in the HuggingFace Open LLM Leaderboard (HuggingFace, 2023) i.e., ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), and MMLU (Hendrycks et al., 2021), to evaluate the model performance for our Okapi framework.All the datasets are organized as multiple-choice question-answering tasks although they focus on different types of knowledge and reasoning aspects.ARC involves 1170 grade-school science questions; HellaSwag provides 9162 commonsense inference questions that are easy for humans, but difficult for many state-of-the-art models; and MMLU assesses accuracy for 13062 questions over various branches of knowledge (STEM, humanities, social sciences, and more).Nevertheless, although the LLM community has widely adopted the Hugging-Face leaderboard for performance examination, the datasets are only provided for English, thus unable to evaluate LLMs for the languages in our work.To this end, we translate the examples of the three datasets into 26 selected languages using ChatGPT and the translation prompt in Figure 1.The translated datasets are then reserved to evaluate the LLMs in our Okapi framework.

Instruction-tuning with RLHF
We follow three steps to develop a fine-tuned LLM with RLHF for each target language in our Okapi framework: supervised fine-tuning, reward model training, and reinforcement learning.
Supervised Fine-tuning (SFT): Starting with a multilingual LLM as the base, e.g., BLOOM (Scao et al., 2022), we fine-tune the model with our instruction dataset for the target language using supervised learning with the autoregressive objective.Here, we fine-tune the entire base LLM for all of its parameters with SFT to accurately understand the model performance for multilingual settings.
Reward Model Training: The goal of this step is to train a reward model for the target language that will compute reward signals for reinforcement learning to further optimize the SFT-tuned model from the previous step.For each pair of a prompt and potential response, our reward model returns a scalar value to quantify the appropriateness of the response with respect to the instruction and input text in the prompt.We exploit the instructions with multiple ranked responses in the data collection step for this training step.An example to train our reward model for a language involves an instruction and an input text (to form a prompt x) along with two sampled responses y c and y r for x from our datasets.Based on the ranking information, we can assume one of the responses (i.e., y c ) is more preferable than the other (i.e., y r ).In the next step, the binary ranking loss (Ouyang et al., 2022) is employed to train our reward model, aiming to assign a higher score r(x, y c ) for the preferred response y c than the score r(x, y r ) for y r : Reinforcement Learning (RL): With the reward model established for the target language, the SFT model undergoes additional fine-tuning through RL to align it with human preferences.For this purpose, we employ the Proximal Policy Optimization (PPO) algorithm (Ouyang et al., 2022) that maximizes the mean reward of the model via the objective: Here, D RL corresponds to the prompt distribution, and π ϕ (y|x) denotes the policy or language model that requires optimization.π ϕ (y|x) is initialized with the SFT-tuned model π ϕ (y|x).Also, KL(x, y) = D KL (π ϕ (y|x)||π 0 (y|x)) is the Kullback-Leibler divergence to penalize large deviation of π ϕ from the initial SFT policy π 0 .

Experiments
Our Okapi framework utilizes two multilingual LLMs: BLOOM (Scao et al., 2022) and LLaMA (Touvron et al., 2023) as the base models for the fine-tuning processes.We focus on their 7Bparameter versions to facilitate the computing resources and achieve fairer comparison.For each base model and target language, we carry out both SFT-based and RLHF-based instruction-tuning: • SFT: The base model is fine-tuned over our entire set of 158K translated instructions for the target language in the supervised manner.
• RLHF: The base model is first fine-tuned with supervised training over 52K translated instructions from Alpaca.Afterward, a reward model is trained using the 42K instructions with ranked responses obtained in the data collection.Note that the ranked responses are sampled from the SFT-tuned base model over the 52K Alpaca instructions from previous step.Finally, given the reward model, the SFT-tuned model is further optimized via reinforcement learning over the 64K remaining translated instructions from our generation set.
Following the HuggingFace Open LLM Leaderboard, the Eleuther AI Language Model Evaluation Harness framework (Gao et al., 2021) is used to compute the model performance over the translated datasets ARC, HellaSwag, and MMLU for each language in our framework.As a reference, we also report the performance of the base models BLOOM and LLaMA in the experiments.Finally, for BLOOM, we further compare with BLOOMZ (Muennighoff et al., 2022), which is the fine-tuned version of BLOOM over the cross-lingual task mixture dataset xP3 with millions of multilingual instructions to achieve instruction-following ability.Evaluation: Tables 2 and 3 present the performance of the models on ARC, HellaSwag, and MMLU when BLOOM and LLaMa are used as the base models (respectively).In the tables, for each language group (i.e., high-, medium-, and lowresource), we report the average performance over the languages and the performance for two example languages in the group.We also include the average performance over all languages in Okapi.As some of our languages in Okapi (especially the low-resource ones) are not supported by LLaMA, Table 3 will omit those languages (see Table 1).Finally, Appendix A provides performance of the models over all languages and datasets in Okapi.
The first observation from the tables is that Table 3: Performance of the models using LLaMa 7B.
RLHF is generally better than SFT for multilingual fine-tuning of LLMs over different datasets, base models, and language groups.It is also evident that the RLHF-tuned models can significantly improve the performance of the original base models (i.e., BLOOM and LLaMa) for almost all the language groups and datasets.In all, it highlights the quality of the generated instruction data and the effectiveness of RLHF in Okapi.
Comparing the performance across language groups, the models tend to achieve the highest performance for the high-resource languages, followed by the medium-resource and low-resource languages.The performance improvement of RLHF for low-resource languages is also the least (based on BLOOM).Interestingly, our fine-tuned BLOOM models with 158K generated instructions can significantly outperform BLOOMZ over almost all the languages for the ARC, HellaSwag, and MMLU datasets using either SFT or RLHF.As BLOOMZ has fine-tuned BLOOM over more than 78M multilingual instructions converted from NLP datasets (Muennighoff et al., 2022), it demonstrates the higher quality of our generated instructions for multilingual instruction tuning of LLMs.

Related Work
The most advanced methods for NLP involve finetuning the pre-trained language models (PLMs) on training data of the downstream tasks (Min et al., 2023).Instruction tuning can be considered as a special type of fine-tuning techniques for PLMs where generative PLMs (e.g., GPT) are further trained with instruction data to accomplish the instruction following abilities.SFT is the most popular instruction tuning approach that is leveraged by most of the existing LLMs, including ChatGPT, Apaca (Taori et al., 2023), and Vicuna (Chiang et al., 2023).RLHF can also be used to further enhance LLMs (Wei et al., 2021;Ouyang et al., 2022) although it has been less explored by current opensource LLMs due to the challenges in obtaining ranking data for the reward models.For multilingual learning, instruction tuning is only applied in the form of SFT for non-English languages using multilingual LLMs, e.g., BLOOM and LLaMA, in a few contemporary work (Chen et al., 2023;Li et al., 2023;Muennighoff et al., 2022).

Conclusion
We present the first framework, called Okapi, on instruction tuning for LLMs in multiple language using RLHF.We introduce instruction, ranked response, and evaluation data in 26 diverse languages to enable the training of RLHF methods.Our results reveal the benefits of RLHF for multilingual fine-tuning of LLMs and the challenging problems of low-resource languages in this area.

Ethical Statement
Our framework utilizes the multilingual LLMs BLOOM-7B and LLaMa-7B to develop instructiontuned models with reinforcement learning from human feedback.To obtain necessary resources to train and evaluation our models, we also apply Self-Instruct (Taori et al., 2023) with GPT-3 to generate English instruction data, and ChatGPT to translate and rank our response data in different languages.As such, the models in our framework might inherit potential issues in the underlying models of BLOOM, LLaMa, GPT-3, and ChatGPT, such as hallucination, biases, and toxic content.Regrettably, the data required to train such LLMs, even in the case of purportedly open-source models such as LLaMa and BLOOM, remains unreleased to enable essential investigation into these matters for our models.Future research can explore open-source datasets, such as CulturaX (Nguyen et al., 2023) and RedPajama (Computer, 2023), to develop truly open LLMs, enabling deeper attribution of the problems and better understanding of the models' operations.To maximally minimize the impacts of these issues in the current work, our framework will fully release the generated instruction, ranking, and evaluation data to enable comprehensive exploration and research for the techniques.We will also restrict the release of our models to research purpose, respecting the policy of the underlying models such as LLaMa and ChatGPT, to facilitate future research for LLMs while limiting the potential ethical issues for the society.Consequently, we do not believe our framework poses any greater societal risks than existing published research in this area for LLMs (Wang et al., 2023).Finally, we confirm that our work fully complies with the ACL Ethnics Policy and there is no other ethical issues associated with our work, to the best of our knowledge.

Figure 2 :
Figure 2: Prompts to translate and rank responses.

Table 1 :
List 1 http://commoncrawl.org of 26 non-English languages in Okapi along with their codes, numbers of first and second speakers (the "Pop."column), data ratios in CommonCrawl, and categories.The languages are grouped into categories based on their data ratios in CommomCrawl: High-(H, > 1%), Medium-(M, > 0.1%), and Low-Resource (L, > 0.01%).Columns "B" and "L" indicate if a language is supported by the LLMs BLOOM and LLaMa (respectively) or not.
Translation Prompt: Translate the values in the following JSON object into <target language> language.You must keep the keys in the JSON object in English.If a value contains programming code, only translate the comments while preserving the code.Your translations must convey all the content in the original text and cannot involve explanations or other unnecessary information.Please ensure that the translated text is natural for native speakers with correct grammar and proper word choices.Your translation must also use exact terminology to provide accurate information even for the experts in the related fields.Your output must only contain a JSON object with translated text and cannot include explanations or other information.Figure1: Translation prompt for ChatGPT for multiple languages in Okapi.We organize our instruction examples into JSON objects with fields for translation prompts, instructions, inputs, and outputs send to ChatGPT.<target language> is replaced with the selected languages in our dataset.

Table 2 :
Performance of the models using BLOOM 7B.