InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitasking ability. To address this issue, we propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages. Our results demonstrate the effectiveness of InstructAlign in enabling the model to understand low-resource languages with limited parallel data while preventing catastrophic forgetting. Our work contributes to the advancement of language adaptation methods, particularly for adapting instruction-tuned LLMs to underrepresented languages. Our code is released on https://github.com/HLTCHKUST/InstructAlign

Expanding the language repertoire of LLMs is essential for promoting inclusivity and diversity in Natural Language Processing (NLP) technology, particularly for languages that are underrepresented and low-resource.Recent studies, including Wilie et al. (2020); Cahyawijaya et al. (2021); Aji et al. (2022); Adelani et al. (2021Adelani et al. ( , 2022)); Kakwani et al. (2020); Kumar et al. (2022); Ebrahimi et al. (2022); Adilazuarda et al. (2022); Cahyawijaya et al. (2023b,a); Song et al. (2023) have emphasized the importance of this issue.To address this concern, previous research (Yong et al., 2022) has demonstrated that continual pretraining (Chau et al., 2020;Muller et al., 2021;Ebrahimi and Kann, 2021) and parameter-efficient fine-tuning (PEFT) methods, like MAD-X (Pfeiffer et al., 2020) and (IA) 3 (Liu et al., 2022), can be utilized to swiftly integrate the knowledge of unseen languages into LLMs using monolingual corpora of the new languages by conducting masked language modeling (MLM) (Devlin et al., 2019).However, these methods become ineffective when applied directly to instruction-tuned LLMs due to catastrophic forgetting (French, 1993) which prevents them from solving general natural language tasks after the language adaptation phase (Yong et al., 2022).Moreover, adapter-based approaches, such as MAD-X (Pfeiffer et al., 2020), result in the loss of multi-Figure 1: The number of languages supported by existing LLMs (green region) per language family 1 .Existing LLMs only support a fraction of languages around the globe.Most of them are within the Indo-European language family, while most other language families are underrepresented or even unexplored.lingual inference capability due to modularity (Adilazuarda et al., 2023).
To solve this problem, we introduce InstructAlign, a continual instruction tuning framework to seamlessly align newly adapted low-resource languages (L2) with the pre-trained high-resource languages (L1) of an instruction-tuned LLM through crosslingual alignment.InstructAlign compels LLMs to perform crosslingual alignments between pre-trained and novel languages through alignmentbased crosslingual instruction tuning, enabling the model to grasp L2 with only a limited amount of parallel data.To further prevent catastrophic forgetting, InstructAlign incorporates experience replay (Chaudhry et al., 2019b;Rolnick et al., 2019), which adds past data during the instruction tuning.
In summary, our work presents the following major contributions: • We propose InstructAlign, a crosslingual continual instruction tuning method that allows instruction-tuned LLMs to understand L2 with minimal degradation on L1 while retaining their zero-shot prompting capability.
• We propose alignment-based crosslingual instruction tuning, which enables LLMs to align L2 to L1 allowing better L2 acquisition with only a limited amount of parallel data.
• We evaluate the effectiveness of InstructAlign on Indonesian local languages datasets, and demonstrate that InstructAlign can significantly improve the performance on L2 by 5-10% F1 while maintaining the original per-1 We gather the language and language family information from URIEL (Littell et al., 2017;Malaviya et al., 2017).
formance on L1 and its multitask capability.
• We analyze the correlation between the performance of L2 and other unseen languages (L3), suggesting the zero-shot generalization of InstructAlign to L3 particularly when the languages are related. 2 2 Related Work
2 We use the terms L1, L2, and L3 to denote the first, second, and third language acquisition (Hammarberg, 2001(Hammarberg, , 2014)).In our context, L1 denotes the pre-trained languages in LLMs, L2 denotes the newly adapted languages, and L3denotes other languages that have not been seen after tuning with InstructAlign, which are only used in the evaluation.

Crosslingual Alignment
Crosslingual alignment is a widely explored concept that allows language models (LMs) to align, commonly at a word/sentence level, across different languages.Crosslingual alignment allows the models to perform crosslingual inference without requiring any tuning on the target task.Fung (1997Fung ( , 1998) ) a bilingual lexicon extraction method through word-to-word alignment from word relation matrix.Fung and Cheung (2004) introduces a bilingual lexicon and parallel sentence extraction method from aligning sentences from nonparallel data via Bootstrapping and EM.Lample et al. (2018b);Cao et al. (2020) introduces align bilingual lexicon method that requires no parallel data by performing embedding alignment across different languages.This is then utilized to deal with unsupervised machine translation Lample et al. (2018a).A crosslingual pre-training objective for building LMs, namely translation language modeling (TLM) (Conneau and Lample, 2019), has also been explored which enforces token-level alignment between languages allowing the model to learn aligned representation across multiple languages.In this work, we perform crosslingual alignment through instruction by introducing bilingual denoising instruction which is equivalent to token-level alignment in TLM, and translation instruction which serves as sentence-level alignment across different languages.

Continual Learning for Language Models
Continual learning is a paradigm to learn various tasks gradually allowing the model to acquire new knowledge over time (Delange et al., 2021).Using a naive fine-tuning approach for continual learning causes the model to suffer from catastrophic forgetting (CF) (French, 1999).Therefore, various methods have been introduced to prevent CF.Regularization-based methods (Kirkpatrick et al., 2017;Liu et al., 2018;Aljundi et al., 2018) add a regularization in the loss function to prevent the model to be updated into a direction that causes CF.Replay-based methods (Rolnick et al., 2019;Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019a) add samples from previous tasks to be incorporated during learning the new task, which helps regularize the model to avoid CF.Parameter isolation methods (Aljundi et al., 2017;Serrà et al., 2018;Mallya and Lazebnik, 2018) prevent the model from CF by learning new tasks using a new set of parameters while keeping the other parameters frozen during fine-tuning.In this work, we apply experience replay (Rolnick et al., 2019), which is a simple replay-based method by adding tasks from previously learned languages when training new languages without any loss modification.

Methodology
InstructAlign is a continual crosslingual instruction tuning framework that allows the model to align high-to-low resource languages through instruction tuning.InstructAlign introduces two components, i.e., 1) crosslingual alignment through instruction tuning, which allows the model to align pre-trained languages with the new languages through crosslingual alignment, and 2) continual instruction tuning, which applies continual learning into instruction tuning to avoid catastrophic forgetting.

Crosslingual Alignment through Instruction
Given a parallel text pair (x, y) from two languages, the goal of crosslingual alignment is to learn a mapping function f (.) parameterized by θ such that f (x, θ) = f (y, θ).The (x, y) text pair commonly comes in the form of a word pair or a phrase pair (Lample et al., 2018b,a), but in theory, it should be able to generalize to a sentence pair or even a paragraph.With the goal of aligning two parallel texts from two different languages, Instruc-tAlign defines a set of alignment-based crosslingual instructions by exploiting multiple alignment objectives that can be achieved through a parallel sentence.Specifically, we explore three different objectives, i.e., bilingual denoising / translation language modeling (TLM), machine translation (MT) and crosslingual semantic similarity (XSS).
We first define a parallel sentence pair (X = {x 1 , x 2 , . . ., x m }, Y = {y 1 , y 2 , . . ., y n }), where x i and y i denote the i-th token of the sentence X and Y , respectively.For bilingual denoising (TLM), we model the problem as a conditional denoising task.InstructAlign first applies a perturbation function g tlm (.) to the target sentence Y that masks out part of the tokens in order to get Ỹ = g tlm (Y ).The pair (X, Ỹ ) is then used to generate a prompt using h(X, Ỹ , T tlm ), resulting in an input-output data pair for prompting (h tlm (X, Ỹ , T tlm ), Y ), where h tlm (.) denotes a bilingual denoising prompt generator and T tlm the prompt template.
For the machine translation (MT) objective, we define the input-output data pair as (h mt (X, T mt ), Y ), where h mt (.) denotes a machine translation prompt generator and T mt denotes a machine translation prompt template.As for the crosslingual semantic similarity (XSS) objective, we models the problem as an inference task to predict whether two parallel sentences X and Y are semantically similar.Specifically, we define the input-output data pair as (h xss (X, Y, T xss ), l) where h xss (.) is a semantic similarity prompt gen-erator, T xss denotes a semantic similarity prompt template and l the binary label regarding whether the sentences are semantically related or not.The examples of the crosslingual alignment objectives are shown in Figure 2.

Continual Instruction Tuning through Experience Replay
Within the continual instruction tuning phase of InstructAlign, experience replay (Rolnick et al., 2019) is employed to minimize the catastrophic forgetting problem.mhamdi-etal-2023-cross Experience replay works by storing some of the past training data and using them during the optimization step of the new data.These past data serve as a regularization term that prevents the models to forget past knowledge when learning from the new data.The past data is collected from the instruction tuning data used when developing the corresponding instruction-tuned model, which are all supervised.
During the continual instruction tuning, In-structAlign takes only r randomly sampled data from the past instruction tuning data.The sampled past data is used during continual-instruction tuning with a balanced sampling between the past data and new data.More formally, we define a past dataset D old and a newly generated crosslingual instruction dataset D cli .On each optimization step, InstructAlign samples data in an interleaving manner resulting in a batch data B = {s D old denote a sample that is taken randomly from D old and D cli , respectively.Since the samples are all supervised, the optimization can be done by optimizing the cross-entropy loss (Good, 1952)  guage family group, i.e., Sundanese (sun), Javanese (jav), Balinese (ban), Minangkabau (min), Buginese (bug), Acehnese (ace), and Banjarese (bjn).
For the L1 languages, we utilize English (eng), as English covers the majority of the pre-training data in most LLMs, and Indonesian (ind), as the language is closely related to the target L2 languages.For the dataset, we utilize FLORES-200 dataset (Goyal et al., 2021;Team et al., 2022) as the source of the parallel data where we combine the validation and the test set producing a total of ∼2000 parallel sentences for each language pair which is orders of magnitude smaller compared the data size used for language adaptation used in prior works (Pfeiffer et al., 2020;Cahyawijaya et al., 2021;Alabi et al., 2022;Yong et al., 2022).

Models & Hyperparameters
We utilize BLOOMZ (Muennighoff et al., 2022) as the backbone model.Specifically, we explore InstructAlign on two model size, i.e., BLOOMZ-560M and BLOOMZ-1.1B.For InstructAlign, we evaluate three crosslingual alignment objectives, i.e., TLM, XSS, and MT.The list of prompts used for instruction tuning is described in Appendix A.
We use English prompts in all experiments.We run all experiments with an initial learning rate of 1e-5 with a linear learning rate decay and a batch size of 32 for a fixed optimization step of 50,000.We run the InstructAlign on a single RTX3090 GPU (24GB) using the AdamW optimizer (Loshchilov and Hutter, 2019) and mixedprecision training (Micikevicius et al., 2018).We use a fixed number of replay samples r = 100000.

Evaluation Setting
After tuning with InstructAlign, the model is then evaluated in a zero-shot crosslingual inference setting, in which the model has never seen the task on the target languages, but might have seen the task on other seen languages.To retrieve the classification label, we compute the joint probability of the prompt with each label in the dataset and pick the label which prompt the highest joint probability.We consider 3 different prompts in English for the zero-shot inference and take the average accuracy and weighted F1 scores as the evaluation metrics.
The list of the prompts used in our evaluation is shown in Appendix A. We use a single RTX1080Ti GPU (11GB) to run the evaluation for all models.
To reduce the memory bottleneck during inference, we run the evaluations using 8-bit inference via LLM.int8() (Dettmers et al., 2022).We provide the performance comparison between 8-bit and 32-bit evaluation in Appendix B.  sian (ind) subset of NT-S.More details about each dataset can be found in Appendix C.

Baselines
For our baselines, we conduct zero-shot prompting using four different sizes of BLOOMZ, i.e., BLOOMZ-560M, BLOOMZ-1.1B,BLOOMZ-1.7B,and BLOOMZ-3B, without any additional language adaptation phase.In addition, to compare the effectiveness of the crosslingual alignment, we add continual instruction-tuned baselines that incorporate only monolingual denoising instructions, which is equivalent to performing language adaptation using MLM (Devlin et al., 2019).

Experiment Result
Effectiveness of InstructAlign Table 2 shows the result of InstructAlign on both L1 and L2 languages.InstructAlign-tuned models with MT, TLM, and XSS objectives significantly outperform the comparable-sized BLOOM and BLOOMZ baselines on L2 languages while retaining a similar performance level as the original BLOOMZ models on L1 languages.Surprisingly, InstructAlign with MLM objectives is also effective, yielding a similar performance on L2 languages compared to the crosslingual objectives.In § 6.1, we show that this improvement only occurs after combining the MLM objective with the experience replay, demonstrating the importance of continual instruc- tion tuning during language adaption.While in the NusaParagraph emotion recognition (NP-E) and topic classification (NP-T) tasks, all baselines yield a very low score, suggesting that the ability to solve long text classification tasks do not emerge on that scale (Wei et al., 2022b).Interestingly, InstructAlign tuned models indicate consistent improvement, although marginal, on these tasks, demonstrating that an early emergence in L2 languages is possible through InstructAlign.

Effect of Model Scaling
As shown in Figure 3, we observe that scaling increases the zero-shot performance of BLOOMZ on both L1 and L2, but the same does not apply to BLOOM, suggesting the benefit of instruction tuning for better generalization to unseen tasks and languages.Moreover, applying InstructAlign on larger BLOOMZ results in higher overall zero-shot performance on both L1 and L2.Specifically, InstructAlign-tuned models with 1.1 billion parameters yield ∼2% higher performance compared to the 560 million parameters InstructAlign-tuned models and even perform competitively with the original 3 billion parameters BLOOMZ model.This suggests that the scaling law of language models (Kaplan et al., 2020;Hoffmann et al., 2022) also apply after InstructAlign where larger-sized models tend to perform better compared to their smaller counterpart.Detailed experiment results are described in Appendix D.
6 Analysis and Discussion

Alignment Objectives
To better understand the effectiveness of each alignment objective, we conduct experiments by using a single objective, i.e., monolingual denois-ing (MLM), machine translation (MT), bilingual denoising (TLM) and crosslingual semantic similarity (XSS), as well as multi objectives on various combinations.We also test zero-shot prompting without any additional language adaption phase as a baseline for comparison.Note that continual instruction tuning through experience replay is not applied (r=0) in these experiments since we focused on the effect of alignment objectives.
As shown in Table 3, BLOOMZ 560M zero-shot performs better than the random baseline on L1 while achieving a lower score on L2, showing that BLOOMZ 560M is unable to be directly applied to these L2 languages.For InstructAlign with a single objective, similar to the result from prior work (Yong et al., 2022), applying the MLM objective decays the performance of the model.Similarly, using MT objective also decreases the performance of both L1 and L2.Nevertheless, as shown in Table 2, this problem can be mitigated by applying continual learning.On the other hand, both TLM and XSS help improve the model on L2, indicating that these objectives are effective for aligning L1 and L2 languages.Additionally, the performance in L1 languages is also retained the most when using the TLM and XSS objectives.
When combining multiple objectives during In-structAlign, we observe the highest score when combining TLM and XSS.Interestingly, adding the MLM and MT objectives during InstructAlign consistently yields a lower score compared to the single TLM and XSS objectives for both L2 and L1 languages.These facts suggest that cross-lingual objectives such as XSS and TLM, are effective for learning new languages through cross-lingual instruction-tuning with limited data.

Continual Instruction Tuning
In order to assess the effectiveness of continual instruction tuning through experience replay, we conduct an experiment exploring the effect of different numbers of replay samples r used in continual instruction tuning.Specifically, we explore 4 settings of r, i.e., r = [0, 1000, 10000, 100000].Figure 4 shows the performance of the InstructAlign tuned models across different ranges of replay examples r.When using no experience replay (r=0), the performance of the pre-trained languages drops significantly, and even further, the performances on the novel languages also drop which suggests that the multitask prompting capability for both of these methods are degraded (Yong et al., 2022).When r increases, a much smaller performance degradation is observed on the L1 languages.Interestingly, the performance on novel languages also improved when r increases which in the end, increases the performance of the model across all languages.These facts demonstrate the importance of the experience replays for avoiding catastrophic forgetting in continual instruction tuning.

Impact of InstructAlign on L3 Languages
We further assess the impact of aligning L2 languages through InstructAlign to other unseen Indonesian languages which are within the same language family group (L3).To assess the effectiveness of transferability from the L2 languages to L3 languages, we compute the correlation coefficient between ∆ weighted F1 score on the L2 and L3 languages for each model compared to the cor-Figure 6: Per language results of InstructAlign tuned models in NusaX.red denotes L1 languages.teal denotes L2 languages.purple denotes L3 languages responding baseline, and measure the Pearson's correlation coefficient (Rodgers and Nicewander, 1988;Freedman et al., 2007).
As shown in Figure 5, the correlation coefficient between the performance improvement of L2 and L3 languages is high with a Pearson's correlation coefficient of 0.96.This indicates the effectiveness of the InstructAlign approach for not only adapting to L2 languages but also to related L3 languages.Nevertheless, the improvement for unseen language still depends on the language distances as shown in Figure 6, where performance on Toba Batak (bbc) and Buginese (bug) yield much lower scores compares the other languages.This result aligns with the analysis from NusaX (Winata et al., 2022) which shows that the performances of Buginese (bug) and Toba Batak (bbc) are the lowest for both the multitask and zero-shot crosslingual settings due to the relatively low vocabulary overlapping compared to other languages in NusaX.This suggests that by performing , the model can also understand unseen languages that are related to the novel-adapted language, indicating the generalization of the crosslingual transfer from pre-trained languages to novel and unseen languages

Conclusion
In this work, we address the challenge of increasing the language coverage of instruction-tuned LLMs by introducing a crosslingual continual instruction tuning method, InstructAlign.We demonstrate that InstructAlign allows an instruction-tuned LLM to effectively learn novel languages through alignment-based crosslingual instruction tuning objectives while retaining the existing multitask and multilingual abilities.Based on our experiment results on four Indonesian local languages datasets, InstructAlign effectively improves the understanding of novel Indonesian local languages, improving the language understanding performance on novel languages by ∼5-10% weighted F1 score and also demonstrates a better forward transfer performance to other unseen Indonesian local languages by a significant margin.In addition, we analyze various objectives of InstructAlign and demonstrate the effectiveness of alignment-based crosslingual instruction tuning objectives compared to the traditional masked language modeling (MLM) for learning novel languages with a limited amount of data.Our work contributes to the advancement of language adaptation methods for instruction-tuned LLMs, especially for underrepresented languages.
7 Limitation and Future Works

Other Model Architectures
Despite the effectiveness of InstructAlign on BLOOMZ, its effectiveness has not been explored for different model architectures such as encoderdecoder or other model architectures.Due to the limited computing budget, we can only run the In-structAlign experiment on a decoder-only model, i.e., BLOOMZ, We encourage future works to explore the experiment in other model architectures.

Scaling to Larger LLMs
As described in §5, we hypothesize that InstructAlign-tuned models follow the scaling laws of language models (Kaplan et al., 2020;Hoffmann et al., 2022).Nevertheless, we can only empirically show this scaling effect of InstructAlign in BLOOMZ-560M and BLOOMZ-1.1Bdue to the limited computing budget.We expect future works to expand the exploration to larger-scale models.

Other Continual Learning Methods
In terms of continual learning methods, we only explore a single approach, i.e., experience replay (Rolnick et al., 2019), due to the efficient memory requirement of this method.Further analysis and examination of other potential continual learning approaches, such as A-GEM (Chaudhry et al., 2019a) and EWC (Liu et al., 2018), is another potential research direction to be explored in future works.

Underrepresented Languages from Other Language Family
There are many other underrepresented languages such as indigenous languages of the Americas (Ebrahimi et al., 2022), African (Adelani et al., 2021(Adelani et al., , 2022)), Indic (Kakwani et al., 2020;Kumar et al., 2022), Austronesian (Winata et al., 2022;Cahyawijaya et al., 2023b), and many others all around the world.In this work, we only explore InstructAlign for Malayo-Polynesian language family group under the Austronesian language family, specifically for Indonesian local languages.For future work, we are eager to explore the generalization of InstructAlign and other language adaptation methods on other underrepresented and lowresource languages.

Ethical Consideration
Our work highlights the importance of inclusivity in LLM technology for underrepresented and extremely low-resource languages.During our study, we are well aware of the ethical responsibility associated with language research and the potential impact it can have on communities.Our ultimate goal is to promote linguistic diversity and contribute to a more inclusive NLP landscape.We encourage further collaboration and engagement with underrepresented and low-resource language communities to ensure that their voices are heard and their needs are addressed in future language technology development.We remain committed to the principles of ethical research, diversity, inclusivity, and fairness, striving to mitigate biases and promote social good through our work in the field of NLP.

C Datasets Details
In this section, we describe the statistics for each dataset use in the experiment.Table 12 shows the statistics for the sentiment analysis task of NusaTranslation (Cahyawijaya et al., 2023a).
For the Indonesian subset, we take the first fold of the IndoLEM sentiment (Koto et al., 2020), which is the Indonesian sentiment analysis dataset used as the source sentences in the NusaTranslation (Cahyawijaya et al., 2023a).Table 13 shows the statistics for the sentiment analysis task of NusaX (Winata et al., 2022).Table 14 and Table 15 display the statistics for the emotion recognition and topic classification tasks of NusaParagraph (Cahyawijaya et al., 2023a), respectively.

D Detailed Experiment Results
In this section, we provide the complete experimental result per dataset.Table 16 shows the experiment results on the sentiment analysis task of NusaTranslation.

Figure 3 :
Figure 3: Average performance of various models across different model scales on the L1 and L2 languages subsets of the NT-S and NX-S datasets.

Figure 4 :
Figure 4: ∆ weighted F1 of InstructAlign tuned BLOOMZ-560M with (left) TLM and (right) XSS objectives various continual instruction-tuned approaches compared to the original BLOOMZ-560M baseline.Negative scores indicate that the model performs worse compared to the baseline.

Figure 5 :
Figure 5: Correlation of ∆ weighted F1 from the Instruc-tAlign tuned models to the corresponding BLOOMZ backbone models on novel and unseen languages.R denotes the Pearson correlation coefficient.

Table 1 :
Statistics of all datasets used in the experiments.#Lang.denotes the #languages in each dataset.

Table 2 :
from all the samples in the batch.
Evaluation results of InstructAlign with BLOOMZ-560M and BLOOMZ-1.1Bbackbones.Compared to BLOOM and BLOOMZ baselines, All InstructAlign-tuned models improve the zero-shot crosslingual performance in L2 while also retaining the performance in L1.

Table 3 :
Averaged Weighted F1 scores from various InstructAlign objectives in the NT-S and NX-S datasets.We use BLOOMZ-560M as the backbone.

Table 19 :
Experiment result on the topic classification task of the NusaParagraph dataset