Multilingual LLMs are Better Cross-lingual In-context Learners with Alignment

In-context learning (ICL) unfolds as large language models become capable of inferring test labels conditioned on a few labeled samples without any gradient update. ICL-enabled large language models provide a promising step forward toward bypassing recurrent annotation costs in a low-resource setting. Yet, only a handful of past studies have explored ICL in a cross-lingual setting, in which the need for transferring label-knowledge from a high-resource language to a low-resource one is immensely crucial. To bridge the gap, we provide the first in-depth analysis of ICL for cross-lingual text classification. We find that the prevalent mode of selecting random input-label pairs to construct the prompt-context is severely limited in the case of cross-lingual ICL, primarily due to the lack of alignment in the input as well as the output spaces. To mitigate this, we propose a novel prompt construction strategy — Cross-lingual In-context Source Target Alignment (X-InSTA). With an injected coherence in the semantics of the input examples and a task-based alignment across the source and target languages, X-InSTA is able to outperform random prompt selection by a large margin across three different tasks using 44 different cross-lingual pairs.


Introduction
The emergence of large-scale, pretrained, Transformer-based language models (LLMs) has marked the commencement of an avant-garde era in NLP. Departing from the traditional methods of neural language learning with temporally separated training-testing phases for downstream tasks, pretrained LLMs have shown the ability to infer labels from test inputs conditioned on the training data within a single pass. This is known as In-context learning -an LLM is prompted ET, SD and MB contributed equally. ET and SD designed the experiments. ET and MB ran the experiments. SD wrote the paper. TC mentored the project. with a few input-output pairs from the training data (commonly referred to as demonstrations) followed by the test input; for generative tasks (summarization, text-to-code, chain-of-thought reasoning, etc.) the LLM is then required to produce an output; for classification tasks, the probabilities of the next tokens predicted by the LLM are mapped to the label space. All of this is done without updating the parameters of the LLM. In-context learning is particularly promising for two different aspects. Firstly, it reduces the need for task-specific training data, and thus, the cost of human annotation. Secondly, while the LLM was trained in a compute-intensive environment, the removal of the need for task-specific gradientbased weight updates can significantly reduce the carbon footprint of automated NLP/NLU since the inference-time compute-necessity is orders of magnitude smaller than that of the training/finetuning phases.
Challenges in cross-lingual ICL: Given that there is an order-of-magnitude discrepancy in the availability of annotated data in a high-resource language vs. a low-resource one, the ability to learn from the high-resource source context to solve tasks in low-resource targets sounds enticing. Yet, the application of ICL in a cross-lingual setting remains largely unexplored. Previous attempts at multilingual ICL (Zhang et al., 2021;Winata et al., 2021) use randomly selected input-label pairs to construct the prompt-context. This limits the ability of an LLM to infer from the context. As Xie et al. (2022) suggested, ICL emerges as the ability to infer target labels from the pretraining distribution conditioned upon the context; each input-label pair in the prompt-context are, in turn, sampled from the prompt token distribution. Theoretically, Review: cannot operate this without using 2 hands. doesnt that defeat the point of using it in the... ... Review: they were nice but too big. Rating:.. In French bad means mal and good means bien.
Examen: j ai commencé a écrire correctement puis au bout de 10 lignes l'encre commence a sortir difficilement je suis très déçu de la qualité de ces recharges je ne pense pas que ce soit des recharges mont banc malgré l'emballage.....je vais faire une réclamation Évaluation: Review: cannot operate this without using 2 hands. doesnt that defeat the point of using it in the... ... Review: they were nice but too big. Rating:.. In French bad means mal and good means bien.
Review: Great pen set. I love the colors and the writing is very smooth. A few of the pens I received were broken upon... ... ..arrived cracked and broken. a very bad experience..... In French bad means mal and good means bien. Task and Semanticaligned: Figure 1: Working example of different ICL prompts explored in this work. In example #1, randomly selecting the prompt examples fails as it prompts irrelevant contradictions, whereas semantic alignment succeeds as it makes the context with similar reviews. In example #2, semantic alignment fails; it extracts demonstrations about 'multiple pieces', but these are not helpful for the LLM, whereas a simple task aligner works. In the last example, it is a combination of semantic and task alignments that works.
as the number of examples in the prompt increases, the expected prediction error decreases. However, such infinitely long prompts are practically infeasible to attain. Xie et al. (2022) imposed that a distinguishability of the prompt-concept, shared across the prompt-examples, from all other possible concepts is essential for an optimal predictor. A random sampling of prompt examples is unlikely to construct a prompt with distinguishable concepts. Furthermore, given (x i , y i ) and (x i+1 , y i+1 ) as two consecutive input-label pairs in the promptcontext, the transition probability from y i to x i+1 is a low-probability one under the pretraining distribution (Xie et al., 2022). The transition becomes even more improbable if we are to simply append a test example to the prompt-context of a different language. Consider the following example of ICL prompting for cross-lingual sentiment classification: 1. That movie was good. Positive 2. Depression is the new pandemic. Negative 3. Ella lo está haciendo bien ?
The text segments are concatenated from left-toright and top-to-bottom; therefore, two English input-label pairs are followed by a Spanish test input. There are irremovable, token-level lowprobability transitions from the labels to the next input sentences. On top of this, we have three completely unrelated sentences juxtaposed together with an abrupt change in language. Intuitively, it is less likely for an LLM to be able to map the third input to its correct label, positiva (positive in Spanish) following the very much convoluted patterns presented in English.
Proposed approach: We seek to develop prompt-design strategies for ICL in a cross-lingual setting that can overcome the foregoing challenges. A two-way alignment of the source and target examples is proposed. We start with injecting semantic coherence into the prompt-context by selecting similar examples; this aligns the labeled demonstrations as well as the test inputs to share a set of common concepts. Next, we seek to enforce an alignment of task-level signals across languages. We introduce manually-designed task-specific mappings from the source language to the target language, thereby providing the LLM with a 'natural' transition from the former to the latter. Together, these two approaches constitute our proposed prompts-selection strategy, X-InSTA (Crosslingual In-context Source-Target Alignment, see Figure 1 for working examples). X-InSTA shows a staggering 18% relative improvement over random prompt selection averaged across three different text classification tasks in multiple different languages with English being the source language. Careful perturbations to these alignment methods disclose the importance of label space structure induced by LLMs for cross-lingual ICL.
Our contributions are summarized below 1 : 1. We propose X-InSTA, a novel method of aligning prompt examples in a cross-lingual scenario. To the best of our knowledge, this is the first at-tempt to push prompt design techniques for ICL in cross-lingual settings beyond the trivial strategy of random example selection. 2. We present the first, in-depth analysis of the role of semantic similarity between prompt examples for cross-lingual ICL. 3. A novel concept of task-based prompt alignment is presented. We show its efficacy with 44 different source-target language pairs and empirically relate this to the underlying structures of multilingual representations of the LLM.

Prompting Techniques
In this section, we lay out a step-by-step approach to aligning semantic coherence and taskbased signals across source-target examples for ICL prompts.

Prelimineries
Let D s = {(x i s , y i s )} i be a monolingual labeled dataset in language s, realized as a collection of input examples and their labels, x i s ∈ X s and y i s ∈ Y s , respectively. Here Y s is the natural language label space in language s. We have another with examples in language t. One can define a crosslingual text classification task with source and target languages being s and t in the following manner. First, we select k input-label pairs from D s to construct the prompt-context, C: where [sep] denotes a separator token (e.g., newlines), and ⊕ denotes the concatenation operator. The problem of in-context prediction then translates to inferring the label y t ∈ Y t , where Y t is the natural language label space in language t corresponding to the test input x t ∈ D t conditioned on the prompt-context C, as follows: i.e., we select the maximum probability label in the target label space generated by the model as the token next to the test input x t appended to the context C. The source and target label spaces, Y s and Y t , share a one-to-one mapping among each other in terms of translation from s to t.
One of the most widely-used methods of constructing the context C, which we will henceforth call random prompting, is to randomly select (x i s , y i s ) from D s and concatenate together. We explore this method in our analysis, and it serves as a baseline for our experiments.

Semantic Alignment
Chang et al. (2022) showed that multilingual models encode these languages in a shared embedding space, while still preserving several languagesensitive semantic information. Despite the language difference between source and target inputs, x s and x t , it is then likely that their semantic similarities will be reflected in their hidden representations constructed by LLM. Therefore, we hypothesize that choosing semantically similar examples to construct the prompt-context would help the model do in-context inference. That is, if e t is the embedding of the target and e s that of the source, the higher the similarity score between them, the better sentence x s will serve as a demonstration for the target sentence x t .
Inspired by Liu et al. (2022), we extract prompt examples directly dependent on the test input distribution. Here we utilize multilingual sentencetransformers (Reimers and Gurevych, 2020) to extract the sentence embedding of the test input x t ∈ D t and the source inputs X s . Based on the cosine similarity between the target input x j t and source inputs x j s ∈ X s , we then extract the top k demonstrations (see Algorithm 1). While the target input and the demonstration differ in language, we hypothesize that by pairing semantically similar context demonstration and input sentence, the LLM would be able to improve its reasoning ability and subsequently, the final task performance (see Table 11 in Appendix D for examples of such aligned demonstrations).

Algorithm 1: Semantic Alignment
Input: An unlabeled target sentence xt, source data Ds, multilingual sentence encoder, θ, and number of samples to extract k.

Task-based Alignment
Despite the semantic coherence enforced within the prompt-context via the previously mentioned method, the source and target label spaces, Y s and Y t , remain superficially disconnected. For fine-tuning, techniques like meta-learning (Nooralahzadeh et al., 2020), and adapters (Parović et al., 2022) have been used to bridge this gap. For in-context prompting in which context matters the most, we propose to do so by adding a manually designed statement that gives the LLM task-specific information like target language and target label space.
Task-based alignment is done by appending a manually designed statement, called task aligner to context. This aligner is supposed to inform the LLM about the mapping from source label-space Y s to the target label-space Y t . We do task alignment by first manually creating D l = {L s,t } for a given task and source-target language pairs s and t as a collection of statements in the source language that emphasizes what the target label and language are. For example, when the source is English and the target is Spanish, "In Española bad means malo and good means bueno" will be the said task aligner that gives the information that the target language is Española (Spanish) and the target labels are malo and bueno (bad and good, respectively). Next, we construct the prompt-context by randomly selecting k source language examples, followed by the task aligner from this source-target pair from D l (see Algorithm 2). For examples of task-aligned prompt design, please refer to Table 11 and Table 12 in Appendix D.

Algorithm 2: Task Alignment
Input: An unlabeled target sentence xt, source dataset Ds, aligner Ls,t and number of samples to extract k. Procedure: Randomly select k sentences from Ds

X-InSTA
We finally move on to our proposed method X-InSTA that combines the semantic alignment with the task-based one. It first selects the source examples from D s with top-k similarity scores as mentioned in Section 2.2. Additionally, we select task-aligners from D l depending on the source and target languages and the task. Finally, we construct the prompt context by concatenating the selected examples followed by the task-aligner.  label inference can be described as and L s,t ∈ D l is the task aligner for source and target languages s and t for the given task.

Results and Analysis
We experiment on three datasets -Multilingual Amazon Reviews Corpus (MARC) (Keung et al., 2020), Cross-language sentiment classification (CLS) (Prettenhofer and Stein, 2010), and Hat-Eval (Basile et al., 2019), spanning over twelve languages-task pairs and totalling 44 cross-lingual setups (refer to Appendix A for further description of the datasets). The results on MARC, CLS and HatEval are shown in Tables 1, 2, and 3, respectively. For our main experiments, we make use of  XGLM (Lin et al., 2021) 7.5 billion variant. We experiment with various models with random prompting and select XGLM 7.5B for its performance superiority on various tasks (refer to Table 8 in Appendix B). For further details on the experimental setup, please refer to Appendix C and Table 10 for the language abbreviations used.

Comparing Alignment Techniques
Semantic Alignment: The improvement introduced by semantic alignment of the prompt-context over randomly selected source examples is eminent in Tables 1, 2, and 3. On the MARC dataset, we can see a 14% improvement in macro F1 scores averaged across different languages. This observation is consistent across all target-source pairs on other datasets as well -a gain of 10% on Hateval, and 6% on CLS. This improvement over random example selection is consistent across all language pairs (except English-to-German in CLS) considered in this experiment. This is particularly noteworthy and one might lead to the conclusion that dynamically selecting prompt examples based on semantic similarity aligns the LLM to become a better in-context learner irrespective of the task and the languages. Task-based Alignment: Just by adding a task aligner, we not only outperform random prompts but also bring substantial improvements for simi-  larity prompting, even though it is not dynamically varying with input sentences. The improvement is 18% in CLS, 8% in HatEval, and 15% in MARC, in terms of macro F1 scores averaged over different language pairs. However, some languages like German in MARC and English in HatEval produce nearrandom predictions in all the set-ups we experimented with. This might be due to the model's inability to perform ICL on these tasks in a crosslingual manner for these languages. Previous studies observed such phenomena in monolingual ICL (Webson and Pavlick, 2022;Lin et al., 2021); crosslingual ICL has its added nuances that make it even more difficult.
We also see a performance drop in the case of Mandarin in MARC (Table 1) while adding a task aligner. We investigate the performance drop and near-random results of German further.
X-InSTA: This prompting mechanism inherits both the benefits of semantic and task-based prompting, hence giving the best results in most language pairs. But similar to task-based alignment, X-InSTA also performs badly on some target languages. The improvement is 23% on MARC, 22% on CLS, and 14% on HatEval. We also note that no specific language can be used as the best source language.

Why does Task Alignment Work?
Next, we seek to validate the performance boost achieved via task-based aligners along with an attempt to explain the drop in performance with Mandarin and German. We vary the task aligner and   note its effect on the output. We do so in five different variations along with the original method (see Table 11 in Appendix D for detailed examples of each scenario): 1. No aligner prompt added: Same as random prompting. 2. Making the label space uniform: Across all source-target setups. We set the source label distribution as output for the target too. Trying to reduce the need for task alignment. 3. Only language information: Only giving the language information to LLM, without providing any further label information. An example of such an aligner would be 'The following post is in French language', in a case when the source is English, and the target is french. 4. Providing aligner but of a third unrelated language: We set the aligner of a third language. For example 'In Spanish bad means malo and good means bueno.', in a case when the source is English and the target is french. 5. Incorrect aligner: Making the aligner incorrect corresponding to the label space. For example 'In French bad means bien and good means mal.', in a case when the source is English and the target is French. It's all about the label information: In Table 4, we note the importance of label space information.
Providing the model with language information does improve the performance however, the improvement is minuscule compared to the improvement achieved via task aligners. This label infor-mation, even when of an unrelated third language, still helps the model predict better. This might be due to the fact that the model looks more rigorously at label space for inference. Therefore, this showcases the importance of labelling information while going cross-lingual.
Why drop in some languages? It is noteworthy that in Table 4, the task aligner works best for all target languages except for German and Mandarin. Both of these languages give the best results in uniform label space, i.e. when y t is made the same as y s . This points to the inability of the LLM to align the label space of different source languages to these target languages. In making the label space uniform we lose on certain language-specific signals, but this may also be seen as a way of reducing task alignment. Only for German and Mandarin do we see this trade-off as beneficial in all other cases the loss of language-specific features of y t leads to a drop in performance.

Role of semantic alignment
To understand the role of semantic alignment we ran an experiment in which instead of choosing k nearest neighbor of x t , we chose the most dissimilar sentences. Table 5 shows that there is a sharp decrease in performance as compared to random prompting for all languages, with German as an exception. The average fall is 8% whereas using semantic alignment gives a gain of 10% w.r.t. random prompting.

Automated aligner generation
We also expand our analysis to automatically generate the aligner using mT5 (Xue et al., 2021). It is trained using a span generation task using sentences like 'Paris <MASK> France'. mT5 is trained to fill the mask token by generating spans like 'is capital of'. In our usage, mT5 will fill the <MASK> between the input target test x t , and prompt context C in the source language in such a way as to align the semantics of both. We summarize our procedure  Table 6: Comparing the performance of automated aligners generated by mT5 with the rest of the methods in terms of macro-F1. We use English as the source language for all three tasks in this experiment.
for automatic alignment generation in Algorithm 3.

Algorithm 3: Task Alignment
Input: An unlabeled target sentence xt, source data set Ds, multilingual-T5, mT 5, multilingual LLM, M and number of samples to extract k. Procedure: Randomly select k sentences from Ds Due to the computational cost of generating the intermediate prompt for each source-target input pair, we experiment with English as the only source language in all three datasets. Table 6 summarizes the results of using an automated aligner. We note that the automated aligner leads to better results than random prompting, and delivers results competitive to semantic prompting in some languages. However, it fails to incorporate any task-specific signals, therefore failing to beat Task-based alignment. One can note the limitations of this approach in terms of the different pretraining distributions of the in-context learner and the aligner generator (XGLM and mT5, respectively, in this scenario). The hypothesized role of the aligner was to construct a 'natural' transition from the source context to the target input for a particular task. Since mT5 is generating those aligners independently without any access to the pretraining distribution of XGLM, the disparity manifests with sub-optimal results.

Error Analysis
We present four examples in Table 7, highlighting the four major errors we notice while using X-InSTA, stemming from the following factors: 1. Static task-aligner: In Example #1, slurs are used by all the posts. In the context examples, they are being used as hate speech; whereas in the target, it is not directed at any individual and thereby, should not be identified as hate speech. However, the model labels it otherwise. Here, the apparent semantic similarity is misdirecting the model and the static nature of the task aligners is not able to guide it to understand the nuances of the task. 2. Cultural differences: Neither of the alignment methods introduces common knowledge or cultural knowledge in the prompt. To classify the tweet in example #2, one must have a grasp of hate focused on migration. 3. Input length: Both the context prompt and the input sentence are just too long in example #3. In this case, no matter how better we design the aligner, we cannot fit it within the maximum input length of 1024 tokens. One cannot keep on increasing the max-length to accommodate this pitfall, as that might lead to higher computation costs. A possible solution can be found in the direction of Transformer architectures suitable for longer input sequences. 4. Lack of human-like commonsense: In example #4, alignment of the semantics and the task constructed a good prompt, but the model predicted it wrongly by getting confused by the sarcasm in the first demonstration. To bridge this pitfall, we need to bring about more knowledge of humor or commonsense to make the model understand what is obvious to us.
It should be noted that the majority of these errors are stemming from the incapability of the LLM itself. Advancements in language model designs may lead to betterment in future models.

Related Works
In-context learning: Brown et al. (2020) introduced a new approach called in-context few-shot learning using the GPT-3 model. Subsequent efforts have been made to enhance the effectiveness of in-context learning. Hendrycks et al. (2020) evaluated the breadth and depth of model understanding to determine its weaknesses and strengths. Techniques such as selecting semantically similar examples, using differentiable soft prompts for backpropagation, and adjusting prompts to elimi-  Table 7: Error analysis of X-InSTA. Four examples represent the major error characteristics (discussed in Section 3.5).
We omit most of the text in the test input of the 3rd example as it was too long.
nate bias in predictions have been implemented to optimize the input prompt (Liu et al., 2022;Zhang et al., 2021;Zhao et al., 2021). These efforts have primarily been directed toward improving the performance of in-context learning in a monolingual setting.
Multiple recent studies have sought to explain the emergence of ICL by assigning different roles to the LLM. Xie et al. (2022) provided the notion of LLMs doing bayesian inference conditioned upon the prompt context to predict the test label. Our work is much in line with this hypothetical model since alignment over the semantics and the taskbased signals across languages are motivated by the quest for better alignment between the prompt and the pretraining distribution and warranting a shared, distinguishable concept as Xie et al. (2022) argued. Additionally, von Oswald et al. (2022) sought to identify LLMs doing gradient-descent as meta-optimizers while learning in context. Li et al. (2023) described ICL as implicit model selection.
Multilingual models: Recent research for multilingual tasks has focused on creating multilingual versions of popular pre-trained language models. These include mBERT (Devlin et al., 2018), mBART , XLM-R (Conneau et al., 2020), and mT5 (Xue et al., 2020) which are derived from models like BERT (Devlin et al., 2018), BART , RoBERTa , and T5 (Raffel et al., 2019). However, finetuning these large models for each task is infeasible due to computational limitations. While in-context learning has been attempted for cross-lingual downstream tasks, these methods only involve random sampling of demonstrations for prompt construction (Zhang et al., 2021;Winata et al., 2021). Shi et al. (2022) addressed the problem of cross-lingual text-to-sql conversion using ICL. However, their method relies on translating the input text in the source language to the target language before generating the corresponding SQL code. To the best of our knowledge, there is no study on optimizing prompts for cross-lingual ICL.

Conclusion
In this work, we described the first-ever attempt in the direction of cross-lingual prompt design for in-context learning. We found that a random selection of labeled training examples to construct the prompt-context limits the capability of a multilingual LLM to infer target labels. Instead, aligning the semantics as well as the task-specific textual signals across the source and the target language inputs in the prompt demonstrates superior performance in cross-lingual text classification. Based on these findings, we introduced X-InSTA, a novel method of in-context prompt design for cross-lingual text classification. X-InSTA improves upon random prompt selection substantially across multiple different cross-lingual tasks.
We found that the dynamicity of similarity-based example selection is able to guide the LLM to learn better in-context predictors irrespective of the language pair under consideration. On the other hand, language pairs with proper alignment in the label space get more out of the task-based alignment. These findings may serve as paving stones toward better cross-lingual ICL methods that incorporate an automated, dynamic transition from the source to target distributions.

Limitations
Since this work relies on the in-context learning ability of large language models, the challenges associated with computational resources to load an LLM ensue. Due to resource constraints, we could not use larger or commercially available LLMs to validate if the advantages of X-InSTA translate to those models as well.
As we observed in Section 3.5, the static nature of the aligners poses a limitation on X-InSTA. Moreover, these aligners are manually designed. Therefore, task-specific, trial-and-error style manual intervention is needed. We believe a better understanding of the pretraining distribution of the multilingual LLMs can pave the way toward better automated alignment methods.
There are multiple shortcomings of monolingual ICL that entail its cross-lingual counterpart and X-InSTA does not address them; issues like knowledge hallucination, limited common-sense reasoning, inconsistency in retrieving factual associations, etc.

Ethics statement
Our proposed method, X-InSTA, delivers improvements in cross-lingual in-context learning. Since in-context learning ability is emergent in language models over billion parameters in size, this can cause potential discrimination in the usage of these methods based on the availability of access to computational resources. Research groups with limited access to computational resources will be handicapped while resourceful groups will be able to investigate and advance the future directions of this research.
We did not use any private or sensitive information throughout this research. However, if any private information was leaked to an LLM during the pretraining stage, X-InSTA does not provide any privacy filtration. Therefore, privacy concerns of the underlying model can potentially manifest with the outputs provided by X-InSTA.
As we dissected the erroneous predictions in Section 3.5, the lack of knowledge of cultural differences among different languages is a serious challenge within the LLM and this limits the performance of X-InSTA. Therefore, any potential deployment of our proposed method should be done under the lens of such considerations. This is even more delicate in case tasks like hate-speech classification which was one of the tasks that we explored in this work. Wrongfully identifying a hate speech as non-hate or vice versa in a low-resource target language based on culturally different language usage cues present in the prompt-context in a high-resource languages is a possibility; this may lead to unwarranted cultural appropriation and/or undemocratic gatekeeping.

B Model Variants
We experiment with multiple different LMs in their base versions (i.e., random prompting) to gauge their ability, namely: XGLM 7.5B, XGLM 1.7B, and Bloom 7.1B. Table 8 contains the performance of these models on a subset of the test data used (namely, CLS and HatEval with English as the source language). As we can see, XGLM 7.5B appears to outperform other models by a significant margin on multiple different tasks and hence, is used for the rest of the experiments.

C Hyperparameters
All codes are written using PyTorch. We used the Huggingface repository for loading the LLM and sentence transformer for extracting semantic similarity. Sklearn was used for calculating the F1 score. Table 9 describes values of different hyperparameters and compute-resources used.

D.1 Language Code
Refer to Table 10 for this information.

D.2 Prompt Examples
We show a few example prompts (demonstrations and test input) in Table 11. Additionally, in Ta Table 12: Example of different types of task aligner. Blue text marks the task aligner. As there is only variation in the aligner and none in the demonstration of the context prompt, So the demonstrations are shortened. In the examples, English serves as the source language while Spanish is the target language. Hence, Y t is {malo, bueno} and Y s is {bad, good}. In second-row the labels are colored in red to highlight that we have made Y t the same as Y s , i.e. for the input example we will label based on the label space {bad, good} thus making the label space uniform. In the fourth row the aligner of a third unrelated language is given, French in this case.