Crosslingual Retrieval Augmented In-context Learning for Bangla

The promise of Large Language Models (LLMs) in Natural Language Processing has often been overshadowed by their limited performance in low-resource languages such as Bangla. To address this, our paper presents a pioneering approach that utilizes cross-lingual retrieval augmented in-context learning. By strategically sourcing semantically similar prompts from high-resource language, we enable multilingual pretrained language models (MPLMs), especially the generative model BLOOMZ, to successfully boost performance on Bangla tasks. Our extensive evaluation highlights that the cross-lingual retrieval augmented prompts bring steady improvements to MPLMs over the zero-shot performance.


Introduction
In recent years, the field of Natural Language Processing (NLP) has witnessed transformative advancements, especially with the advent of deep transformer techniques (Vaswani et al., 2017;Devlin et al., 2019;Radford et al., 2019).The introduction of Large Language Models (LLMs), such as GPT-3 (Brown et al., 2020b) and GPT-4 (OpenAI, 2023), has further revolutionized the landscape.These models showcase unparalleled prowess in tasks like text classification and generation, unified under the umbrella of in-context learning, and cater to a plethora of applications across diverse languages (Conneau et al., 2020;Raffel et al., 2020;Radford et al., 2019).While comprehensive benchmarks like XTREME (Hu et al., 2020) and BUFFET (Asai et al., 2023) underscore their capabilities, languages such as English remain the primary beneficiaries.In stark contrast, several lowresource languages, Bangla being a prime example, grapple with challenges, notably the scarcity of pretraining corpora (Artetxe and Schwenk, 2019;Hangya et al., 2022;Sazzed, 2020).† Corresponding author.Despite having a significant number of native speakers, Bangla remains underrepresented in the NLP arena due to linguistic intricacies, limited labeled datasets, and prevalent issues like data duplication (Das and Bandyopadhyay, 2010;Das and Gambäck, 2014).Although there have been commendable strides using conventional machine learning techniques in Bangla NLP tasks, the untapped potential of the latest LLMs is evident (Bhowmick and Jana, 2021;Wahid et al., 2019;Hoq et al., 2021).
In the evolving landscape of in-context learning with LLMs, the concept of retrieval augmentation, which emphasizes sourcing semantically rich prompts, has gained traction (Shi et al., 2023).However, when it comes to multilingual in-context learning, previous works like MEGA (Ahuja et al., 2023) often limit their scope to task instructions and lack deeper semantic insights due to their approach of random prompt selection.In contrast, strategies like PARC (Nie et al., 2023) pave the way arXiv:2311.00587v2[cs.CL] 2 Dec 2023 for a more comprehensive methodology, fetching semantically aligned prompts from high-resource languages.
Our work draws inspiration from these methodologies but introduces novel perspectives.While MEGA offers task-level instructions, we infuse semantic understanding into our approach.Similar to PARC, our approach is cross-lingual, ensuring a broader application spectrum.Diverging from PARC's focus on masked language models like mBERT and XLMR, as shown in Figure 1, we venture into uncharted territories by employing larger, decoder-only multilingual pretrained language models (MPLMs) -BLOOM and BLOOMZ -to tackle Bangla NLP tasks in a generative style (Muennighoff et al., 2023;Scao et al., 2022).
In this paper, we explore the application of crosslingual retrieval augmented in-context learning to Bangla text classification and summarization tasks.Our main contributions encompass: • An extensive evaluation of cross-language retrieval augmented in-context learning methods in Bangla, achieving steady improvements over the zero-shot performance of MPLMs.
• A pioneering exploration to extend PARC to the generative models, BLOOM and BLOOMZ, providing insights for a unified pipeline of cross-lingual retrieval augmented in-context learning.

Related Work
Bangla Natural Language Processing Bangla is a morphologically rich language with various dialects that belongs to the Indo-Aryan branch of the Indo-European language family.With roughly 270 million speakers concentrating in Bangladesh and some regions of India, Bangla is ranked as the 7th most widely spoken language in the world1 .However, Bangla is still considered as a low-resource language in the NLP research due to the scarcity of digital text resources and annotated corpora.
Research on Bangla NLP has covered a variety of common NLP subfields since 1990s, such as POS tagging (Dandapat et al., 2004;Ekbal and Bandyopadhyay, 2008b), stemming and lemmatization (Islam et al., 2007;Paik and Parui, 2008), named entity recognition (Ekbal andBandyopadhyay, 2007, 2008a), sentiment analysis (Das and Bandyopadhyay, 2010;Wahid et al., 2019), news categorization (Mansur, 2006;Mandal and Sen, 2014), etc.However, the research in different areas of Bangla NLP still remains sparse.In the era of deep learning, further progress has been made in Bangla NLP, particularly in terms of the development datasets (Rahman and Kumar Dey, 2018;Islam et al., 2021Islam et al., , 2023) ) and models (Tripto and Ali, 2018;Ashik et al., 2019;Karim et al., 2020).Pretrained language models have achieved decent performance in a large variety of NLP downstream tasks through the finetuning.Under this background, Bhattacharjee et al. (2022) pretrained the BanglaBERT model, a BERT-based language understanding model pretrained on Bangla language corpora.With the advent of the large language models (LLMs), zero-and few-shot prompting methods have gradually gained prominence. Hasan et al. (2023) compared the zero-and few-shot prompting performance of LLMs with the finetuned models for the Bangla sentiment analysis task.Our work explores the application of the retrieval-augmented prompting method in Bangla violence detection and sentiment analysis tasks.
Multilingual In-context Learning Brown et al. (2020a) demonstrated that LLMs like GPT-3 can acquire task-solving abilities by incorporating inputoutput pairs as context.The in-context learning approach involves concatenating input with randomly selected examples from the training dataset, which is also called the prompting method.Recent research (Gao et al., 2021;Liu et al., 2022Liu et al., , 2023;;Shi et al., 2023) has expanded on this idea by enhancing prompts for pretrained models through the inclusion of semantically similar examples.The effectiveness of prompting methods for English models extends to multilingual models in cross-lingual transfer learning as well.Zhao and Schütze (2021) and Huang et al. (2022) investigated the prompt-based learning with multilingual PLMs.Nie et al. (2023) incorporated augmented the prompt with cross-lingual retrieval samples in the multilingual understanding and proposed the PARC pipeline.Tanwar et al. (2023) augmented the prompt with not only cross-lingual semantic information but also additional task information.However, previous studies mainly concentrated on the multilingual encoder or encode-decoder models, while our work extend the PARC pipeline to Multilingual LLMs In the era of LLMs, BLOOMZ and mT0 (Muennighoff et al., 2023) are two representative newly emerging multilingual models.These two multilingual LLMs are finetuned on xP3, a multilingual multitask finetuning dataset, and based on the pretrained models BLOOM (Scao et al., 2022) and mT5 (Xue et al., 2021), respectively.Six different sizes of BLOOMZ models are released from 560M to 176B and 5 different sizes of mT0 models are released from 300M to 13B.These multilingual LLMs open up the possibility for conducting few-and zeroshot cross-lingual in-context learning, as demonstrated by recent benchmarking efforts, for example MEGA (Ahuja et al., 2023) and BUFFET (Asai et al., 2023).

Methodology
Our research extends the work of Nie et al. (2023) by focusing on improving multilingual pre-trained language models (MPLMs) for low-resource languages in a zero-shot setting, specifically using retrieved content from high-resource languages such as English.
The backbone of our research approach is a two-stage pipeline consisting of a cross-lingual retriever and a prompt engineering process as shown in Figure 2.This pipeline aims to build on the strengths of MPLMs while mitigating their limitations, especially when dealing with low-resource languages.
The first stage of the pipeline uses a cross-lingual retriever that maps the input Bangla text q to a vector q embed in a shared embedding space and uses it as a query.Using semantic similarities with q embed , the retriever returns the most similar k examples from high-resource languages either with or without their labels: where d i means each document in the highresource language corpus and |d| is the number of documents.If there's no label, it suggests a self-prediction step.
The second stage of the pipeline is the prompt engineering.The input Bangla text and the retrieved pattern are subjected to this process.A prefix prompt template P is used to reformulate the input to facilitate the model's prediction y: y = M P LM (P (q, R)) Depending on the architecture of the chosen MPLM, for decoder-only models, the answer is generated by the model directly.For encoder models, the answer is obtained by first mapping each label to its predefined word using the verbalizer and then deducing the label word using mask token prediction.
By integrating cross-lingual content retrieval with prompt-guided prediction, we aim to improve the ability of MPLMs to handle low-resource languages.This synergy not only extracts rich linguistic insights from high-resource languages, but also uses them to improve performance on low-resource language tasks.

Experiments
In this study, we focused on the tasks of classification and summarization.We refer to our research approach, which uses k retrieved samples for crosslingual augmented in-context learning methods, as the main method in the following sections.

Baselines
Zero-shot The template, when populated with the input sample, is fed directly into the MPLM for prediction.This process bypasses the use of cross-lingual context.

Lead64
The first 64 tokens of the input text are taken as a summary of the text (For summarization tasks only).

Classification
Vio-Lens The Vio-Lens dataset (Saha et al., 2023) contains YouTube comments related to violent incidents in the Bengal region, with the goal of highlighting potential threats that could incite further violence.The prompt templates for both main method and zero-shot baseline are defined as follows: • BLOOMZ-3b and BLOOM-3b: Reflecting on the statement "{text}", which aggressive level does it resonate with: non-aggressive, slightly aggressive, or highly aggressive?
with the verbalizer: We use the ETHOS (onlinE haTe speecH de-tectiON dataSet) (Mollas et al., 2020) as sentence pool in our experiments.This repository provides a dataset designed to identify hate speech on social media.We use the binary variants of the dataset, which contains 998 comments, each labeled for the presence or absence of hate speech.Since the labels are inconsistent, we use the self-prediction method to predict the labels.
SentNoB Designed to capture the sentiment within text, SentNoB classifies content as positive, negative or neutral (Islam et al., 2021).The prompt templates for both main method and zero-shot baseline are defined as follows: • BLOOMZ-3b and BLOOM-3b: Text: {text} What is a possible sentiment for the text given the following options?
• mBERT: {text} Sentiment: [MASK] with the verbalizer: The English Sentiment Analysis dataset (Rosenthal et al., 2017), which consists of tweets annotated for sentiment on 2-, 3-, and 5-point scales with labels positive, negative, and neutral, serves as the HRL corpora in our study.We use the labeled training set for our experimental sentence pool.

Summarization
XL-Sum is a large and varied dataset consisting of 1.35 million pairs of articles and their corresponding summaries (Hasan et al., 2021).These pairs have been expertly annotated by the BBC and meticulously extracted through a series of carefully designed heuristic methods.The dataset includes 45 languages, from low to high resource, many of which do not currently have publicly available datasets.The prompt template is defined for all models as follows: • Main method: {text} Generate a concise summary of the above text using the same language as the original text ({target_lang}): • Zero-shot baseline: {text} Generate a concise summary of the given text:

Models
BLOOM is an autoregressive Large Language Model trained on a diverse corpus to generate text based on prompts (Scao et al., 2022).It is capable of generating coherent text in 46 languages.
BLOOMZ takes a novel approach in the MPLM landscape by applying Bloom filters in the context of language models (Muennighoff et al., 2023).This allows the model to use high-resource languages to improve embeddings for low-resource languages, effectively bridging the gap between languages with different levels of available resources.
mBERT is an early MPLM that extends the original BERT model (Devlin et al., 2018).It is pretrained on a corpus of 104 languages, using shared WordPiece vocabularies and a unified architecture for all languages.
mT5 or Multilingual T5 (Xue et al., 2021), is an extension of the T5 (Text-to-Text Transfer Transformer) model (Raffel et al., 2020) designed specifically for multilingual capabilities.Pre-trained on mC4, a large multilingual dataset, mT5 demonstrates multilingual capabilities by transforming input text sequences into output sequences.
Cross-Lingual Retriever We followed Nie et al. (2023) to use the multilingual sentence transformer "paraphrase-multilingual-mpnet-base-v2" (Reimers and Gurevych, 2019).This transformer maps sentences and paragraphs into a 768dimensional dense vector space.Such a highdimensional embedding facilitates tasks such as clustering and semantic search.In our experiments, the number of retrieval samples k is 1 and 3 for classification task and 1 for summarization task.

Results of classification tasks
Table 1 provides an overview of the results of classification.With the instructions of k = 3 retrieval augmented English prompts, we enhance the F1scores of Bloomz-3b on the two tasks by 5% and 10% respectively.While Bloom-3b, without instruction tuning compared to Bloomz-3b, cannot generate any meaningful result, suggesting that instruction tuning has a strong impact on retrieval augmented in-context learning.The traditional masked MLM, mBERT, also gained improvement by 8% and 7%.To facilitate a comprehensive understanding of the performance and discrepancies associated with each task, we present confusion matrices for analysis as follows.Given the confusion matrix in Table 2 , we find that: 1) With a general assessment across micro, macro, and weighted F1 scores, Bloomz-3b and mBERT gained improvement from the retrieval prompts.2) Compare the two models, Bloomz-3b's zero-shot setting tends to misclassify "nonviolence" and "Neutral", and has a reduced macro F1 compared to its weighted F1, while mBERT has a more balanced distribution of confusion between "non-violence" ("Neutral") and the other classes.This may indicate that for classification tasks, the text generation struggles more with minority classes compared to masked prediction.

Results of summarisation task
The Table 3 compares several models and methods for summarization task.

LEAD-64
As an extractive method, it performs well across all metrics.This indicates that in many cases the first few sentences or tokens of an article or document provide a fairly informative summary.As expected, LEAD-64 outperforms the mt5 base model in the zero-shot setting, but is outperformed by the Bloomz models in the same scenario.
Zero-Shot Models mt5-base produces the lowest scores across all metrics, suggesting that it struggles to produce satisfactory summaries without domain-specific fine-tuning or data augmentation.Both bloomz-1b1 and bloomz-3b show significantly better performance, with bloomz-3b having a slight edge over bloomz-1b1, especially in bigram capture (R-2).
Retrieval augmentation with k=1 Retrieval augmentation seems to drastically affect the perfor-  mance of mt5-base, reducing its score considerably.This could be due to noise introduced by the retrieved sample or ineffective use of the additional information.For the Bloomz models, bloomz-1b1 still retains decent performance, although there's a drop when compared to its zero-shot performance.Surprisingly, blommz-3b shows a sharper drop, suggesting that the additional retrieval data may be more of a distraction than an advantage for this model configuration in the summarization task.

Analysis and Discussion
When examining the performance of different models on different tasks, several key observations emerge that are related to linguistic nuances, the underlying language models, and resource allocation.
For classification tasks, it's clear that models with a strong grasp of complex sentence structure and deeper semantics, such as the Bloomz-3b, are more adept at distinguishing nuanced categories like "passive violence" or the more ambiguous "neutral" sentiment.This aptitude likely stems from their ability to understand context better than their simpler counterparts.In parallel, the critical role of zero-shot learning becomes apparent.The ability of a model to generalize a task without specific fine-tuning speaks volumes about its robustness.For example, in our studies, models such as the Bloomz-3b showed commendable performance in a zero-shot setting.Furthermore, as we played around with the variable k (representing the number of samples retrieved), it was instructive to see that a larger value didn't always translate into better performance.This underscores the nuanced ability of a model to sift through information and potentially eliminate noise.
Turning to the summarization task, coherence and relevance seem to be the pillars of excellence.Advanced models are more adept at weaving sentences that are not only structurally coherent, but also rich in information.This finesse is evident in the superior Rouge scores of the models.The dichotomy between generative and extractive approaches is also evident.While generative models, including mt5-base and Bloomz-1b1, outperformed the extractive model (LEAD-64) in a zeroshot framework, they seemed a bit sensitive when Figure 3: Model performance over differences between zero-shot (represented as '0' on the y-axis) and main method with k=1 and k=3 demonstrations for Vio-Lens test set using bloomz-3b (left) and mbert (right).The y-axis shows the deviations of the main method from the zero-shot values.The statistics are based on 8 and 6 templates, shown in Appendix Table 5 and Table 6, respectively.retrieval augmentation came into play.
Finally, when it comes to resource distribution, there's an undeniable correlation between performance and computational resources.The stellar performance of models like Bloomz-3b likely comes at the cost of intense computational demands.However, one must consider the costbenefit ratio.In addition, the drop in performance of these models with retrieval augmentation at k=1 suggests a potential sensitivity to the balance or diversity of the dataset.
For the summarization task, an interesting observation is that more extensive models don't always outperform on all metrics, suggesting that we need to be more discriminating in our resource allocation.The significant performance drop with retrieval augmentation further supports this argument.
To conclude this analysis, while modern language models are capable of handling complex tasks, they require careful configuration and thoughtful resource distribution.Unraveling the complexity of these models can pave the way for optimized solutions in both classification and summarization.

The Stability across Templates
In our experiment for Vio-Lens, we compared the performance of Bloomz-3b and mbert, in terms of their ability to classify text samples into categories.In order to assess the effectiveness of the retrieval augmented prompting method compared to the zero-shot baseline, we conduct a statistic across different templates.
For Bloomz-3b and mBERT, we test different prompt templates , and created a boxplot (Figure 3) to visualize the difference of F1 scores from our main method to the zero-shot baseline across templates.It's shown that with the retrieval augmented English prompts under different templates, both models achieved a stable improvement compared to the Bangla zeroshot baseline.Aso it's clear that mBERT, on average, shows greater improvements in F1 scores when transitioning from the zero-shot baseline to retrieval augmented prompting, compared to Bloomz-3b.

Impact of Bangla and Hindi Prompt Template
Instead of English, we further explore applying Bangla itself and its linguistically similar highresource language Hindi as the language of the prompt template , as shown in Table 4. Main method with English prompt: This configuration yields the highest macro average F1 score of all three prompt templates.
Hindi Prompt Template: While the Hindi prompt template leads to significant improvements in precision and recall for individual categories such as "Neutral", the macro average F1 score is still lower than that of the main method with the English prompt.
Bangla prompt template: The Bangla prompt template, while showing some improvements in precision for specific categories such as "positive", experiences a decrease in recall and overall accu- racy.As a result, the macro average F1 score is the lowest of the three templates.This means that while the Bangla prompt template may improve performance for specific categories, it has an overall negative impact on the model's ability to generalize across all categories in the SentNoB test.Conversely, the Hindi prompt template's improvements in precision and recall for individual categories don't translate into a higher macro average F1 score compared to the main method with the English prompt.
In summary, the macro average F1-score results show that the main method with the English prompt template remains the most effective overall.However, the choice of prompt template can significantly affect performance for specific categories, as demonstrated by the Hindi and Bangla templates.This nuanced understanding underscores the need to balance category-specific and overall performance when selecting prompt templates in cross-lingual retrieval augmentation.

Impact of Hindi sentence pool
Comparing the results in Table 7 with the previous experiments, we observe that the Hindi retrieval dataset generally improves the model's ability to retrieve "Neutral" content in the mBERT model.However, the model continues to struggle with the "Neutral" category, with low recall and F1 scores, regardless of the sentence pool used.This suggests that further refinements may be needed to improve retrieval accuracy for neutral sentiment sentences.The studies with Hindi retrieval data show that both bloomz-3b and mbert don't show any improvements compared to the main method with the English prompt template.This suggests that while using alternative retrieval datasets can improve performance for specific sentiment categories, the choice of retrieval data may need to be carefully considered to maximize overall performance across categories in cross-lingual sentiment analysis tasks.

Conclusion
In this paper, we have introduced a novel approach to address the challenges of applying Large Language Models to low-resource languages, with a focus on Bangla.Our methodology employs cross-lingual retrieval-augmented in-context learning, thereby enriching the capabilities of MPLMs, specifically BLOOM and BLOOMZ.We have extensively tested our approach on two classification tasks and one summarization task.
Our experimental results demonstrate the effectiveness of our approach in achieve superior F1 scores for classification tasks.
Upon further analysis, the cross-lingual retrieval mechanism contributes significantly to the model's performance.
This work lays the foundation for further studies on the application of cross-lingual retrieval and in-context learning methods in low-resource languages.Future work could extend this approach to even more underrepresented languages and potentially adapt it to more complex NLP tasks such as question answering or machine translation.

Limitations
While our study has yielded promising results, it is not without limitations.The effectiveness of retrieval augmentation is also tied to the model architecture, and its impact on different models remains largely unexplored.In addition, the availability of specific language datasets for sentence retrieval and resource constraints remain practical challenges.Further exploration of prompt design and consideration of external factors could improve our methodology.Acknowledging these limitations is essential for a full interpretation of our results and the direction of future research.

Figure 2 :
Figure2: Detailed overview of the PARC pipeline for LRLs using cross-lingual retrieval: (a) An LRL input is used as a query for the cross-lingual retriever, which then retrieves the most semantically similar HRL sample from the HRL corpus.The associated label is either taken directly from the corpus (labeled setting) or determined by self-prediction (unlabeled setting).(b) Next, this HRL sample, its label, and the original input are combined to create a retrieval-enhanced prompt for MPLM prediction.

Table 2 :
Confusion matrix of main method in Vio-Lens (top) and SentNoB (bottom) test set of BLOOMZ-3b and mBERT.

Table 3 :
Rouge scores of Bangla summarization.

Table 4 :
{text} in ilixt ibk°pgu il edOJa paeFYr jnY séabY Anu v it k? Results of prompt template in bangla and hindi of main method in SentNoB test of bloomz-3b.