Zero-Shot Cross-Lingual Sentiment Classification under Distribution Shift: an Exploratory Study

The brittleness of finetuned language model performance on out-of-distribution (OOD) test samples in unseen domains has been well-studied for English, yet is unexplored for multi-lingual models. Therefore, we study generalization to OOD test data specifically in zero-shot cross-lingual transfer settings, analyzing performance impacts of both language and domain shifts between train and test data. We further assess the effectiveness of counterfactually augmented data (CAD) in improving OOD generalization for the cross-lingual setting, since CAD has been shown to benefit in a monolingual English setting. Finally, we propose two new approaches for OOD generalization that avoid the costly annotation process associated with CAD, by exploiting the power of recent large language models (LLMs). We experiment with 3 multilingual models, LaBSE, mBERT, and XLM-R trained on English IMDb movie reviews, and evaluate on OOD test sets in 13 languages: Amazon product reviews, Tweets, and Restaurant reviews. Results echo the OOD performance decline observed in the monolingual English setting. Further, (i) counterfactuals from the original high-resource language do improve OOD generalization in the low-resource language, and (ii) our newly proposed cost-effective approaches reach similar or up to +3.1% better accuracy than CAD for Amazon and Restaurant reviews.


Introduction
To solve Natural Language Processing (NLP) tasks in low-resource languages, using multilingual models is a much adopted strategy (Devlin et al., 2019;Artetxe and Schwenk, 2019;Conneau and Lample, 2019;Feng et al., 2022).A particularly popular paradigm is zero-shot cross-lingual transfer (Ruder et al., 2019;Artetxe et al., 2020b;Hu et al., 2020;Lauscher et al., 2020): pre-trained multilingual models are finetuned on downstream tasks with training data solely from a high-resource language (e.g., English).The resulting finetuned model can then be applied on a low-resource language samples, i.e., without requiring costly training data in the low-resource language.
In such zero-shot cross-lingual transfer, linguistic discrepancy between training and test languages causes a challenge: typically, performance is subpar compared to monolingual models. 1 Several works have looked into narrowing the performance gap stemming from such language-based distribution shift (Liu et al., 2021;Yu and Joty, 2021;Zheng et al., 2021;Artetxe et al., 2023).
Yet, besides the language-based shift, in realworld settings there may also be a domain-shift between training and test samples, i.e., test samples may comprise out-of-distribution (OOD) data (Quiñonero-Candela et al., 2008).For example, a sentiment classifier to predict positive/negative appreciation by a consumer may be trained on movie reviews but applied on product reviews or tweets, where underlying sentiment features are assumed to be invariant (Arora et al., 2021).
In a monolingual (English) setting, several studies unsurprisingly found a performance degradation when evaluating on OOD test data rather than on indistribution (ID) data (Kaushik et al., 2019(Kaushik et al., , 2020;;Gardner et al., 2020;Katakkar et al., 2022).One of the underlying causes for that performance drop was found to be the classifier's reliance on spurious features, i.e., patterns that from a human perspective should not be indicative for the classifier's label (Poliak et al., 2018;Gururangan et al., 2018;Mc-Coy et al., 2019;Wang and Culotta, 2020;Joshi et al., 2022): e.g., Wang and Culotta (2020) found the occurrence of "Spielberg" to be important for a positive sentiment classification.
A strategy that has been shown to improve OOD generalization in the monolingual English setting is the use of counterfactually augmented data (CAD), where annotators minimally revise training data to flip their labels (Kaushik et al., 2019).Yet, constructing such annotations is costly: Kaushik et al. (2019) report 5 min/sample.
In this paper, we present an exploratory study of OOD generalization specifically in a cross-lingual context, since we found this not to be covered in related work ( §2).Specifically, we (i) identify the impact of OOD data on zero-shot cross-lingual transfer performance, aiming to disentangle performance drops stemming from language vs. domain shifts between training and test data, and (ii) propose and analyze two new data augmentation strategies to improve OOD generalization that avoid the costly annotations associated with using counterfactuals.For both, we present an empirical study ( §3) within the domain of binary sentiment analysis.We consider English IMDb reviews (Maas et al., 2011) as in-distribution training data, with out-of-distribution test data spanning 13 languages across the Amazon (Keung et al., 2020), Tweets (Barbieri et al., 2022), and Restaurants (Pontiki et al., 2016) datasets.We further experiment with pre-trained multilingual models mBERT (Devlin et al., 2019), XLM-R (Conneau and Lample, 2019), and LaBSE (Feng et al., 2022).
For (i), we answer a first research question, (RQ1) How well do zero-shot cross-lingual methods trained with English sentiment data generalize to out-of-distribution samples in non-English languages?To this end, we finetune the multilingual models on the English IMDb sentiment data, and evaluate their performance on OOD test samples in non-English languages.
For (ii), we answer (RQ2) How can zero-shot cross-lingual transfer methods better generalize to out-of-distribution samples, including for non-English languages?We will consider a CAD baseline as proposed by Kaushik et al. (2019), where annotators minimally revise training data to flip their labels, since training on both original and counterfactual data improves OOD generalization to unseen domains in the monolingual English setting.Specifically, we finetune the multilingual models on both the original English and counterfactually revised English IMDb reviews, and evaluate whether the OOD generalization gains observed in the monolingual setting translate also to OOD test samples in non-English languages.
We then propose ( §3.3) two cost-effective alternatives for CAD, using Large Language Models (LLMs): (1) domain transfer, and (2) summarization, as illustrated in the 2 bottom rows of Fig. 1.For (1), we prompt an LLM to minimally edit both ID training and OOD test samples to map them onto the same, hypothetical domain, e.g., books.For (2), we prompt an LLM to abstractly summarize both ID training and OOD test data, since we hypothesize that summaries can capture the core essence of samples while removing non-essential, potentially spurious, information.
Our results ( §4) show that in the OOD test setting for non-English languages, model performance of zero-shot cross-lingual transfer substantially declines, aligned with OOD generalization studies in a monolingual English setting.We further find that CAD improves OOD generalization for non-English samples, with gains up to +14.8%, +4.7%, and +7.9% for respectively LaBSE, mBERT, and XLM-R.Finally, our cost-effective domain transfer and summarization data augmentation methods similarly improve OOD generalization, on par with or even surpassing CAD for Amazon and Restaurants by up to +3.1% in accuracy.

Related Work
Zero-shot cross-lingual transfer: A large part of multilingual NLP research focuses on improving the transfer of multilingual models trained on highresource language data to low-resource languages.This can be achieved either by (i) cross-lingual pre-training schemes that yield stronger multilingual models (Artetxe and Schwenk, 2019;Conneau and Lample, 2019;Conneau et al., 2020;Xue et al., 2021;Feng et al., 2022;Chi et al., 2022), or (ii) fine-tuning strategies that facilitate better cross-lingual transfer (Liu et al., 2021;Yu and Joty, 2021;Zheng et al., 2021).Recently, Artetxe et al. (2023) revisited the translate-test and translate-train baselines (Shi et al., 2010;Duh et al., 2011;Artetxe et al., 2020a), where test samples are translated into English prior to evaluating them, or, respectively, the training samples are translated into the test languages for fine-tuning a multilingual model.Artetxe et al. found that using more recent machine translation systems, e.g., NLLB (Costa-jussà et al., 2022), further boosts performance and often surpasses strong zero-shot cross-lingual methods.Hence, we also experiment with translate-test and translate-train approaches.

Cross-lingual transfer under distribution shift:
The limited research on the robustness of multilingual models has primarily focused on being robust against specific types of noise, e.g., adversarial perturbations for Japanese Natural Language Inference (Yanaka and Mineshima, 2021), a combination of general and task-specific text transformations based on manipulating synonyms, antonyms, syntax, etc. (Wang et al., 2021), and introducing errors and noise through Wikipedia edits (Cooper Stickland et al., 2023).Unlike these works, we will evaluate how well zero-shot cross-lingual transfer from English to non-English test samples can generalize in scenarios where there is a shift in domain from train to test data: the domain-specific features of test samples may change, whereas the semantic sentiment features remain invariant.
Counterfactually-augmented data (CAD): For English sentiment analysis, CAD is widely adopted to mitigate the effect of spurious patterns.For example, Kaushik et al. (2019Kaushik et al. ( , 2020) ) recruited Mechanical Turk workers to construct counterfactually revised samples by flipping labels with minimal editing, helping classifiers to learn real associations between samples and labels, thereby improving OOD generalization to unseen test domains.Building upon the success of CAD, several works have also studied how to automatically generate counterfactuals for English sentiment analysis (Wang and Culotta, 2021;Yang et al., 2021;Dixit et al., 2022;Howard et al., 2022;De Raedt et al., 2022).We adopt this CAD idea for OOD generalization in a zero-shot cross-lingual setting, which to the best of our knowledge has not been studied yet.
We start by exploring whether augmenting the English training data with the manually constructed counterfactuals from Kaushik et al. (2019) also benefits OOD generalization for non-English test samples.Additionally, we propose two new LLMbased methods as alternatives to constructing counterfactuals, aiming to specifically improve zeroshot transfer to non-English OOD test samples.We benchmark our new LLM-based methods against a CAD setup following Kaushik et al. (2019), thus assessing whether we can achieve similar OOD performance but avoid CAD's costly human annotations.We further contrast classifiers trained on data augmented by our two new LLM-based methods to those trained on counterfactuals generated by CORE (Dixit et al., 2022), the state-of-theart method in automatic counterfactual generation.CORE first retrieves naturally occurring counterfactual edits from an unlabeled text corpus and then, based on these retrieved edits, instructs an LLM (GPT-3) to counterfactually revise training samples.

Experimental Setup
We describe the English ID training data and non-English OOD test data in §3.1.Next, we outline the pre-trained multilingual models and the transfer strategies we experiment with in §3.2.In §3.3, we present our LLM-based domain transfer and summarization data augmentation methods.We cover finetuning and evaluation in §3.4.

Datasets
In-distribution (ID) training data: We use the subset of 1,707 English reviews selected by Kaushik et al. (2019) from the IMDb sentiment dataset (Maas et al., 2011) as training data, as well as 245 English validation samples.To better assess the OOD generalization of cross-lingual transfer, we also report in-distribution results of all 13 considered languages on the IMDb test set with 488 samples.However, the test set of Kaushik et al. (2019) is English-only.Hence, we translate the 488 English test samples into each of the 12 other non-English languages, using OpenAI's ChatGPT-turbo (v0301) (Ouyang et al., 2022), as it achieves translation quality that is competitive to commercial machine translation tools (e.g., Google Translate or Microsoft Translation Suite) (Jiao et al., 2023;Hendy et al., 2023;Peng et al., 2023), while being more cost-effective.Since we aim to explore the benefits of English CAD for The straps are super small, for a very small wrist, and the closure is bad, easy to lose the watch.

RESTAURANTS
The food is standard, but the person waiting at the door in the style of a manager is cold and unfriendly.The binding of the book is super tight, suited for a compact size, and the cover is not secure, easy to lose the pages.

RESTAURANTS
The books are average, but the person at the front desk in a manager-like role is distant and unapproachable.

IMDB
Terrible and traumatizing movie, avoid it.

TWEETS
Allegations of voter suppression and tampering.

AMAZON
Small straps, bad closure, easy to lose.

RESTAURANTS
Standard food, unfriendly manager.OOD generalization also to non-English test samples, we augment the respectively 1,707 and 488 original training and validation samples with their English counterfactually revised counterparts provided by Kaushik et al. (2019).All training, validation, and test sets are equally balanced between positive and negative samples.
Out-of-distribution (OOD) test data: Our OOD test data comprises three non-movie domains: product reviews, tweets and restaurant feedback.We use the MARC dataset (Keung et al., 2020) for Amazon product reviews in six languages: English, German, French, Spanish, Japanese, and Chinese.
For tweets, we use the recent multilingual test sets provided by Barbieri et al. (2022), in eight languages: English, German, French, Spanish, Arabic, Hindi, Portuguese, and Italian.For restaurant reviews, we use the multilingual aspect-based sentiment classification dataset for the 2016 SemEval Task 5 (Pontiki et al., 2016), i.e., its restaurant domain data covering six languages: English, Dutch, French, Spanish, Russian, and Turkish.Since Se-mEval Task 5 concerns aspect-based sentiment, we apply a rule-based mapping to cast it as a binary classification task: included reviews are labeled either as positive (if all aspects are positive or a mix of neutral and positive) or negative (if all aspects are negative or a mix of neutral and negative).We undersample test examples from the majority sentiment to ensure that all test sets are balanced.Further dataset statistics are provided in Appendix A.

Zero-shot cross-lingual transfer
Pre-trained multilingual models: We consider the base-cased versions of two multilingual language models pre-trained on masked language model (MLM) objectives: mBERT, i.e., a multilingual variant of BERT (Devlin et al., 2019), and XLM-R, a RoBERTa-based multilingual model (Conneau and Lample, 2019).Additionally, we use the pre-trained multilingual sentence encoder LaBSE (Feng et al., 2022) that maps sentences to 768-dimensional single vector representations.
Transfer strategies: To transfer from the English ID training data to non-English test samples, we use 3 widely adopted strategies (Fig. 1, top row): (1) zero-shot: finetunes the multilingual model on the English ID training and validation set, followed by directly evaluating the OOD test samples in the non-English languages.
(2) translate-test: finetunes the multilingual model on the English ID training and validation datasets.However, prior to making predictions for OOD test samples, the samples are translated into English.
(3) translate-train: first translates the English ID training and validation datasets to the target OOD test language.Subsequently, the multilingual model is trained on this translated data to then make predictions for the original, untranslated, OOD test samples in that non-English language.
Note that in case where both translate-train and CAD are used, the English CAD training and validation data are translated to the target OOD test language.For both translate-test and translatetrain, we use OpenAI's ChatGPT-turbo (v0301) (Ouyang et al., 2022) as the LLM to translate from English to non-English languages and vice versa.We adopt OpenAI's default parameter values.See Appendix A for translation prompts.

LLM-based data-augmentation
We explore whether data augmentation using an LLM, as a cost-effective alternative to CAD, can also boost OOD generalization.We propose two such alternatives: (1) domain transfer, and (2) summarization.Our focus is on augmenting data for translate-test, as recent work has shown it to be more effective than zero-shot and translate-train (Artetxe et al., 2023).The multilingual models are finetuned on the original English ID, as well as the augmented ID training samples2 , with predictions made solely on augmented test samples.Table 1 provides illustrations for both strategies.

Domain transfer:
We align the domains of both the original ID training and OOD test samples translated into English to a common hypothetical domain.To achieve this, we instruct ChatGPT-turbo (v0301) (Ouyang et al., 2022) to minimally change the samples so that they relate to the new hypothetical domain, for which we experiment with the domain of books.Note that rather than solely mapping OOD test samples to the ID training domain of movies, we use a hypothetical domain to transform both training and test samples with an LLM to avoid introducing a new distribution shift caused by the mismatch between the original human-based training and the LLMgenerated test samples.See Appendix A for our domain transfer prompt.
Summarization: For our second augmentation strategy, we abstractly summarize both the original English training and the translated English OOD test samples.We hypothesize that such concise summaries can retain essential information while omitting non-essential and potentially spurious features, such as, e.g., specific syntax structures and lexical choices, thereby a priori preventing classifiers from relying on such features for prediction.Furthermore, transforming text with language models, i.e., through summarization, may have the added benefit of normalizing the background, non-sentiment related, features.Hence, summarizing the data can lead to more uniform syntax and word choice among test and training samples, potentially further narrowing the distribution mismatch between ID training and OOD test samples.Appendix A lists the exact prompt that we supply to ChatGPT-turbo (v0301) (Ouyang et al., 2022), using OpenAI's default parameter values.

Finetuning and evaluation
We finetune the MLM-based models, i.e., mBERT and XLM-R, by adding a classification head to the [CLS]-token.We use the Hugging Face Transformers library (Wolf et al., 2020) and train on a single Tesla V100 GPU for 20 epochs, with a batch size of 38, and a learning rate of 5e−6.To select an optimal model, we employ early validation stopping with a loss threshold of 0.01 and a patience of 10.Since we are also interested in measuring the performance of a more compute-efficient model, we freeze LaBSE's parameters and train on CPU a logistic regression model on LaBSE's sentence vectors through five-fold cross-validation.We use the scikit-learn library (Pedregosa et al., 2011), with lbfgs (Liu and Nocedal, 1989) as the solver, and set the maximum number of iterations to 5,000.
To assess the performance of each transfer strategy, we report the mean accuracy over 5 randomly initialized training runs, i.e., with randomly selected weights and cross-validation folds for respectively mBERT/XLM-R and LaBSE.
Note that classifiers trained on CAD, as well as on data augmented by our two strategies, use respectively 1.7k extra manually constructed counterfactuals and 1.7k extra LLM-generated samples, in addition to the 1.7k original IMDb training samples.To ensure that the OOD generalization gains from CAD and our two augmentation strategies are not solely attributed to the increased number of training samples, we randomly sample an extra 1.7k original English IMDb reviews from the IMDb dataset of Maas et al. (2011) for the original-only strategy (i.e., models trained without counterfactuals or augmented data).As such, all considered strategies are trained on 3.4k samples 4 Experimental Results and Discussion

Zero-shot cross-lingual out-of-distribution generalization
We first address (RQ1), on assessing OOD generalization to non-English samples.In Table 2, we present both ID and OOD accuracies of the original only method, which trains solely on (translated) English IMDb movie reviews without data augmentation.
We see that both for English and non-English, all models and transfer strategies decline in perfor- mance when evaluated on OOD rather than ID test samples.For example, the zero-shot strategy's drop from English ID to English OOD (ID EN →OOD EN ) ranges from 8.7%-18.7%for LaBSE, 9.3%-13.6%for mBERT, and 6.1%-8.1% for XLM-R.Similarly, for non-English (ID NON-EN →OOD NON-EN ), the performance drops for LaBSE, mBERT, and XLM-R vary within the ranges of 10.8%-17.1%,8.6%-18%, and 3.4%-19.2%,respectively.These findings suggest that model performance decline to OOD test samples in non-English is substantial, as was already known (and here confirmed again) for English.We do not, however, see a consistently stronger decline for non-English than for English, as may be intuitively expected.This is discussed in more detail in the next paragraph.
English vs. non-English OOD generalization: We assess whether multilingual models generalize better to English than non-English OOD test data.Overall, the EN versus NON-EN scores in Table 2 reveal that the MLM-based models mBERT and XLM-R generalize less well to non-English compared to English OOD test samples: the accuracies for non-English languages are lower in most cases.Surprisingly, the converse holds for LaBSE: it has consistently better non-English OOD accuracies compared to English on Amazon and Restaurants.Note, however, the absolute performance of the three models: LaBSE appears to be the least accurate model in English in most cases.This is consistent with the fact that its encoder remains frozen during training in English, unlike the other encoders, whereas LaBSE's non-English performance is more on par with the other models.
While our results suggest that performance decline to OOD test samples in non-English and English is substantial, the disparity among OOD model performance between non-English and English depends on the (i) pre-trained multilingual model or finetuning strategy, and (ii) the type of OOD data.
Impact of the pre-trained multilingual models: We compare the OOD generalization of LaBSE, mBERT, and XLM-R.The results in OOD generalization against the zero-shot approach.
The results in Table 2 reveal large OOD generalization gains for non-English languages using translate-test and mBERT, with accuracy gains between +5.6% and +9.3%.This supports our previous discussion of mBERT being more biased towards English.For LaBSE, translate-train is most effective on Amazon and Restaurants, with average accuracy boosts of +2.1% and +2.3% respectively, but not for Tweets (−1.8%).For XLM-R, Restaurants and Tweets benefit most from translation: translate-train (translate-test) surpass zeroshot with respective gains of +3.8% (+2.3%) and +3.3% (+2.5%).In conclusion, while translationbased strategies can further boost the OOD generalization zero-shot cross-lingual transfer, the benefits are dependent on the multilingual model and OOD test data.

Out-of-distribution generalization with data augmentation
To address (RQ2) on achieving better OOD generalization, we first analyze the effect of augmenting training data with the manually constructed counterfactuals of Kaushik et al. (2019).These counterfactuals will serve as an upper baseline against which we will subsequently compare the performance of models trained on (i) counterfactuals generated by the state-of-the-art in automatic counterfactual construction, i.e., CORE (Dixit et al., 2022), and (ii) our LLM domain transferred and summa-rized augmented data.
Manually constructed counterfactuals: Comparing the original + CAD results in Table 3  The bold and underlined scores in Table 3 denote the top two results.Our summarization strategy achieves the best non-English OOD generalization on Amazon and Restaurants, on par with (or surpassing) models trained on CAD.On Tweets, while summarization still improves models trained solely on the original data, training on CAD or CORE (XLM-R) yields the best results.
These findings support the efficacy of costeffective data augmentation as a viable alternative to manually constructed counterfactuals for non-English test data.It is worth noting that our summarization and domain transfer methods scale linearly, only requiring a single transformation of training samples for each class.However, it is doubtful that CAD and CORE can be similarly expanded beyond binary sentiment classification due to their quadratic data complexity: counterfactuals have to be constructed among every pair of classes.
Impact of LLM-based data augmentation on mono-lingual OOD generalization: Thus far, our analysis has primarily focused on the generalization from English ID training data to non-English OOD test data.Here, we investigate whether our summarization and domain transfer strategies can also help classifiers generalize in the well-studied monolingual setup, i.e., from English training data to English OOD test data.In this setup, the translate-test step is omitted: both the English ID training reviews from IMDb and the English OOD test samples are summarized or domain transferred, without any prior translation.
Comparing the EN scores across the different transfer strategies in Table 3 for each of LaBSE, mBERT, and XLM-R, reveals findings similar to the OOD generalization to non-English languages.(i) For Amazon and Restaurants, all data augmentation approaches deliver classifiers that better generalize OOD compared to the original only classifiers trained without augmented data.Our summarization strategy achieves the best overall results, surpassing both classifiers trained on CORE and manually constructed counterfactuals (CAD), except for mBERT and Amazon, where CAD results in a minor accuracy gain of 0.6% over summarization.(ii) Surprisingly, for Tweets, only classifiers trained on manually constructed CAD show consistent OOD generalization improvements over original-only classifiers.This is in contrast to the results observed for non-English, where CORE and our summarization augmentation approach were able to improve upon the original-only classifiers.
Overall, these results highlight that our summarization strategy can also benefit monolingual OOD generalization, surpassing classifiers augmented either with CAD or CORE generated counterfactuals for Amazon and Restaurants.
Ablations: We provide ablations in Table 4 for our most effective strategy, i.e., summarization, and find that: (i) The benefits of translating test samples into English (translate-test) versus solely augmenting the training data with summaries (zero-shot) vary based on the multilingual and/or OOD test data: there are clear OOD improvements to non-English samples for mBERT and XLM-R, but results for LaBSE are mixed and comparable to the zero-shot strategy; (ii) More importantly, further summarizing the English translated test samples improves OOD generalization more than solely translating them to English, consistently boosting accuracies by up to +5% for LaBSE and +4.3% mBERT, across all datasets.For XLM-R, summarization slightly reduces accuracy, e.g., −1.2% for non-English languages on Amazon and −1.9% for Tweets compared to translation alone, yet still boosts OOD generalization to Restaurants by 3.1% over translatetest.

Cost-effectiveness of LLM-based augmentation:
To assess the cost-effectiveness of our LLM-based augmentation, we discuss the costs of our best approach, i.e., summarization, and compare it to that cost of employing human workers to manually construct counterfactuals.Kaushik et al. (2019) report that human workers spent an average of 5 minutes revising a single IMDb review, with each worker earning $0.65 per revised review.Therefore, manually revising 1.7K training reviews incurs a total cost of ≈$1,105 and ≈141 hours of labor.
In contrast, our summarization strategy costs $0.0003 on average for summarizing a single training IMDb review, totaling $0.51 for all 1.7K training reviews.However, our best OOD generalization is achieved not only by summarizing training reviews, but also by using an LLM during inference to: (1) translate non-English test samples to English (translate-test), and (2) further summarize the English translated test samples.For (1), the cost is $0.00015 per OOD sample.For (2), an additional cost of $0.00007 is required per OOD sample. 3The reported costs per test sample are taken as the average among all OOD test sets and non-English languages.
In conclusion, our summarization strategy costs $0.51 to summarize all 1.7K training samples, and $0.00022 (=(1)+(2)) per inference.Thus, for the same cost of employing human workers for CAD creation (≈ $1,105), our summarization strategy enables inference for 5M test samples.Note, however, that the best overall performance of classifiers augmented with CAD are achieved for translatetest.Therefore, if we also account for translation costs of the CAD-augmented classifiers, our summarization method can perform inference for 15M test samples for the same cost as employing human workers for CAD creation.This demonstrates the cost-effectiveness of our summarization approach when scaled up to 5M test samples as compared to 3 Summarizing OOD test samples is less costly than summarizing IMDb training samples due to the test samples comprising fewer tokens.
zero-shot +CAD, and up to 15M when compared to translate-test +CAD.For future work, exploring open-source LLMs -or translation and summarization models could prove valuable for reducing inference costs.

Conclusions
We explored the generalization of zero-shot crosslingual transfer to out-of-distribution (OOD) test data, considering both language and domain shifts.Our experiments on binary sentiment classification with pre-trained multilingual models LaBSE, mBERT, and XLM-R finetuned on English IMDb movie reviews and evaluated on non-English test samples comprising Amazon product reviews, Restaurant feedback, and Tweets, demonstrate that model performance substantially degrades, aligning with previous OOD generalization studies in a monolingual English setting.We also found that mBERT and XLM-R suffer more from performance reduction on OOD in non-English languages compared to English OOD degradation, while LaBSE's generalization strongly depends on the OOD dataset.Our experiments with models finetuned on original data augmented with manually constructed English counterfactual (CAD) IMDb reviews show that CAD's OOD generalization gains observed in a monolingual English setting also translate well to a zero-shot cross-lingual setup.Finally, to avoid costly manually constructed counterfactuals, we propose two new data augmentation approaches for OOD generalization based on large language models: (i) domain transfer, and (ii) summarization.Models trained with data augmented by our summarization strategy, show substantial gains across all datasets and models, and on Amazon and Restaurants surpassing models either augmented with (i) manually constructed CAD (Kaushik et al., 2019), or (ii) state-of-the-art generated CORE counterfactuals (Dixit et al., 2022).

Limitations
Task domain: In this exploratory study, we only presented results for zero-shot cross-lingual binary sentiment classification.To investigate whether our findings generalize beyond binary classification, and to other non-classification tasks, further analysis is required.Nevertheless, as mentioned in §4.2, our data augmentation approaches scale better for classification tasks with more than two classes, since it only requires summarizing/transferring the training samples of each class once, whereas it is unclear how to scale counterfactuals to a larger number of classes.
Automatically translated in-distribution test data: Since we followed a similar setup as Kaushik et al. (2019), our experiments used the IMDb movie reviews as in-distribution sentiment data.While the main focus in our study is on out-ofdistribution generalization, the in-distribution test set was only provided in English.Hence, we used translation tools to automatically translate the English IMDb test set to the considered non-English languages.This may have caused annotation artifacts in the translated in-distribution tests, making it unclear how well the reported in-distribution results for non-English languages match real-world test data for non-English languages.

Translate-test based on a multilingual model:
As our aim was to analyze the out-of-distribution generalization of multilingual models and compare their performance, we did not include results for the translate-test based on a monolingual English model.We believe that using such a monolingual model could further boost the accuracy of translatetest, as well as for our summarization and domain transfer strategies.However, we leave exploration thereof for future work.
Applicability to low-resource languages: The effectiveness of the translate-test and translatetrain approaches are highly dependent on the accuracy of the adopted machine translation system.In this study, we used ChatGPT-turbo (v0301) as our translation tool, and found it to produce highquality translations for all languages considered in our experiments, i.e., boosting OOD generalization compared to the zero-shot strategy.However, such machine translations systems may not work well for low-resource languages that lack high-quality translation data.

Ethics Statement
Since our data augmentation methods use LLMs to generate summaries or create domain-transferred training (and test) samples, any biases present in the data used to train these LLMs could be transferred to the augmented data.We should therefore be careful to ensure that these biases do not carry over when training models on the augmented data, to avoid models that could discriminate against and/or potentially be harmful to certain demographics.

A Appendix
Datasets: Tables 5 and 6 summarize respectively the number of out-of-distribution test samples and the number of train, validation and test indistribution test samples.Note that the number of samples for translate-train and translate-test exactly match those shown in the tables.
Prompts: Figs. 2 and 3 show our adopted prompts for instructing ChatGPT-turbo to translate (i) non-English out-of-distribution test samples into English for translate-test, and (ii) English in--distribution English training and validation samples into non-English for translate-train.

Detailed ID and OOD results per language:
The in-distribution and out-of-distribution results per language are presented in Tables 7 and 8

Fig. 1 :
Fig. 1: Zero-shot cross-lingual transfer setup.Multiple transfer strategies, including our newly proposed summarization and domain transfer methods for boosting OOD generalization.
IMDBIf you haven't read this book, it's terrible.It is pure trash.I read this about 17 years ago, and I'm still screwed up from it.

Table 1 :
LLM-based data-augmentation.Top: original ID training and OOD test samples (including English translations).Middle: mapping of the diverse domain samples onto the hypothetical books domain.Bottom: demonstrates how summarization retains essential information while removing potentially spurious elements.

Table 2 :
In-distribution vs. out-of-distribution test accuracies for the original only strategy trained solely on IMDb reviews (without CAD or data augmentation).Results are presented for English (EN) and non-English (NON-EN) test data, with the latter's accuracies averaged across all non-English languages per test set.Detailed results per language are provided in Appendix A. Note, for English, TTRAIN and TTEST do not involve any translation, hence their EN scores are equivalent to ZSHOT.Further, ID scores for TTEST are omitted as these would involve backtranslating the non-English ID samples (originally translated from English ID test data per §3.1) to English, which would largely assess back-translation quality.

Table 2
EN EN NON-EN EN NON-EN EN NON-EN EN NON-EN EN NON-EN EN NON-EN EN NON-EN EN NON-EN Impact of the transfer strategies: We assess the translate-train and translate-test strategies for

Table 3 :
Out-of-distribution generalization with data augmentation.Original only: baseline model trained solely on IMDb reviews, without CAD or data augmentation.+CAD: augments IMDb training samples with manually constructed counterfactuals.+CORE: augments training samples with automatically generated counterfactuals.+Domain transfer and +Summarization augment the training data with our newly proposed strategies.Best model in bold with the runner-up underlined.

Table 4 :
Ablations of our best data augmentation strategy: summarization.ZSHOT: trains on the original English and summarized English IMDb reviews.+TTEST: additionally translates test samples to English.+SUM.: further summarizes the English translated test samples prior inference.

Table 8 :
Out-of-distribution accuracies for LaBSE, mBERT, and XLM-R.Best model in bold with the runner-up underlined.♠: ablations.For English, TTRAIN and TTEST do not involve any translation, hence their EN scores are equivalent to ZSHOT.Highlighted rows show a 1-on-1 comparison between classifiers augmented with (i) our (summarization) strategy, and (ii) the state-of-the-art generated CORE counterfactuals.