Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

In this paper, we explore the challenging problem of performing a generative task in a target language when labeled data is only available in English, using summarization as a case study. We assume a strict setting with no access to parallel data or machine translation and find that common transfer learning approaches struggle in this setting, as a generative multilingual model fine-tuned purely on English catastrophically forgets how to generate non-English. Given the recent rise of parameter-efficient adaptation techniques, we conduct the first investigation into how one such method, prompt tuning (Lester et al., 2021), can overcome catastrophic forgetting to enable zero-shot cross-lingual generation. Our experiments show that parameter-efficient prompt tuning provides gains over standard fine-tuning when transferring between less-related languages, e.g., from English to Thai. However, a significant gap still remains between these methods and fully-supervised baselines. To improve cross-lingual transfer further, we explore several approaches, including: (1) mixing in unlabeled multilingual data, and (2) explicitly factoring prompts into recombinable language and task components. Our approaches can provide further quality gains, suggesting that robust zero-shot cross-lingual generation is within reach.


Introduction
Cross-lingual language understanding is an important area of ongoing research (Conneau et al., 2020;Hu et al., 2020;Ruder et al., 2021).With vastly differing amounts of data (both labeled and unlabeled) available across languages, there is significant value to developing techniques that can transfer knowledge from higher-resource languages to improve performance in lower-resource languages.Zero-shot cross-lingual benchmarks push on the Work done as a student researcher at Google Brain.

Inference time: Apply the resulting LM to summarize articles written in non-English languages (zero-shot cross-lingual)
Training time: Adapt a pretrained multilingual LM to English summarization using prompt tuning or model tuning

Multilingual Language Model (mT5)
Figure 1: A demonstration of WIKILINGUA-0, a challenging zero-shot cross-lingual generation (XGEN) task, which requires a model to learn a generative task from labeled data in one language (i.e., English), and then perform the equivalent task in another language at inference time.
limiting case where no labeled data is available in the target language.Remarkable progress has been made on zero-shot cross-lingual tasks by scaling up the size of pre-trained multilingual models (Conneau et al., 2020;Xue et al., 2021).However, prior work has focused nearly exclusively on non-generative tasks (e.g., classification, extractive question answering, and sequence labeling).
In this paper, we turn our attention to zeroshot cross-lingual generation, or "XGEN", which requires a model to learn a generative task from labeled data in one language (typically English), and then perform the equivalent generative task in another language.This problem is particularly challenging because generative models trained on one language are known to exhibit catastrophic forgetting, losing the ability to generate coherent text in other languages (Xue et al., 2021;Maurya et al., 2021;Shakeri et al., 2021).In particular, we focus on the relatively under-explored task of zero-shot cross-lingual summarization.We construct a new zero-shot task WIKILINGUA-0 from the WIKILINGUA dataset (Ladhak et al., 2020), allowing us to test XGEN capabilities across 18 languages.We motivate a new evaluation metric for our task, SP-ROUGE, and show that it correlates well with human judgments of summary quality.Maurya et al. (2021) show improved performance on XGEN tasks by freezing model parameters in the input and output layers during finetuning.Inspired by recent parameter-efficient adaptation techniques (Houlsby et al., 2019;Zaken et al., 2021;Li and Liang, 2021;Lester et al., 2021), we take this approach further: can we overcome catastrophic forgetting by freezing all of the pre-trained model parameters, and only tuning a much smaller set of task-specific parameters?Parameter-efficient tuning methods are particularly appealing for multilingual NLP, as they would enable reuse of a single frozen model across many combinations of task and language, reducing storage and serving costs.
To this end, we conduct the first investigation of the XGEN performance of PROMPTTUNING (Lester et al., 2021), a simple parameter-efficient adaptation technique that limits learned parameters to a set of virtual tokens prepended to the text input.We compare PROMPTTUNING with standard fine-tuning (or MODELTUNING, where all model weights are tuned) across different languages and model scales.We find that increasing model size and decreasing tunable parameter capacity are key for overcoming catastrophic forgetting.Despite its inferior performance on the training language (English), PROMPT-TUNING with scale typically outperforms MODEL-TUNING when evaluated on non-English languages, especially on languages more distantly related to English, such as Thai.This corroborates previous findings (Li and Liang, 2021;Lester et al., 2021) that parameter-efficient methods are more robust to domain shifts between training and inference.Motivated by our initial findings, we investigate two approaches to further improve the XGEN performance of PROMPTTUNING and MODELTUNING.Our first approach involves mixing unlabeled data in the target language into the supervised training stage.We show this dramatically alleviates catastrophic forgetting on WIKILINGUA-0.We also introduce a novel approach, "factorized prompts", which is specifically designed for PROMPTTUNING.We train prompts on a multi-task multilingual mixture, where each prompt is factorized into composable language and task modules-the first half of the prompt encodes language knowledge, while the second half captures language-agnostic task knowledge.During inference in the zero-shot crosslingual setting, the source language module is replaced with the target language module, while the task module remains unchanged.We demonstrate that factorized prompts provide an effective means of improving XGEN performance.
(XGEN) where a model is trained on a generative task in one language (typically English), and then asked to perform the equivalent task in another language during inference.We construct a novel zeroshot cross-lingual summarization task and show that state-of-the-art text-to-text models adapted using MODELTUNING and PROMPTTUNING techniques are not able to successfully perform our task.Our analysis reveals that both techniques suffer from catastrophic forgetting, causing them to often generate text in the wrong language.

Problem formulation
Defining WIKILINGUA-0 zero-shot cross-lingual summarization: We leverage the WIKILINGUA dataset (Ladhak et al., 2020;Gehrmann et al., 2021) to create a novel zero-shot cross-lingual summarization task, which we dub WIKILINGUA-0.2While WIKILINGUA provides labeled training data in 18 languages (including English), we are interested in a more realistic experimental setup where no training data is provided in non-English languages, as it is less practical to obtain labeled data for real low-resource languages. 3As such, we discard all training data for non-English languages, with the exception of ablation experiments, and cast WIK-ILINGUA as training a model with English summarization data and feeding it non-English articles during zero-shot evaluation. 4efining SP-RG for multilingual summarization evaluation: ROUGE (Lin, 2004) has been the metric of choice for evaluating summarization systems.However, it assumes that the input text uses spaces to separate words, which is not the case for many languages (e.g., Chinese, Japanese, and Thai). 5ne possible solution is to use language-specific tokenizers, as done in Conneau and Lample (2019).To avoid language-specific preprocessing, we use SentencePiece sub-word tokenization (Kudo and Richardson, 2018), which is data-driven and lan-guage independent. 6We call our metric SP-ROUGE (SentencePiece-based ROUGE) or SP-RG for short, and report SP-RG-LSUM in our experiments. 7In Appendix B, we demonstrate that SP-ROUGE produces a similar correlation to human judgments as BLEURT (Sellam et al., 2020) while being significantly more computationally efficient.

Baselines
In addition to vanilla MODELTUNING and PROMPT-TUNING, we consider the following baselines: LEAD-64: This baseline simply copies the first 64 SentencePiece tokens from the input article.8

TRANS-TRAIN:
We perform MODELTUNING or PROMPTTUNING on WIKILINGUA-0 English summarization data that is translated into the target language using GOOGLE TRANSLATE.

TRANS-TEST:
We train on English summarization data and evaluate on validation data that is translated from the target language to English.

SUP & SUP-ALL:
To ablate the impact of using the labeled training data provided in the original WIK-ILINGUA dataset for all languages, we either train on supervised data for each individual target language (SUP) or a mixture of supervised data from all languages (SUP-ALL).9

Training and implementation details
We perform MODELTUNING and PROMPTTUNING on top of pretrained mT5 checkpoints (Xue et al., 2021) of all sizes: SMALL, BASE, LARGE, XL, XXL,10 using T5X (Roberts et al., 2022).For PROMPTTUN-ING, we create an LM adapted version of these checkpoints by further training them for 100K steps with the "prefix LM" objective (Raffel et al., 2020) using mC4 (Xue et al., 2021) data for all languages. 11Except for ablations, we use 100 prompt tokens and initialize the prompt by sampling from the vocabulary embeddings.Training inputs and targets are clipped to 1024 and 512 SentencePiece tokens, respectively.We always train for 100,000 steps for both MODELTUNING and PROMPTTUNING.We save a checkpoint every 5,000 steps and report results on the model checkpoint corresponding to the highest performance on a target language using 250 validation examples for all languages.122.3 Results and Discussion WIKILINGUA-0 is challenging for both MODELTUN-ING and PROMPTTUNING: Our zero-shot evaluation results on WIKILINGUA-0 for French (FR), Vietnamese (VI), Russian (RU), and Thai (TH) are shown in Figure 2a. 13For comparison, we also include results on English.Overall, we find that zeroshot inference on an unseen language leads to a substantial performance drop for both model adaptation techniques, especially when feeding in articles in non-Latin script languages like Russian and Thai.Consistent with the findings in An et al. (2022) for other generative tasks, we find that PROMPTTUNING, even with scale, falls far below MODELTUNING on monolingual English summarization. 14ROMPTTUNING is better on larger language shifts: Interestingly, PROMPTTUNING is competitive with or out-performs MODELTUNING when evaluated on other languages.For instance, at the XXL scale, PROMPTTUNING outperforms MODELTUNING by a large margin of +7.3 SP-ROUGE (37.4 vs. 30.1)on Thai.A closer look at these results reveals an interesting pattern: as model size increases, PROMPT-TUNING usually produces better results than MODEL-TUNING when there is a significant language shift at inference time (e.g., from English to a non-Latin script language).15 This corroborates the view in Lester et al. (2021) that MODELTUNING may be over-parameterized and thus more prone to overfit the training task and less robust to domain shifts.
Both MODELTUNING and PROMPTTUNING suffer from catastrophic forgetting and this effect is more pronounced for MODELTUNING: When performing zero-shot evaluation on non-English languages, we discover that both MODELTUNING and PROMPTTUNING often partially summarize non-English articles into English instead of the target language.This suggests that they suffer from overfitting on the training task.To probe more deeply into this problem, we evaluate performance for each saved checkpoint, and additionally measure: (i) LID lang -the average confidence score given by cld316 when detecting the language lang, and (ii) ASCII-the average percentage of ASCII characters present in the model's predictions, with a higher value indicating a larger amount of English in the model's output for non-Latin script languages.Figure 3 shows our evaluation results as training progresses.For PROMPTTUNING, we observe a clear "deteriorating" trend, where the longer the prompt is tuned on English, the more unwanted English is generated, and the lower summarization quality becomes for Russian and Thai.For MODELTUNING, even by the first checkpoint, the model has already heavily overfit to English, outputting >60% ASCII for Russian and Thai inputs.There is a modest recovery later in training, but quality as measured by SP-ROUGE remains low.
Bigger models are less prone to forget: In Figure 2b, we observe that moving to larger model sizes mitigates catastrophic forgetting to a large extent.This is true both for MODELTUNING (in line with the findings of Xue et al. (2021)), as well as for PROMPTTUNING.For example, at SMALL size, MODELTUNING and PROMPTTUNING only successfully generate Russian text 0.0% and 10.1% of the time respectively, whereas at XXL size, these numbers jump to 57.5% and 84.4%.
Too much capacity is harmful: Figure 2c shows an interesting "paradox of capacity" with regard to the prompt length for PROMPTTUNING.On the one hand, greater capacity (in the form of longer prompts) clearly helps to better learn the summarization task.On the other hand, the greater the capacity to learn from English training data, the more the model forgets other languages.We observe that at the beginning of training, the little amount of English introduced in generated outputs is eclipsed by the improvement in summarization quality, which results in a better SP-ROUGE score.
As training continues, however, the increased capacity becomes harmful as more and more English is introduced in the model's output, which domi- nates the improvement in summarization quality and leads to lower SP-ROUGE.For each language and model size, we observe a critical point past which adding extra capacity becomes harmful.For instance, in Thai at the XXL size, increasing capacity from 1 to 10 prompt tokens improves summarization quality (SP-ROUGE +4.8) despite a drop in language accuracy (LIDTH −8.0), and increasing capacity further to 100 tokens hurts both metrics.Significant headroom remains: The supervised baselines in Figure 4 highlight that significant headroom remains on this XGEN task.When tuning the XXL model directly on supervised training data in all languages, SP-ROUGE scores are between +5.8 (VI) and +12.8 points (TH) higher than our highest zero-shot results.We also note that for some languages, like Thai, the supervised baseline greatly exceeds any approach using machine translation.This highlights that machine translation quality is still low in some languages, so pursuing stronger zero-shot solutions is worthwhile.

Mitigating catastrophic forgetting
We have seen that increasing model scale and decreasing tunable parameter capacity are both effective in improving XGEN performance.Can we obtain further gains by devising methods that explicitly tackle catastrophic forgetting?Here, we investigate two approaches: mixing unlabeled training data with English supervised data, and factorizing the learned prompts into composable language and task modules.We show that both methods can provide substantially better results when there is severe catastrophic forgetting.Below, we describe each method and analyze our findings in detail.

Methods
Mixing in unlabeled training data: This approach involves multi-task learning by mixing an unsupervised training task (UNSUP) into the WIKILINGUA-0 data.Mixing is controlled by a mixing rate κ, resulting in a final mixture that is κ% UNSUP data and (100 − κ)% WIKILINGUA-0.As a data augmentation scheme, this method can be applied in all settings.We use the span corruption pretraining objective from T5 (Raffel et al., 2020) with mC4 data.We create separate multilingual datasets for each target language (MIX-UNSUP) as well as a single multilingual dataset that includes all of the WIKILINGUA-0 languages (MIX-UNSUP-ALL).Our goal is to encourage the model not to forget about other languages during training on English summarization.In our experiments, we use κ = 1. 17An alternative approach is to perform model or prompt tuning on an intermediate task before tuning on WIKILINGUA-0.This intermediate tuning approach has been used to boost performance on English tasks for both MODELTUNING (Phang et al., 2019;Vu et al., 2020) and PROMPTTUNING (Vu et al., 2022), and has been successfully applied to the zero-shot cross-lingual transfer setting (Phang et al., 2020;Maurya et al., 2021) for MODELTUNING.
In Appendix F, we show that intermediate tuning does not give reliable gains for XGEN.
Factorized prompts: Inspired by the MAD-X (Pfeiffer et al., 2020) adapter-based framework that learns modular language and task representations to adapt a multilingual model to arbitrary tasks and 17 In our preliminary experiments, κ = 1 performed best among a range of values {1, 5, 10, 30, 50}.We conjecture that a value of κ > 1 would prevent the model from focusing on the main task of summarization as more unsupervised data is added.languages, we propose a novel method, dubbed "factorized prompts" (FP) and specifically designed for PROMPTTUNING.We attempt to decompose a soft prompt into "task" and "language" components that can be recombined in novel pairings (see Figure 5) with the goal of learning soft prompts that consist of disentangled and interpretable components.Unlike MAD-X, which learns language and task adapters separately for each language and each task, we learn language and task sub-prompts jointly for all languages and tasks.While we do not actively incentivize disentanglement, our multitask multilingual pretraining procedure encourages the general language and task-specific knowledge to be stored in separate regions of the prompt.Intuitively, we vary languages while keeping the task sub-prompt fixed to train one side of the prompt, and vary tasks while keeping the language subprompt fixed to learn the other side.We use mC4 data for all 18 WIKILINGUA-0 languages to create 7 unsupervised tasks per language.We randomly initialize language and task sub-prompts, each 50 tokens long.For each training example in our multi-task multilingual mixture, the relevant task and language sub-prompts are concatenated to form a full 100-token prompt.This training yields a set of learned language and task sub-prompts. 18Next, we train a new task subprompt on WIKILINGUA-0 English summarization while using a frozen copy of the English language sub-prompt.Finally, when performing inference in another language, we replace the English subprompt with the target language sub-prompt, while continuing to use the learned summarization subprompt.To ablate the impact of the target language sub-prompt, we also report the performance using the English sub-prompt for all languages (FP-EN).
We use 7 unsupervised tasks per language, including: the PREFIX LM, SPAN CORRUPTION, and I.I.D. DENOISING tasks described in Raffel et al. (2020); LM, the causal left-to-right LM task with no context provided, i.e., the encoder's input is empty; MISSING PREFIX PREDICTION, predicting a missing prefix from the input; N-TOKEN PREFIX PREDICTION, copying the first n-tokens of the input; and MISSING N-TOKEN PREFIX PREDICTION, predicting the missing n-token prefix of the input.When training on WIKILINGUA-0, we initialize the task sub-prompt with the learned SPAN CORRUPTION task sub-prompt.
To confirm that language-specific prompts trained in this way encode meaningful differences between languages, we visualize a clustered heatmap of the cosine similarities between prompts trained on a classic LM task for each language in mC4.We observe meaningful clusters reflecting both linguistic and geographical similarities across languages.See Appendix D for details.

Results and Discussion
Mixing in multilingual data prevents catastrophic forgetting: In Figure 6, we observe that mixing in unsupervised multilingual data helps prevent catastrophic forgetting in all conditions, increasing the likelihood of predicting text in the target language.With MODELTUNING, this improved language accuracy reliably translates into higher end task performance (SP-ROUGE).For PROMPTTUN-ING, mixing provides gains for non-Latin script languages (RU and TH) where catastrophic forgetting is more severe; for Latin-script languages (FR and VI), mixing harms the overall summarization quality, despite achieving higher language accuracy.
Mixing in multilingual data in all WIKILINGUA languages leads to similar results, with a marginal drop in performance.Thus, if the desired target language is known ahead of time, the simpler strategy of mixing in just that language should be preferred.However, in cases where the inference language is unknown, mixing many languages is also effective.
Factorized prompts are helpful for overcoming severe catastrophic forgetting: Factorized prompts are successful at improving target language accuracy in all conditions.However, this does not always translate to higher SP-ROUGE.When language accuracy is already relatively high (for Latin-script languages, and for XXL models), factorized prompts are not helpful.However, in settings where vanilla PROMPTTUNING shows the most severe forgetting (e.g., at BASE size, on non-Latin script languages), factorized prompts provide large gains, similar to or exceeding our mixing approach.

Qualitative Analysis
To better understand qualitative differences between the solutions reached by MODELTUNING and PROMPTTUNING, two authors who were native speakers of Vietnamese and Hindi inspected 50 predictions of each method at the XXL model size.
For both languages, we observed that the MOD-ELTUNING predictions were much more likely to include "code-switching", alternating between English and the target language, sometimes several times within a single sentence, as seen in Table 1.By comparison, the PROMPTTUNING predictions were more likely to use a consistent language throughout-typically staying entirely within the target language, but for some predictions resorting entirely to English.For both methods and both languages, we found code-switching predictions to generally be well-formed, in the sense that a bilingual speaker could extract the intended meaning, and that it served as a reasonable summary.For Hindi, the PROMPTTUNING method showed lower mean SP-ROUGE scores than MODELTUNING (17.9 vs. 23.1),and had higher variances across runs (std: 5.1 vs. 0.7).Manual inspection showed that the lower-scoring PROMPTTUNING runs had far more predictions that were entirely English, explaining the lower SP-ROUGE scores.
For Vietnamese, PROMPTTUNING achieved higher SP-ROUGE than MODELTUNING (38.0 vs. 34.0),with low variance in both cases (std: ≤ 0.5).On inspection, we found that most PROMPTTUNING predictions were entirely in Vietnamese, whereas MODELTUN-ING predictions typically contained at least some English.The PROMPTTUNING summaries tended to be shorter, but were often judged to be as good or better than the ground truth summaries.The MODELTUNING summaries tended to be a bit longer.
If mentally translating any English words back to Vietnamese, the quality was judged to be similar to the prompt tuning summaries, suggesting that the lower SP-ROUGE score is primarily due to the presence of intervening English.

Related Work
Mixing unlabeled multilingual data in during finetuning can be viewed a version of rehearsal (Robins, 1995), commonly used to mitigate catastrophic forgetting.Related work has used this mixing (Xue et al., 2021;Shakeri et al., 2021)  Previous work has also explored intermediate adaptation of pre-trained models, which has been shown to be effective for MODELTUNING (Howard and Ruder, 2018;Phang et al., 2019;Vu et al., 2020Vu et al., , 2021) ) and PROMPTTUNING (Vu et al., 2022).Phang et al. (2020) apply intermediate adaptation in the multilingual domain, but use English in the adaption instead of the target language.Maurya et al. ( 2021) use a cross-lingual intermediate task.Unlike our task, theirs is designed to closely match the downstream task.Several works use intermediate adaptation to create a model that is better in the zero-or few-shot settings (Wei et al., 2022;Sanh et al., 2022;Min et al., 2022), but these target generalization to new tasks, whereas we focus on generalizing to new languages within one task.
Other work explores cross-lingual transfer learning with parameter-efficient methods.Zhao and Schütze (2021) find that soft prompts can effectively be used in cross-lingual settings, but their work is constrained to classification.Pfeiffer et al. (2020) use adapters rather than prompts and leverage parameter-efficient learning to create separate language and task understanding modules that can be combined at inference time.
There has been recent interest in cross-lingual generation.Maurya et al. (2021) and Chi et al. (2020) evaluate their methods using cross-lingual generation, including summarization as we do.However, Chi et al. (2020) use parallel data during pre-training to "align" representations across languages during pre-training while our approach does not.

Conclusion
In this work, we explored how different adaptation methods fare on the challenging "XGEN" task of zero-shot cross-lingual summarization.While many methods struggled with catastrophic forgetting (outputting English rather than the target language), we observed two factors helped to mitigate this problem: (1) increasing model scale, and (2) decreasing the number of parameters tuned during adaptation.When all of a model's weights are tuned on English (MODELTUNING), forgetting is quick and severe.By contrast, limiting the tunable parameters to a smaller soft prompt (PROMPTTUN-ING) helps to combat forgetting, though prompt size is an important variable to control.
To further close the gap with supervised methods, we explored two adaptation techniques-one entirely novel, and one that has been used before, but not in combination with parameter-efficient methods like PROMPTTUNING.We find that mixing in unsupervised multilingual data is always helpful.Our novel approach, "factorized prompts", is helpful at smaller model sizes, but has no benefit at larger sizes.We hope that future work will continue to explore XGEN tasks including WIKILINGUA-0, and develop stronger zero-shot adaptation techniques to allow multilingual models to reliably generate coherent text in any target language.

Limitations
Our work focuses on a single XGEN task, WIKILINGUA-0 summarization.In future work, it would be valuable to see if our findings generalize to additional domains and tasks, including those beyond summarization.
WIKILINGUA-0 is not a traditional summarization task.Rather than news articles, the input documents are how-to guides, and the summaries are "headings" for each step, which may be more terse than a traditional summary.We observed some minor data quality issues in WIKILINGUA-0, including HTML code present in some target strings, and artifacts of machine translation evident in some non-English documents.Nevertheless, we believe that WIKILINGUA-0 is a meaningful and challenging XGEN task, with the notable advantage of covering a range of high-and low-resource languages from diverse language families and with diverse scripts.
In evaluating parameter-efficient methods, we focused on PROMPTTUNING due to its simplicity.There are a growing number of other parameter-efficient methods that could also be tested, including ADAPTERS (Rebuffi et al., 2017;Houlsby et al., 2019), BITFIT (Zaken et al., 2021), PREFIX-TUNING (Li and Liang, 2021), (IA) 3 (Liu et al., 2022), and many more; see Liu et al. (2021), He et al. (2022), andLiu et al. (2022) for detailed discussion of the differences between these methods.We expect many of the benefits of tuning fewer parameters to persist across methods, but this remains to be explored.

B Measuring the correlation between SP-RG and human judgments
To evaluate how well our proposed SP-ROUGE metric correlates with human judgments, we use the MULTISUMM EVAL dataset introduced by Koto et al. (2021), which is a manually-annotated multilingual resource for summarization evaluation with 4,320 human annotations on FOCUS (precision) and COVERAGE (recall) between machine-generated summaries and ground-truth summaries.We compare SP-ROUGE to BLEURT (Sellam et al., 2020), which is a learned evaluation metric based on BERT (Devlin et al., 2019).Table 9 shows the Pearson correlation coefficient between these metrics and human judgments across 8 MULTISUMM EVAL languages, including German (DE), English (EN), Spanish (ES), French (FR), Indonesian (ID), Russian (RU), Turkish (TR), and Mandarin Chinese (ZH).Overall, we found that the performance of SP-ROUGE and the more computationally expensive BLEURT metric were similar.Specifically, SP-ROUGE achieved an average FOCUS score of 0.68 and an average COV-ERAGE score of 0.65, whereas BLEURT achieved 0.68 and 0.70, respectively.Figure 7 demonstrates the linear relationship between SP-ROUGE-LSUM vs FOCUS scores on French.
C Zero-shot evaluation results on WIKILINGUA-0 Our zero-shot evaluation results on WIKILINGUA-0 for French (FR), Vietnamese (VI), Russian (RU), and Thai (TH) are shown in Table 10.See Table 8 for results across all target languages.Our results suggest that WIKILINGUA-0 is a challenging task for both MODELTUNING and PROMPTTUNING.As model size increases, PROMPTTUNING usually produces better results than MODELTUNING when there is a significant language shift at inference time.Longer prompts help to better learn the English summarization task.However, the increased capacity leads the model to forgets other languages.

D Language-Specific Prompt Clustering Analysis
To confirm that language-specific prompts trained on an LM task encode meaningful differences between languages, we train 107 prompts, one for each language in the mC4 corpus.Specifically, we train prompts for the mT5-BASE model, with a prompt length of 1, for 10K training steps, using a batch size of 32.The training task consists of classic causal language modeling, with an empty string fed as inputs to the encoder, and the document text passed as targets.Each prompt is trained exclusively on data from a single language bucket; however, we note that mC4 contains a non-trivial number of language ID errors, particularly for lowerresource languages (Kreutzer et al., 2022).
Figure 8 shows a clustered heatmap of the cosine similarities between the trained prompts.We observe a number of interpretable clusters that give us confidence that the learned prompts encode meaningful language representations.For example, the leftmost 25 languages form a visible cluster and are all nearly all languages of Europe,19 with meaningful sub-clusters for different European regions: Northern (NO, SV, DA, NL), Central (CS, PL, SK, LT, SL), South-Western (ES, PT, FR, IT) and Eastern (KK, AZ, TR, BG, MK, BE, UK).Another prominent cluster covers languages of India, Pakistan and Nepal (ML, TE, NE, KA, KN, GU, HI, SI, BN, TA), despite the fact that these languages cover different linguistic families and are written with different scripts.While geography seems to be the primary factor influencing prompt similarity, linguistic relationships also play a role.For instance, we observe that Finnish (FI) and Hungarian (HU), both Finno-Ugric languages, form a cluster despite their geographic distance.Similarly, Igbo (IG), spoken mainly in 20.7 0.0 99.6 0.0 18.9 0.0 0.0 0.0 100.0 0.0 16.5 0.0 0.0 0.0 99.6 0.0 22.1 0.0 0.0 0.0 100.0 0.0 15.9 0.0 0.0 0.0 97.6 10: Summarization quality (SP-ROUGE) and language identification confidence scores (LID) across model sizes and methods (numbers in the subscript indicate the standard deviation across 3 random seeds).Our results suggest that WIKILINGUA-0 is a challenging task for both MODELTUNING and PROMPTTUNING.As model size increases, PROMPTTUNING usually produces better results than MODELTUNING when there is a significant language shift at inference time.Longer prompts help to better learn the English summarization task.However, the increased capacity leads the model to forgets other languages.
Nigeria, is clustered nearby Haitian Creole (HT), whose grammar derives from Igbo.

E Mitigating catastrophic forgetting
Table 11 shows our experiment results for different approaches described in §3.1.As can been seen, mixing in unlabeled multilingual data (MIX-UNSUP/MIX-UNSUP-ALL) helps prevent catastrophic forgetting for MODELTUNING.Intermediate tuning (IT-GIGAWORD/IT-LM) does not result in reliable gains.Finally, factorized prompts (FP-EN/ FP) lead to an improvement in target language accuracy, and an improvement in SP-RG in cases where vanilla PROMPTTUNING shows the worst performance.

F Intermediate tuning
As an adaptation step, we perform model or prompt tuning on an intermediate task before training on WIKILINGUA-0.Intermediate tuning has been used to boost performance on English tasks for both MODELTUNING (Phang et al., 2019;Vu et al., 2020) and PROMPTTUNING (Vu et al., 2022), and has been successfully applied to the zero-shot cross-lingual transfer setting (Phang et al., 2020;Maurya et al., 2021)  vised task from the target language is helpful in conjunction with freezing some model components for MODELTUNING.Previous work has used an auxiliary task designed to be close to the main task, while we simply use mC4 data.For each target language we create a causal, left-to-right LM task by providing no context, i.e., the encoder's input is empty (IT-LM).To further explore the effect of continued training on English data, we include an additional experiment where the GIGAWORD (Graff et al., 2003) summarization dataset is used as the intermediate task (IT-GIGAWORD). 20 Intermediate tuning does not give reliable gains: As can be seen in Table 11, intermediate tuning on English summarization (IT-GIGAWORD) improves English performance, but generally hurts XGEN capabilities.For MODELTUNING, it exacerbates catastrophic forgetting and harms overall performance across all model sizes.For PROMPTTUNING, 20 We found that additional tuning was helpful for intermediate tuning on large datasets.As such, we performed 200,000 steps during tuning on an intermediate task and selected the best prompt checkpoint based on validation performance on that task.
English intermediate tuning provides small gains at BASE size, but is harmful at XXL size.Intermediate tuning on an LM task in the target language (IT-LM) has a neutral or negative effect in most cases, running somewhat counter to the findings of Maurya et al. (2021). 21Compared to directly mixing in unlabeled multilingual data, intermediate tuning has little benefit on language accuracy.This smaller effect is to be expected, given that the final stage of English-only training is still ample opportunity to overfit on English and catastrophically forget other languages.

English article :
Mask the noise in your ears by turning on background music or other sounds You can use tapes or CDs with "white noise" of the ocean, … English summary: Use calming background sound to drown out the noise.Listen to soothing sounds as you fall asleep … Thai article: กลบเสี ยงดั งในหู โดยเป ดเพลง บรรเลงหรื อเสี ยงบรรยากาศคลอไป จะเป ดคลิ ปหรื อแผ น CD ที ่ เป น … Thai summary: ใช เสี ยง บรรยากาศชวนสงบใจ.ฟ งเสี ยงขั บ กล อมจนหลั บไป.

Figure 2 :Figure 3 :
Figure2: (a) Zero-shot XGEN summarization quality (SP-RG) and (b) target language accuracy (LIDXX) of PROMPT- TUNING and MODELTUNING models across five model sizes and four target languages: French (FR), Vietnamese (VI), Russian (RU), and Thai (TH).English (EN) performance is provided as a point of comparison, but is no longer a zero-shot task.(c) The effect of prompt length on PROMPTTUNING performance at BASE and XXL model sizes.

Figure 4 :
Figure4: SP-ROUGE scores of our baselines (LEAD-64, PROMPTTUNING, MODELTUNING) at the XXL model size, in the zero-shot XGEN setting.For comparison, we also show the headroom available if a machine translation system is used (TRANS-TRAIN, TRANS-TEST), or if gold data in target languages is used (SUP, SUP-ALL).

Figure 6 :
Figure6: SP-ROUGE (top) and language accuracy (bottom) performance at BASE and XXL sizes of our proposed approaches: mixing unsupervised data (MIX), and factorized prompts (FP).See Appendix E for full results.

Table 2 :
Best validation accuracy per language on XNLI.

Table 3 :
Best validation F1 per language on XQUAD.

Table 4 :
Best validation F1 per language on MLQA.

Table 5 :
Best validation F1 per language on TYDIQA.

Table 6 :
Best validation accuracy per language on PAWS-X.

Table 9 :
SP-ROUGE correlates well with human judgments, providing a similar correlation to BLEURT while being significantly less computationally expensive.

Table 11 :
Maurya et al. (2021)ya et al. (2021)show that intermediate tuning on an auxiliary unsuper-Summarization quality (SP-ROUGE) and language identification confidence scores (LID) across two model sizes (BASE and XXL) and methods (numbers in the subscript indicate the standard deviation across 3 random seeds).Mixing in unlabeled multilingual data (MIX-UNSUP/MIX-UNSUP-ALL) helps prevent catastrophic forgetting for MODELTUNING.Intermediate tuning (IT-GIGAWORD/IT-LM) does not result in reliable gains.Factorized prompts (FP-EN/ FP) lead to an improvement in target language accuracy, and an improvement in SP-ROUGE in cases where vanilla PROMPTTUNING shows the worst performance.