On the Role of Parallel Data in Cross-lingual Transfer Learning

While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.


Introduction
Multilingual models have been shown to generalize across languages in a zero-shot fashion (Pires et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020;Kale et al., 2021).These models are usually pretrained on concatenated monolingual corpora in multiple languages using some form of language modeling or denoising objective.The models are then finetuned using labeled downstream data in the source language (typically English), which makes them capable of generalizing to the target language thanks to the aligned representations learned at pretraining.While this paradigm does not require any data in the target language other than the monolingual pretraining corpus, prior work has reported improved results by incorporating parallel data into the pipeline, either at pretraining or finetuning time.During pretraining, parallel data has been incorporated through an auxiliary objective, such as Translation Language Modeling (TLM) in XLM (Conneau and Lample, 2019) or bitext denoising in PARADISE (Reid and Artetxe, 2022).Regarding finetuning, it is common to use Machine Translation (MT)-which is trained on parallel data under the hood-to translate the downstream training data into the target language(s) (Conneau et al., 2020), which can be seen as a form of data augmentation.Nevertheless, it is still unclear why parallel data is beneficial for cross-lingual transfer learning.Is the data itself that matters, given the additional information that it contains?Or is it the explicit modeling of parallel interactions that is important?To answer this question, we systematically compare the use of parallel data from different sources: ground truth parallel data, or synthetic parallel data generated by either supervised MT, unsupervised MT, or word-by-word translation.Most notably, our unsupervised MT variant relies on the exact same monolingual corpus used to pretrain the model, so any potential improvement can only come from the modeling side.
Our experiments on Natural Language Inference (NLI), Question Answering (QA) and Named Entity Recognition (NER) show that the explicit modeling of parallel interactions is indeed beneficial, demonstrating that existing pretraining and finetuning methods do not exploit the full potential of monolingual data.However, our best results are obtained using real parallel data-either directly or indirectly through supervised MT-showing that there is also some inherent value on it.
In the light of these results, we argue that the traditional categorization of cross-lingual transfer approaches into zero-shot, translate-train and translate-test (Figure 1a) falls short at capturing the required detail for a fair comparison across different approaches.Given this, we encourage further research on understanding what the contribution of monolingual and parallel data is, and how to best leverage them (directly or indirectly through MT, and at different parts of the pipeline), which requires thinking beyond the boundaries of the existing categorization (Figure 1b).
Our finetuning incorporation experiments in §3.2 involve machine translating the training data into the target languages.For XNLI, we just translate the premise and hypothesis and leave the label unchanged.For XQuAD and WikiANN, which have token-level labels (as opposed to sequence-level), we translate the input text and project the answer spans by using the awesome (Dou and Neubig, 2021) word aligner , taking the aligned spans as the target labels.

Model
We use XLM-R base (Conneau et al., 2020) for all of our experiments, which was trained through Masked Language Modeling (MLM) on CC-100 (a monolingual corpus covering 100 languages).For finetuning, we experiment with learning rates of 1e-5, 5e-5, and 1e-4 using the Adam optimizer.We train for up to 10 epochs and choose the checkpoint with the best validation performance averaged across the languages in consideration.

Parallel data sources
We compare the following sources of parallel data in our experiments: Gold.Ground-truth parallel data generated by humans.We use the same parallel data as Reid and Artetxe (2022), which combines data from IWSLT, WMT, and other parallel corpora.
Supervised MT.Synthetic parallel data generated through a conventional MT system.The MT system is supervised, so this approach is also leveraging ground-truth parallel data indirectly.We use the 420M M2M-100 model (Fan et al., 2020).
Unsupervised MT.Synthetic parallel data generated through an unsupervised MT system (Artetxe et al., 2018;Conneau and Lample, 2019).The MT system is trained on a subset of the monolingual data used for pretraining, so this approach does not use any additional data neither directly nor indirectly, other than the synthetically generated one.More concretely, we use XLM-R base to initialize our unsupervised MT model, and finetune it in 16 directions (en↔{ar,de,hi,fr,sw,ru,th,vi} using the iterative denoising autoencoding and backtranslation approach proposed by Conneau and Lample (2019). 1 We train for a total of 750k iterations using a batch size of 128k tokens.We use 200MB of text from CC100 for each language, amounting to a total of 1.8GB of training data.
Dictionary.Synthetic parallel data generated through random word replacement with a dictionary.We use the same dictionaries as Reid and Artetxe (2022), which combine dictionaries from MUSE (Lample et al., 2018) and those extracted using word aligners (Östling and Tiedemann, 2016).Following Reid and Artetxe (2022), we replace words that are included in our dictionary with a probability of 0.4.
3 Experiments and results

Pretraining incorporation
In these experiments, we incorporate parallel data into the pretraining process.We take XLM-R as our starting point, which was trained on monolingual data through MLM, and continue pretraining it on both MLM and TLM for 70k steps, using a batch size of 64k tokens.We use a learning rate of 5e-5 with a linear warmup and cosine decay schedule.We use the MLM objective 70% of the time, and the TLM objective 30% of the time.The latter applies the same masking objective over concatenated parallel sentences, and we compare different sources of parallel data as detailed in §2.3.For parallel data generated through MT, we translate a random subset of CC100 (keeping consistent with the data used in pretraining).The model is then finetuned on the downstream tasks using the original training data in English, and zero-shot transferred to the target languages.We report our results in Table 1.We find that all variants incorporating parallel data outperform the original XLM-R model, 2 and the improvements 2 The skeptical reader might attribute this improvement to the additional training steps we perform, irrespective of the use of parallel data.However, we find strong evidence that the improvements are brought by the use of parallel data given that (i) XLM-R was trained until convergence using a huge amount of compute, and our continued training represents an insignificant fraction on top (96 GPU days, compared to 13k GPU days, or a relative 0.7% further), and (ii) we get improvements in all target languages but not in English, suggesting are consistent across all target languages.However, different from Reid and Artetxe (2022), we do not find any clear improvements on English.Regarding the source of parallel data, we find that supervised MT performs at par with gold data, even for less-resourced languages for which MT tends to suffer.Unsupervised MT lags behind them, but consistently outperforms the baseline.
These results suggests that the mere facilitation of parallel interaction is helpful even when not using any new data, but incorporating groundtruth parallel data brings further improvements.However, the way in which parallel data is incorporated-either directly or through MTdoes not have any clear impact, as evidenced by the similar performance of supervised MT and gold.

Finetuning incorporation
In these experiments, we incorporate parallel data into the finetuning process.We translate the downstream training data in English into the rest of languages, and finetune XLM-R in the combined data in all languages.This is commonly referred to as translate-train-all in the literature.
We report our results in Table 2. Similar to the finetuning incorporation, we find that incorporating parallel data outperforms the baseline in all tasks and target languages for all data sources that we explore.Supervised MT obtains the best results, followed by unsupervised MT and wordby-word translation with dictionaries.Similar to the pretraining incorporation results, this suggests that the additional steps improve the cross-lingual capabilities of the model but not its general quality.that synthetic parallel data can bring improvements even when generated exclusively through monolingual data, but using real parallel data brings further improvements.Finally, we find that even simplistic ways to incorporate parallel signals can bring improvements, as evidenced by the dictionary replacement results.

Discussion
While prior work has reported strong results from incorporating parallel data for cross-lingual transfer learning, our results show that this improvement can partly-but not exclusively-be attributed to the explicit use of a parallel training signal, which can also be achieved through unsupervised MT without the need for any real parallel data.In fact, we find that the facilitation of parallel interactions is more important than the use of real parallel data in all tasks but XQuAD, where the latter has a larger impact.Despite the popularity of multilingual pretrained models, which predominantly rely on monolingual data both for pretraining and finetuning, this calls into question the extent to which existing approaches are able to exploit the full potential of such monolingual data.In addition, it is striking that we obtain similar results for both pretraining and finetuning incorporation, as well as supervised MT and gold standard parallel data.While further evidence is necessary to draw a more definitive conclusion, this suggests that parallel data brings similar improvements regardless of when (pretraining vs. finetuning) and how (directly vs. indirectly through MT) it is incorporated.

Reconsidering the categorization of cross-lingual learning approaches
As illustrated in Figure 1a 1b, there are different data types that one can use (monolingual source corpora, monolingual target corpora and parallel corpora, in addition to downstream data), which can be incorporated at different stages of the pipeline (pretraining, finetuning, testing) and via different procedures (directly or indirectly through MT).We argue that research in cross-lingual learning should aim to understand how the variants in each dimension as well the interactions between them impact downstream performance, which can require thinking beyond the boundaries of the 3 conventional categories.For instance, our variant using unsupervised MT to translate the downstream training data would fall within the definition of translate-train.However, this approach is more comparable to zero-shot in that it only uses monolingual data, and it would be unfair to compare it to conventional translate-train systems that rely on parallel data to train the MT system.

Related work
Prior work has explored the extent to which monolingual pretraining relies on knowledge transfer from unlabeled corpora by using synthetic data (Chiang and Lee, 2020;Krishna et al., 2021) or downstream data (Krishna et al., 2022) instead, and similar ideas have also been explored in computer vision (Kataoka et al., 2020;Asano et al., 2020).However, to the best of our knowledge, we are first to examine if cross-lingual learning also relies on knowledge transfer from parallel data.Our use of synthetic parallel corpora is also connected with back-translation, which is widely used in MT (Sennrich et al., 2016).However, conventional MT systems are trained on parallel data, and backtranslation is usually motivated as a way to leverage additional (monolingual) data.In contrast, our unsupervised MT variant does not use any additional data compared to regular pretraining.

Conclusions
In this work, we show that even model-generated parallel data can be useful for cross-lingual learning-greatly expanding the possibilities for multilingual models to improve their performance by taking advantage of their own machine translation capabilities.Given this, we advocate for investigating the optimal way to leverage monolingual and/or parallel data for cross-lingual learning, which might require thinking beyond the boundaries of the conventional zero-shot, translate-train and translate-test categories.

Limitations
In this work, we only consider the pre-train then fine-tune paradigm which assumes that model weights are tuned for adaptation to specific tasks.Future work, once more capable multilingual LLMs are released, may also consider the few shot, and in-context learning-based setups to accommodate for more recent approaches towards adaptation in NLP.Future work may also consider setups more relevant to different, more diverse tasks (e.g.including webtext).D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Cross-lingual transfer settings.Monolingual and parallel data can be used at different stages of the pipeline, either directly or indirectly through MT (b), but the traditional categorization falls short at capturing them (a).
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Not applicable.Left blank.C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? 3 C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 3 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.