How to Translate Your Samples and Choose Your Shots? Analyzing Translate-train & Few-shot Cross-lingual Transfer

,


Introduction
With the emergence of large-scale multilingual Pretrained Language Models like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), a significant amount of research went into exploring the cross-lingual transfer capabilities of these models, allowing for an easier adaptation to a task in many various languages. This is achieved through a number of approaches.
Zero-shot cross-lingual transfer has become a research focus, e.g. XTREME / XTREME-R benchmark (Hu et al., 2020;Ruder et al., 2021). In this approach, transfer to new languages is done by fine-tuning a multilingual PLM on the task at issue, using only an English corpus (source language) and reporting the performance on multiple target languages. Few-shot cross-lingual transfer was recently shown to give an advantage over zero-shot cross-lingual transfer (Lauscher et al., 2020). In this approach, it is shown that fine-tuning the model using a small amount of target-language task data (fewshot) improves the performance, especially for lowresource languages. Translate-train is another common approach to improve the performance. Here the full training dataset is machine translated to the target language and used for fine-tuning. There exists relatively good Machine Translation (MT) systems for the languages that are usually studied in the few-shot approach 2 that could be used in translate-train.
In the following, we use few-shot to refer to fine-tuning using fewer samples of high-quality professional manual translation. Translate-train is used to refer to fine-tuning using lower-quality machine translation that has the potential to be scaled to a larger number of samples. Although some research has dealt with few-shot cross-lingual transfer and analyzing it (Lauscher et al., 2020;Zhao et al., 2021), no systematic study was done to compare it to translate-train. Given that both zeroshot and few-shot cross-lingual transfer assume the availability of a large-scale English corpus of the task for source training, we hypothesize that the translate-train approach might have an advantage over few-shot given the scale of data that would be available even if not at the best quality.
On the other hand, when there is a need for fewshot cross-lingual transfer for some task and therefore a need for professional translation of some training samples, this entails significantly more effort and cost compared to MT. It is then important to find out which samples to manually translate given the high variance in performance depending on the choice of samples as shown in (Zhao et al., 2021).
We investigate both those research directions using 3 base models (mBERT base , XLM-R base , XLM-R large ) on 3 high-level semantic tasks and datasets: XNLI (Natural Language Inference), X-PAWS (Paraphrase Detection) and XQUAD (Question Answering), spanning 17 diverse languages. We investigate the following research questions: Q1. How does the performance of few-shot crosslingual transfer compare to that of translate-train? We show that there is a performance advantage for few-shot transfer over translate-train given the same number of samples, but that with the increase of samples used for translate-train, this gap shrinks, and using the full large-scale corpus in translatetrain results in a clear advantage over few-shot. We show that at a scale of 10x-100x of machinetranslation to manual-translation, quantity trumps quality and it is recommended in this case to use translate-train if MT is available for the language. Few-shot transfer still has an advantage when less source data is available and it is therefore not possible to benefit from the scale gain of using MT.
Q2. Are there sets of samples that have better few-shot performance if translated and how can those sets be identified? We show that when few-shot transfer is beneficial for the task, there are random sets of samples that perform better across most target languages and across different model initializations. We investigate using the performance on the English version of the samples and the machine-translated version to choose the best candidates to manually translate and use for few-shot transfer. We show that there is a correlation between the performance of the same set of shots across languages and that the few-shot samples that perform better on the source language, English, perform also better across languages. A similar observation is made also using MT of the samples. We further show empirically that choosing the sets of samples for few-shot transfer using those heuristics or a model, using such features of the samples, results in more bang for your shots.

Related Work
Cross-lingual transfer: The cross-lingual transfer capabilities of multilingual pretrained language models have led to major recent advances and a growing number of such models have been introduced, e.g., mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), mT5 (Xue et al., 2021 etc. The cross-lingual transfer is usually exploited in a zero-shot setup, and benchmarks are built based on this assumption e.g. XTREME/XTREME-R (Hu et al., 2020;Ruder et al., 2021).
Few-shot: There has been recently some focus on few-shot cross-lingual transfer and its analysis. Lauscher et al. (2020) shows the effectiveness of few-shot compared to zero-shot cross-lingual transfer especially in lower-resource and distant languages, where zero-shot is least effective and few-shot gives a large gain. Zhao et al. (2021) analyzes few-shot cross-lingual transfer emphasizing that the choice of shots has a significant effect on the performance. The experiments are conducted at a small scale of around 10 samples. Compared to this, we conduct larger-scale few-shot experiments with a size up to hundreds of samples and focus on choosing the best-performing samples.
Translate-train: is commonly used to boost the performance for a target language using a machine translation of the source corpus (Conneau et al., 2018;Lample and Conneau, 2019;Conneau et al., 2020;Hu et al., 2020). Xue et al. (2021) shows that, similar to zero-shot, translate-train performance increases with the scale of the model. No systematic study tested the effect of the scale of the translated data in comparison with few-shot to understand the interplay of data quality vs. quantity in this context.
Choosing samples: Two related areas are sample selection (Rousseeuw, 1984) which is used for robust training on noisy data (Song et al., 2019) and active learning (Cohn et al., 1994;Krogh and Vedelsby, 1994) used to choose the best potential samples to annotate (Siddhant and Lipton, 2018). Both assume access to the actual sample input (with or without label). On the other hand, this work investigates choosing samples while only having access to the source-language sample input/output.

Datasets
We focus on high-level tasks and conduct our experiments on 2 classification tasks and a question answering task (Table 1)

Experiments
Three main models are used: mBERT (base), XLM-R B (base) and XLM-R L (large). We report results on XLM-R B if not specified otherwise, because it strikes a balance between good performance and efficient training. For each task, we fine-tune the model on the source language (English) corpus for 5 epochs with early stopping using the loss on the English dev set. We then continue fine-tuning the model on the target language either in a few-shot or translate-train setup as explained in the following sections. Training details are in Appendix A.

Few-shot experiments
We use samples from the multilingual dev set as training samples. Few-shot fine-tuning is done as follows: for each language, we separately continue fine-tuning the source model for one epoch on n ∈ {10, 50, 100, 500, 1k} samples from the target language corpus for the two classification tasks and for n ∈ {10, 50, 100, 250} for the Question Answering task, given the smaller amount of data available for training in this case. We report the results on the test set for each target language. For each n number of samples, the performance is averaged across 5 different sets of random samples using 5 different fine-tuned models with different random initializations, 25 runs in total. This is to ensure more robust results when measuring the gain over zero-shot given the high variance across different sets of samples (Zhao et al., 2021) as well as the variance in zero-shot performance across all random initializations (Keung et al., 2020). For comparing the performance across shots, we make sure to use the same set of parallel samples across languages, using the sample ids, to compare how a set of samples performs when translated to different languages. This is possible due to our selection of tasks and datasets that have a parallel corpus for the various target languages.

Translate-train experiments
We train using MT of the source train set to each target language 3 and adapt a similar setup as few-shot: for each language, continue fine-tuning separately on n ∈ {10, 50, 100, 500, 1k, 10k, |dataset|} samples from the machine-translated train set and report the results on the test set of the target language.  in further narrowing the gap as compared to the small scale of few-shot transfer. This results in translate-train having the best performance for all models across all datasets. The highest gain is seen for the model with the highest en gap (mBERT) for both few-shot and translate-train. For XLM-R on XQuAD, the gain is low and negligible. Given that there is a significant gain for mBERT and the same experimental setup is used for all models, the lack of gain is probably not dataset-specific and possible happens with some models.
To see the effect of the available dataset size in each scenario, Figure 1 shows the average performance across languages for few-shot vs. translatetrain across varying number of samples. We can see an advantage of having manual over machine translation resulting in a clear performance gap between both on XNLI for the same number of samples. This gap increases with the increase of the number of samples as seen at 1k. The availability of manual translation for few-shot is limited though and starting from 10k-100k, the scale of translatetrain has an advantage for all tasks (similar results for the other models are in Appendix Figure 7  in Appendix Table 6). We can see, across all tasks and models, that European languages have a small gain compared to non-European languages which show the largest gain e.g. Swahili (Niger-kongo) in XNLI, Korean and Japanese in PAWS-X, and Turkish and Chinese for XQuAD. Those languages also tend to have a larger zero-shot performance gap to English and are more distant to it (the source language). Those results are comparable to the few-shot results of Lauscher et al. (2020). We can see that the languages with the most gain differ be-tween mBERT and XLM-R mainly because XLM-R extends the pre-training corpus using Common-Crawl to have more data that less-spoken languages benefit especially from e.g. Turkish zero-shot performance on XQuAD is low with mBERT as compared to XLM-R models which result in more gain for Turkish with mBERT on XQuAD (detailed results on XQuAD in Appendix Figure 15, 14, 16). Appendix C contains the detailed performance gains for few-shot and translate-train over zeroshot for each language across varying sizes of sam-   Figure 3 shows the detailed results for XNLI as an example, where we see that once the full machine-translated training set is used, a clear advantage for translate-train is seen across almost all languages and in all tasks. We can see that the gain for Urdu (ur) is the highest on XNLI up until 100k when it starts decreasing. We think this might be due to a lower-quality MT. The same effect is seen for Thai (th) on XQuAD with a significant performance degrade when the full training dataset is used (details in the Appendix in Figure 15). This is also the reason for the degrade and high variance in performance seen at this point in Figure 19b. We investigate whether longer training would have changed the results and would have been beneficial, especially for few-shot where longer training on the high-quality manual translation might be beneficial. We split the available set of samples into train/dev and train for 10 epochs with early stopping on dev. Although some languages benefit from this setup, it still yields comparable results and translate-train still has a clear advantage. (results in Appendix Figure 17 and 18).

How to choose your shots? Which samples
to translate for few-shot?
Few-shot can still have an advantage over translatetrain when the English dataset is not large enough to benefit from the scale effect of translate-train. It can also be necessary when adapting to a target language that does not have an existing machine translation system or does not have a good one. Creating few-shot samples, in this case, can be done by collecting and labeling new samples or by translating samples from the available English dataset. The latter is a common method and 4 out of the 7 non-retrieval datasets in XTREME use manual professional translation to create samples in the target languages (all of which high-level semantic tasks). It is beneficial then to support in selecting the samples with higher performancepotential to translate and do few-shot training on.
To emphasize the significance of choosing the samples, we plot in Figure 4 the XNLI performance variance on different shots (using the same model initialization) across 20 sets of random few-shot samples varying in size from 10 to 1k samples. The performance varies, sometimes significantly, depending on the set of samples used.  Table 3: XNLI Pearson correlation between the performance of machine translation and manual translation vations on a smaller number of samples (around 10). We consider a larger size range that is more representative of the data size if a manual translation is conducted. The performance variance across shots decreases with the increased number of shots. This means that choosing the shots to translate is more important when a smaller size of samples is used. (similar results on PAWS-X and XQuAD are in Appendix Figure 19 although for XQuAD the variance increases with the size). In the following, we focus mainly on XNLI as the task that had the most few-shot gain. We investigate whether there are sets of samples that have a potential for better performance across languages and what could be an indication of that. For a set of shots, we consider two indicators: the performance of this set in another language, and the performance on the MT of the samples in the set.

Correlation between performance across languages
If the performance of a set of samples for one language can be an indication of its performance on another language, a high correlation between the performance for both languages is expected. To estimate this, we calculate the performance using the manual translations across languages of the same set of training samples. We then calculate the Pear-son correlation of the performance across 5 random sets of samples (with varying sample-set sizes) using 5 models with different random initialization. As seen in Figure 5, there is a high positive correlation between the performance on XNLI for the various languages (using XLM-R B ). This is also the case, but to a lesser degree for PAWS-X as seen in the Appendix Table 8. XQuAD, on the other hand, has low and sometimes even negative correlation (Appendix Table 11), which might be due to the QA task being harder and requiring more data and the fact that we have less data in this case for both training and test. It is also worth noting that the correlation is lower for both tasks, PAWS-X and XQuAD, which had low few-shot gain.

Correlation between manual and machine translation performance
Another possible indicator of the best performing set of samples could be the performance of the samples in the set when they are machine translated to the target language. Artetxe et al. (2020a) has shown that subtle patterns in the (machine or manual) translated samples can have a notable impact on the model performance, so it is important to empirically study the relation between both. Similar to the above, we calculate the correlation between the performance for both manual and machine translation of the same set of samples for each target language. As seen for XNLI in Table 3, there is an even higher correlation than with the English performance. A somewhat lower correlation is seen for PAWS-X in Appendix Table 10. Lower correlation might be a result of lower-quality MT or a result of the different patterns introduced by MT as mentioned before.

Gain from choosing shots
We show in Table 4 the few-shot performance gain resulting from choosing the shots with the highest English performance and the highest MT performance. Random samples are used for fewshot cross-lingual transfer in related work, so we compare to the average few-shot gain across the different shots in no choosing (avg), and also to the minimum in no choosing (min), because an important aspect of choosing shots is avoiding the worst-performing ones (Comparing to the average hides the fact that we might accidentally use a very bad set of shots). We can see a clear gain in most cases across all models when using en performance or mt performance. When there is no gain compared to no choosing (avg), the performance is still comparable and the benefit of not choosing the worst performing shots is still there as compared to no choosing (min). The few-shot gain with chosen-shots is most significant at smaller number of samples where the gain is almost double that from no choosing (avg).
Combining both En and MT performance when choosing the shots is expected to result in more gain, so we investigate feeding the performance values as features to a linear model that takes as input the performance of a set of samples and predicts the performance gain when this set is manually translated and used for few-shot. Predicting the performance gain is also helpful to avoid translating any set of samples if all are expected to result in a negative or low gain. We use the performance metrics as a dataset: collecting the performance of En/MT of random sets of samples along with the performance of the actual manual translation. This is done using 5 different random sets of samples for 5 different XLM-R B initialization with varying   Table 6). Those features can help the model better use the English performance depending on the similarity between the language and English. The prediction error of the linear models is reported in Appendix Table 13. We can see in Table 4 that using the models improves the chosen-shots performance gain for XNLI with the best result, as before, using a combination of all features. This is not the case for PAWS-X and could be partially due to having a smaller performance data and fewer languages to train on (7 as compared to 15 for XNLI). The detailed results for the different languages are in the Appendix Figure 21. Choosing the shots improves the few-shot performance on XNLI for all languages across almost all sample sizes. For PAWS-X, there is mixed gain/loss but the improvement when using English performance at maximum size is concentrated in the European languages.

Conclusion and Future Work
This work conducted a systematic comparison between translate-train and few-shot cross-lingual transfer. It quantified the performance gain for each and showed that starting from 1k samples, MT data could be used to improve over zero-shot performance, and that at 10k-100k, there is an advantage for translate-train over few-shot.
For the tasks that benefit from few-shot, we show that there are random sets of samples that perform better across languages and that the English performance of the samples in those sets can help us identify them. The performance of the MT of the samples can also be used as another indicator. When not incurring gain, both help at least avoid the worst performing samples.
Further analysis in the future could help identify why some datasets do not benefit from few-shot transfer with certain models, and analysing the samples might lead to uncovering interesting properties in the best/worst performing sets of samples. Hyperparameters: For the two classification tasks, we use a maximum sequence length of 128. We limit hyperparmeter tuning to a search for the learning rate in {7e − 6, 1e − 5, 3e − 5} and use a batch size of 32. For Question Answering, we use a maximum sequence length of 384 with a paragraph slide of 128.
We train using a learning rate of 3e − 5 and a batch size of 12 for 2 epochs. The used learning rate for XLM-R B along with the dev performance for a model with seed=42 is reported in  (1) properties taken from XTREME (2) similarity calculated using lang2vec (3) size is the #wikipedia articles in millions     zer-shot few-shot translate-train Figure 18: Detailed Results on PAWS-X using a part of the available data as dev. The few-shot performance shows mixed gains decreasing by ∼0.60% for 10 samples, increasing by ∼0.40% at 100 then decreasing againg by ∼0.10%. Translate-train performance decreases util the full dataset is used where it increases by ∼1%.  Table 7: XNLI Pearson correlation between the performance on English and the performance on other languages using the same set of samples.         (h) PAWS-X chosen-shots gain using (en + mt + lang features) model Figure 21: Chosen-shots gain in performance. The gain of choosing shots over the average of no-choosing (average over 5 random sets). The actual few-shot gain (compared to zero-shot) is shown in parenthesis as follows: chosen-shots-gain (few-shot-gain). When chosen-shots-gain is positive (green), choosing the shots results in more gain. When negative (red), it hurts and results in less gain.  Figure 22: XQuAD chosen-shots gain in performance (no gain!). The gain of choosing shots over the average of no-choosing (average over 5 random sets). The actual few-shot gain (compared to zero-shot) is shown in parenthesis as follows chosen-shots-gain (few-shot-gain). We can see that there is no gain in choosing the shots. Experiments with adding language features to the model further decrease the performance.