Zero-shot Cross-lingual Transfer is Under-specified Optimization

Pretrained multilingual encoders enable zero-shot cross-lingual transfer, but often produce unreliable models that exhibit high performance variance on the target language. We postulate that this high variance results from zero-shot cross-lingual transfer solving an under-specified optimization problem. We show that any linear-interpolated model between the source language monolingual model and source + target bilingual model has equally low source language generalization error, yet the target language generalization error reduces smoothly and linearly as we move from the monolingual to bilingual model, suggesting that the model struggles to identify good solutions for both source and target languages using the source language alone. Additionally, we show that zero-shot solution lies in non-flat region of target language error generalization surface, causing the high variance.


Introduction
Pretrained multilingual encoders like Multilingual BERT (mBERT;Devlin et al., 2019) and XLM-RoBERTa (XLM-R; Conneau et al., 2020) facilitate zero-shot cross-lingual transfer (Wu and Dredze, 2019;Hu et al., 2020) -training the model on one language then using it on another language without additional task-specific training data. While the generalization performance on the source language has low variance, on the target language the variance is much higher with zero-shot cross-lingual transfer (Keung et al., 2020;Wu and Dredze, 2020), making it difficult to compare different models in the literature. Similarly, pretrained monolingual encoders also have unstable performance during fine-tuning (Devlin et al., 2019;Phang et al., 2018).
Why are these models so sensitive to the random seed? Many theories have been offered: catas-Code is available at https://github.com/ shijie-wu/crosslingual-nlp. trophic forgetting of the pretrained task (Phang et al., 2018;Lee et al., 2020;Keung et al., 2020), small data size (Devlin et al., 2019), impact of random seed on task-specific layer initialization and data ordering (Dodge et al., 2020), the Adam optimizer without bias correction (Mosbach et al., 2021;Zhang et al., 2021), and a different generalization error with similar training loss (Mosbach et al., 2021). However, none of these factors fully explain the high generalization error variance of zero-shot cross-lingual transfer on target language but low variance on source language.
We offer a new explanation for high variance in target language performance: the zero-shot crosslingual transfer optimization problem is underspecified. Based on the well-established linear interpolation of 1-dimensional plot and contour plot (Goodfellow et al., 2014;Li et al., 2018), we empirically show that any linear-interpolated model between the monolingual source model and bilingual source and target model has equally low source language generation error. Yet the target language generation error surprisingly reduces smoothly and linearly as we move from a monolingual model to a bilingual model. To the best of our knowledge, no other paper documents this finding.
This result provides a new answer to our mystery: only a small subset of the solution space for the source language solves the target language on par with models with actual target language supervision; the optimization could not find such a solution with existing condition (without target language supervision), hence an under-specified optimization problem. If target language supervision were available, as it was in the counterfactual bilingual model, the optimization would find the smaller subset. By comparing both mBERT and XLM-R, we find that the generalization error surface of XLM-R is flatter than mBERT, contributing to its better performance compared to mBERT. Thus, zero-shot cross-lingual transfer has high variance, as the solution found by zero-shot cross-lingual transfer lies in the non-flat region of the target language generalization error surface. Small turbulence on the parameter space would lead to big generalization error difference, hence the high variance.

Existing Hypotheses (Related Work)
Prior studies have observed fine-tuning variance with pretrained encoder, and have offered various hypotheses to explain this behavior. Catastrophic forgetting -when neural networks trained on one task forget that task after training on a second task (McCloskey and Cohen, 1989;Kirkpatrick et al., 2017) -has been credited as the source of high variance in both monolingual fine-tuning (Phang et al., 2018;Lee et al., 2020) and zero-shot crosslingual transfer (Keung et al., 2020). Mosbach et al. (2021) wonders why preserving cloze capability is important. However, in zero-shot cross-lingual transfer, deliberately preserving the multilingual cloze capability with regularization improves performance but does not eliminate the zero-shot transfer gap (Aghajanyan et al., 2021;Liu et al., 2021).
Small training data size often seems to have higher variance in performance (Devlin et al., 2019), but Mosbach et al. (2021) found that when controlling the number of gradient updates, smaller data size has the similar variance as larger data size.
In the pretraining-then-fine-tune paradigm, random seeds impact the initialization of task-specific layers and data ordering during fine-tuning. Dodge et al. (2020) shows development set performance has high variance with respect to seeds. Additionally, Adam optimizer without bias correction-an Adam (Kingma and Ba, 2014) variant (inadvertently) introduced by the implementation of Devlin et al. (2019)-has been identified as the source of high variance during monolingual fine-tuning (Mosbach et al., 2021;Zhang et al., 2021). However, in zero-shot cross-lingual transfer, while different random seeds lead to high variance in target languages, the source language has much smaller variance in comparison even with standard Adam (Wu and Dredze, 2020).
Beyond optimizers, Mosbach et al. (2021) attributes high variance to generalization issues: despite having similar training loss, different models exhibit vastly different development set performance. However, in zero-shot cross-lingual transfer, the development or test performance variance is much smaller on the source language compared Parameters with low generation error on source language Parameters with low generation error on target language Found by zero-shot optimization Found by bilingual optimization Figure 1: zero-shot cross-lingual transfer is an underspecified optimization problem. With the existing condition, the optimization could not find the solution that we really want.
to target language.

Under-specified Optimization
Existing hypotheses do not explain the high variance of zero-shot cross-lingual transfer: much higher variance on generalization error of the target language compared to the source language. We propose a new explanation: zero-shot cross-lingual transfer is an under-specified optimization problem. 1 As in Fig. 1, optimizing a multilingual model for a specific task using only source language annotation allows choices of many good solutions in terms of generalization error. However, unbeknownst to the optimizer, these solutions have wildly different generalization errors on the target language. In fact, a small subset has similar low generalization error as models trained on target language. Yet without the guidance of target data, the zero-shot cross-lingual optimization could not find this smaller subset. As we will show in §5, the solution found by zero-shot transfer lies in a non-flat region of target language generalization error, and small turbulence in the parameter space causes a big difference in generalization error, causing its high variance.

Linear Interpolation
We test this hypothesis via a linear interpolation between two models to explore the neural network parameter space. Consider three sets of neural network parameters: θ src , θ tgt , θ {src,tgt} for a model (1) and Eq.
(2), with α = 0 and α = 1 representing source language monolingual model and source + target bilingual model, respectively. Each subfigure title indicates the source and target languages. Across all experiments, the source language dev performance stays consistently high (red and purple lines) during interpolation while the target language dev performance starts low and increases smoothly and linearly as it moves towards the bilingual model (gray and blue lines). App. D break down this figure by tasks.
trained on task data for the source language only, target language only and both languages, respectively. This includes both task-specific layers and encoders. 2 Note all three models have the same initialization before fine-tuning, making the bilingual model a counterfactual setup if the corresponding target language supervision were available. We obtain the 1-dimensional (1D) linear interpolation of a monolingual (source) task trained model and bilingual task trained model with or we could swap source and target by where α is a scalar mixing coefficient (Goodfellow et al., 2014). Additionally, we can compute a 2-dimensional linear interpolation as (3) where δ src = θ src − θ {src,tgt} , δ tgt = θ tgt − θ {src,tgt} , α 1 and α 2 are scalar mixing coefficients (Li et al., 2018). 3 Finally, we can evaluate any interpolated models on the development set of source and target languages, testing the generalization error on the same language and across languages. The performance of the interpolated model illuminates the behavior of the model's parameters. Take Eq. (1) as an example: if the linear interpolated model performs consistently high for our task on the source language, it suggests that both models lie within the same local minimum of source language generalization error surface. Additionally, if (3), respectively. By comparing mBERT and XLM-R, we observe that XLM-R has a flatter target language generalization error surface compared to mBERT. Different language pairs and tasks combinations show similar trends and additional figures can be found in App. E the linear interpolated model performs vastly differently on the target language, it would support our hypothesis. On the other hand, if the linear interpolated model performance drops on the source language, it suggests that both models lie in different local minimum of source language generalization error surface, suggesting zero-shot optimization searching the wrong region.

Experiments
We consider four tasks: natural language inference (XNLI; Conneau et al., 2018), named entity recognition (NER; Pan et al., 2017), POS tagging and dependency parsing (Zeman et al., 2020). We evaluate XNLI and POS tagging with accuracy (ACC), NER with span-level F1, and parsing with labeled attachment score (LAS). We consider two encoders: base mBERT and large XLM-R. For the task-specific layer, we use a linear classifier for XNLI, NER, and POS tagging, and Dozat and Manning (2017) for dependency parsing.
To avoid English-centric experiments, we consider two source languages: English and Arabic. We choose 8 topologically diverse target languages: Arabic 4 , German, Spanish, French, Hindi, Russian, Vietnamese, and Chinese. We train the source language only and target language only monolingual model as well as a source-target bilingual model. 4 Arabic is only used when English is the source language.
We compute the linear interpolated models as described in §3.1 and test it on both the source and target language development set. We loop over {−0.5, −0.4, · · · , 1.5} for α, α 1 and α 2 . 5 We report the mean and variance of three runs by using different random seeds. We normalized both mean and variance of each interpolated model by the bilingual model performance, allowing us to aggregate across tasks and language pairs. Details of fine-tuning can be found in App. A.

Results
In Fig. 2, we observe that interpolations between the source monolingual and bilingual model have consistently similar source language performance. In contrast, surprisingly, the target language performance smoothly and linearly improves as the interpolated model moves from the zero-shot model to bilingual model. 6 The only exception is mBERT, where the performance drops slightly around 0.1 and 0.9 locally. In contrast, XLM-R has a flatter slope and smoother interpolated models. Fig. 3 further demonstrates this finding with a 2D linear interpolation. The generalization error surface of the target language of XLM-R is much flatter compared to mBERT, perhaps the fundamental reason why XLM-R performs better than mBERT in zero-shot transfer, similar to findings in CV models (Li et al., 2018). As we discuss in §3, these two findings support our hypothesis that zero-shot cross-lingual transfer is an underspecified optimization problem. As Fig. 3 shows, the solution found by zero-shot transfer lies in a non-flat region of target language generalization error surface, causing the high variance of zeroshot transfer on the target language. In contrast, the same solution lies in a flat region of the source language generalization error surface, causing the low variance on the source language.

Discussion
We have presented evidence that zero-shot crosslingual transfer is an under-specified optimization problem, and the cause of high variance on target language but not the source language tasks during cross-lingual transfer. This finding holds across 4 tasks, 2 source languages and 8 target languages. Training bigger encoders addresses this issue indirectly by producing encoders with flatter crosslingual generalization error surfaces. However, a more robust solution may be found by introducing constraints into the optimization problem. There are a few potential solutions. Few-shot cross-lingual transfer is a potential way to further constrain the optimization problem. Zhao et al. (2021) finds that it is important to first train on source language then fine-tune with the fewshot target language examples. Through the lens of our analysis, this finding is intuitive since finetuning with a small amount of target data provides a guidance (gradient direction) to narrow down the solution space, leading to a potentially better solution for the target language. The initial fine-tuning with the source data is also important since it provides a good starting point. Additionally, Zhao et al. (2021) observes that the choice of shots matters. This is expected as it significantly impacts the quality of the gradient direction.
Similarly, silver target data is a potential way to further constrain the optimization problem. While Yarmohammadi et al. (2021) finds that jointly training with gold source data and silver target data benefits cross-lingual transfer, a pipeline fine-tuning approach like few-shot cross-lingual transfer is also worth exploring.
Unsupervised model selection like Chen and Ritter (2020) and optimization regularization like Aghajanyan et al. (2021) have been proposed in the literature to improve zero-shot cross-lingual transfer. Through the lens of our analysis, both solutions attempt to constrain the optimization problem.
As none of the existing techniques fully constrain the optimization, future work should study the combination of existing techniques and develop new techniques on top of it instead of studying one technique at a time. We leave the exploration of this to future work.

Acknowledgments
This research is supported in part by ODNI, IARPA, via the BETTER Program contract #2019-19051600005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A Fine-tuning Experiments Detail
We follow the implementation and hyperparameter of Wu and Dredze (2020). We optimize with Adam (Kingma and Ba, 2014). The learning rate is 2e-5. The learning rate scheduler has 10% steps linear warmup then linear decay till 0. We train for 5 epochs and the batch size is 32. For token level tasks, the task-specific layer takes the representation of the first subword, following previous work (Devlin et al., 2019;Wu and Dredze, 2019). Model selection is done on the corresponding dev set of the training set. We fine-tune each model using a single Quadro RTX 6000 and it takes less than one hour except for XNLI.
During fine-tuning, the maximum sequence length is 128. We use a sliding window of context to include subwords beyond the first 128 for NER and POS tagging. At test time, we use the same maximum sequence length with the exception of parsing, where the first 128 words instead of subwords of a sentence were used. We ignore words with POS tags of SYM and PUNCT during parsing evaluation. For NER, the prediction of BIO was post-processed to make sure a valid span is produced.
B Norm Ratio and Angle of δ src and δ tgt Fig. 4 plots the relationship between δ src / δ tgt and angle between δ src and δ tgt . We observe most δ src and δ tgt have similar norms, and the angle between them is around 55 • .