Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer

Multilingual pre-trained contextual embedding models (Devlin et al., 2019) have achieved impressive performance on zero-shot cross-lingual transfer tasks. Finding the most effective fine-tuning strategy to fine-tune these models on high-resource languages so that it transfers well to the zero-shot languages is a non-trivial task. In this paper, we propose a novel meta-optimizer to soft-select which layers of the pre-trained model to freeze during fine-tuning. We train the meta-optimizer by simulating the zero-shot transfer scenario. Results on cross-lingual natural language inference show that our approach improves over the simple fine-tuning baseline and X-MAML (Nooralahzadeh et al., 2020).


Introduction
Despite the impressive performance of neural models on a wide variety of NLP tasks, these models are extremely data hungry -training them requires a large amount of annotated data. As collecting such amounts of data for every language of interest is extremely expensive, cross-lingual transfer that aims to transfer the task knowledge from high-resource (source) languages for which annotated data are more readily available to lowresource (target) languages becomes a promising direction. Cross-lingual transfer approaches using cross-lingual resources such as machine translation (MT) systems (Wan, 2009;Conneau et al., 2018) or bilingual dictionaries (Prettenhofer and Stein, 2010) have effectively reduced the amount of annotated data required to obtain reasonable performance on the target language. However, such cross-lingual resources are often limited for low-resource languages.
Recent advances in cross-lingual contextual embedding models have reduced the need for cross- * Work done while interning at Amazon AI. lingual supervision (Devlin et al., 2019;Lample and Conneau, 2019). Wu and Dredze (2019) show that multilingual BERT (mBERT) (Devlin et al., 2019), a contextual embedding model pre-trained on the concatenated Wikipedia data from 104 languages without cross-lingual alignment, does surprisingly well on zero-shot cross-lingual transfer tasks, where they fine-tune the model on the annotated data from the source languages and evaluate on the target language. Wu and Dredze (2019) propose to freeze the bottom layers of mBERT during fine-tuning to improve the cross-lingual performance over the simple fine-tune-all-parameters strategy, as different layers of mBERT captures different linguistic information (Jawahar et al., 2019).
Selecting which layers to freeze for a downstream task is a non-trivial problem. In this paper, we propose a novel meta-learning algorithm for soft layer selection. Our meta-learning algorithm learns layer-wise update rate by simulating the zero-shot transfer scenario -at each round, we randomly split the source languages into a heldout language and the rest as training languages, fine-tune the model on the training languages, and update the meta-parameters based on the model performance on the held-out language. We build the meta-optimizer on top of a standard optimizer and learnable update rates, so that it generalizes well to large numbers of updates. Our method uses much less meta-parameters than the X-MAML approach (Nooralahzadeh et al., 2020) adapted from model-agnostic meta-learning (MAML) (Finn et al., 2017) to zero-shot cross-lingual transfer.
Experiments on zero-shot cross-lingual natural language inference show that our approach outperforms both the simple fine-tuning baseline and the X-MAML algorithm and that our approach brings larger gains when transferring from multiple source languages. Ablation study shows that both the layer-wise update rate and cross-lingual metatraining are key to the success of our approach.

Meta-Learning for Zero-Shot
Cross-lingual Transfer The idea of transfer learning is to improve the performance on the target task T 0 by learning from a set of related source tasks {T 1 , T 2 , ..., T K }.
In the context of cross-lingual transfer, we treat different languages as separate tasks, and our goal is to transfer the task knowledge from the source languages to the target language. In contrast to the transfer learning case where the inputs of the source and target tasks are from the same language, in cross-lingual transfer learning we need to handle inputs from different languages with different vocabularies and syntactic structures. To handle the issue, we use the pre-trained multilingual BERT (Devlin et al., 2019), a language model encoder trained on the concatenation of monolingual corpora from 104 languages. The most widely used approach to zero-shot cross-lingual transfer using multilingual BERT is to fine-tune the BERT model θ on the source language tasks T 1...K with training objective L θ * = Learn(L, T 1 , ..., T K ; θ) and then evaluate the fine-tuned model θ * on the target language task T 0 . The gap between training and testing can lead to sub-optimal performance on the target language.
To address the issue, we propose to train a metaoptimizer f ϕ for fine-tuning so that the fine-tuned model generalizes better to unseen languages. We train the meta-optimizer by where T k is a "surprise" language randomly selected from the source language tasks T 1...K .

Meta-Optimizer
Our meta-optimizer consists of a standard optimizer as the base optimizer and a set of metaparameters to control the layer-wise update rates. An update step is formulated as: where θ t represent the parameters of the learner model at time step t, and ∆θ t is the update vector produced by the base optimizer f opt given the Randomly select a test language k to form the test data D test = D k .
at the current and previous steps. The function f opt is defined by the optimization algorithm and its hyper-parameters. For example, a typical gradient descent algorithm uses f opt = αg t where α represents the learning rate. A standard optimization algorithm will update the model parameters by: Our meta-optimizer is different in that we perform gated update using parametric update rates λ, which is computed by λ = σ(ϕ), where ϕ represents the meta-parameters of the metaoptimizer f ϕ . The sigmoid function ensures that the update rates are within the range [0, 1]. Different from Andrychowicz et al. (2016) in which the optimizer parameters are shared across all coordi- Figure 1: Computational graph for the forward pass of the meta-optimizer. Each batch (X t , Y t ) is from the training data D train , and (X test , Y test ) denotes the entire test set. The meta-learner is comprised of a base optimizer that takes the history and current step gradients as inputs and suggests an update ∆θ t , and the meta parameters that control the layer-wise update rates λ for the learner model θ. The dashed arrows indicate that we do not back-propagate the gradients through that step when updating the meta-parameters. nates of the model, our meta-optimizer learns different update rates for different model layers. This is based on the findings that different layers of the BERT encoder capture different linguistic information, with syntactic features in middle layers and semantic information in higher layers (Jawahar et al., 2019). And thus, different layers may generalize differently across languages. Figure 1 illustrates the computational graph for the forward pass when training the meta-optimizer. Note that as the losses L t and gradients ∇ θ t−1 L t are dependent on the parameters of the metaoptimizer, computing the gradients along the dashed edges would normally require taking second derivatives, which is computationally expensive. Following Andrychowicz et al. (2016), we drop the gradients along the dashed edges and only compute gradients along the solid edges.

Meta-Training
A good meta-optimizer will, given the training data in the source languages and the training objective, suggest an update rule for the learner model so that it performs well on the target language. Thus, we would like the training condition to match that of the test time. However, in zero-shot transfer we assume no access to the target language data, so we need to simulate the test scenario using only the training data on the source languages.
As shown in Algorithm 1, at each episode in the outer loop, we randomly choose a test language k to construct the test data D test = D k and use the remaining data as the training data D train .
Then, we re-initialize the parameters of the learner model and start the training simulation. At each training step, we first use the base optimizer f opt to compute the update vector ∆θ t based on the current and history gradients g 1...t . We then perform the gated update using the meta-optimizer ϕ s−1 with Eq. (1). The resulting model θ t can be viewed as the output of a forward pass of the meta-optimizer. After every L iterations of model update, we compute the gradient of the loss on the test data D test with respect to the old meta parameters ϕ s−1 and make an update to the meta parameters. Our meta-learning algorithm is different from X-MAML (Nooralahzadeh et al., 2020) in that 1) X-MAML is designed mainly for few-shot transfer while our algorithm is designated for zero-shot transfer, and 2) our algorithm uses much less meta-parameters than X-MAML as it only requires training the update rate for each layer while in X-MAML we meta-learn the initial parameters of the entire model.

Experiments
We evaluate our meta-learning approach on natural language inference. Natural Language Inference (NLI) can be cast into a sequence pair classification problem where, given a premise and a hypothesis sentence, the model needs to predict whether the premise entails the hypothesis, contradicts it, or neither (neutral). We use the Multi-Genre Natural Language Inference Corpus , which consists of 433k English sentence pairs labeled with textual entailment information, and the XNLI dataset (Conneau  Table 1: Accuracy of our approach compared with baselines on the XNLI dataset (averaged over five runs). We compare our approach (Meta-Optimizer) with our fine-tuning baseline with one or two auxiliary languages, the fine-tuning results in Devlin et al. (2019), the highest scores (with a selected subset of layers fixed during finetuning) in Wu and Dredze (2019), the best zero-shot results using X-MAML (Nooralahzadeh et al., 2020) with one auxiliary language. We boldface the highest scores within each auxiliary language setting. et al., 2018), which has 2.5k development and 5k test sentence pairs in 15 languages including English (en), French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw), and Urdu (ur). We use this dataset to evaluate the effectiveness of our meta-learning algorithm when transferring from English and one or more low-resource auxiliary languages to the target language.

Model and Training Configurations
Our model is based on the multilingual BERT (mBERT) (Devlin et al., 2019) implemented in GluonNLP (Guo et al., 2020). As in previous work (Devlin et al., 2019;Wu and Dredze, 2019), we tokenize the input sentences using WordPiece, concatenate them, feed the sequence to BERT, and use the hidden representation of the first token ([CLS]) for classification. The final output is computed by applying a linear projection and a softmax layer to the hidden representation. We use a dropout rate of 0.1 on the final encoder layer and fix the embedding layer during fine-tuning. Following Nooralahzadeh et al. (2020), we fine-tune mBERT by 1) fine-tune mBERT on the English data for one epoch to get initial model parameters, and 2) continue fine-tuning the model on the other source languages for two epochs. We compare using the standard optimizer (fine-tuning baseline) and our meta-optimizer for Step 2. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2 × 10 −5 , β 1 = 0.9, and β 2 = 0.999 as the standard optimizer and base optimizer in our meta-optimizer. To train our meta-optimizer, we use Adam with a learning rate of 0.05 for N = 10 epochs with L = 15 training batches per iteration (Algorithm 1). Different from Nooralahzadeh et al. (2020) who select the auxiliary languages for each target language that lead to the best transfer results, we simulate a more realistic scenario where only a limited set of auxiliary languages is available. We choose two distant auxiliary languages -Greek (Hellenic branch of the Indo-European language family) and Urdu (Indo-Aryan branch of the Indo-European language family)and evaluate the transfer performance on the other languages.

Main Results
As shown in Table 1, we compare our metalearning approach with the fine-tuning baseline and the zero-shot transfer results reported in prior work that uses mBERT. Our approach outperforms the fine-tuning methods in Devlin et al. (2019) by 1.6-8.5%. Compared with the best fine-tuning method in Wu and Dredze (2019) which freezes a selected subset of mBERT layers during fine-tuning, our approach achieves +0.4% higher accuracy on average. We compare our approach with a strong fine-tuning baseline which achieves competitive accuracy scores to the best X-MAML results (Nooralahzadeh et al., 2020) using a single auxiliary language, even though we limit our choice of the auxiliary language to Greek and Urdu, while Nooralahzadeh et al. (2020) select the best auxiliary language among all languages except for the target one. Overall, our approach outperforms the strong fine-tuning  Table 2: Ablation results on the XNLI dataset using Greek and Urdu as the auxiliary languages (averaged over five runs). Results show that ablating the layer-wise update rate or cross-lingual meta-training degrades accuracy on all target languages. baseline on 10 out of 14 languages and by +0.2% accuracy on average.
Our approach brings larger gains when using two auxiliary languages -it outperforms the finetuning baseline on all languages and improves the average accuracy by +0.6%. This suggests that our meta-learning approach is more effective when transferring from multiple source languages. 1

Ablation Study
Our approach is different from Andrychowicz et al. (2016) in that 1) it adopts layer-wise update rates while the meta-parameters are shared across all model parameters in Andrychowicz et al. (2016), and 2) it trains the meta-parameters in a cross-lingual setting while Andrychowicz et al. (2016) is designated to few-shot learning. We conduct ablation experiments on XNLI using Greek and Urdu as the auxiliary languages to understand how they contribute to the model performance.

Impact of Layer-Wise Update Rate
We compare our approach with its variant that replaces the layer-wise update rate with one update rate for all layers. Table 2 shows that our approach significantly outperforms this variant on all target languages with an average margin of 2.0%. This suggests that layer-wise update rate contributes greatly to the effectiveness of our approach.

Impact of Cross-Lingual Meta-Training
We measure the impact of cross-lingual meta-training by replacing the cross-lingual meta-training in our approach with a joint training of the layer-wise update rate and model parameters. As shown in Table 2, ablating the cross-lingual meta-training 1 Using two auxiliary languages improves over one auxiliary language the most on lower-resource languages in mBERT pre-training (such as Turkish and Hindi), but does not bring gains or even hurts on high-resource languages (such as French and German). This is consistent with the findings in prior work that the choice of the auxiliary languages is crucial in cross-lingual transfer . We leave further investigation on its impact on our meta-learning approach for future work. degrades accuracy significantly on all target languages by 1.4% on average, which shows that our cross-lingual meta-training strategy is beneficial.

Cross-lingual Transfer Learning
The idea of cross-lingual transfer is to use the annotated data in the source languages to improve the task performance on the target language with minimal or even zero target labeled data (aka zeroshot). There is a large body of work on using external cross-lingual resources such as bilingual word dictionaries (Prettenhofer and Stein, 2010;Schuster et al., 2019b;Liu et al., 2020a), MT systems (Wan, 2009), or parallel corpora (Eriguchi et al., 2018;Singla et al., 2018;Conneau et al., 2018) to bridge the gap between the source and target languages. Recent advances in unsupervised cross-lingual representations have paved the road for transfer learning without crosslingual resources (Yang et al., 2017;Schuster et al., 2019a). Our work builds on Mulcaire et al. (2019); Lample and Conneau (2019); Pires et al. (2019) who show that language models trained on monolingual text from multiple languages provide powerful multilingual representations that generalize across languages. Recent work has shown that more advanced techniques such as freezing the model's bottom layers (Wu and Dredze, 2019) or continual learning (Liu et al., 2020b) can further boost the cross-lingual performance on downstream tasks. In this paper, we explore meta-learning to softly select the layers to freeze during fine-tuning.

Meta Learning
A typical meta-learning algorithm consists of two loops of training: 1) an inner loop where the learner model is trained, and 2) an outer loop where, given a meta-objective, we optimize a set of meta-parameters which controls aspects of the learning process in the inner loop. The goal is to find the optimal meta-parameters such that the inner loop performs well on the metaobjective. Existing meta-learning approaches differ in the choice of meta-parameters to be optimized and the meta-objective. Depending on the choice of meta-parameters, existing work can be divided into four categories: (a) neural architecture search (Stanley and Miikkulainen, 2002;Zoph and Le, 2016;Baker et al., 2016;Real et al., 2017;Zoph et al., 2018); (b) metric-based (Koch et al., 2015;Vinyals et al., 2016); (c) modelagnostic (MAML) (Finn et al., 2017;Ravi and Larochelle, 2016); (d) model-based (learning update rules) (Schmidhuber, 1987;Hochreiter et al., 2001;Maclaurin et al., 2015;Li and Malik, 2017).
In this paper, we focus on model-based metalearning for zero-shot cross-lingual transfer. Early work introduces a type of networks that can update their own weights (Schmidhuber, 1987(Schmidhuber, , 1992(Schmidhuber, , 1993. More recently, Andrychowicz et al. (2016) propose to model gradient-based update rules using an RNN and optimize it with gradient descent. However, as Wichrowska et al. (2017) point out, the RNN-based meta-optimizers fail to make progress when run for large numbers of steps. They address the issue by incorporating features motivated by the standard optimizers into the metaoptimizer. We instead base our meta-optimizer on a standard optmizer like Adam so that it generalizes better to large-scale training.
Meta-learning has been previously applied to few-shot cross-lingual named entity recognition , low-resource machine translation (Gu et al., 2018), and improving cross-domain generalization for semantic parsing (Wang et al., 2021).
For zero-shot cross-lingual transfer, Nooralahzadeh et al. (2020) introduce an optimization-based meta-learning algorithm called X-MAML which meta-learns the initial model parameters on supervised data from low-resource languages. By contrast, our meta-learning algorithm requires much less metaparameters and is thus simpler than X-MAML. Bansal et al. (2020) show that MAML combined with meta-learning for learning rates improves few-shot learning. Different from their approach which learns layer-wise learning rates only for task-specific layers specified as a hyper-parameter as part of the MAML algorithm, our approach learns layer-wise learning rates for all layers, and we show the effectiveness of our approach without being used with MAML on zero-shot cross-lingual transfer.

Conclusion
We propose a novel meta-optimizer that learns to soft-select which layers to freeze when fine-tuning a pretrained language model (mBERT) for zeroshot cross-lingual transfer. Our meta-optimizer learns the update rate for each layer by simulating the zero-shot transfer scenario where the model fine-tuned on the source languages is tested on an unseen language. Experiments show that our approach outperforms the simple fine-tuning baseline and the X-MAML algorithm on cross-lingual natural language inference.