Meta-Learning for Effective Multi-task and Multilingual Modelling

Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g., named entity recognition in English) and knowledge of other languages (e.g., question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.


Introduction
Multi-task and multilingual learning are both problems of long standing interest in natural language processing. Leveraging data from multiple tasks and/or additional languages to benefit a target task is of great appeal, especially when the target task has limited resources. When it comes to multiple tasks, it is well-known from prior work on multitask learning (Liu et al., 2019b;Kendall et al., 2018;Liu et al., 2019a;Yang and Hospedales, 2017) that jointly learning a model across tasks can benefit the tasks mutually. For multiple languages, the ability of deep learning models to learn effective embeddings has led to their use for joint learning of models across languages (Conneau et al., 2020;Conneau and Lample, 2019;Artetxe and Schwenk, 2019); learning cross-lingual embeddings to aid languages in limited resource settings is of growing interest (Kumar et al., 2019;Wang et al., 2017;Adams et al., 2017). Let us say we had access to M tasks across N different languages -c.f. Table 1 that outlines such a matrix of tasks and languages from the XTREME benchmark (Hu et al., 2020). How do we perform effective joint learning across tasks and languages? Are there specific tasks or languages that need to be sampled more frequently for effective joint training? Can such sampling strategies be learned from the data?
In this work, we adopt a meta-learning approach for efficiently learning parameters in a shared parameter space across multiple tasks and multiple languages. Our chosen tasks are question answering (QA), natural language inference (NLI), paraphrase identification (PA), part-of-speech tagging (POS) and named entity recognition (NER). The tasks were chosen to enable us to employ a gamut of different types of language representations needed to tackle problems in NLP. In Figure 1, we illustrate the different types of representations by drawing inspiration from the Vauquois Triangle (Vauquois, 1968), well-known for machine translation, and situating our chosen tasks within such a triangle. Here we see that POS and NER are relatively 'shallower' analysis tasks that are token-centric, while QA, NLI and PA are 'deeper' analysis tasks that would require deeper semantic representations. This representation suggests a strategy for effective parameter sharing. For the deeper tasks, the same task in different languages could have representations that are closer and hence benefit each other, while for the shallower tasks, keeping the language unchanged and exploring different tasks might be more beneficial. Interestingly, this is exactly what we find with our meta-learned model and is borne out in our experimental results. We also find that as the model progressively learns, the meta-learning based models for the tasks requiring deeper semantic analysis benefit more from joint learning compared to the shallower tasks. With access to multiple tasks and languages during training, the question of how to sample effectively from different tasks and languages also becomes important to consider. We investigate different sampling strategies, including a parameterized sampling strategy, to assess the influence of sampling across tasks and languages on our metalearned model.
Our main contributions in this work are threefold: • We present a meta-learning approach that enables effective sharing of parameters across multiple tasks and multiple languages. This is the first work, to our knowledge, to explore the interplay between multiple tasks at different levels of abstraction and multiple languages using meta-learning. We show results on the recently-released XTREME benchmark and observe consistent improvements across different tasks and languages using our model. We also offer rules of thumb for effective meta-learning that could hold in larger settings involving additional tasks and languages.
• We investigate different sampling strategies that can be incorporated within our metalearning approach and examine their benefits.
• We evaluate our meta-learned model in zeroshot settings for every task on target languages that never appear during training and show its superiority compared to competitive zero-shot baselines.

Related Work
We summarize three threads of related research that look at the transferability in models across different tasks and different languages: multi-task learning, meta-learning and data sampling strategies for both multi-task learning and meta-learning. Multitask learning (Caruana, 1993) has proven to be highly effective for transfer learning in a variety of NLP applications such as question answering, neural machine translation, etc. (McCann et al., 2018;Hashimoto et al., 2017;Chen et al., 2018;Kiperwasser and Ballesteros, 2018). Some multi-task learning approaches (Jawanpuria et al., 2015) have attempted to identify clusters (or groups) of related tasks based on end-to-end convex optimization formulations. Meta-learning algorithms (Nichol et al., 2018) are highly effective for fast adaptation and have recently been shown to be beneficial for several machine learning tasks (Santoro et al., 2016;Finn et al., 2017). Gu et al. (2018) use a meta-learning algorithm for machine translation to leverage information from high-resource languages. Dou et al. (2019) investigate multiple model agnostic meta-learning algorithms for low-resource natural language understanding on the GLUE (Wang et al., 2018) benchmark. Data sampling strategies for multi-task learning and meta-learning form the third thread of related work. A good sampling strategy has to account for the imbalance in dataset sizes across tasks/languages and the similarity between tasks/languages. A simple heuristic-based solution to address the issue of data imbalance is to assign more weight to low-resource tasks or lan-guages (Aharoni et al., 2019). Arivazhagan et al.
(2019) define a temperature parameter which controls how often one samples from low-resource tasks/languages. The MultiDDS algorithm, proposed by Wang et al. (2020b), actively learns a different set of parameters for sampling batches given a set of tasks such that the performance on a held-out set is maximized. We use a variant of Mul-tiDDS as a sampling strategy in our meta-learned model. Nooralahzadeh et al. (2020) is most similar in spirit to our work in that they study a crosslingual and cross-task meta-learning architecture but only focus on zero-shot and few-shot transfer for two natural language understanding tasks, NLI and QA. In contrast, we study many tasks in many languages, in conjunction with sampling strategies, and offer concrete insights on how best to guide the meta-learning process when multiple tasks are in the picture.

Methodology
Our setting is pivoted on a grid of tasks and languages (with some missing entries as shown in Table 1). Each row of the grid corresponds to a single task. A cell of the grid corresponds to a Task-Language pair which we refer to as a TL pair (TLP). We denote by q i = |D i train |/ n k=1 |D k train | , the fraction of the dataset size for the i th TLP and by P D (i), the probability of sampling a batch from the i th TLP during meta training. The distribution over all TLPs, viz., is a Multinomial (say M) over P D (i)s.

Our Meta-learning Approach
The goal in the standard meta learning setting is to obtain a model that generalizes well to new test/target tasks given some distribution over training tasks. This can be achieved using optimizationbased meta-learning algorithms that modify the learning procedure in order to learn a good initialization of the parameters. This can serve as a useful starting point that can be further fine-tuned on various tasks. Finn et al. (2017) proposed a general optimization algorithm called Model Agnostic Meta Learning (MAML) that can be trained using gradient descent. MAML aims to minimize the following objective where M is the Multinomial distribution over TLPs, L i is the loss and U k i a function that returns θ after k gradient updates both calculated on batches sampled from T i . Minimizing this objective using first order methods involves computing gradients of the form ∂ ∂θ U k i (θ), leading to the expensive computation of second order derivatives. Nichol et al. (2018) proposed an alternative first-order metalearning algorithm named "Reptile" with simple update rule: . Despite its simplicity, a recent study by Dou et al. (2019) showed that Reptile is atleast as effective as MAML in terms of performance. We therefore employed Reptile for meta learning in all our experiments.

Algorithm 1 Our Meta-learning Approach
Input: D train set of TLPs for meta training (Also D dev for parametrised sampling) Sampling Strategy (Temperature / Mul-tiDDS) Output: The converged multi-task multilingual model parameters θ * 1: Initialize P D (i) depending on the sampling strategy 2: while not converged do if Sampling Strategy ← MultiDDS then 10: end if 17: end while 3.2 Selection and Sampling Strategies

Selection
The choice of TLPs in meta-learning plays a vital role in influencing the model performance, as we will see in more detail in Section 5. Apart from the use of all TLPs across both tasks and languages during training, selecting all languages for a given task (Gu et al., 2018) and selecting all tasks for a given language (Dou et al., 2019) are two other logical choices. We refer to the last two settings as being Task-Limited and Lang-Limited, respectively.

Heuristic Sampling
Once the TLPs for meta training (denoted by D) have been selected, we need to sample TLPs from M. We investigate temperature-based heuristic sampling (Arivazhagan et al., 2019) which defines the probability of any dataset as a function of its size. P D (i) = q is the probability of the i th TLP to be sampled and τ is the temperature parameter. τ = 1 reduces to sampling TLPs proportional to their dataset sizes and τ → ∞ reduces to sampling TLPs uniformly.

Parameterized Sampling
The sampling strategy defined in Section 3.2.2 remains constant throughout meta training and only depends on dataset sizes. Wang et al. (2020b) proposed a parameterized sampling technique called MultiDDS that builds on Differential Data Selection (DDS) (Wang et al., 2020a) for weighing multiple datasets. The P D (i) are parameterized using ψ i as P D (i) = e ψ i / j e ψ j with the initial value of ψ satisfying P D (i) = q i . The optimization for ψ and θ is performed in an alternating manner (Colson et al., 2007) is the objective function which we want to minimize over development set(s). The reward function, R(x, y; θ t ), is defined as: ψ's are updated using the REINFORCE (Williams, 1992) algorithm.

Evaluation Benchmark
The recently released XTREME dataset (Hu et al., 2020) is a multilingual multi-task benchmark consisting of classification, structured prediction, QA and retrieval tasks. Each constituent task has associated datasets in multiple languages. The sources of POS and NER datasets are Universal Dependency v2.5 treebank (Nivre et al., 2020) and WikiAnn (Pan et al., 2017) respectively, with ground-truth labels available for each language. Regarding evaluation metrics, for QA we report F1 scores and for the other four tasks (PA, NLI, POS, NER) we report accuracy scores.

Implementation Details
BERT (Devlin et al., 2019) models yield state-ofthe-art performance for many NLP tasks. Since we are dealing with datasets in multiple languages, we build our meta learning models on mBERT (Pires et al., 2019;Wu and Dredze, 2019) base architecture, implemented by Wolf et al. (2020), with output layers specific to each task. In our experiments, we use the AdamW (Loshchilov and Hutter, 2017) optimizer to make gradient-based updates to the model's parameters using batches from a particular TLP (Alg. 1, Line 6). This optimizer is shared across all the TLPs. When performing the metastep (Alg. 1, Line 8), we use vanilla stochastic gradient descent (SGD) (Robbins and Monro, 1951) updates. Similarly, in the case of parameterized sampling the weights are updated (Alg. 1, Line 15) using vanilla SGD. Meta training involves sampling a set of m tasks, taking k gradient update steps from the initial parameter to arrive at θ (k) i for task T i and finally updating θ using the Reptile update rule. For meta-

Data Selection and Sampling Strategies
We experiment with three different configurations for the set of TLPs to be considered during metalearning: (a) using all tasks for a given language (Lang-Limited) (b) using all languages for a given task (Task-Limited) and (c) using all tasks and all languages (All TLPs). Since the dataset size varies across tasks (as also across languages), we use temperature sampling within each setting for τ = 1, 2, 5 and ∞. (In Table 4 of the Appendix C in the supplementary material, we report results for different choices of TLP selection and different values of the temperature.) With respect to the Input in Algorithm 1, there are two sets of TLPs that need to be selected for parameterized sampling: D train and D dev . In order to analyse the effect of the choice of task and language, we experiment with the following 4 settings - The models (a), (b) are referred to as mDDS and (c), (d) are called mDDS-Lang and mDDS-Task respectively. Results for these 4 models are reported in Table 2 alongside temperature sampling for comparison.

Baselines
Our first baseline system for each TLP uses mBERT-based models trained on data specific to each TLP, which is either available as ground-truth or in a translated form. We follow the same hyperparameter settings as reported in XTREME. We also present three multi-task learning (MTL) baseline systems: task limited (Task-Limited), language limited (Lang-Limited), and the use of all TLPs during training (All TLPs MTL). During MTL training, we concatenate and shuffle the selected datasets. The model is trained for 5 epochs with a learning rate of 5e-5. We refer the reader to Appendix A for more training details.

Results and Analysis
Table 2 presents all our main results comparing different data selection and sampling strategies used for meta-learning. Each column corresponds to a target TLP; the best-performing meta-learned models for each target TLP within each data selection setting have been highlighted in colour.
(Light-to-dark gradation reflects improvements in performance.) From Table 2, we see that our metalearned models outperform the baseline systems across all the TLPs corresponding to QA, NLI and PA. (POS and NER also mostly benefit from metalearning, but the margins of improvement are much smaller compared to the other tasks given the already high baseline scores).
Task-Limited vs Lang-Limited models. For QA and NLI, we observe that the Task-Limited models are always better than   the temperature-based sampling strategy and SS=mDDS refers to the multiDDS-based sampling strategy. mDDS-Task and mDDS-Lang refer to the use of a development set for multiDDS that contains all languages for a task and all tasks for a language, respectively. The best result among Baseline and three MTL models is highlighted using orange. For each column we present the difference (positive or negative) of the meta models from the best baseline (highlighted in orange) of that column the Lang-Limited models. This is in line with our intuition that tasks like QA and NLI (which require deeper semantic representations) will benefit more by using data from different languages for the same task. We see the opposite seems to hold for POS and NER where the Lang-Limited models are almost always better than the Task-Limited models. With POS and NER being relatively shallower tasks, it makes sense that they benefit more from language-specific training that relies on token embeddings shared across tasks.
Investigating Sampling Strategies. In Table 2, all the scores shown for the Temp sampling strategy are the best scores across four different values of T , T = 1, 2, 5, ∞. (The complete table is available in Appendix C in the supplementary material.) We also present comparisons with the mDDS, mDDS-Lang and mDDS-Task sampling strategies enforced within the Lang-Limited, Task-Limited and All TLPs models, respectively. For POS and NER, our best meta-learned models are mostly Lang-Limited with Temp sampling. It is intuitive that for these shallower tasks, mDDS does not offer any benefits from allowing to sample instances from other tasks.
To better understand the effects of mDDS sampling, Figure 3 shows plots of the rewards and sampling probabilities ψ's computed as a function of training time for two deeper tasks -QA-en and NLI-es along with a shallower task -POS-de. We note that initially all the TLPs in any mDDS setting would start with similar rewards, thus lending ψ's to converge towards the T = ∞ state. We highlight the following three observations: • We find that the mDDS strategy does not help NLI at all. This is because the NLI task occupies the largest proportion across tasks at the start, as shown in Figure 2, and the proportion of NLI decreases substantially over time (since all tasks start with similar rewards at the beginning of meta training). Thus, for tasks that are over-represented in the meta-learning phase, temperature-based sampling is likely to be sufficient.
• We observe that the rewards for both QA and NLI are consistently high, irrespective of the target TLP. This suggests that both QA and NLI are information-rich tasks and could benefit other tasks in meta-learning. This is also apparent from the accuracies for PA in Table 2, where all the best meta-learned models employ mDDS sampling.
• From the sampling probabilities for QA-en, we see that both QA and NLI are given almost equal weightage. However, from the F1 scores in Table 2, the best numbers for QA are in the Task-Limited setting which suggests that QA does not benefit from any other task. One explanation for this could be that the sequence length of inputs for NLI is 128 while the inputs for QA are of length 384, thus allowing lesser room for QA to be benefited by NLI.
Zero-shot Evaluations. Zero-shot evaluation is performed on languages that were not part of the training (henceforth, we refer to them as external languages). In the case of QA, NLI and PA we select all external language for which datasets were available in XTREME. For NER and POS, the number of external languages is close to 35 so we choose a subset of these to report the results. For evaluation, we compare models that are agnostic to the target language during meta training (Task-Limited, All TLPs and All TLPs mDDS-Task). Since Lang-Limited MTL is language specific and does not offer a competitive  Table 3: Results comparing Zero-shot evaluations for several external languages with competitive MTL baselines. The best MTL model is highlighted using orange. Rows for meta models show the difference (positive or negative) of the meta model result from the best MTL setting (orange) for that column baseline when applied to an external language, we compare against Task-Limited MTL and All TLPs MTL that are more competitive. An interesting observation from the zero shot results in Table 3 is that for every external language, on the 'shallower' NER and POS tasks, the Task-Limited variant of meta-learning performs better than both the variants of MTL, viz., Task-Limited MTL and All TLPs MTL. In contrast, the 'deeper' tasks, viz., QA, NLI and PA benefit more from the use of meta-learning using All TLPs setting, presumably because, as argued earlier, the deeper tasks tend to help each other more.

Conclusion
We present effective use of meta-learning for capturing task and language interactions in multi-task, multi-lingual settings. The effective use involves appropriate strategies for sampling tasks and languages as well as rough knowledge of the level of abstraction (deep vs. shallow representation) of that task. We present experiments on the XTREME multilingual benchmark dataset using five tasks and six languages. Our meta-learned model shows clear performance improvements over competitive baseline models. We observe that deeper tasks consistently benefit from meta-learning. Furthermore, shallower tasks benefit from deeper tasks when meta-learning is restricted to a single language. Finally, zero-shot evaluations for several external languages demonstrate the benefit of using meta-learning over two multi-task baselines while also reinforcing the linguistic insight that tasks requiring deeper representations tend to collaborate better.

Appendices Appendix A: Baseline Training Details
For QA learning rate is 3e-5 and sequence length is 384 and the model is trained for 2 epochs. For PA, NLI, POS and NER the learning rate is 2e-5 and sequence length is 128. NLI and PA models are trained for 5 epochs while POS and NER models are trained for 10 epochs. The choice of hyperparameters was kept constant across different languages for the same task.

Appendix B: Finetuning Details
For finetuning we kept the same number of epochs as the baseline of that task i.e 2 epochs for QA, 10 epochs for POS and NER, 5 epochs for NLI and PA. For QA we finetune with learning rate 3e-5 and 3e-6 and POS/NER we finetune with learning rate 2e-5 and 2e-6 and select the better of the two model. For PA and NLI the results for learning rate 2e-5 were consistently worse compared to 2e-6 so we just use lr = 2e-6 for PA and NLI.  Baseline and three MTL models is highlighted using orange. For each column we present the difference (positive or negative) of the meta models from the best baseline (highlighted in orange) of that column