Multilingual and cross-lingual document classification: A meta-learning approach

The great majority of languages in the world are considered under-resourced for the successful application of deep learning methods. In this work, we propose a meta-learning approach to document classification in limited-resource setting and demonstrate its effectiveness in two different settings: few-shot, cross-lingual adaptation to previously unseen languages; and multilingual joint training when limited target-language data is available during training. We conduct a systematic comparison of several meta-learning methods, investigate multiple settings in terms of data availability and show that meta-learning thrives in settings with a heterogeneous task distribution. We propose a simple, yet effective adjustment to existing meta-learning methods which allows for better and more stable learning, and set a new state of the art on several languages while performing on-par on others, using only a small amount of labeled data.


Introduction
There are more than 7000 languages around the world and, of them, around 6% account for 94% of the population. 1 Even for the 6% most spoken languages, very few of them possess adequate resources for natural language research and, when they do, resources in different domains are highly imbalanced. Additionally, human language is dynamic in nature: new words and domains emerge continuously and hence no model learned in a particular time will remain valid forever.
With the aim of extending the global reach of Natural Language Processing (NLP) technology, much recent research has focused on the development of multilingual models and methods to efficiently transfer knowledge across languages.
Among these advances are multilingual word vectors which aim to give word-translation pairs a similar encoding in some embedding space (Mikolov et al., 2013a;Lample et al., 2017). There has also been a lot of work on multilingual sentence and word encoders that either explicitly utilizes corpora of bi-texts (Artetxe and Schwenk, 2019;Lample and Conneau, 2019) or jointly trains language models for many languages in one encoder (Devlin et al., 2018;. Although great progress has been made in cross-lingual transfer learning, these methods either do not close the gap with performance in a single high-resource language (Artetxe and Schwenk, 2019;, e.g., because of cultural differences in languages which are not accounted for, or are impractically expensive (Lai et al., 2019).
Meta-learning, or learning to learn (Schmidhuber, 1987;Bengio et al., 1990;Thrun and Pratt, 1998), is a learning paradigm which focuses on the quick adaption of a learner to new tasks. The idea is that by training a learner to adapt quickly and from a few examples on a diverse set of training tasks, the learner can also generalize to unseen tasks at test time. Meta-learning has recently emerged as a promising technique for few-shot learning for a wide array of tasks (Finn et al., 2017;Koch et al., 2015;Ravi and Larochelle, 2017) including NLP (Dou et al., 2019;Gu et al., 2018). To our best knowledge, no previous work has been done in investigating meta-learning as a framework for multilingual and cross-lingual few-shot learning. We propose such a framework and demonstrate its effectiveness in document classification tasks. The only current study on meta-learning for cross-lingual few-shot learning is the one by (Nooralahzadeh et al., 2020), focusing on natural language inference and multilingual question answering. In their work, the authors focus on applying meta-learning to learn to adapt a monolingually trained classi-= θ for all steps k do Compute: θ end for end for Update θ = θ − β(MetaUpdate(f θ (K) l , Q l )) end while fier to new languages. In contrast to this work, we instead show that, in many cases, it is more favourable to not initialize the meta-learning process from a monolingually trained classifier, but rather reserve its respective training data for metalearning instead.
Our contributions are as follows: 1) We propose a meta-learning approach to few-shot cross-lingual and multilingual adaptation and demonstrate its effectiveness on document classification tasks over traditional supervised learning; 2) We provide an extensive comparison of meta-learning methods on multilingual and cross-lingual few-shot learning and release our code to facilitate further research in the field; 2 3) We analyse the effectiveness of meta-learning under a number of different parameter initializations and multiple settings in terms of data availability, and show that meta-learning can effectively learn from few examples and diverse data distributions; 4) We introduce a simple yet effective modification to existing methods and empirically show that it stabilizes training and converges faster to better local optima; 5) We set a new state of the art on several languages and achieve on-par results on others using only a small amount of data.
2 Meta-learning methods Meta-learning, or learning to learn, aims to create models that can learn new skills or adapt to new tasks rapidly from few training examples. Unlike traditional machine learning, datasets for either training or testing, which are referred to as metatrain and meta-test datasets, comprise of many tasks sampled from a distribution of tasks p(D) rather than individual data points. Each task is asso-2 https://github.com/mrvoh/meta_ learning_multilingual_doc_classification ciated with a dataset D which contains both feature vectors and ground truth labels and is split into a support set and a query set, D = {S, Q}. The support set is used for fast adaptation and the query set is used to evaluate performance and compute a loss with respect to model parameter initialization. Generally, some model f θ parameterized by θ, often referred to as the base-learner, is considered. A cycle of fast-adaptation on a support-set followed by updating the parameter initialization of the baselearner based on the loss on the query-set is called an episode. In the case of classification, the optimal parameters maximize the probability of the true labels across multiple batches Q ⊂ D In few-shot classification/fast learning, the goal is to minimize the prediction error on data samples with unknown labels given a small support set for learning. Meta-training (Algorithm 1) consists of updating the parameters of the base-learner by performing many of the formerly described episodes, until some stop criterion is reached.
Following this procedure, the extended definition of optimal parameters is given in Eq. 2 to include fast adaptation based on the support set. The underlined parts mark the difference between traditional supervised-learning and meta-learning. The optimal parameters θ * are obtained by solving In this work, we focus on metric-and optimizationbased meta-learning algorithms. In the following sections, their respective characteristics and the update methods in Algorithm 1 are introduced.

Prototypical Networks
Prototypical Networks (Snell et al., 2017) belong to the metric-based family of meta-learning algorithms. Typically they consist of an embedding network f θ and a distance function d(x 1 , x 2 ) such as Euclidean distance. The embedding network is used to encode all samples in the support set S c and compute prototypes µ c per class c ∈ C by computing the mean of the sample encodings of that respective class Using the computed prototypes, Prototypical Networks classify a new sample as Wang et al. (2019) show that despite their simplicity, Prototypical Networks can perform on par or better than other state-of-the-art meta-learning methods when all sample encodings are centered around the overall mean of all classes and consecutively L2-normalized. We also adopt this strategy.

MAML
Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an optimization-based method that uses the following objective function is the loss on the query set after updating the base-learner for k steps on the support set. Hence, MAML directly optimizes the base-learner such that fast-adaptation of θ, often referred to as inner-loop optimization, results in task-specific parameters θ (k) l which generalize well on the task. Setting B as the batch size, MAML implements its MetaUpdate, which is also referred to as outer-loop optimization, as Such a MetaUpdate requires computing second order derivatives and, in turn, holding θ (j) l ∀j = 1, . . . , k in memory. A first-order approximation of MAML (foMAML), which ignores second order derivatives, can be used to bypass this problem: Following previous work (Antoniou et al., 2018), we also adopt the following improvements in our framework for all MAML-based methods: Per-step Layer Normalization weights Layer normalization weights and biases are not updated in the inner-loop. Sharing one set of weights and biases across inner-loop steps implicitly assumes that the feature distribution between layers stays the same at every step of the inner optimization.
Per-layer per-step learnable inner-loop learning rate Instead of using a shared learning rate for all parameters, the authors propose to initialize a learning rate per layer and per step and jointly learn their values in the MetaUpdate steps.
Cosine annealing of outer-loop learning rate It has shown to be crucial to model performance to anneal the learning rate using some annealing function (Loshchilov and Hutter, 2016).

Reptile
Reptile (Nichol et al., 2018) is a first-order optimization-based meta-learning algorithm which is designed to move the weights towards a manifold of the weighted averages of task-specific parameters θ (k) l : Despite its simplicity, it has shown competitive or superior performance against MAML, e.g., on Natural Language Understanding (Dou et al., 2019).

ProtoMAML
Triantafillou et al. (2020) introduce ProtoMAML as a meta-learning method which combines the complementary strengths of Prototypical Networks and MAML by leveraging the inductive bias of the use of prototypes instead of random initialization of the final linear layer of the network. Snell et al. (2017) show that Prototypical Networks are equivalent to a linear model when Euclidean distance is used. Using the definition of prototypes µ c as per Eq. 3, the weights w c and bias b c corresponding to class c can be computed as follows ProtoMAML is defined as the adaptation of MAML where the final linear layer is parameterized as per Eq. 9 at the start of each episode using the support set. Due to this initialization, it allows modeling a varying number of classes per episode.
ProtoMAMLn Inspired by Wang et al. (2019), we propose a simple, yet effective adaptation to Pro-toMAML by applying L 2 normalization to the prototypes themselves, referred to as ProtoMAMLn, and, again, use a first-order approximation (foPro-toMAMLn). We demonstrate that doing so leads to a more stable, faster and effective learning algorithm at only constant extra computational cost (O(1)).
We hypothesize the normalization to be particularly beneficial in case of a relatively highdimensional final feature space -in case of BERTlike models typically 768 dimensions. Let x be a sample andx = f θ (x) be the encoding of the sample in the final feature space. Since the final activation function is the tanh activation, all entries of bothx and µ c have values between -1 and 1. The pre-softmax activation for class c is computed aŝ x T µ c . Due to the size of the vectors and the scale of their respective entries, this in-product can yield a wide range of values, which in turn results in relatively high loss values, making the inner-loop optimization unstable.

Related work 3.1 Multilingual NLP
Just as the deep learning era for monolingual NLP started with the invention of dense, lowdimensional vector representations for words (Mikolov et al., 2013b) so did cross-lingual NLP with works like those of Mikolov et al. (2013a); Faruqui et al. (2014). More recently, multilingual and/or cross-lingual NLP is approached by training one shared encoder for multiple languages at once, either by explicitly aligning representations with the use of parallel corpora (Artetxe and Schwenk, 2019;Lample and Conneau, 2019) or by jointly training on some monolingual language model objective, such as the Masked Language Model (MLM) (Devlin et al., 2018), in multiple languages (Devlin et al., 2018;. The formerly described language models aim to create a shared embedding space for multiple languages with the hope that fine-tuning in one language does not degrade performance in others. Lai et al. (2019) argue that just aligning languages is not sufficient to generalize performance to new languages due to the phenomenon they describe as domain drift. Domain drift accounts for all differences for the same tasks in different languages which cannot be captured by a perfect translation system, such as differences in culture. They instead propose a multi-step approach which utilizes a multilingual teacher trained with Unsupervised Data Augmentation (UDA) (Xie et al., 2019) to create labels for a student model that is pretrained on large amounts of unlabeled data in the target lan-guage and domain using the MLM objective. With their method, the authors obtain state-of-the-art results on the MLDoc document classification task (Schwenk and Li, 2018) and the Amazon Sentiment Polarity Review task (Prettenhofer and Stein, 2010). A downside, however, is the high computational cost involved. For every language and domain combination: 1) a machine translation system has to be inferred on a large amount of unlabeled samples; 2) the UDA method needs to be applied to obtain a teacher model to generate pseudo-labels on the unlabeled in-domain data; 3) a language model must be finetuned, which involves forwards and backwards computation of a softmax function over a large output space (e.g., 50k tokens for mBERT and 250k tokens for XLM-RoBERTa). The final classifier is then obtained by 4) training the finetuned language model on the pseudo-labels generated by the teacher.

Meta-learning in NLP
Monolingual Bansal et al. (2019) apply metalearning to a wide range of NLP tasks within a monolingual setting and show superior performance for parameter initialization over selfsupervised pretraining and multi-task learning. Their method is an adaptation of MAML where a combination of a text-encoder, BERT (Devlin et al., 2018), is coupled with a parameter generator that learns to generate task-dependent initializations of the classification head such that metalearning can be performed across tasks with disjoint label spaces. Obamuyide and Vlachos (2019b) apply meta-learning on the task of relation extraction; Obamuyide and Vlachos (2019a) apply lifelong meta-learning for relation extraction;  apply meta-learning for few-shot learning on missing link prediction in knowledge graphs.
Multilingual Gu et al. (2018) apply metalearning to Neural Machine Translation (NMT) and show its advantage over strong baselines such as cross-lingual transfer learning. By viewing each language pair as a task, the authors apply MAML to obtain competitive NMT systems with as little as 600 parallel sentences. To our best knowledge, the only application of meta-learning for cross-lingual few-shot learning is the one by Nooralahzadeh et al. (2020). The authors study the application of X-MAML, a MAML-based variant, to crosslingual Natural Language Inference (XNLI) (Conneau et al., 2018) and Multilingual Question An-swering (MLQA) (Lewis et al., 2019) in both a cross-domain and cross-language setting. X-MAML works by pretraining some model M on a high-resource task h to obtain initial model parameters θ mono . Consecutively, a set L of one or more auxiliary languages is taken, and MAML is applied to achieve fast adaptation of θ mono for l ∈ L. In their experiments, the authors use either one or two auxiliary languages and evaluate their method in both a zero-and few-shot setting. It should be noted that, in the few-shot setting, the full development set (2.5k instances) is used to finetune the model, which is not in line with other work on few-shot learning, such as (Bansal et al., 2019). Also, there is a discrepancy in the training set used for the baselines and their proposed method. All reported baselines are either zero-shot evaluations of θ mono or of θ mono finetuned on the development set of the target language, whereas their proposed method additionally uses the development set in either one or two auxiliary languages during meta-training.

Data
In this section, we give an overview of the datasets we use and the respective classification tasks.
MLDoc Schwenk and Li (2018) published an improved version of the Reuters Corpus Volume 2 (Lewis et al., 2004) with balanced class priors for all languages. MLDoc consists of news stories in 8 languages: English, Spanish, French, Italian, Russian, Japanese and Chinese. Each news story is manually classified into one of four groups: Corporate/Industrial, Economics, Government/Social and Markets. The train datasets contain 10k samples whereas the test sets contain 4k samples.
Amazon Sentiment Polarity Another widely used dataset for cross-lingual text classification is the Amazon Sentiment Analysis dataset (Prettenhofer and Stein, 2010). The dataset is a collection of product reviews in English, French, German and Japanese in three categories: books dvds and music. Each sample consists of the original review accompanied by meta-data such as the rating of the reviewed product expressed as an integer on a scale from one to five. In this work, we consider the sentiment polarity task where we distinguish between positive (rating > 3) and negative (rating < 3) reviews. When all product categories are concatenated, the dataset consists of 6K samples per language per dataset (train, test). We extend this with Chinese product reviews in the cosmetics domain from JD.com (Zhang et al., 2015), a large e-commerce website in China. The train and test sets contain 2k and 20k samples respectively.

Experiments
We use XLM-RoBERTa , a strong multilingual model, as the base-learner in all models. We quantify the strengths and weaknesses of meta-learning as opposed to traditional supervised learning in both a cross-and a multilingual joint-training setting with limited resources.
Cross-lingual adaptation Here, the available data is split into multiple subsets: the auxiliary languages l aux which are used in meta-training, the validation language l dev which is used to monitor performance, and the target languages l tgt which are kept unseen until meta-testing. Two scenarios in terms of amounts of available data are considered. A small sample of the available training data of l aux is taken to create a limited-resource setting, whereas all available training data of l aux is used in a high-resource setting. The chosen training data per language is split evenly and stratified over two disjoint sets from which the meta-training support and query samples are sampled, respectively. For meta-testing, one batch (16 samples) is taken from the training data of each target language as support set, while we test on the whole test set per target language (i.e., the query set).
Multilingual joint training We also investigate meta-learning as an approach to multilingual jointtraining in the same limited-resource setting as previously described for the cross-lingual experiments. The difference is that instead of learning to generalize to l tgt = l aux from few examples, here l tgt = l aux . If we can show that one can learn many similar tasks across languages from few examples per language, using a total number of examples in the same order of magnitude as in "traditional" supervised learning for training a monolingual classifier, this might be an incentive to change data collection processes in practice.
For both experimental settings above, we examine the influence of additionally using all training data from a high-resource language l src during meta-training, English.

Specifics per dataset
MLDoc As MLDoc has sufficient languages, we set l src = English and l dev = Spanish. The remaining languages are split in two groups: l aux = {German, Italian, Japanese}; and l tgt = {French, Russian, Chinese}. In the limitedresource setting, we randomly sample 64 samples per language in l aux for training. Apart from comparing low-and high-resource settings, we also quantify the influence of augmenting the training set l aux with a high-resource source language l src , English.
Amazon Sentiment Polarity The fact that the Amazon dataset (augmented with Chinese) comprises of only five languages has some implications for our experimental design. In the cross-lingual experiments, where l aux , l dev and l tgt should be disjoint, only three languages, including English, remain for meta-training. As we consider two languages too little data for meta-training, we do not experiment with leaving out the English data.
Hence, for meta-training, the data consists of l src = English, as well as two languages in l aux . We always keep one language unseen until meta-testing, and alter l aux such that we can meta-test on every language. We set l dev = French in all cases except when French is used as the target language; then, l dev = Chinese. In the limited-resource setting, a total of 128 samples per language in l aux is used. For the multilingual joint-training experiments there are enough languages available to quantify the influence of English during meta-training. When English is excluded, it is used for metavalidation. When included, we average results over two sets of experiments: one where l dev = French and one where l dev = Chinese.

Baselines
We introduce baselines trained in a standard supervised, non-episodic fashion. Again, we use XLM-RoBERTa-base as the base-learner in all models.
Zero-shot This baseline assumes sufficient training data for the task to be available in one language l src (English). The base-learner is trained in a nonepisodic manner using mini-batch gradient descent with cross-entropy loss. Performance is monitored during training on a held-out validation set in l src , the model with the lowest loss is selected, and then evaluated on the same task in the target languages.
Non-episodic The second baseline aims to quantify the exact impact of learning a model through the meta-learning paradigm versus standard super-  vised learning. The model learns from exactly the same data as the meta-learning algorithms, but in a non-episodic manner: i.e., merging support and query sets in l aux (and l src when included) and training using mini-batch gradient descent with cross-entropy loss. During testing, the trained model is independently finetuned for 5 steps on the support set (one mini-batch) of each target language l tgt .

Training setup and hyper-parameters
We use the Ranger optimizer, an adapted version of Adam (Kingma and Ba, 2014) with improved stability at the beginning of training -by accounting for the variance in adaptive learning rates (Liu et al., 2019) -and improved robustness and convergence speed Yong et al., 2020). We use a batch size of 16 and a learning rate of 3e-5 to which we apply cosine annealing. For meta-training, we perform 100 epochs of 100 episodes and perform evaluation with 5 different seeds on the meta-validation set after each epoch. One epoch consists of 100 update steps where each update step consists of a batch of 4 episodes. Earlystopping with a patience of 3 epochs is performed to avoid overfitting. For the non-episodic baselines, we train for 10 epochs on the auxiliary languages while validating after each epoch. All models are created using the PyTorch library (Paszke et al., 2017) and trained on a single 24Gb NVIDIA Titan RTX GPU. We perform grid search on MLDoc in order to determine optimal hyperparameters for the MetaUpdate methods. The hyper-parameters resulting in the lowest loss on l dev = Spanish are used in all experiments. The number of update steps in the inner-loop is 5; the (initial) learning rate of the inner-loop is 1e-5 for MAML and ProtoMAML and 5e-5 for Reptile; the factor by which the learn-ing rate of the classification head is multiplied is 10 for MAML and ProtoMAML and 1 for Reptile; when applicable, the learning rate with which the inner-loop optimizer is updated is 6e-5. See Table  1 for the considered grid. Tables 2 and 3 show the accuracy scores on the target languages on ML-Doc and Amazon respectively. We start by noting the strong multilingual capabilities of XLM-RoBERTa as our base-learner: Adding the full training datasets in three extra languages (i.e., comparing the zero-shot with the non-episodic baseline in the high-resource, 'Included' setting) results in a mere 1.2% points increase in accuracy on average for MLDoc and 0.6% points for Amazon. Although the zero-shot 3 and non-episodic baselines are strong, in the majority of cases, a metalearning approach improves performance. This holds especially for our version of ProtoMAML (ProtoMAMLn), which achieves the highest average accuracy in all considered settings.

Cross-lingual adaptation
The substantial improvements for Russian on MLDoc and Chinese on Amazon indicate that metalearning is most advantageous when the considered task distribution is somewhat heterogeneous or, in other words, when domain drift (Lai et al., 2019) is present. For the Chinese data used for the sentiment polarity task, the presence of domain drift is obvious as the data is collected from a different website and concerns different products than the other languages. For Russian in the MLDoc dataset, it holds that the non-episodic baseline has the smallest gain in performance when adding English data (l src ) in the limited-resource setting (0.2% absolute gain as l src = en  opposed to 5.7% on average for the remaining languages) and even a decrease of 2.4% points when adding English data in the high-resource setting.
Especially for these languages with domain drift, our version of ProtoMAML (foProtoMAMLn) outperforms the non-episodic baselines with a relatively large margin. For instance, in Table 2 in the high-resource setting with English included during training, foProtoMAMLn improves over the non-episodic baseline with 9.1% points whereas the average gain over the remaining languages is 0.9% points. A similar trend can be seen in Table 3 where, in the limited-resource setting, foPro-toMAMLn outperforms the non-episodic baseline with 1.9% points on Chinese, with comparatively smaller gains on average for the remaining languages.
Joint training In this setting, we achieve a new state of the art on MLDoc for German, Italian, Japanese and Russian using our method, foPro-toMAMLn (  Tables 2 and 3. on MLDoc, while we use a much less computationally expensive approach. Again, we use Russian in MLDoc to exemplify the difference between meta-learning and standard supervised learning. When comparing the difference in performance between excluding and including English meta-training episodes (l src ), opposite trends are noticeable: for standard supervised, nonepisodic learning, performance drops slightly by 0.3%, whereas all meta-learning algorithms gain between 2.2% and 6.7% in absolute accuracy. This confirms our earlier finding that meta-learning benefits from, and usefully exploits heterogeneity in data distributions; in contrast, this harms performance in the standard supervised-learning case.
7 Ablations foProtoMAMLn Figure 1 shows the development of the validation accuracy during training for 25 epochs for the original foProtoMAML and our model, foProtoMAMLn. By applying L 2 normalization to the prototypes, we obtain a more stable version of foProtoMAML which empirically converges faster. We furthermore re-run the high-   resource experiments with English for both ML-Doc and Amazon using the original foProtoMAML (Table 5) and find it performs 4.3% and 1.7% accuracy points worse on average, respectively, further demonstrating the effectiveness of our approach.
Initializing from a monolingual classifier In our experiments, we often assume the presence of a source language (English). We now investigate (in the l src = en 'Excluded' setting) whether it is beneficial to pre-train the base-learner in a standard supervised way on this source language and use the obtained checkpoint θ mono as an initialization for meta-training (Table 6) rather than initializing from the transformer checkpoint. We observe that only ProtoNet consistently improves performance, whereas foProtoMAMLn suffers the most with a decrease of 3.1% and 3.96% in accuracy in the low-and high-resource setting respectively. We surmise this difference is attributable to two factors. Intuitively, the monolingual classifier aims to learn a transformation from the input space to the final feature space, from which the prototypes for ProtoNet and Pro-toMAML are created, in which the learned classes are encoded in their own disjoint sub-spaces such that a linear combination of these features can be used to correctly classify instances. ProtoNet aims to learn a similar transformation, but uses a Nearest Neighbours approach to classify instances instead. ProtoMAML on the other hand benefits the most from prototypes which can be used to classify instances after the inner-loop updates have been per-formed. This, in combination with the fact that the first-order approximation of ProtoMAML cannot differentiate through the creation of the prototypes, could explain the difference in performance gain with respect to ProtoNet.

Conclusion
We proposed a meta-learning framework for fewshot cross-and multilingual joint-learning for document classification tasks in different domains. We demonstrated that it leads to consistent gains over traditional supervised learning on a wide array of data availability and diversity settings, and showed that it thrives in settings with a heterogenous task distribution. We presented an effective adaptation to ProtoMAML and, among others, obtained a new state of the art on German, Italian, Japanese and Russian in the few-shot setting on MLDoc.