On the cross-lingual transferability of multilingual prototypical models across NLU tasks

Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications when a sufficient number of training examples are available. In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages. Domain and language models are supposed to grow and change as the problem space evolves. On one hand, research on transfer learning has demonstrated the cross-lingual ability of multilingual Transformers-based models to learn semantically rich representations. On the other, in addition to the above approaches, meta-learning have enabled the development of task and language learning algorithms capable of far generalization. Through this context, this article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models. Experiments in natural language understanding tasks on MultiATIS++ corpus shows that our approach substantially improves the observed transfer learning performances between the low and the high resource languages. More generally our approach confirms that the meaningful latent space learned in a given language can be can be generalized to unseen and under-resourced ones using meta-learning.


Introduction
Traditionally, Natural Language Understanding (NLU) is an intermediate module between the user interface and the dialogue management module in a dialogue system. It aims to extract semantic information from a user's query or utterance to fill slots in a domain specific semantic frame. Domain classification, intent detection and slot filling are three core components belonging to the NLU. They are in charge of determining the domain or service of a users query, its underlying goal or in-tent and associating utterance segments with conceptual labels, called slots, similar to named entity recognition.
NLU is usually defined as a supervised learning problem, involving conventional machine learning models on massive amount of annotated training data, which are language dependent. This prerequisite has prevented its widespread adoption for poorly endowed languages and for small technology companies that do not benefit from millions of users to gather data. Besides the requirement of a large amount of annotated data being available, domains, intents and slots are language dependent. Consequently, in practice, the resulting systems are hardly adaptable to expand to new languages.
As a solution to this problem, cross-lingual transfer approaches were developed to leverage the knowledge from well-resourced languages, with task specific data available to underresourced languages with little or no data. Recent efforts focused on training Transformer models multilingually such as the multilingual version of BERT (Devlin et al., 2019). While earlier work demonstrated the effectiveness of multilingual models to learn representations which are transferable across languages, they show limitations when applied to low-resource languages (Pires et al., 2019;Conneau et al., 2020). From another perspective low-shot learning such as fewshot and zero-shot, aims to transfer knowledge learned from one language to another when the training data is limited or is missing some task labels.
As a core contribution, we explore the potential for cross-lingual transferability of multilingual Transformer-based model (Vaswani et al., 2017) (mBERT) combined with a few-shot learning algorithm based on prototypical representations. We also introduce a zero-shot scenario, where models are trained on multiple languages and evalu-ated on another. Our proposed approach relies on appending a mBERT encoder module to the prototypical neural network, which is a proven fewshot model, originally designed for image classification. Our experimental results show that the generated model trained with a limited number of annotated training examples outperforms the transfer learning based approach on MultiATIS++ dataset (Xu et al., 2020;Upadhyay et al., 2018) and can be applied to unseen languages directly with decent performance.

Related work
The availability of large datasets has enabled deep learning methods to achieve great success in a variety of fields. However, most of these successes are based on supervised learning approaches, which require lots of labeled data to train. Most datasets are only available in English. Only a few other languages are supported, and most of them are considered as under-resourced languages.
Recently, meta-learning approaches have enabled the development of task-agnostic learning algorithms capable of far generalizations (crossdomain or cross-lingual) in the context of having a low-data regime. Because literature on low-shot learning is vast and diverse, only the most relevant approaches to this work are presented and we refer the reader to Vanschoren (2019) and  for a surveys of earlier work.

Low-shot learning
Humans manifest a capacity of learning new concepts from few stimuli quickly and efficiently by utilizing prior knowledge and experience. Inspired by this ability, there has been a resurgence of interest in designing specialized models to perform low-shot learning. An example of this form of learning is metric-based approaches founded on the simple idea of learning a discriminative metric space in which similar samples are mapped close to each other and dissimilar ones distant. Siamese (Koch, 2015), Matching (Vinyals et al., 2016) or Prototypical (Snell et al., 2017) networks belong to this category.

Supervised generalization
In recent years, several approaches have been introduced and refined to overcome the issue of data-limited regime. As an example, the Prototypical Neural Networks (PNNs), developed by Snell et al. (2017) originally for image classification, were used to extract representative characteristics of the data by mapping data points into an embedding space where each sample will cluster around their respective prototype representation. Fort (2017) proposed to extend their work by adding a confidence region around prototypes with the help of Gaussian covariance models.With the aim of improving the generalization capacity of metric-based methods, Wang et al. (2018) proposed to enforce a large margin between the class prototypes by modifying the standard softmax loss function.

Semi-supervised generalization
Other approaches, closely related to the aforementioned ones, proposed to take advantage of labeled and unlabeled data. Among them, Boney and Ilin (2017) extended PNNs to address semi-supervised image classification problems. They applied a hard clustering to assign the class for the unlabeled examples within the latent space learned by the PNNs. A close method was developed by Ren et al. (2018) to refine the prototype generation process with clustering. The authors introduced distractor classes with the aim of handling unlabeled samples not belonging to any of the known classes.
Most of these approaches have mainly been explored in the field of computer vision, and a few of them were applied to NLP fields, such like Natural Language Understanding (NLU).

NLU using low-shot learning
A number of different deep learning approaches have been applied to the problem of language understanding in recent years. For a thorough overview of deep learning methods in conversational language understanding, we refer the readers to Gao et al. (2018). In the context of relying on limited training resources, fewshot learning has been used for NLU tasks. Yazdani and Henderson (2015) proposes a method to leverage unlabeled data in order to find the separating hyperplanes that divide the utterances with the same label from those with different labels. Sun et al. (2019) extended PNNs for intent classification using hierarchical attention mechanisms when generating the prototype representations.
Slot filling using few-shot models has also been explored. Ferreira et al. (2015) presented a zeroshot approach based on a knowledge base and on word representations learned from unlabeled data.  Zhang et al. (2020) that focus on handling new and low-resource languages for machine translation, to the best of our knowledge, there are no approaches that combine cross-lingual transfer and meta-learning methods for NLU tasks.

Approach
In this section, we present the design of a Prototypical Neural Network and its episodic training procedure before introducing our approach.

Prototypical Neural Networks
Prototypical Neural Networks (Snell et al., 2017) or PNNs are based on the computation of distance measures between seen-class prototypes to unseen ones. More specifically, a D-dimensional embedding is generated for each example x ∈R D using a neural network based function f (·) parameterized by Θ. This function enhances the encoding process with better separability properties through a non-linear mapping f Θ :R D →R M . The Mdimensional prototype of each class is formed as the centroid c i of their embedded support points as seen in Equation (1): where S i represents the set of examples labeled with class i and y j the corresponding label of x j . Equation (2) shows how, given a query (that is, a new and an unlabeled sample) q i , the probability distribution over the prototypes is computed from d(·, ·), an arbitrary similarity measures function such as the squared euclidean distance or cosine similarity.
Finally, the class with the highest probability is chosen by a softmax over the distances and at optimization time, the negative log-probability J(Θ) = − log p Θ (y i |q i ) of the true class of each query point is minimized by stochastic gradient descent during an episodic learning process described in the next subsection.

Episodic learning
With the aim of generalizing unseen classes from zero to few training examples per class, PNNs is trained from a collection of N -way, k-shot classification tasks through an episodic training procedure (Vinyals et al., 2016). Specifically, each episode is one mini-batch consisting of k examples from each of the N classes (both randomly sampled), used to form a labeled (support S) and an unlabeled set of examples (query Q). The parameter k often takes a very small value, meaning we have zero-to-k labeled samples. During training, the model is fed with S to construct the class prototypes using Equation (1). Its parameters are learned in order to minimize the prototypical loss of its predictions for the examples in the given Q according to Equation (3) of Section 3.1. The evaluation is done by averaging the classification performances on query sets of many testing episodes.

Transformer-based PNNs
Studies have demonstrated that contextualized representations produced by language models such as ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019) gave neural networks a better training initializations. Rather than training the initialized encoder of PNNs with feature extractors such as convolution or recurrent networks we propose to induce robustness of the pre-trained multilingual BERT (mBERT) to test the distinctiveness of the representation of each class accross languages. The embedding layer is initialized with the pretrained mBERT embeddings and fine-tuned together with a dense linear layer that defines the embedding space where the prototype-based classifier operates. This latent space is used to learn prototypes of each class by estimating their mean and the chosen class is derived from the output layer of the network based on a softmax over distance to the class prototypes. The motivation behind fine-tuning the encoder with prototypical loss is to induce better generalization properties at testtime to new class labels not seen during training given only a few examples.

The cross-lingual way
As introduced earlier, even though recent works demonstrate strong cross-lingual transfer capability of multilingual pretrained BERT, they exhibit limitations when applied to low-resource languages (Pires et al., 2019;Conneau et al., 2020).
To enable cross-lingual transfer according to our few-shot scenario, we construct mutiple episodic batches E. From the available data, we draw the task sets by sampling a subset of labels to form a support set from data in the high-resources languages and a query set from data in the lowresource languages to be evaluated. NLU data consists of utterances composed of sentence-level intent labels and sequences of slot labels annotated in BIO format (Ramshaw and Marcus, 1995) to define the boundary of slots. The N -way k-shot NLU task is then defined as follows: given an input query utterance in a new language q i and a k-shot support set S as references, find the most appropriate intent label or slot label sequence y:

Experiments
Our NLU experiments in cross-lingual and fewshot learning for under-resources languages are conducted on MultiATIS++ (Xu et al., 2020;Upadhyay et al., 2018) corpus, whose description follows.

Models
Pour tous les modèles de base construits, nous utilisons les modèles mBERT disponibles au public pré-entraînés sur plus d'une centaine de langages différents (Devlin et al., 2019). We use the fine-tuning procedure (Devlin et al., 2019) of the original mBERT model as our baseline. In sequence-level and token-level classification tasks, it takes the final hidden states (the last layer output of the multi-head Transformer) of the first [CLS] sequence token or each individual token representation as input of the prediction layer to compute classification scores. Since we plan to use transfer learning in the context of PNNs, we fine-tune the pre-trained mBERT model together with a dense linear layer that defines the embedding space (Section 3.3).

Training configurations
We perform three sets of experiments: target only, multilingual and multilingual zero-shot.
• target only: this configuration consists of us-  Table 2: Averaged intent accuracies obtained with PNNs on 5-way k-shot classification k ∈ [1, 10] (best scores are marked in bold) and baseline results.
ing only the target language data.
We also considered two cross-lingual classification tasks with a varying quantity of data between source and target languages to investigate the behaviour of different types of knowledge transfer.
• multilingual: where the training strategy aims to train a network on the concatenation of all of the nine languages and testing the model for each target language.
• multilingual zero-shot: where the training relies on the concatenation of all training datasets from all languages except the one we want to test.
This works only for the baseline approach (mBERT), but with our PNNs approach (mBERT+PNN), we performs few-shot learning. This means we use only a few training data in the considered language (target only and multilingual configurations). For instance, when we evaluate our approach in the English task, we consider only a fraction of the English training dataset to train our mBERT+PNN model in the target only. In the multilingual configuration, our few-shot approach (mBERT+PNN) is trained using only a fraction of all the examples provided for each language.

Training details
For all the baseline models built, we use the publicly available mBERT models pre-trained on over a hundred different languages (Devlin et al., 2019). We trained it using 20 epochs like Xu et al. (2020).
PNNs training was done using a number of 1000 episodes using Euclidean distance as suggested by the original authors (Snell et al., 2017). We consider a configuration parameter and tried a 5-way k-shot intent classification with k ∈ [1, 10] (5w1s and 5w10s) and 5-way 10-shots slot filling.
For all approaches we use AdamW optimizer (Loshchilov and Hutter, 2017) using a learning rate of 5e-5 to apply gradients with respect to the loss and weight decay.
All results are reported using the average performances of over 30 runs for intent classification and over 5 runs for slot filling (fewer amount of runs because of higher training time).

Results
Our experimental findings are summarized in Tables 2 and 3 for the intent classification and the slot-filling tasks, respectively.

Intent classification results
Using the target only configuration, the baseline obtains optimal scores when applied to high resource languages, e.g. English (en), French (fr) or German (de) reaching nearly identical high scores. We obtain the highest baseline scores with an accuracy of 98.8 on the French model, followed by the English model with an accuracy of 98.5. Unlike other mainstream languages, the baseline is less accurate on under-resourced languages, with a loss of 7 to 15 points for intent classification on Hindi (hi) and Turkish (tr) respectively.
In multilingual configuration, baseline models perform reasonably well over all the high-resource languages with a significant performance boost due to the availability of additional data. The mBERT + PNN (5w10s) models outperformed the baseline for all languages, except for the Turkish (tr) language.
When transferring from all languages to an unseen one (multilingual zero-shot configuration) we observe the best results for the mBERT model, except Portuguese (pt) and English (en) languages, in which the mBERT + PNN (5w10s) is 0.5 points better.
Finally, within the framework of the intent classification task, the mBERT + PNN (5w10s) model achieves better overall performances in the multilingual configuration, especially in the case of under-resourced languages with a gain up to 9 points of accuracy, compared to the target-only configuration and an average of one point compared to the best model in the multilingual zeroshot configuration.

Slot-filling results
Slot-filling result trends in the target only configuration are about one point better of F1 score for the mBERT + PNN (5w10s) model compared to the baseline model (mBERT). The mBERT + PNN (5w10s) model even outperformed the baseline by more than 4 points of F1 in the Turkish task (tr).
We can observe the same trend in the multilingual configuration: our approach outperformed the baseline in all languages.
On the contrary, the mBERT + PNN (5w10s) fails in most of language tasks in the multilingual zero-shot configuration, except for the Hindi (hi) and the Turkish (tr) languages.
Finally, like the intent classification task, the mBERT + PNN (5w10s) model achieves better overall performance in the multilingual configuration for all languages.

Result analysis
First, our baseline results are on par with those obtained by Qin et al. (2019) and Xu et al. (2020) when they trained BERT-based models using only English training data (en) with intent accuracy scores of 97.5% and 96.08% while we obtain 98.5%. This is the same in our slot-filling experiment in which they report 94.7 F1 points while we obtain 95.6. This difference comes from our results averaging between 30 and 5 runs for intent classification and slot filling, while previous works only performed 5 runs. We also observe that, just like Xu et al. (2020), slot filling on Spanish (es) leads to lower results, similar to those obtained in our few-shot setting.
When transferring from all languages to an unseen one (multilingual zero-shot configuration in both tables 2 and 3) we obtained lower scores than the multilingual configurations. This means the multilingual representation captured in mBERT is efficient enough when data is available in several languages and none are available in the target considered language. But, in both cases, the combina-tion of mBERT+PNN performs better when fewer data is available using the few-shot approach (the multilingual configuration). This means that our approach quickly adapts to the considered target language with only a few examples available and enhances the mBERT multilingual transfer learning capabilities. This is especially true in the case of slot filling with gains in terms of F1-scores ranging from 2 to 5 points.
Finally, using the mBERT baseline model, transfer learning to French or German has performance scores similar to English while using the Turkish (tr) or Hindi (hi) yielded significant loss. This leads us to the same conclusion as Xu et al. (2020): exploiting language interrelationships learnt with transfer learning improve the model performances. This may come from the fact that French, English and German are similar and share some vocabulary while Turkish or Hindi are dissimilar to European languages (Hock and Joseph, 2019).
A detailed inspection of the PNNs results shows that in the target only and in the multilingual configurations, there is an overall and important reduction in recall values, which is balanced by an improvement of the precision values. If we analyze deeper the mislabeled examples we can observe that applying PNNs help to prevent overlapping and annotation mismatch cases that occur in the data.
We observed that MultiAtis++ corpus seems to be a highly unbalanced labeled dataset with the number of training examples per class varying from 1 to 3300. This impacts the model performance, and it could explain why we observe a lower recall and an improvement in precision using our approach, since it is based on the reduction of the amount of data.

Conclusions
In this paper, we demonstrate the opportunities in leveraging mBERT-based modeling using fewshot learning for both intent classification and slot filling tasks on under-resource languages. We found that our approach model is a highly effective technique for training models for low-resource languages. This illustrates the performance gains that can be achieved by exploiting language interrelationships learnt with transfer learning, a conclusion further emphasised by the fact that multilingual results outperformed other configuration models (target only and specifically multilingual  From this work a new challenge naturally comes up and a possible direction is to adapt a few-shot setting to a joint approach of intent detection and slot filling, like in Zhang and Wang (2016), Liu and Lane (2016) and Zhang et al. (2019), which demonstrates that performing these two tasks jointly improves the performance of both of them. Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The atis spoken language systems pilot corpus. In Speech and Natural Language: