Nearest Neighbour Few-Shot Learning for Cross-lingual Classification

Even though large pre-trained multilingual models (e.g. mBERT, XLM-R) have led to significant performance gains on a wide range of cross-lingual NLP tasks, success on many downstream tasks still relies on the availability of sufficient annotated data. Traditional fine-tuning of pre-trained models using only a few target samples can cause over-fitting. This can be quite limiting as most languages in the world are under-resourced. In this work, we investigate cross-lingual adaptation using a simple nearest-neighbor few-shot (<15 samples) inference technique for classification tasks. We experiment using a total of 16 distinct languages across two NLP tasks- XNLI and PAWS-X. Our approach consistently improves traditional fine-tuning using only a handful of labeled samples in target locales. We also demonstrate its generalization capability across tasks.


Introduction
The rise of massively pre-trained multilingual language models (LM) 1 (Lample and Conneau, 2019;Conneau et al., 2020;Chi et al., 2020;Luo et al., 2020;Xue et al., 2020) has significantly improved cross-lingual generalization across many languages (Wu and Dredze, 2019;Pires et al., 2019;K et al., 2020;Keung et al., 2019). Recent work on zeroshot cross-lingual adaptation (Bari et al., 2020;Fang et al., 2020;Pfeiffer et al., 2020), in the absence of labelled target data, has also demonstrated impressive performance gains. Despite these successes, however, there still remains a sizeable gap between supervised and zero-shot performances. On the other hand, when limited target language data are available (i.e few-shot setting), traditional fine-tuning of large pre-trained models can cause over-fitting (Perez and Wang, 2017). * Work done while Saiful was interning at Amazon AI 1 We loosely use the term LM to describe unsupervised pretrained models including Masked-LMs and Causal-LMs One way to deal with the scarcity of annotated data is to augment synthetic data using techniques like paraphrasing (Gao et al., 2020;Du et al., 2020), word translation (Xie et al., 2018;Mohiuddin and Joty, 2020;Mohiuddin et al., 2020), machine translation (Sennrich et al., 2015), data-augmentation (Ding et al., 2020;Liu et al., 2021;Laskar et al., 2020;Ding et al., 2020) and/or data-diversification (Nguyen et al., 2019;Mohiuddin et al., 2021;Bari et al., 2021). Few-shot learning, on the other hand, deals with handling out-of-distribution (OOD) generalization problems using only a small amount of data (Koch, 2015;Vinyals et al., 2016;Jake Snell, 2017;Santoro et al., 2017;Chelsea Finn, 2017). In this setup, the model is evaluated over few-shot tasks, such that the model learns to generalize to new data (query set) using only a hand full of labeled samples (support set). In a cross-lingual few-shot setup, the model learns cross-lingual features to generalize to new languages. Recently, Nooralahzadeh et al. (2020) used Meta-Learning (Finn et al., 2017) for few-shot adaptation on several cross-lingual tasks. Their fewshot setup used full development datasets of various target languages (XNLI development set, for instance, has over 2K samples). In general, they showed the effectiveness of cross-lingual metatraining in the presence of a large quantity of OOD data. However, they did not provide any fine-tuning baseline. On the contrary, (Lauscher et al., 2020) explored few-shot learning but did not explore beyond fine-tuning. To the best of our knowledge, there has been no prior work in cross-lingual NLP that uses only a handful of target samples (< 15) and yet surpasses or matches traditional fine-tuning (on the same number of samples). Traditional finetuning (parametric) approaches require proper hyperparameter tuning techniques for the learning rate, scheduling, optimizer, batch size, up-sampling few-shot support samples and failing to do so would often led to model over-fitting. It can be expensive to update parameters of large model frequently for few shot adaption, each time there is a fresh batch of support samples. As the model grows bigger, it becomes almost unscalable to update weights frequently for few shot adaptation. It takes significant amount of time to update gradients for a few number of samples and then perform inference. There have been previous successful attempts to inject external knowledge via non-parametric methods (Wang et al., 2017;Khandelwal et al., 2020Khandelwal et al., , 2019. In this work, we explore a simple Nearest Neighbor Few-shot Inference (NNFS) approach for crosslingual classification tasks. Our main objective is to utilize very few samples to perform adaptation on a given target language. To achieve this, we first fine-tune a multilingual LM on a high resource source language (i.e., English), and then apply fewshot inference using few support examples from the target language. Unlike other popular metalearning approaches that focus on improving the fine-tuning/training setup to achieve better generalization (Chelsea Finn, 2017;Ravi and Larochelle, 2017b), our approach applies to the inference phase. Hence, we do not update the weights of the LM using target language samples. This makes our approach complimentary to other regularized finetuning based few-shot meta-learning approaches.
Our key contributions are as follows: • We propose a simple method for cross-lingual few-shot adaptation on classification tasks during inference. Since our approach applies to inference, it does not require updating the LM weights using target language data.
• Using only a few labeled target support samples, we test our approach across 16 distinct languages belonging to two NLP tasks and achieve consistent improvements over traditional fine-tuning.
• We demonstrate that our proposed method generalizes well not only across languages but also across tasks.
• As the support sets are minimal in size, subsequent results obtained using them can suffer from high variability. We borrow the idea of episodic testing widely used in computer vision few-shot tasks, to evaluate few-shot performance for NLP tasks (more details in section 3.3).
We also opensource our implementation 2 .
The objective of few-shot learning is to adapt from a source distribution to a new target distribution using only few samples. The traditional fewshot setup (Chelsea Finn, 2017;Snell et al., 2017;Vinyals et al., 2016) involves adapting a model to the distribution of new classes. Similarly, in a crosslingual setup, we adapt a pre-trained LM, that has been fine-tuned using a high resource language, to a new target language distribution (Lauscher et al., 2020;Nooralahzadeh et al., 2020).

Setup
We begin by fine-tuning a pre-trained model θ lm (Conneau et al., 2020) to a specific task T s using a high resource (source) language data set D src = (X src , Y src ), to get an adapted model θ src Ts . We use θ src Ts to perform few-shot adaptation. In our few-shot setup, we assume to possess very few labeled support samples D s = (X s , Y s ) from the target language distribution. A support set covers C classes, where each class carries N number of samples. This is a standard C-way-N -shot few-shot learning setup. The objective of our proposed method is to classify the unlabeled query samples D q = (X q ). We denote the latent representation of the support and query samples as X s and X q , respectively, where X s = θ src Ts (X s ) and X q = θ src Ts (X q ) .

Nearest Neighbor Class
Let |D s | and |D q | be the total number of support and query samples. For query samples X q , feature representations X q is obtained by forward propagation on θ src Ts model. For each query representation x q , we define a latent binary assignment vector y q = [y q,1 , y q,2 ..., y q,C ]. Here, y q,i is a binary variable such that, and i y qi = 1. Let Y q denote the R Nq×C matrix where each row represents the y q term of each query. We compute the centroid, m c , of each class by taking the mean of its support representations X s . Next, we compute the distances between each x q and m c (Equation 2). Our loss function becomes, Finally, we assign each x q the label of the class it has the minimum distance to. This is done using the following function, Algorithm 1 Nearest Neighbor Few-shot Inference Input: Model θ src Ts trained using source language, support set Ds = (Xs, Ys), query Set Dq = (Xq), mean representation of train/dev samples ms Output: Distribution of the query label, Yq 1: / * feature representation normalization * / 2:Xs,Xq = θ src Ts (Xs), θ src Ts (Xq) 3:Xs,Xq =Xs − ms,Xq − ms / * Calculate mean representation of each of the classes * / 9:mc = Traditional inductive inference handles each query sample (one at a time), independent of other query samples. On the contrary, our proposed approach includes additional Normalization and Transduction steps. Algorithm 1 illustrates our approach. Here we discuss these additional steps in more detail.
Norm. We measure the cross-lingual shift as the difference between the mean representations of the support set (target language) and the training set (en), m s . We then perform cross-lingual shift correction on the query set. To achieve this, at first, we extract the latent representation of both support and query samples from θ src Ts (X s ). We then center the representation (Alg 1 #3) by subtracting the mean representation of the train/dev data of the source language, followed by L2 normalization of both representations (train/dev). Algorithm 1 (#2-7) further details our approach.
Transduction. We apply prototypical rectification (proto-rect) (Liu et al., 2019) on the extracted features of LM. In the rectification step, to compute m c (in Alg.1), initially, we obtain the mean representation for each of the support classes by taking the weighted combination of X s and X q . Finally, we calculate predictions on the query set using equation 3. We also present our proposed NNFS inference in Figure 2 in the Appendix.

Data
We use two standard multilingal datasets -XNLI (Williams et al., 2018) (15 languages) and PAWS-X (Zhang et al., 2019) (7 languages) to evaluate our proposed method. Additional details on languages and complexity of the task can be found in the Appendix. For few-Shot inference, we use samples from the target language development data to construct the support sets and the test data to construct the query sets.

Fine-tuning
We use XLMR-large (Conneau et al., 2020) as our pre-trained language model θ lm and perform standard fine-tuning using labeled English data to adapt it to task model θ src Ts . We tune the hyper-parameters using English development data and report results using the best performing model (optimal hyperparameters have been enlisted in the appendix). We train our model using 5 different seeds and report average results across them. We use the same optimal hyper-parameters to fine-tune on the target languages. As baseline we add two additional finetuning named head and full. Fine-tuning full means all the parameters of the model are updated. This is very unlikely in Few-shot scenarios. Fine-tuning head means only the parameters of the last linear layer are updated.

Evaluation Setup
Nooralahzadeh et al. (2020) and Lauscher et al. (2020) used 10 and 5 different seeds to measure the few-shot performance. As few-shot learning involves randomly selecting small support sets, results may vary greatly from one experiment to the next, and hence may not be reliable (Le et al., 2020) Table 1: Few-shot XNLI accuracy results across 14 languages with average improvements for each of the methods. All the confidence interval is less than .07 in the experiments. "fs-3.5" means 3-way-5-shot learning. Exp Table 3: PAWS-X accuracy results for cross-task experiments across 6 languages. For this experiment, we fine-tuned XLM-R LM using the XNLI task and then applied few-shot inference on the PAWS-X task. 2020) is often used for evaluating few-shot experiments. Each episode is composed of small randomly selected support and query sets. Model's performance on each episode is noted, and the average performance score, alongside the confidence interval (95%) across all episodes are reported. To the best of our knowledge, episodic testing has not been leveraged for cross-lingual few-shot learning in NLP. We evaluate our approach using 300 episodes per seed model θ src Ts totalling 1500 episodic testing and report their average scores. For each episode, we perform C-way-N-shot inference. For 2-way-5-shot setting, for instance, we randomly select 15 query samples per class, and 2 × 5 number of support samples. For XNLI and PAWS-X, we use 3 and 2 as the value of C, respectively. Our episodic testing approach has been detailed further in the Episodic Algorithm of the Appendix.

Results and Analysis
After training the model with the source language samples (i.e. labeled English data), we perform additional fine-tuning using C-way-5-shot target language samples. Finally, we perform our proposed NNFS inference. The fine-tuning baseline using limited target language samples result in small but non-significant improvements over the zero-shot baseline. The NNFS inference approach, however, resulted in performance gains using only 15 (3-way-5-shot) and 10 (2-way-5-shot) support examples for both XNLI and PAWS-X tasks. When compared to the few-shot baseline, we got an average improvement of 0.6 on XNLI (table 4) and 1.0 on PAWS-X (table  5). At first we experimented with 3-shot support samples but did not observe any few-shot capability in the model. We also experimented with 10-shot setup and found similar improvements of NNFS on top of the Fine-tuning baseline (results have been added to the Appendix). Interestingly, for both cases, we observed higher performance gains on low resource languages. To further evaluate the effectiveness of our model, we tested it in a cross-task setting. We first trained the model on XNLI (EN data) and then used NNFS inference on PAWS-X. Table 3 demonstrates an impressive average performance gain of +7.3 across all PAWS-X languages, over the fine-tuning baseline. In addition to that, NNFS inference approach is fast. When compared to the zero-shot inference (1X), our approach takes only ≈ 1.36 − 1.7X time of computation cost compared to the finetuning time which takes ≈ 38 − 40X. Table 6 in appendix shows the inference time details on both tasks.

Conclusion
The paper proposes a nearest neighbour based fewshot adaptation algorithm accompanied by a necessary evaluation protocol. Our approach does not require updating the LM weights and thus avoids over-fitting to limited samples. We experiment using two classification tasks and results demonstrate consistent improvements over finetuning not only across languages, but also across tasks.

A.1 Decision choice for Episodic Testing
In the traditional testing framework, we sample a batch from the dataset and calculate the batch's prediction. Finally, accumulate all the predictions to calculate the score of the evaluation metric. However, Few-shot experiments are quite unpredictable because of the following two reasons, • Support set: Per class sampling strategy of the support set is random. In a few shot experiments, we perform inference on the test dataset utilizing support-samples. For a different support set, the prediction may vary drastically. However, taking few samples (ie., 10 out of 2500 or 15 out of 2000) and doing experiments 5-10 times doesn't reflect the true potential of a few-shot algorithm. • Transductive inference: On the contrary, for a few shot experiments, algorithms often perform transductive inference. In transductive inference, predictions may vary based on the combination of the query samples. Hence it is challenging to benchmark the few shot algorithms with the traditional testing framework.
In Episodic testing, we randomly sample a query set and support set from the dataset and perform few-shot experiments. We perform the experiments until we get a low confidence-interval (95%). In this way, we may iterate over the test dataset 5-10 times more. However, it is not affected by the above problems mentioned and can benchmark any few-shot algorithm properly. Challenges Both datasets posses different challenges. NLI task requires rich and a high level of factual understanding of the text. The PAWS task, on the other hand, contains pairs of sentences that usually have a high lexical overlap and may/may not be paraphrases. We use accuracy as the evaluation metric for both datasets.
10 Shot results For reference we have added 10 shot experiment for XNLI and PAWSX dataset with same setup as Table 1 and Table 2 of main paper.

A.3 Hyperparameters and Resource Description
We used 8 V100 GPUs (amazon p3.16xlarge) to run all experiments. The hyper-parameters of the best performing model are enlisted in Table 6. In the pretrained language model finetuning, We use (1e-5, 3e-5, 5e-5, 7.5e-6 ,5e-6) boundary values to search for proper learning rate. Step we train a language model θ lm on the source language (en) data (X src , Y src ) to get θ src Ts . In Few-Shot Inference Step, we apply forward propagation on the θ src Ts model using support input samples X s and X q and get the latent representations X s and Y s . Using X s , we apply normalization and calculate m c . We then use both X s and X q , and compute the unary term a q , which in turn gives the label distribution of the query samples (see in Alg. Few Shot Inference. line #14-15 ).

Algorithm 2 Episodic Testing
Input: Model θ src Ts trained using the source language, transductive parameter λ, mean representation of train/dev samples ms, a threshold value eps, a multiplier τ , input data (C-way-N -shot) Output: Averages score and the confidence interval ∂  Table 4: 10-shot XNLI accuracy results across 14 languages with average improvements for each of the methods. All the confidence interval is less than .07 in the experiments.