Continual Few-Shot Learning for Text Classification

Natural Language Processing (NLP) is increasingly relying on general end-to-end systems that need to handle many different linguistic phenomena and nuances. For example, a Natural Language Inference (NLI) system has to recognize sentiment, handle numbers, perform coreference, etc. Our solutions to complex problems are still far from perfect, so it is important to create systems that can learn to correct mistakes quickly, incrementally, and with little training data. In this work, we propose a continual few-shot learning (CFL) task, in which a system is challenged with a difficult phenomenon and asked to learn to correct mistakes with only a few (10 to 15) training examples. To this end, we first create benchmarks based on previously annotated data: two NLI (ANLI and SNLI) and one sentiment analysis (IMDB) datasets. Next, we present various baselines from diverse paradigms (e.g., memory-aware synapses and Prototypical networks) and compare them on few-shot learning and continual few-shot learning setups. Our contributions are in creating a benchmark suite and evaluation protocol for continual few-shot learning on the text classification tasks, and making several interesting observations on the behavior of similarity-based methods. We hope that our work serves as a useful starting point for future work on this important topic.


Introduction
Large end-to-end neural models are becoming more pervasive in Computer Vision (CV) and Natural Language Processing (NLP). In NLP in particular, large language models such as BERT (Devlin et al., 2019) fine-tuned end-to-end for a task, have advanced the state-of-the-art for many problems such as classification, Natural Language Inference (NLI), and Question Answering (QA) (Devlin et al., 2019;Liu et al., 2019;Wang et al., 2019). End-to-end models are conceptually simpler than the previously-popular pipelined models, making them easier to deploy and maintain. However, because large end-to-end models are black-boxes, it is difficult to correct the mistakes that they make. Practical, real-world applications of NLP require such mistakes to be corrected on the fly as the system operates. For example, when a translation system makes a harmful mistake (e.g., translates "EMNLP" to "ICML"), a phrase-based system can be corrected by finding and modifying the responsible entries in the phrase table (Zens et al., 2002), whereas there is no equivalent way to correct that in an end-to-end neural MT system. Similarly, systems have been shown to exhibit bias (e.g., gender or racial stereotypes) toward certain inputs of text, which we want to correct via few examples on the fly.
Further, the examples that provide supervision to correct mistakes or learn a phenomenon are often hard or impossible to acquire (e.g., due to privacy or ethics issues) (Wang et al., 2020). Hence, it is important to effectively learn to correct mistakes using few extra training examples. Recent work has shown the generalization capability of large pre-trained models to handle multiple tasks with zero to few training examples (Schick and Schütze, 2021;Brown et al., 2020;Yin et al., 2020). For example, Yin et al. (2020) has shown that system trained for NLI can be used to perform new tasks zero-shot, i.e., without any task-specific training data. We believe that similar models can be used to rapidly learn to correct a phenomenon within the same task from a few (e.g., 10 or 15) training examples.
From a practical point of view, we need our trained systems to rapidly adapt to new phenomena (or correct its mistakes) using very few extra training examples, and do it continually as new phenomena (or errors) are discovered over time.
Tackling this important setting, we take a fresh look at continual learning in NLP and formulate a new setting that bears similarity to both continual and few-shot learning, but also differs from both in important ways. We dub the new setting "continual few-shot learning" (CFL) and formulate the following two requirements: 1. Models have to learn to correct classes of mistakes (or adapt to new domains) from only a few examples. 2. They have to maintain performance on previous test sets.
To this end, we propose a benchmark suite and evaluation protocol for continual few-shot learning (CFL) on text classification tasks. Our benchmark suite consists of both existing and newly created datasets. More precisely, we use the dataset with several linguistic categories annotated by Williams et al. (2020) from ANLI Round-3 (Nie et al., 2020); and also provide two new datasets with linguistic categories that we annotated using the counterfactual augmented data provided by Kaushik et al. (2020) on SNLI natural language inference dataset (Bowman et al., 2015) and IMDB sentiment analysis dataset (Maas et al., 2011). We discuss several methods as important promising baselines for CFL, borrowing from the literature of few-shot learning and continual learning. We classify these baselines into parameter correction methods (e.g., MAS (Aljundi et al., 2018)) and non-parametric feature matching methods (e.g., Prototypical networks (PN) (Snell et al., 2017)). We compare these methods on our benchmark suite in a traditional few-shot setup and observe that non-parametric feature matching methods perform surprisingly better than other methods. Next, we test the same methods in a continual few-shot setup and observe that a simple fine-tuning method performs better than other parameter correction methods like MAS. The non-parametric feature matching based PN performs well on the examples that are being corrected (few-shot categories), but at the expense of the original performance. Further, we also observe a large performance improvement on the few-shot categories in this setup. Additionally, we provide interesting ablations to understand the usefulness and generalization capabilities of PN for few-shot linguistic categories. We compare models trained with cross-entropy loss versus Prototypical loss via empirical studies and t-SNE plots, and discuss their major differences in detail. We hope that our CFL benchmark suite and evaluation protocol will serve as a useful starting baseline point and encourage substantial progress and future work by the community on this important practical setting.

Related Work
CFL bears similarity to few-shot learning, continuous learning, and online learning. Below, we discuss these three paradigms and highlight the similarity and differences from our approach.
Few-Shot Learning. The goal in few-shot learning is to learn a new task from only a few labeled examples. Few-shot learning problems are studied in the image domain (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017;Ren et al., 2018;Sung et al., 2018), focusing mainly on two kinds of approaches: metric-based approaches and optimization-based approaches. Metric-based approaches learn generalizable metrics and corresponding matching functions from multiple training tasks with limited labels (Vinyals et al., 2016). For example, Snell et al. (2017) proposed to build representations for each class using supporting examples and then comparing the test instances by Euclidean distances. Optimization approaches aim to learn to optimize model parameters based on the gradients computed from limited labeled examples (Ravi and Larochelle, 2017;Munkhdalai and Yu, 2017;Finn et al., 2017).
In the language domain,  proposed to use a weighted combination of multiple metrics obtained from meta-training tasks for inferring on a newly-seen few-shot task. On the dataset side, Han et al. (2018) introduce a few-shot relation classification dataset. Recently, large-scale pretrained language models have been used for fewshot learning of downstream tasks (Brown et al., 2020;Schick and Schütze, 2021). Yin et al. (2020) used pre-trained entailment system for generalizing across more domains or new tasks when there are only a handful of labeled examples.
All of the above-mentioned approaches focus on few-shot learning for new tasks. In contrast, we consider the same original task, but target examples that can be considered new because they require solving a linguistic phenomenon, an error category, or a new domain. Unlike few-shot learning, we also require models that can maintain or improve performance on the existing data.
Continual Learning. Continual learning is a long-standing challenge for machine learn-

Dataset
Categories Example ANLI R3 Numerical, Reference Context: Police said that a 21-year-old man was discovered after he had been shot in South Jamaica on Aug. 18 and is in critical condition. Just before 9:30 p.m., police responded to a shooting at 104-46 164th St and discovered the victim, whose name has not been released, at the scene. The victim was shot in the thigh and transported to Jamaica Hospital, where he is currently listed in critical condition. No arrests have been made in the incident. Hypothesis: The victim was less than a quarter century old. Label: Entailment

IMDB Negation
Original Text: We know from other movies that the actors are good but they cannot save the movie. A waste of time.
The premise was not too bad. But one workable idea (interaction between real bussinessmen and Russian mafia) is not followed by an intelligent script Revised Text: We know from other movies that the actors are good and they make the movie. Not at all a waste of time. The premise was not bad. One workable idea (interaction between real bussiness men and Russian mafia) is followed by an intelligent script  ing (French, 1999;Hassabis et al., 2017), defined as an adaptive system capable of learning from a continuous stream of information. The information progressively increases over time, but there is no predefined number of tasks to be learned. Majority of methods in continual learning focus on sequential training of various 'tasks' (not necessarily of same kind) and address the catastrophic forgetting problem. These approaches can be broadly classified into (1) architectural approaches that focus on altering the architecture of the network to reduce the interference between the tasks without changing the objective function (Razavian et al., 2014;Donahue et al., 2014;Yosinski et al., 2014;Rusu et al., 2016); (2) functional approaches that focus on penalizing the changes in the input-output function of the neural network (Jung et al., 2018;Li and Hoiem, 2017); and (3) structural approaches that introduce constraints on how much the parameters change when learning the new task so that they remain close to their starting point (Kirkpatrick et al., 2017). Other notable works in recent years are based on using intelligent synapses to accumulate task-related information over time (Zenke et al., 2017), using online variational inference (Nguyen et al., 2018), and dynamically expanding network capacity based on incoming data (Yoon et al., 2018 Our setup is important for practical usage. The most closest to our work is from the vision community, where they proposed a benchmark suite containing few-shot datasets for continual learning and evaluation criteria (Antoniou et al., 2020). However, the major contrast is that our setup focuses on correcting the errors specific to a linguistic phenomenon rather than learning new class labels with few examples.
Online Learning. Online learning algorithms learn to update models from data streams sequentially, where the task is the same but can exhibit concept drift (new patterns) (Zinkevich, 2003;Crammer et al., 2006;Sahoo et al., 2018;Jerfel et al., 2019;Javed and White, 2019). Our setup is different from online learning because we start with a model that is fully trained on a task (i.e., no large sequential data steams), and only focus on correcting the errors specific to linguistic phenomena by giving few extra training examples.

Datasets
In this section, we describe all the English datasets that we curated and borrowed from previous works for creating a benchmark suite for continual fewshot learning (CFL).  (2) Basic: require reasoning based on lexical hyponymy, conjunction, and negation.
(3) Reference: noun or event references need to be resolved either within or between premise and hypothesis. (4) Tricky: require complex linguistic knowledge, e.g., pragmatics or syntactic verb argument structure. (5) Reasoning: require reasoning outside of the given premise and hypothesis pair.
(6) Imperfections: examples that have spelling errors, foreign language content, or are ambiguous. We refer to Williams et al. (2020) for more details on each of these categories. We use the reasoning annotations to create a CFL setup. Unlike previous few-shot learning setups, we focus on few-shot learning of linguistic phenomena (6 categories in this case), instead of new tasks, classes, or domains. We use the Round-3 (R3) development set and consider all 6 of the above categories as different few-shot learning cases (labeled ANLI R3 categories in the rest of the paper). In our framework, we consider two scenarios: (1) few-shot learning setup; (2) continual few-shot learning setup. 2 In the few-shot learning setup, for each category, we choose 5 disjoint training sets with each set containing 5 examples from each class label. presents the full statistics on all 6 categories.

SNLI Counterfactual Few-Shot Categories
Stanford NLI dataset (Bowman et al., 2015) is a popular natural language inference dataset where given a premise and a hypothesis, the task is to predict whether hypothesis entails or contradicts or neural w.  to change the label, e.g, man vs. person. 3 A few examples did not fall into any of these categories which are labeled as 'Other', and are discarded. We follow similar data splits as discussed for ANLI R3 few-shot categories, except that we use only 3 training sets instead of 5 in the few-shot learning setup. We did not get enough balanced training sets for negation, numerical changes, and using abstraction categories, hence discarded them. The statistics of the rest of the categories are presented in Table 3.

IMDB Counterfactual Few-Shot Categories
Kaushik et al. (2020)  (1) Inserting or replacing modifiers, (2) Inserting phrases, (3) Adding negations, (4) Diminishing polarity via qualifiers, (5) Changing ratings, and (6) Suggesting sarcasm. We discarded a few examples that did not belong to any of these categories. We follow similar data splits as discussed for SNLI counterfactual few-shot categories. We did not get enough balanced training sets for categories except inserting or replacing modifiers and adding negation, hence we discarded those categories. Table 3 presents the statistics of these two categories.

Annotation (More details in Appendix A)
First, a single expert annotated both SNLI and IMDB counterfactual examples, as both need a degree of expertise to correctly reason among various categories with examples often falling into multiple 3 We refer to Sec. 3.4 for more details about the annotation.
categories. Previous NLU projects also benefited from expert annotations (Basile et al., 2012;Bos et al., 2017;Warstadt et al., 2019;Williams et al., 2020). Next, since the annotations need complex reasoning and can be subjective sometimes, we further employed another annotator to annotate 100 examples from each dataset to calculate the interannotator agreement. We calculate the percentage agreement and Cohen's kappa (Cohen, 1960) for each category independently and report the average scores across all categories. The average percentage agreement score for SNLI and IMDB datasets are 86.4% and 90.5%, respectively, which is a high, acceptable level as per previous work (Toledo et al., 2012;Williams et al., 2020). The Cohen's kappa score (Cohen, 1960) for SNLI and IMDB datasets are 0.61 and 0.79, respectively, which is a substantial agreement (Landis and Koch, 1977

Results
In this section, we report the performances of various baselines discussed in Sec. 4 on our benchmark suite. We refer to Appendix for training details.

Results on Few-Shot Learning
ANLI R3 Categories. Table 4 shows the results on the 6 categories from the Round-3 of the ANLI dataset. The base model, is trained on the combined data of MNLI (Williams et al., 2018), ANLI Round-1 (R1), and ANLI Round-2 (R2). On average, we observe that using the few-shot training examples for each of the categories improves the performance (comparing zero-shot vs. rest of the models), while maintaining the performance on MNLI matched (MNLI-m) and mis-matched (MNLI-mm) datasets. More importantly, we also observe that     Table 5 presents the performance of various models on the 5 annotated categories of SNLI dataset in a few-shot learning setup. We observe similar trends: few-shot examples improve the performance (comparing zero-shot vs. other models in Table 5) and feature matching approaches perform consistently better than parameter correction approaches. Similar to the results on ANLI R3 categories, feature matching methods also exhibit high variance on the SNLI categories.
IMDB Categories. Table 6 presents the performance of various models on the 2 categories of fewshot IMDB sentiment analysis setup. Again, few examples improve the performance in all categories (with the exception of k-NN), and feature matching method (Prototypical Networks) outperforms parameter correction methods by a large margin. Since IMDB is a 2-way classification dataset and the examples are curated based on counterfactual edits, the feature matching methods have to figure out to just flip the label, which PN succeeded in (also reason for high scores) and k-NN did not in this case. Further, the variance for feature matching methods is notably lower on this dataset.

Results on Continual Few-Shot Learning
In this section, we discuss the continual few-shot learning setup on ANLI R3, SNLI, and IMDB categories. We sequentially train the models on each category by initializing with the model parameters learned for the previous category, thus enabling continual few-shot learning. Evaluation is performed on the final model that we get after continually training on all categories.

Ablations and Analyses
Robustness of Prototypical Networks. To ablate on how Prototypical networks (PN) performs on the original data (e.g., MNLI or SNLI or IMDB), we use the model trained with the cross-entropy loss and test it using PN with the original training data as the support set. Surprisingly, we observe that PN performs equal to that of general softmaxbased prediction on all three datasets (see Table 10 row-1 vs. row-2, MNLI column; Table 11 row-1 vs. row-2, SNLI and IMDB columns). This is interesting since and we can simply calculate an example's Euclidean distance to the mean feature representations of classes to label it.
Cross-Entropy vs. Prototypical Loss. We train a model with Prototypical network (PN) loss (minimize the distance between training examples and the approximated class representations) and compare it with cross-entropy (CE) loss. Table 10 and  Table 11 present the results. The model trained with PN loss performs similar or slightly better than cross-entropy loss on the original test sets (see Table 10 row-1 vs. row-5; Table 11 row-1 vs. row-5). Further, models with PN loss perform worse on average than the CE loss for ANLI R3 categories, whereas the opposite is true for the counterfactual categories of SNLI and IMDB (Table 11 Table 11: Comparison of the performance of various models with cross-entropy loss (CE-Loss) optimization or prototypical network loss (PN-Loss) optimization on NLI and sentiment analysis datasets. NLI models are trained on SNLI dataset and tested on test sets of SNLI and its counterfactual categories. IMDB models are trained on full IMDB dataset and tested on test sets of IMDB and its counterfactual categories. † represents SNLI/IMDB as support set, ‡ represents SNLI or IMDB categories as support set, and represent both SNLI/IMDB and their categories as support set. and the few-shot categories as the support set, 9 and observe a performance drop in the ANLI R3 few-shot categories, but still better than just using original dataset (MNLI) as support set (Table 10). This holds for both CE and PN losses. On the SNLI and IMDB categories setup, the performance drops again but still better than original dataset as support set on CE loss and almost same on PN loss.

t-SNE Plot Visualizations.
To further understand the differences between cross-entropy loss and Prototypical network (PN) loss, we present t-SNE plots 10 on the examples from MNLI and ANLI R3 categories (each example is represented in the feature space f θ ). In Figure 1, Table 4 zero-shot results. Interestingly most of these examples are at the edge of the clusters. Further, there is a remarkable difference in the cluster patterns between CE and PN loss models. CE loss plots have dense clusters and PN loss plots have skew (stretched) clusters. We also observe that clusters based on PN loss model have higher average distance to their cluster center and a higher average distance with very high variance between any two examples that belong to the same cluster, supporting the 2D t-SNE observations. 9 For a given test example, we assign the class label of the closest mean class feature from the pool of mean class features of original train data and categories train data. 10 sklearn library (https://scikit-learn.org/).

Conclusion
We presented a benchmark suite and evaluation protocol for continual few-shot learning (CFL) on the text classification tasks. We presented several methods as important baselines for our CFL setup. Further, we provided several interesting ablations to understand the use of non-parametric feature matching methods for CFL. We hope that our work will serve as a useful starting point to encourage future work on this important practical setting.

Broader Impact and Ethics Statement
We view the CFL as a way to make real-world AI systems safe and reliable by being able to correct errors quickly. At the same time, we believe there is a lot more to be done to bring the CFL approach to practical scenarios and we do not intend to directly employ our benchmark suite off-the-shelf on any real systems. Our benchmark suite serves only to compare various models and encourage the community to build better models on this important practical setting. Moreover, since CFL deals with only a few examples of training, the models might overfit these examples, so any practical usage of such setup should thoroughly consider the implications of overfitting scenarios. Further, our data collection methods for this research and the setup are not tuned for any specific real-world application. Hence, while applying our methods in a sensitive context, it is important to strictly employ extensive qualitative control and robust testing before using them with real systems.

A Annotation Details
Annotation of both SNLI and IMDB counterfactual examples needed a degree of expertise to correctly reason among various categories with often examples falling into multiple categories. Hence, a single expert manually annotated both datasets in an attempt to ensure high quality. The annotation process is not done at scale, so this approach seemed safer. ANLI categories discussed in Sec. 3.1 are also manually annotated by an expert (Williams et al., 2020). Further, various NLU projects benefited from expert annotations (Basile et al., 2012;Bos et al., 2017;Warstadt et al., 2019). The expert annotated 1, 422 and 234 examples in SNLI and IMDB counterfactual datasets, respectively. It took roughly 15 hours to complete the annotations.
Inter-annotator Agreement. Since the annotations need complex reasoning and can be sometimes subjective, we further employed another annotator to annotate a subset of the examples to calculate the inter-annotator agreement. The new annotator first went over the definitions of various categories and later trained with a few examples. Finally, the new annotator annotated 100 examples each from SNLI and IMDB datasets.
We calculate the inter-annotator agreement on these second-annotated examples using the percentage agreement and Cohen's kappa (Cohen, 1960) for each category independently and report the average scores across all categories. For the SNLI counterfactual dataset, average percentage agreement score between the two annotators is 86.4%, and the average kappa score is 0.62. Our interannotator percentage agreement score is at an acceptable level as per previous work (Toledo et al., 2012;Williams et al., 2020) annotation agreement scores on similar types of annotations. Further, Cohen's kappa score ranges from −1 to 1, and a score in the range of 0.61 to 0.80 is considered as substantial agreement (Cohen, 1960;Landis and Koch, 1977). For the IMDB counterfactual dataset, the average percentage agreement score between the two annotators is 90.5, and the corresponding Cohen's kappa score is 0.79, which is again a substantial agreement.

B More Details on Baselines
Memory-Aware Synapses. Aljundi et al. (2018) proposed an approach that estimates an importance weight for each parameter of the model which approximates the sensitivity of the learned function to a parameter change.
Let f be a function with parameters θ that represents the neural network model trained on the original full dataset. Let X, Y be the new examples from the few-shot setup. Hence, for a given data point x k , the output of the network is f (x k ; θ). A small perturbation δ in the parameters space results in a change in the output function as follows: (1) where g ij (x k ) is the gradient of the learned function w.r.t. the parameter θ ij and δ ij is the change in the parameter θ ij . The magnitude of the gradient g ij (x k ) represents the importance of a parameter w.r.t. the input x k , hence, the overall importance weight Ω ij for a parameter θ ij is defined as follows: where N is the total number of few-shot examples. Aljundi et al. (2018) proposed to use l 2 norm of the function f to calculate g ij , since this scalar value allows to estimate g ij with a single back propagation. During the training with few-shot examples, the loss function is updated to consider the importance weights of the parameters through a regularizer. The final loss function is defined as follows: where λ is the hyperparameter for the regularizer and θ * ij is the learned parameter on the original full dataset.
Prototypical Networks. Snell et al. (2017) rely on an embedding function f θ that computes an m dimensional representation for each example and a prototype for each class. Let X, Y represents a set of few-shot examples, then the class representation features are computed as c k = 1 |S k | (x k ,y k )∈S k f θ (x k ), where k represents the k th class and S k represents all the few-shot examples that belong to k th class. Prototypical networks produce a class distribution for an example based on a softmax over distances to the prototypes or mean class representations (c k ). The class distribution for an example is defined as follows: where d is the Euclidean distance. We use several different support sets to compute the class prototypes: the original training data; the few-shot training examples; or, both. We use the model's output before the softmax layer as f θ (x). For our initial experiments, we use the model trained on cross-entropy loss using the original training data. We also experiment with a model trained on Prototypical loss (results discussion in Sec. 6), where we randomly sample a support set from the training data during each mini-batch optimization step and try to minimize the distance between the mini-batch examples and the approximated class representations based on the support set. Distance minimization is done using Eqn. 4.
Supervised Contrastive Learning (SCL). Gunel et al. (2021) proposed supervised contrastive learning for better generalizability, where they jointly optimize the cross-entropy loss and supervised contrastive loss that captures the similarity between examples belonging to same class while contrasting with examples from other classes. Let X, Y be the few-shot examples, then the total loss and supervised contrastive loss are defined as follows: where λ is a hyperparameter to balance these two losses. τ is also a hyperparameter to control the smoothness of the distribution. N y i represent number of examples with class label y i , and S y i represents the set of all the examples belonging to class label y i . In this work, we use l 2 normalized representation of the final encoder hidden layer before softmax as f θ .

C Original Datasets Details
In our experiments, before training on our category datasets, we initially train our RoBERTa-Large

D Training Details
In all our experiments, we use the RoBERTa-Large classifier (356M parameters). 15 We report on accuracy for all of our models. Our choice of the best model during training is decided based on the accuracy performance on the development set. We do minimal manual hyperparameter search in our experiments. While training on the original datasets (MNLI+ANLI R1+ANLI R2, SNLI, or IMDB), we use a learning rate of 2e −5 . For the training on the few-shot categories, we use a learning rate of 1e −5 , where we initially tuned in the range [2e−5, 5e −6 ]. We keep the rest of the hyperparameters same between training on the original dataset versus training on the few-shot categories, e.g., we 11 https://gluebenchmark.com/tasks 12 https://github.com/facebookresearch/ anli 13 https://nlp.stanford.edu/projects/ snli/ 14 https://github.com/acmi-lab/ counterfactually-augmented-data 15 Based on Transformers repository (https://github. com/huggingface/transformers). use a batch size of 32, maximum sequence length of 128 for training and 256 for testing, etc. The average run time for training on the few-shot categories is less than five minutes (because of very few training examples). We use 4 Nvidia GeForce GTX 1080 GPUs on a Ubuntu 16.04 system to train our models.

E.1 Effect of Few-Shot Learning on Domains
In order to better understand the few-shot learning performance at the domain level, we chose the ANLI R3 few-shot learning setting where domain (genre) information is available. For example, the numerical category has 'Wikipedia' and 'RTE' domains. Table 13 presents the domain specific performances of various categories comparing parameter correction approach (fine-tuning) and nonparametric feature matching method (Prototypical Networks). We observe that 'Legal' domain performed best on average for both methods. Furthermore, the feature-matching method performed 'relatively' better on RTE domain whereas the parameter correction method performed relatively worse on this domain. Table 12 presents the detailed continual learning results on ANLI R3 categories using the fine-tuning method. First, we observe that the performance on MNLI drops as we add the categories, suggesting that it is affected by catastrophic forgetting. Next, we observe that the performance on all categories improve after the end of the continual training (w.r.t. performance on the pre-trained model). Further, we also observe that some categories are helping improve other categories. For example, after continually training the model from tricky category to reasoning category, the performance on