MCML: A Novel Memory-based Contrastive Meta-Learning Method for Few Shot Slot Tagging

Meta-learning is widely used for few-shot slot tagging in task of few-shot learning. The performance of existing methods is, however, seriously affected by \textit{sample forgetting issue}, where the model forgets the historically learned meta-training tasks while solely relying on support sets when adapting to new tasks. To overcome this predicament, we propose the \textbf{M}emory-based \textbf{C}ontrastive \textbf{M}eta-\textbf{L}earning (aka, MCML) method, including \textit{learn-from-the-memory} and \textit{adaption-from-the-memory} modules, which bridge the distribution gap between training episodes and between training and testing respectively. Specifically, the former uses an explicit memory bank to keep track of the label representations of previously trained episodes, with a contrastive constraint between the label representations in the current episode with the historical ones stored in the memory. In addition, the \emph{adaption-from-memory} mechanism is introduced to learn more accurate and robust representations based on the shift between the same labels embedded in the testing episodes and memory. Experimental results show that the MCML outperforms several state-of-the-art methods on both SNIPS and NER datasets and demonstrates strong scalability with consistent improvement when the number of shots gets greater.


Introduction
Slot tagging (Tur and De Mori, 2011), is a key part of natural language understanding and usually be modeled as a sequence labeling problem (Chen, Zhuo, and Wang, 2019) with BIO format as shown in Figure 1. However, rapid domain transfer and scarce labeled data in the target domain lead to new challenges (Bapna et al., 2017a;Zhang et al., 2020). For this purpose, few-shot techniques (Li Fei-Fei, Fergus, and Perona, 2006;Snell, Swersky, and Zemel, 2017;Vinyals et al., 2016) are designed and becomes a popular research topic, which aims to recognize a set of novel classes with only a few labeled samples (usually less than 50-shot) by knowledge transfer from a set of base classes with abundant samples.
Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
For optimization-based meta-learning, since it requires fine-tuning based on pre-trained language models, the more is the labelled data in the target domain the better the performance will be. For this reason, this method is favorable for larger few-shot applications, e.g 20-shot learning (Krone, Zhang, and Diab, 2020;. For metricbased meta learning, the catastrophic forgetting problem (McCloskey and Cohen, 1989) is inevitable and seriously affects learning performance. First of all, in the meta-training stage, when the model learns from the current episode, it does not take into account of the previously trained episodes. For example, episode 1 and episode 2 in Figure 1 may have overlap labels but the model cannot borrow already learned representations from episode 1 at episode 2 . Furthermore, in the meta-testing stage, there are some labels which appears in training data but the metric-based meta-learning utilize the same mechanism to handle these "seen" labels and "unseen" labels 1 . For example, both of the training data and testing data contains object type and object name in Figure 1, but the model needs to learn a new representation for these two labels during meta-testing without considering the learned representation in meta-training. Secondly, when the number of shots increases, the upper bound of metricbased meta-learning saturates quickly (Cao, Law, and Fidler, 2019). As shown in Figure 2, the number of shots only affects the accuracy of local cluster distribution but not global distribution. We noted similar behavior was also observed by Ouali, Hudelot, and Tami (2020) ;Doersch, Gupta, and Zisserman (2020). Besides that, Cao, Law, and Fidler (2019) provide a solid theoretical analysis for metric-learning based approaches.
To overcome these limitations, we propose the Memorybased Contrastive Meta-learning (MCML) method to capture more transferable and informative label representations. Specifically, we propose two mechanisms to alleviate catastrophic forgetting in meta-training and meta-testing respectively. First of all, during meta-training, we use explicit memory module to keep track of the label representations of previously trained episodes and propose a contrastive learning (Hadsell, Chopra, and LeCun, 2006)    gether semantically similar (i.e., positive) samples in the embedding space while pushing apart dissimilar (i.e., negative) samples. This is what we call the "learn-from-memory" technique. In this way, we can stabilize the training stage and learn more transferable representations (Ouali, Hudelot, and Tami, 2020). Secondly, during meta-testing stage, we use the "adaption-from-memory" technique to determine the output label based on the contrast between the input labels embedded in the test episode and the label clusters in the memory. We also introduce an indicator to control how much information we want to acquire from the memory. To summarize, our contributions are: (1) For the first time, the catastrophic forgetting problem of the few shots is formally tackled by modeling the episode-level relationships in meta-learning based FSL. (2) We propose a novel Memorybased Contrastive Meta-learning (MCML) method, including two model-agnostic methods: learn-from-memory and adaption-from-memory, to alleviate catastrophic forgetting problem happened in meta-training and meta-testing of few shot slot tagging. (3) Compare with metric-based metalearning and optimization-based meta-learning, our method demonstrates superior performance on all 1-shot, 5-shot, 10shot, 20-shot scenarios. Relatively, our method is more scalable and robust.

Related Work
Few-Shot Learning Few-shot learning is first proposed as a transfer method using a Bayesian approach on low-level visual features (Li Fei-Fei, Fergus, and Perona, 2006). Recent works in low-resource cross-domain natural language understanding (Bapna et al., 2017b;Lee and Jha, 2019;Fritzler, Logacheva, and Kretov, 2019;Shah et al., 2019;Hou et al., 2020;Zhu et al., 2020) developed alternative techniques for building domainspecific modules without the need of many labeled examples. Most recent work try to solve this problem by meta learning which can be categorized into two classes: metricbased (Hou et al., 2020;Zhu et al., 2020) meta learning and optimization-based (Bapna et al., 2017a;Shah et al., 2019; meta learning (Yin, 2020). However, both two methods cannot get satisfactory performance due to its limitations mentioned above when the number of show is less especially less than 20. Contrastive Learning Contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Hadsell, Chopra, and LeCun, 2006). The method get popular and attract much attention by showing great potential in image classification Ouali, Hudelot, and Tami, 2020). Another recent work in natural language process apply it to unsupervised text clustering which demonstrate constrastive learning can help to learn instance-level features . This characteristic also can benefit the metric-based meta learning. Episode-level Relationship There is only small amount of work that focuses on relationships between different episodes, and most of them derive from computer vision (Li et al., 2019;Sun et al., 2019;Ouali, Hudelot, and Tami, 2020;Fei et al., 2021). There are two recent and similar works: (1) MELR (Fei et al., 2021) conducts meta-learning over two episodes deliberately and targets pool sampling problem of the few shots.
(2) The intra-episode spatial contrastive learning (SCL) (Ouali, Hudelot, and Tami, 2020) also plays a key role as an auxiliary pre-training objective to learn general-purpose visual embeddings for image classification. In this work, we consider inter-eposide contrastive learning. Furthermore, our MCML is specifically designed to cope with catastrophic forgetting of the few shot slot tagging, and can be easily integrated with other episodictraining based methods. Memory Augment Learning Memory mechanism has been proved powerful and effective in many NLP tasks (Tang, Qin, and Liu, 2016;Das et al., 2017;Geng et al., 2020). Most researchers choose to store the encoded contextual information in each meta episode under the few-shot setting (Kaiser et al., 2017;Cai et al., 2018). Specifically, (Geng et al., 2020) propose a dynamic memory induction networks to solve few shot text classification problem. Similar with our work, MoCo (He et al., 2020) also utilize a external memory module to build a large and consistent dictionary on-thefly that facilitates contrastive unsupervised visual representation learning.

Preliminaries
Before we introduce our proposed framework, we provide the problem definition and an illustration of basic framework of metric-based meta-learning and optimization-based metalearning to solve few-shot slot tagging.

Problem Definition
We denote each sentence x = (x 1 , x 2 , x 3 , ..., x p ) and the corresponding label y = (y 1 , y 2 , y 3 , ..., y p ). Usually, we are provided with lots of labeled data (i.e. (x, y) pairs) of source domains D s , and few-shot (less than 50) labeled data as well as plenty unlabeled data in the target domain D t under the few shot setting. We split the data as episodes e = (S, Q) which consists of the support set and the query set to comply with the setting of few-shot. The support set, contains several labelled samples where the labels are missing during meta-testing. Thus, the few-shot model is trained based on many episodes E tr = (e 1 , e 2 , e 3 , ..., e n ) initially, the trained model is then directly evaluated on the target domain E te = (e 1 , e 2 , ..., e m ). The objective is formulated as follows: (1) where θ refers the parameters of the slot tagging mode, the (x, y) pair and the support set from the target domain, E tr and E te represent different episodes during meta-training and meta-testing respectively.

Metric-based Meta Learning
Given an episode consisting of a support-query set pair, the basic idea of metric-based meta-learning (Snell, Swersky, and Zemel, 2017;Vinyals et al., 2016;Zhu et al., 2020;Hou et al., 2020) is to classify an item (a sentence or token) in the query set based on its similarity with the representation of each label, which is learned from the few labeled data of the support set. Some representative works are matching network (Vinyals et al., 2016) and prototypical network (Snell, Swersky, and Zemel, 2017). More specifically, given an input episode (S, Q) pair, the model encode these two parts to get the sample vector and query vector respectively: After that, various models can be used to extract label embedding c yi . Take prototypical network as an example, each prototype (label embedding) is defined as the average vector of the embedded samples which have the same labels: while I is an indicator function which equals to True when y j i == y i else False; s j i is the corresponding sample vector from S ∈ R |S| * n * h .
Lastly, we calculate the distance between the label embedding and the sample vector from query set. The most popular distance function is the dot product function which is defined as follows: The label of instance from the query set is the label which is closest with the instance vector and can be calculated through a softmax layer. As same as observed by Ouali, Hudelot, and Tami (2020), we find the learned representations lack general discriminative semantic features because of embeddings tailored induced by cross entropy loss to solve classification problem (shown in Figure 2).

Optimization-based Meta Learning
Optimization-based meta learning method also shows great potential since the introduction of pretrained language models, like BERT (Devlin et al., 2019) and RoBERTa . Most of these methods utilize a pre-train and then fine-tune mechanism, which requires to "see" target labels in training such as . The details of second method is out of scope of this paper, we just skip it for brevity.

Model
In this section, we first show the overview of our proposed framework, and then we discuss how to learn from memory and adaption from memory in section 4.2 and 4.3 respectively.

Framework
Due to few-shot setting, catastrophic forgetting guide the model to learn poor representation results in worse adaptability. To overcome this problem during meta-training and meta-testing, we propose "learn-from-memory" and "adaption-from-memory" techniques respectively to reuse the learned features (Raghu et al., 2019) (as shown in Figure  3).  Figure 3: The overview of Our Framework including three modules 1) Memory, 2) Learn-from-memory, and 3) Adaption-from-memory Learn-from-memory: During the meta-training stage, the model will continuously train on different episodes, we utilize an external memory module to store all learned label embedding from the support set. These embeddings form different clusters, and each cluster represents one label. When a newly seen label embedding appears, contrastive loss is computed on these dimensional embedding by attracting positive samples, which have the same class label, and by repelling the negative samples. if this label has not been encountered before, we just add it into our memory. Adaption-from-memory: During the meta-testing stage, we firstly learn an adaption layer by using these overlapping labels during meta-training and meta-testing, and then we use the learned adaption layer to project these unseen labels from testing space to training space in order to capture a more general and informative representation. We use the skip connection to control how much information we want to acquire from the memory.

Learn from Memory
We explore contrastive learning as an auxiliary metatraining objective to learn general-purpose semantic representations which can better transfer to target domain. Specifically, start from the first episode e 1 to the last episode e n in E tr , we store the label embedding from the corresponding support set C i = (c 1 , c 2 , ...c k ) into external memory module M , where k is the number of labels for current episode. M increases as episode continues on.
whilek represents average number of labels for all episodes, and m is the number of episodes. For the ith episode, we first calculate the prototypical embedding of seen label clusters from memory. Theoretically, this step is unnecessary but we choose to do so to save computational resources 2 .
We use c k to represent the centroid of the kth cluster, and then we define a distance function following Ding et al. (2021) as follows: After that, we impose contrastive learning objective between new learned label embedding and these prototypes of clusters (shown in Figure 4).
This objective effectively serves as regularization to learn more consistent and transferable label representation as they evolve during meta-training (Ding et al., 2021;He et al., 2020). It is helpful to note that the parameters of models does not change at this stage, and we do not need to modify the architecture of traditional metric-based meta learning models. As such, the model can be easily trained together with other components in an end-to-end fashion.  Figure 4: The high-level overview of our "learn-from-memory".
We use one label embedding as example for brevity. Inspired by He et al. (2020) Adaption from Memory Adaption from memory only can be used when metatraining data and meta-testing data have overlap labels. In this case, the overlapping label will have two different representations: one from memory, the other from current episode during meta-testing (as shown in Figure 3). In practice, this case is more common, e.g. B-person and B-city almost appear in every slot tagging dataset. We use the overlapped labels during meta-training and meta-testing to learn the adaption function f which minimizes the distribution gap between meta-training and metatesting 3 .
T rain seen = f (T est seen ) where f can be implemented by Multilayer Perceptron (MLP) or one linear layer with the following loss function.
We then use the learned adaption function to project these unseen labels to the training space based on the assumption that the train space should be more accurate than the test space which consists of more labeled data.
T rain unseen = f (T est unseen ) (12) where α ∈ (0, 1) is a hyper-parameter which controls the percentage of information from the original test space and from adaption.

Training Objective
The learning objective of our methods is sum of three parts. It is noted that these losses are not optimized simultaneously. 3 We emphasize this operation is conducted at episode-level to comply the few-shot setting L = L ner + L memory + L ada (15) while L ner represents the traditional cross entropy loss of sequence labelling (see Eq 14), and is optimized with L memory during training (see Eq 9). L ada is optimized during testing (see Eq 11).

Experiments
We evaluate the proposed methods following the data split setting provided by Hou et al. (2020) on SNIPS (Coucke et al., 2018). It is in the episode data setting (Vinyals et al., 2016), where each episode contains a support set (1-shot or 5-shot) and a batch of labeled samples. For slot tagging, the SNIPS dataset consists of 7 domains with different label sets: Weather (We), Music (Mu), PlayList (Pl), Book (Bo), Search Screen (Se), Restaurant (Re) and Creative Work (Cr). And also, we extend our method to more shots (10-shot and 20-shot) to further demonstrate the effectiveness and robust generalization capability of our approach.

Baselines
SimBERT assign labels to words according to cosine similarity of word embedding of a fixed BERT. For each word x i , SimBERT finds the most similar word x k in the support set and assign x k 's label to x i . TransferBERT directly transfer the knowledge from source domain to target domain by parameter sharing. We pretrain it on source domain and select the best model on the same validation set of our model. Before evaluation, we fine-tune it on the target domain support set. L-TapNet+CDT+PWE (Hou et al., 2020) one of strong baselines for few-shot slot tagging, which enhance the WarmProtoZero(WPZ) (Fritzler, Logacheva, and Kretov, 2019) model with label name representation and incorporate it into the proposed CRF framework. L-ProtoNet+CDT+VPB  current state-ofthe-art metric-based meta-learning, which investigates the different distance functions and utilizes the distance function VPB to boost the performance of the model. Coach : current state-of-the-art optimization-based meta-learning method, which incorporates template regular loss and slot description information. It is noted there are more labeled samples in the support set of target domain in our setting. For the sake of fair peer comparison, we randomly choose one support set from target domain to fine-tune the model.

Implementation Details
We take the pre-trained uncased BERT-Base (Devlin et al., 2019) as encoder to embed words into contextual related vectors. We use ADAM (Kingma and Ba, 2015) to train the model with a learning rate of 1e-5, a weight decay of 5e-5 and batch size of 1. And we set the distance function as VPB . To prevent the impact of randomness, we test each experiment 10 times with different random seeds  Table 2: Ablation Study of "adaption-from-memory" and "learn-from-memory" on 1-shot and 5-shot following Hou et al. (2020). For adaption from memory, we set the iteration as 1000, and α from [0.1, 0.3, 0.5, 0.7, 0.9] and report the best result. Table 1 shows the results of both 1-shot and 5-shot slot tagging of SNIPS dataset. Our method reaches comparable result with the state-of-the-art, and outperforms in 3 out of 7 domains under 1-shot setting, and 6 under 5-shot setting. Especially in Creative Work, our method achieves almost 10 points and 6 points improvement under 1-shot and 5-shot respectively. Compare Coach  with Hou et al. (2020) and Zhu et al. (2020), we find optimization-based is not as competitive as metric-based methods when the shot is less.

Ablation Study Ablation Test on Adaption from Memory (A) and Learn from Memory (M)
We borrow the result from Zhu et al. (2020) as baseline (i.e. L-ProtoNet+CDT+VPB) here since it reaches best performance out of all baselines. Table 2 shows the ablation study of learning and adaption from memory. Compare the result between 1-shot and 5-shot, we find that the "learn-frommemory" module gets more important as the number of shot increases. We contribute this phenomenon to the more transferable representations due to more labeled data brings by more shots. However, the "adaption-from-memory" cannot keep consistent improvement, we think this is caused by noise introduced by adaption layer. After combine these two module, the model can reach the best performance as reported in Table 1. Compare with the strongest baseline, the averaged f1 score further improved (More analysis about "adaption-from-memory" can be found in appendix). Table 3 shows the result of 10-shot and 20-shot on SNIPS dataset which is generated follow the method proposed by Hou et al. (2020) 4 . It is noted there may be distribution variation since we do not strictly require the 10-shot/20-shot dataset must contain the original 1-shot/5-shot.

Result on More Shot
Compare 10-shot with 20-shot, we can find that all domains are improved with the help of "learn-from-memory" when the number of shot increases except "SearchCreative-Work". Since this is the only one domain which have 100% overlap labels during meta-training and meta-testing, we attribute this phenomenon caused by poor representations from meta-testing without "adaption-from-memory".
Compare 1-shot and 5-shot (less-shot) with 10-shot and 20-shot (more-shot), there are some interesting findings: 1) "learn-from-memory" can boost 6 out of 7 domains in moreshot instead of 3 in less-shot. This demonstrates the importance and effectiveness of this module when number of shot gets more; 2) "adaption-from-memory" shows exactly the   Table 4: Optimization-based Meta-learning  vs Metric-based Meta-Leaning  vs our MCML, red indicates the worst performance and green indicates the best.
same gains whether or not there are more shot. This is reasonable since the number of shot does not affect the number of labels, and also the accuracy of adaption. We conclude that "learn-from-memory" is always worth to try, and "adaption-from-memory" highly depends on specific domain (see appendix).

Optimization-based vs Metric-based vs MCML
We also rerun the coach  at the more-shot setting, which is a representative work of optimization-based meta-learning. And then we compare the result with metricbased and our methods. As shown in Table 4, when the number of shots is less than 20, optimization-based methods usually get worst performance, which demonstrate the minimum number of shots required by the optimization-based meta-learning methods. Compare the 20-shot with 5-shot, we can find the performance of original metric-based metalearning declines in most domains even the number of shots get more, but our method can effectively alleviate this problem due to "learn-from-memory". However, when the number of shot is 1, our MCML only can reach comparable performance with metric-based meta-learning. We attribute this to weakness of "learn-from-memory" when have less or no memory.

Conclusion
In this paper, we investigate the catastrophic forgetting problem during meta-training and meta-testing in metricbased meta learning. We propose two methods: "learn-frommemory" and "adaption-from-memory" techniques respectively to alleviate this problem, which demonstrate different preference and advantages for various applications. To the best of our knowledge, this is a first study to use contrastive learning to model episode-level relationship and learn more transferable representations for few-shot slot tagging. In addition to that, we conduct extensive experiments on all 1shot, 5-shot, 10-shot and 20-shot scenarios of widely used SNIPS dataset. The experimental results demonstrate our methods is more scalable and robust than metric-based and optimization-based meta-learning. We left more powerful "learn-from-memory" techniques in our future work.