Don’t Miss the Labels: Label-semantic Augmented Meta-Learner for Few-Shot Text Classification

Increasing studies leverage pre-trained language models and meta-learning frameworks to solve few-shot text classiﬁcation problems. Most of the current studies focus on building a meta-learner from the information of input texts but ignore abundant semantic information beneath class labels. In this work, we show that class-label information can be utilized for extracting more discriminative feature representation of the input text from a pre-trained language model like BERT, and can achieve a performance boost when the samples are scarce. Building on top of this discovery, we propose a framework called Label-semantic augmented meta-learner (LaSAML) to make full use of label semantics. We systematically investigate various factors in this framework and show that it can be plugged into the existing few-shot text classiﬁcation system. Through extensive experiments, we demonstrate that the few-shot text classiﬁcation system upgraded by LaSAML can lead to signiﬁcant performance improvement over its original counterparts.


Introduction
The remarkable capability of quickly learning new concepts from a few training samples is one of the advantages of the human learning system over the current machine learning system. Motivated by this gap, research in few-shot learning has received increasing attention in the past decade. Meta-learning (Vinyals et al., 2016;Snell et al., 2017;Finn et al., 2017), as the dominant methodology in few-shot learning, tackles the problem by learning a mapping function from a few support samples to a classifier through a meta-training dataset. Most existing meta-learning systems (Snell et al., 2017;Sung et al., 2018) were developed or at least evaluated in the field of computer vision. More recently, * Corresponding Author few-shot learning has been introduced to the NLP field and in particular, text classification (Yu et al., 2018;Geng et al., 2019), as it is the fundamental task in natural language understanding. In parallel to few-shot learning, pre-trained language models (PLMs) (Devlin et al., 2019;Radford et al., 2019) have revolutionized the NLP fields and show strong evidence of being able to perform well in low data regime when transferred to downstream tasks.
Despite the impressive progress of meta-learning and PLMs, however, most existing few-shot classification systems (Geng et al., 2019;Bao et al., 2020) ignore an important information sourcesemantic of class labels. When the number of training samples is limited, merely using the input texts per class can lead to ambiguity in interpreting the definition of class. Considering the two groups of examples in Figure 1 which shows four samples belonging to different intent classes, even humans cannot fully understand the semantic meaning of those samples if the definition of labels are not given. For example, it is hard to tell if class 1 and class 2 are about the type of bill -water or gas, or class 3 and class 4 are about the destination of the travel -USA or Germany. However, this ambiguity can be easily resolved if the class definition or simply the class name is provided.
Just as understanding class names can help humans to interpret sentences of a given class, we made an interesting observation that the BERT will extract more discriminative features if we append the class name to the input sentence, and it can boost the classification performance in low-shot scenarios. Motivated by the above observations, this work explores how to better leverage the semantic information beneath class names for fewshot learning. Our key idea is to use meta-learning to further strengthen the guidance of class-label semantics for few-shot classification. Specifically, we use meta-learning to encourage the features extracted from class-name-appended samples to be more class-relevant and compatible to the query features. Moreover, we systematically study the issue of how to extract the label-semantic guided feature representation from the support samples and how to make the query sample features compatible with the meta-learner generated from the support set. Our research leads to a framework that can be plugged into the existing few-shot metalearner and we call our method Label-semantic Augmented Meta-Learner (LaSAML). To demonstrate the power of LaSAML, we use LaSAML to upgrade the Prototypical Network and creates a new method called LaSAML-PN. By conducting the extensive experimental studies, we show that LaSAML-PN achieves excellent few-shot learning performance and LaSAML upgraded meta-learning obtains superior performance over its original counterpart. Our code has been released at: https: //github.com/luoqiaoyang/ACL2021-LaSAML.

Related work
This section discusses the related work from three aspects: few-shot learning, few-shot text classification, and low-shot learning with label information. Few-Shot learning Meta-learning approaches have made substantial progress with few-shot learning (FSL) tasks. The focus of the current metalearning framework is how to construct the metalearner. For examples, a meta-learner could be constructed by learning a metric between samples and classes (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018), based on a differentiable learning process (Bao et al., 2020), or based on a few-shot gradient update (Mishra et al., 2018;Finn et al., 2017). A complete review of meta-learning is beyond the scope of this paper, and we refer readers to the recent survey (Hospedales et al., 2020).
Few-shot text classification Few-shot text classification (FSTC) has also gained increasing attention in recent years. ROBUSTTC-FSL (Yu et al., 2018) uses an adaptive metric learning approach to adaptively select an optimal distance metric for different tasks. Induction Network (Geng et al., 2019) utilizes the dynamic routing algorithm (Sabour et al., 2017) to learn a generalized classwise representation. Pre-trained language models have also been applied to few-shot text classification. LEOPARD (Bansal et al., 2020) uses BERT (Devlin et al., 2019) with optimization-based metalearning framework to achieve good performance on diverse NLP classification tasks. More recently, GPT-3 (Brown et al., 2020) shows that the language model itself can be used to perform few-shot text classification without using meta-learning. Meanwhile, another recent work (Bao et al., 2020) points out meta-learning for text classification may have different characteristics to the cases in computer vision. They propose to use distributional signatures to enhance the generalization capability of meta-learner. Our method is still a meta-learningbased few-shot text classification method. The key contribution of our work is the discovery that using label information together with BERT can lead to significantly better generalization performance.
Using label information for text classification An increasing number of recent works have realized the value of label semantics. The matching between label information and text can naturally lead to zero-shot learning. For example, CDSSM (Chen et al., 2016) explores zero-shot intent classifications based on class names. Prompt-based strategies (Puri and Catanzaro, 2019;Schick and Schütze, 2020) have been developed to implicitly match text against class names. In the context of few-shot learning, (Hou et al., 2020) incorporate label semantics into the TapNet (Yoon et al., 2019) for few-shot slot tagging tasks. Different from the above works, this paper explores both pre-trained language models and label semantics for few-shot learning. We only require the name of classes rather than manually constructed prompts or templates to convey label semantics. TARS (Halder et al., 2020) also leverages pre-trained language models and label semantics based on binary text classification. However, our method further strengthens generalization ability via meta-learning framework especially in cross-domain and fine-grained cases.  Table 1: Results of fine-tuning BERT with only 5 or 10 labeled data per class on the AGNews dataset. None: the standard input format for the BERT classifier, class name: appending the respective class names for each training sample.
3 Our method 3.1 The value of label information in the low data regime As described in the introduction, label information is essential for human to accurately interpret the meaning conveyed in the limited number of training samples. In this section, we demonstrate that label information is also useful for extracting discriminative features from a pre-trained language model 1 . More specifically, we consider the following modification to input of BERT for text classification: we append the corresponding class name after each training sentence (for which we know the ground-truth classes) and a [SEP] token. In other words, we use the following input format To perform well in the NSP task, BERT needs to extract information that are most predictive for the next sentence from the first sentence. In our case, we replace the next sentence with the class name and consequently, we expect that BERT can extract information that is relevant to the class name from the input sentence. We call this method labelsemantic augmented feature extraction hereafter. From the experimental results are shown in Table 1, we can clearly see that the classifier trained from label-semantic augmented feature extraction achieves better performance than the baseline approach. When only five samples are used per class, the improvement can be as significant as 10%.
From this motivating experiment, we clearly see the potential of incorporating class-label information. To further strengthen BERT's capability of leveraging class-label semantic information, we incorporate the above idea into the meta-learning framework, lending itself to a new meta-learning framework termed label-semantic augmented meta-learner (LaSAML). We expect that through finetuning a PLM by the meta-learning process, the network can find an optimal way of building metaclassifiers with the guidance of class-labels.

The general framework of the label-semantic augmented meta-leaner
We first present the proposed LaSAML in its general form and then dive into more details of this framework. Formally, we consider the following problem setting. Our aim is to build a meta-learner which can convert a set of support samples, denoted as X s = {x s , y s , t s }, into a classifier φ(·; X s ), where x s , y s and t s denote the input text, class label, and the lexical definition of the class, i.e., the class name, respectively 2 Applying φ(·; X s ) to test data, i.e., query data x q , we could obtain the predicted classŷ q through φ(x q ; X s ). The metalearner is trained from the meta-training set, from which one can randomly construct a support set X C s = {x c s , y c s , t c s } and a query set with groundtruth class name, X C q = {x c q , y c q } for C-Way Kshot settings, where c ∈ C. Therefore, the performance of φ(·; X C s ), classifier generated from the meta-learner, can be evaluated by comparing the predicted class against the ground-truth query label {y c q }. The key difference of traditional metalearner and the proposed LaSAML is that the lexical definition of class name {t c s } will be used for building the meta-learner.
In particular, we consider the meta-learner that can be written in the following form: where f and g denotes the feature extractors which convert the input text to a feature vector. For many meta-learning approaches, f = g. ψ is a mapping function to map the support set data to a set of class vectors, one for each class. Then the classifier is defined by a function m(·, ·) that measures the compatibility, q c , between a query sample x q and the class vector w c . The class with highest q c is the predicted class.
The above formulation encompasses a wide range of meta-learning approaches. For example, for Prototypical network (Snell et al., 2017), w c is the c-th class mean vector calculated from the 2 In our following discussion, we slightly relax the distinction between "class label", "class name" and "class tag" and use them exchangeably when no confusion is caused. feature extractor and m(·, ·) is simply a negative Euclidean distance between g(x q ) and w c .
The proposed LaSAML introduces labelsemantic guidance to f (·) and g(·). In other words, the feature extractors f and g may take the class name as an additional input. Due to that the availability of class name information will be different for support set samples and query set samples, i.e., we know the ground-truth class name for support set samples but not for query samples, we may choose different ways of incorporating label information into the feature extractor, resulting in different implementation of f and g.

Incorporating label information into feature extractors
This subsection discusses various options of incorporating label information into f and g. We show the possible configurations in Figure 3. For the feature extractor of the support set, f , we append the corresponding ground-truth class name to each sentence. Then we have different options of extracting sentence feature representations. In our study, we consider extracting sentence features from [CLS] token, the global average pooling (GAP) of embeddings of sentence tokens, GAP of embeddings of the class name (since a class name may contain multiple tokens), and the average of them.
For the feature extractor of the query set, g. We consider three cases. First, the most straightforward way is not appending anything since we do not know the ground-truth class name for query samples. Second, we can append all class names, as shown in Figure 2. Finally, we can append all class names but extract C features from the corresponding class name, one for each class. Then the c-th feature will be compared against the c-th class vector and calculate the matching score with m. The class corresponding to the highest matching score will be the prediction. This scheme is visualized in Figure 2 option 3 for query. Formally, this process can be written as: where g(x q , t c ) denote the feature extracted from class name token t c . We will leave the detailed comparison results and discussion of those schemes to Section 4.2 and Section 4.3. Here, we report our major discovery.
(1) For supporting set samples, extracting sentence features from different positions leads to similar performance. Extracting features from "[CLS]" and " [CLS]+Tag" in general is slightly better than other options. (2) For query samples, without appending all class-label names leads to the overall best performance for our best performed method. In the following, we by default consider the setting of extracting features from "[CLS]" and not appending class names to a query sample unless otherwise specified.

Upgrade existing meta-learner with LaSAML
The proposed LaSAML can be incorporated into a variety of existing meta-learning frameworks. In our study, we mainly consider Prototypical Network (Snell et al., 2017) as the meta-learning framework and upgrade it with LaSAML. The Prototypical Network (Snell et al., 2017) is a metric-based meta-learning framework, which calculates the class vector by averaging the sameclass features extracted from the support set. In its LaSAML-upgraded version (denoted as LaSAML-PN), we calculate the class vector w c by where f (x c s , t c s ) indicates the feature extracted by incorporating class name information.
Then, we make the decision by comparing the feature extracted from a query sample against {w c }: where d(·, ·) is the squared Euclidean distance.

Experimental results
In this section, we conduct experiments to evaluate the performance of LaSAML. We first introduce our experimental setting. Then, we present the main results by comparing LaSAML against various existing few-shot text classification approaches. Finally, we provide ablation studies to investigate multiple factors in the proposed method.

Datasets
Three text classification datasets are used in our experiment.
HuffPost is a dataset including a wide range of news topics. The dataset consists of 36900 news headline samples and 900 samples for each class. Following the settings from (Bao et al., 2020), we use the same 20/5/16 classes for training, validation, and testing, respectively, for a fair comparison. Due to the limited number of classes, we only consider the 5-way 1-shot and 5-way 5-shot text classification tasks in this dataset. Banking77 published by (Casanueva et al., 2020) is a dataset for intent classification tasks. The dataset covers 13,083 fine-grained intents from 77 classes in the banking domain. We construct the few-shot tasks in 10-way 1-shot, 10-way 5-shot, 15-way 1-shot, and 15-way 5-short. The dataset is partitioned into a training, a validation, and a testing dataset. 30, 15, and 32 classes are sampled for each partition 3 . Clinc150 is a cross-domain intent classification dataset which was originally proposed in (Larson et al., 2019). It provides 22,500 in-scope queries and 150 intent classes from 10 domains. Each domain contains 15 intent classes, and there is no overlap between those classes. We use this dataset to evaluate the performance of meta-learner under domain shift. We split the datasets into 4/1/5 domains for training, validation, and testing.

Comparing methods
We compare the proposed method against several commonly used few-shot learning approaches, which have shown promising results in both computer vision and natural language processing fields. For all the compared methods expect distributional signature (Bao et al., 2020) which shows better performance without BERT, we re-implement them   Table 3: Ablation study results of extracting support data features from varies positions and its combinations on HuffPost and Clinc150 (cross domain). According to the results on another ablation study in Table 5, we pick up different query settings 1 and 2 in Figure 2 for LaSAML-PN and LaSAML-RRML individually.
with the BERT encoder as the feature extractor. Note that this might lead to different (in most cases higher) performance than the one originally reported. Prototypical Network (Snell et al., 2017) is a metric-based few-shot learning method. Our LaSAML-PN is an upgraded version of it. We use two implementations of PN. One extracts features from the [CLS] token, and another applies a multi-layer-perceptron (MLP) for the embedding of [CLS] token. The latter was used in a recent study (Bao et al., 2020). We denote the original implementation of Prototypical Network as PN and the implementation with an MLP as PN*. learns how to use the statistic pattern of tokens to selectively attend key information of the input text and build a meta-learner with better generalization performance. The method in (Bao et al., 2020) can be applied to a wide variety of meta-learning methods. In (Bao et al., 2020), the combination with RRML shows the best performance (DS+RRML).

Implementation Details
In our methods, BERT BASE is employed as the feature encoder and meta-learner. We construct 100, 100, and 1000 random sampled tasks for each training, validation, and testing epoch individually.
Moreover, we use the Adam algorithm (Kingma and Ba, 2015) as the optimizer. For better training performance, we set different learning rates for the BERT encoder and the other modules, that is, 2e−5 for the BERT encoder and 1e−3 for other modules. Both Relation Network and Induction Network consist of a relation module, and we set the dense hidden layer dimension to 50 for the relation module. We follow other settings of Induction Network in the original paper (Geng et al., 2019).

Main results of LaSAML
The main experiment results are displayed in Table  1. From the result, we make the following observations: (1) The proposed LaSAML-PN achieves significant performance improvement over the original PN, especially on the one-shot classification setting: on average, the improvement is around 6% to 16%. With more training data, the gap between LaSAML-PN and PN becomes smaller: on Bank-ing77, LaSAML-PN and PN become comparable; but we can still see 3-5% improvement on Clinc150 and HuffPost. This is understandable because, with more samples, the class-related text patterns become more pronounced. However, this might be data-dependent. In general, if the difference between classes is more subtle, i.e., fine-grained classes, more samples might be needed and consequently, the guidance from class name/definition will be more beneficial.
(2) We find that the original implementation of the Prototypical network performs much better than the one used in (Bao et al., 2020) which employs an additional MLP. The former achieves even higher performance than the method proposed in (Bao et al., 2020) (which is DS+R2D2. Our re-implementation achieves almost the same performance in (Bao et al., 2020)). (3) Another surprising finding is that the Relation Network and Induction network do not perform better than the traditional Prototypical network. From the above observations, we may conclude that using modules without prior information of a language model, e.g., the MLP whose parameters are randomly initialized rather than pre-trained as in the PLM, leads to poor generalization performance. In contrast, methods directly fine-tuning parameters inside a PLM, e.g., PN, RRML, and our methods, tend to perform better. This observation can somehow be supported by the argument in (Bao et al., 2020). In (Bao et al., 2020), it points out that in NLP, "the lexical features highly informative for  one task may be insignificant for another. " Thus, the weight learned from those randomly initialized modules may overfit the meta-training set and cannot generalize well to the target task. In (Bao et al., 2020), the authors suggest a solution by building a meta-learner with the generalizable statistics of words. In our study, we find that this solution might not be stable in all cases. For example, DS+RRML does not perform well in Banking77 and Clinc150. Instead, our results suggest an alternative solution: building the meta-learner by not introducing additional parameters to BERT parameters, since the latter is pre-trained from a large corpus and tends to generalize better across tasks.

Ablation Study
In this section, We investigate LaSAML in depth by answering three questions. First, whether LaSAML is applicable to other meta-learning frameworks. Second, what is the impact of different ways of extracting features from BERT for support set samples? Third, how to leverage class name information for query samples? We conduct a serial of experiments on HuffPost and Clinc150 datasets to answer those questions. LaSAML with other meta-learning framework To further explore the potential of LaSAML, we incorporate it into the Ridge Regression Meta-learner (RRML) (Bertinetto et al., 2019), which is achieved by simply replacing the feature extractor f and g with the feature extractors used in LaSAML-PN. The results are shown in Table4. As seen, using LaSAML leads to significant improvement in oneshot cases for the RRML. For five-shot cases, the improvement gain becomes smaller for Clinic-150 but still significant in HuffPost. This experiment result suggests that the proposed LaSAML has a great potential to upgrade a wide variety of metalearning approaches.
Comparing support set feature extraction strategies In LaSAML, the input format of a sup- (L: All F: CLS), appending all class names but extracting features from the respective class (L: All F: Tag), and make a prediction by using Eq. 2. We also make our comparison with the LaSAML upgraded PN, or LaSAML-RRML. The experiment results are shown in Table 5. From the results, we can see that the best strategy seems to be method dependent. Appending all class names leads to better performance for LaSAML-RRML, but for LaSAML-PN, the best strategy is not appending any class names. Another observation is that extracting features from the respective class tag and comparing them against the respective class vector may lead to worse performance. However, extracting features from the respective class tag is capable of achieving better performance (or comparable performance on 5-shot classification in Clinc150 dataset) than previous state-of-the-art methods.

What has been learned in LaSAML
In this section, we demonstrate what has been learned by LaSAML-PN. We use an example in Figure 4 to highlight the difference between LaSAML-PN and the standard prototypical net-  work. By investigating the attention weight with respect to the [CLS] token (we average the attention value across all heads in the last layer of BERT), we can see that the prototypical network fails to attend the words relevant to the class. In contrast, LaSAML-PN successfully attends the relevant keywords.

Conclusion
In this paper, we systematically study the potential of using class name information for few-shot text classification tasks. We identify that appending the class name to the sentence as the input to a BERT encoder can lead to more discriminative sentence features. By adopting this scheme to metatraining, we propose a new meta-learning framework called LaSAML. Implementing this framework with the Prototypical network (Snell et al., 2017), we achieve significant improvement over the existing few-shot text classification methods.