Distinct Label Representations for Few-Shot Text Classification

Few-shot text classification aims to classify inputs whose label has only a few examples. Previous studies overlooked the semantic relevance between label representations. Therefore, they are easily confused by labels that are relevant. To address this problem, we propose a method that generates distinct label representations that embed information specific to each label. Our method is applicable to conventional few-shot classification models. Experimental results show that our method significantly improved the performance of few-shot text classification across models and datasets.


Introduction
Few-shot text classification (Ye and Ling, 2019; Gao et al., 2019;Bao et al., 2020) has been actively studied aiming to classify texts whose labels have only a few examples. Such infrequent labels are pervasive in datasets in practice, which are headaches for text classifiers because of the lack of training examples. Snell et al. (2017) showed that the conventional text classifiers are annoyed by the over-fitting problem when the distribution of labels is skewed in a dataset.
Few-shot classification has two approaches: metric-based and meta-learning based methods. The metric-based methods conduct classification based on distances estimated by a certain metric, e.g., cosine similarity (Vinyals et al., 2016), euclidean distance (Snell et al., 2017), convolutional neural networks (Sung et al., 2018), and graph neural networks (Satorras and Estrach, 2018). Metric-based methods in natural language processing focus on representation generation that are suitable for few-shot classification using the attention mechanism with various granularity , local and global matching of representations (Ye and Ling, 2019), and word co-occurrence TECH Apple confirms it slows down old iPhones as their batteries age Self-driving cars may be coming sooner than you thought BIZ Apple apologizes for slowed iPhones, drops price of battery replacements Wall Street isn't too worried about first selfdriving Tesla death patterns in attention mechanisms (Bao et al., 2020). In contrast, meta-larning based methods learn to learn for achieving higher accuracy by learning parameter generation (Finn et al., 2017), learning rates and parameter updates Antoniou et al., 2019), and parameter updates using gradients (Andrychowicz et al., 2016;Ravi and Larochelle, 2017;Li and Malik, 2017). All of these previous studies overlooked the effects of the semantic relevance between label representations, which confuses few-shot classifiers. As a result, the classifiers tend to fail distinguishing examples with semantically relevant labels. Table 1 shows examples with labels sampled from Huffpost (Misra, 2018). The label pair of TECH and BUSINESS is semantically relevant, for which the classifiers are easily confused.
To address this problem, we propose a mechanism that compares label representations to derive distinctive representations. It learns semantic differences between labels and generates representations that embed information specific to each label. Our method can be easily applied to existing few-shot classification models.
We evaluated our method using the standard benchmarks of Huffpost and FewRel (Han et al., 2018). Experimental results showed that our method significantly improved the performance of previous few-shot classifiers across models and datasets, and achieved the state-of-the-art accuracy.

2 Few-Shot Text Classification
This section describes the problem definition and a general form of conventional few-shot classifiers.

Problem Definition
In few-shot text classification, sets of supports and queries are given as input. A support set S consists of pairs of text x and corresponding label y: S = {(x i , y i )|i ∈ {1, 2, · · · , N K}}. N is the number of label types in the support set and K is the number of samples per label type. A query set Q consists of M texts to be classified: Q = {q j |j ∈ {1, 2, · · · , M }}. Note that S and Q have the same set of label types. A few-shot text classifier aims to predict a label for each q j .
In few-shot classification, training and evaluation are performed on a subset of a dataset called as episode (Vinyals et al., 2016). A setting of N = n and K = k is called as n-way k-shot classification. A training episode is created by sampling k + m examples with n types of labels from a training set, and then by dividing them into support and query sets, where m = M n . An evaluation episode is created in the same manner using an evaluation set. Note that labels in the training and evaluation episodes are exclusive, i.e., the classifier is required to predict labels that it has not been exposed during training. The performance of a model is measured using the macro-averaged accuracy of all episodes.

General Form of Few-shot Text Classification Models
A classification model first converts texts in the support and query sets into vector representations. We denote a subset S l ⊂ S as S l = {(x p l , y p l )|y p l = l, p ∈ {1, 2, · · · , K}} in which all texts have the same label l. An encoder E(·) converts x p l and a query q j to vectors, x p i ∈ R d and q j ∈ R d (d is the dimension of representations), respectively: E(·) can be any text encoder, such as recurrent neural networks (Yang et al., 2016), convolutional neural networks (Kim, 2014), and pre-trained language models like BERT (Devlin et al., 2019).
Second, the classification model generates a label representation for l. Let C(·) be the function that generates the label representation l ∈ R d : (2) C(·) is typically a pooling function, such as average pooling and max pooling. Finally, the model calculates the similarity between q j and each label representation l i (i ∈ {1, 2, · · · , N }) using a function R(·), and predicts a label whose representation is most similar to that of the query. The probability distribution of the i-th label is computed as: R(·) can be any metrics for estimating similarity.
In natural language processing, cosine similarity is a standard choice. As a loss function L c , negative log-likelihood is commonly used: where y j is the gold-standard label of q j .
3 Proposed Method Figure 1 shows the overview of our method. It adds a mechanism for learning to generate distinctive label representations into conventional few-shot classification models by converting its training into multi-task learning. Our method adds a difference extractor (Section 3.1) and a loss function based on mutual information (Section 3.2) to an arbitrary few-shot classification model.

Difference Extractor
The difference extractor compares a set of N label representations l i obtained by Equation (2) with each other and generates representations that retains only the information specific to each label. For doing so, a label representation should depend on a query q j as classification is conducted based on similarity between the query and labels as shown in Equation (3) (Ye and Ling, 2019). Hence, we model both the label and query representations simultaneously. Specifically, the label representations l 1 , · · · , l N and the query representation q j are transformed as: where MultiHeadAttention(·) is a self-attention mechanism (Vaswani et al., 2017) that outputs H ∈ R d×(N +1) hidden representations. H l i ∈ R d is an output of the self-attention corresponds to l i , and similarly, H q j ∈ R d is that of q j . These hidden representations are further transformed by fully-connected layers with the activation function of GELU(·) (Hendrycks et al., 2020).

Design of Loss Function
We assume that an ideal representationl i retaining only information specific to an i-th label satisfies that I(l i ;l r ) = 0 for alll r (i = r), where I(·) computes mutual information (MI). That is, each label representation is independent. Hence, we propose an MI-based loss functionL, which constrains such that a label representationl i contains only information specific to the i-th label by minimizing: Because the exact value of Equation (8) is difficult to calculate in practice, we minimize its upperbound following Cheng et al. (2020): where p θ (·) is a neural network which approximates the conditional probability p(l j i |l j r ). Finally, the overall loss function is: where α(> 0) balances the effect ofL.

Experiment
We evaluated our method on different few-shot classification models using the standard benchmarks.

Benchmark Datasets
Following previous studies (Bao et al., 2020;Gao et al., 2019; Ye and Ling, 2019; Sun et al., 2019), we use Huffpost and FewRel as benchmarks. 1 Following these previous studies, we evaluated the performance of each model using 1, 000 episodes. Because episode generation involves random sampling from a dataset, we repeated this process for 10 times and computed the macro-averaged accuracy as the final score. The statistic significance was measured using a bootstrap significance test.
Huffpost This dataset consists of titles extracted from HuffPost 2 articles. The task is a prediction of a category of an article from its title. The training, validation, and test sets contain 20, 5, and 16 types of labels, respectively. The number of examples per label is 900.
FewRel The task is a prediction of a relation between entities. The training, validation, and test sets contain 65, 5, and 10 types of labels, respectively. The number of examples per label is 700.

Compared Models and Training Settings
We applied our method on three few-shot classifiers to investigate its effects on different models. As the de-facto standard of metric-based and  meta-learning based models, we employed Pro-toNet (Snell et al., 2017) and MAML (Finn et al., 2017), respectively. Besides, we employed ML-MAN (Ye and Ling, 2019), which is the state-ofthe-art few-shot classification model on FewRel.
We also compared to Bao et al. (2020), which achieved the sate-of-the-art on HuffPost.
As the Encoder E(·) and pooling function C(·) for each model, we used the BERT-base, uncased 3 and average pooling, respectively, which showed strong performance in various text classification tasks (Devlin et al., 2019). We used PyTorch and Huggingface Transformers (Wolf et al., 2020) for implementation. 4 We applied our difference extractor and MI-loss function (denoted as "+ DE +L") to ProtoNet, MAML, and MLMAN. For the difference extractor, we used 1-layer self-attention mechanism with 8heads. As an ablation study, we also compared our method that only applies the difference extractor (denoted as "+ DE"), which is trained only with the classification loss (Equation (4)).

Overall Results
As Table 2 shows, our method significantly improved all of the baseline models across datasets. 5 For MAML and MLMAN, our difference extractor always improved the performance of the original models. By combination with the MI-loss, the performance improved by from 0.61 up to 7.68 points. In contrast, applying only the difference extractor to ProtoNet, i.e., ProtoNet + DE, deteriorated its original performance on FewRel dataset. These results confirm that both the difference extractor and MI-loss are crucial for ProtoNet. By using both, ProtoNet + DE +L consistently improved the baseline by from 0.39 up to 2.13 points.

Impact of DE and MI-loss on Baselines
The experimental results confirmed that the combination of our difference extractor and MI-loss function consistently improved the few-shot classification models. In particular, MI loss is more effective for a simpler model, i.e., ProtoNet. ML-MAN has an internal mechanism for comparing supports and queries, and MAML has a mechanism for updating the model parameters to accurately classify supports. These internal mechanisms allow to learn label representations that boost classification accuracy. Hence, the functionality of MI loss is partly achieved by these internal mechanisms. On the other hand, ProtoNet has the simplest architec-   Table 3 shows the settings of α tuned on the development set. Overall, the values of α on FewRel are larger than those on Huffpost. Larger α values increase the influence of MI loss on models, which is effective on datasets with a large number of labels like FewRel. Figure 2 shows the accuracy measured on the development set when varying α. The performance tends to decrease when α is set too large. We suspect that too large α forces models to extract differences irrelevant to the classification task. For example, the second examples in Table 1 are about selfdriving cars, where only the BIZ example contains named entities of Wall Street and Tesla. It is a noticeable difference; however, unlikely be useful for the classification task. Label representations of such spurious distinctiveness may degrade the classification performance.

Conclusion and Future Work
In this paper, we introduced a novel method shedding light on semantic relations between labels. Our method improved the classification accuracy of representative few-shot classifiers on both Huffpost and FewRel datasets, confirming the reasonable applicability of the proposed method.
Technically, our method can be applied to other classification problems that handle semantic labels, such as image and entity classifications. We will conduct evaluation to see its effects on various types of classifcations.