Making Pretrained Language Models Good Long-tailed Learners

Prompt-tuning has shown appealing performance in few-shot classification by virtue of its capability in effectively exploiting pre-trained knowledge. This motivates us to check the hypothesis that prompt-tuning is also a promising choice for long-tailed classification, since the tail classes are intuitively few-shot ones. To achieve this aim, we conduct empirical studies to examine the hypothesis. The results demonstrate that prompt-tuning makes pretrained language models at least good long-tailed learners. For intuitions on why prompt-tuning can achieve good performance in long-tailed classification, we carry out in-depth analyses by progressively bridging the gap between prompt-tuning and commonly used finetuning. The summary is that the classifier structure and parameterization form the key to making good long-tailed learners, in comparison with the less important input structure. Finally, we verify the applicability of our finding to few-shot classification.


Introduction
Pretrained language models (PLMs) with CLStuning (i.e., finetuning a PLM by applying a classifier head over the [CLS] representation) have achieved strong performance in a wide range of downstream classification tasks (Devlin et al., 2019;Liu et al., 2019a;Raffel et al., 2020;Wang et al., 2019b).However, they have been less promising in the long-tailed scenario (Li et al., 2020).The long-tailed scenario is different from common scenarios due to the long tail phenomenon exhibited in the class distribution, as illustrated in Figure 1.The long-tailed class distribution prevents PLMs from achieving good performance, especially in Class index (from head to tail).(Xu et al., 2020), where we can distinguish tail classes from head classes.
tail classes that only allow learning with very few examples (dubbed tail bottleneck).
Recent advances on Prompt-tuning2 have witnessed a surge of making PLMs better few-shot learners (Schick and Schütze, 2021;Gao et al., 2021;Scao and Rush, 2021).Prompt-tuning requires PLMs to perform classification in a cloze style, thus is superior to CLS-tuning in two aspects: 1) it aligns the input structure with that of masked language modeling (MLM); and 2) it reuses the classifier structure and parameterization from the pretrained MLM head.These two merits of Prompt-tuning equip PLMs to better exploit pretrained knowledge and hence gain a better few-shot performance than that with CLS-tuning.
The success of Prompt-tuning in few-shot scenarios motivates us to hypothesize that Prompttuning can relieve the tail bottleneck and thus make PLMs good long-tailed learners.The reason why we make such a hypothesis is that the tail classes are intuitively few-shot ones.However, long-tailed classification is different from few-shot classification to a certain extent, as it allows the possibility to transfer knowledge from head classes to tail ones.
We empirically examine the hypothesis by conducting empirical evaluations on three long-tailed classification datasets.The comparison results show that PLMs can be good long-tailed learners with Prompt-tuning, which outperforms PLMs with CLS-tuning by large margins.Besides, Prompttuning even exhibits a better performance than that of CLS-tuning with appropriate calibrations (e.g., focal loss Lin et al. 2017).The widely accepted decoupling property (Kang et al., 2020) claims that a good long-tailed learner should enjoy a nearly uniform distribution in terms of weight norms, otherwise the norms of head classes can be way much larger than the tail ones'.It is therefore expected that the weights tuned with Prompt-tuning own a flat distribution.We validate the property of Prompttuning by visualizing the norms of trained classification weights across classes.With the compelling results, we put that our hypothesis is valid.
We also provide further intuitions by asking why Prompt-tuning could be so promising, as shown in the above empirical investigations.Through indepth analyses, we uncover that re-using the classifier structure and parameterization from the MLM head is a key for attaining the good long-tailed performance, largely outweighing the importance of aligning the input structure with that of MLM.CLS-tuning, with classifier structure derived and parameters partly initialized from the MLM head, approximates the performance of Prompt-tuning.We believe that this observation would as well shed light on related work that aims to improve Prompttuning itself.As such, we finally present the comparison results of the improved CLS-tuning and Prompt-tuning in the few-shot scenario, suggesting the applicability of the improved CLS-tuning to few-shot classification.

Long-tailed Classification
Long-tailed classification basically follows a classification setting.Given a dataset D = {(x i , y i )} i in which (x, y) ∼ P (X , Y), a model M is required to learn to approximate P (Y | X ) as accurate as possible so that it can correctly predict the label from the input.However, the long-tailed classification differs from the common classification setting in that P (Y) is a long-tailed one, prohibiting M from achieving a good optimization, especially on tail classes.

Finetuning
CLS-tuning Pretrained with the special token [CLS] for overall semantics, PLMs can be finetuned with classifiers over the [CLS] representations for classification (Figure 2(a) left).
The optimization objective can be depicted as: which is exactly a cross entropy loss.Here, M can be disassembled to a backbone E and a classifier C. While E is a PLM producing [CLS] representation, C is a Tanh-activated MLP.Here, the MLP typically consists of two feed-forward linear layers.To be more specific, we name the classifier for CLS-tuning as CLS head (Figure 2(a) right), and generally the last layer of the classifier as predictor.
We also name the input as CLS input for brevity.
Prompt-tuning For PLMs that are pretrained with an MLM objective, it is natural to finetune the PLMs in an MLM-like cloze style for better exploitation of the pretrained knowledge.
To reach the goal, a template T and a verbalizer V are introduced (Schick and Schütze, 2021).The template converts the original input to an input with one [MASK] token that should be recovered, in other words MLM input.The verbalizer maps all labels to their corresponding tokens, and the model should predict the token corresponding to the correct label.In particular, for a label that is mapped to multiple tokens, one [MASK] should be faced with the issue of inability of multi-token completion.We are inspired by the average strategy in Chen et al. (2021); Hu et al. (2021), and treat the average of logit values for multiple tokens as the logit value for the label.
For example (Figure 2(b) left), the template for sentiment classification can be: Accordingly, the verbalizer can be: Thereby, the optimization objective is described as:  where E in M generates the [MASK] representation, and C is the pretrained MLM head.The MLM head (Figure 2(b) right) is activated with a GELU (Hendrycks and Gimpel, 2016) and normalized with a layer normalization (Ba et al., 2016;Vaswani et al., 2017;Devlin et al., 2019).Note that E and C share a part of parameters (i.e., the word embeddings in E and the predictor over the vocabulary in C).

Research Hypothesis
As discussed in the leading Section 1, we observe that, in a long-tailed class distribution, each of the tail classes is provided with very few examples, typically fewer than one tenth of the number of a common class.This brings challenges in longtailed classification.Meanwhile, Prompt-tuning has been demonstrated to make PLMs better few-shot learners by exploiting pretrained knowledge.Therefore, we are inspired to hypothesize that Prompt-tuning can make PLMs good long-tailed learners, as pretrained knowledge is intuitively learned from a long-tailed vocabulary.In the following, we present a series of empirical examinations to test whether our hypothesis is valid or not in Section 3, and why it is so in Section 4.

Setup
Datasets We conduct examinations on five longtailed classification datasets, ranging from Chinese to English ones.The first one is a medical intent question detection dataset (CMID) (Chen et al., 2020).The second one is an application category classification dataset (IFLYTEK) maintained by CLUE (Xu et al., 2020).The third one is a clinical trial criterion categorization dataset (CTC) (Zong et al., 2021).The fourth one is an entity typing dataset (MSRA) originally released as a named entity recognition dataset (Levow, 2006).The last one is a document topic classification dataset (R52) essentially derived from Reuters 21578 dataset (Debole and Sebastiani, 2004).
For datasets that originally do not include a test set (e.g., IFLYTEK), we use the development set as test set and randomly take 10% of the training set as development set.The statistics of these datasets are listed in Table 1.
Templates and Verbalizers For Prompt-tuning, example templates for the three datasets separately are shown as below: • CMID: x?The intent of the question is • IFLYTEK: x.The mentioned application belongs to [MASK].
• CTC: x.The category of the criterion is [MASK].
• MSRA: x.The e in the sentence is [MASK].
• R52: x.This is [MASK].where x denotes the input, and e denotes the mentioned entity in the sentence offered in MSRA.Here, necessary English translations of Chinese templates are used, and the according Chinese templates are listed in Appendix A.
Since there are many classes for each dataset, we leave the details on verbalizers to Appendix A. Basically, the verbalizers are deduced from class descriptions after removal of some less meaningful tokens (e.g., punctuations).
Implementation Experiments are carried out on an Nvidia Tesla V100.All models are implemented with PyTorch3 and Transformers4 libraries.We initialize models with the Google-released bert-base-chinese and bert-base-uncased checkpoints5 .For parameter settings, the batch size is 32, the learning rate is 1e-5, the weight decay is 0, and lastly the gradient norm is constrained to 1.We train the models for 10 epochs with patience of 2 epochs.In order to stabilize the training procedure, we add a linear warm-up for 1 epoch.
The maximum sequence length is set according to the dataset, specifically, 64 for CMID, 512 for IFLYTEK, 64 for CTC, 128 for MSRA, and 256 for R52.
Metrics Since we are more concerned with model performance across different classes, we use the macro F1 scores as main performance metric.We also offer the macro F1 scores of head (Head scores) and tail classes (Tail scores) separately to gain a fine-grained understanding of the model performance.To separate head classes from the tail classes, we sort all classes in descending order according to the number of examples within each class.According to the power law, we should get head classes that take up 80% of all examples.However, we find some tail classes with very limited examples can be included in this manner.So we manually determine the percentage for each dataset, specifically, 55% for MSRA and R52, 65% for CMID and IFLYTEK, and 80% for CTC.In addition, we also gather the accuracy scores (Acc scores) for reference.We take average scores over 5 runs as the results, attached with variances.

Comparison Results
In order to examine whether our hypothesis that Prompt-tuning can make PLMs good long-tailed learners is valid or not, we first conduct an evaluation on the long-tailed datasets.The results are given in Table 2.
A key finding from the comparison results is that Prompt-tuning outperforms CLS-tuning by large margins across datasets in terms of F1 scores.And Prompt-tuning in fact owns performance with lower variances compared with CLS-tuning.Prompttuning even exhibits better F1 scores than calibrated CLS-tuning (e.g., focal loss in our case).Besides, focal loss does not bring further improvement over Prompt-tuning, implying Prompt-tuning is a suffi- Class index (from head to tail).ciently good long-tailed learner.Contrarily, calibration methods can unexpectedly degrade performance of CLS-tuning (on MSRA) due to inadequate hyperparameter search.Further, the comparison of Head and Tail scores hints that Prompt-tuning improves long-tailed performance by keeping a better trade-off between the head and tail, where Prompttuning achieves much better results on the tail.Yet, Prompt-tuning could slightly give a negative impact to Head scores.
Overall speaking, we can put that our hypothesis is valid, indicating a positive effect of Prompttuning.

Weight Norm Visualization
For sake of a deeper investigation of the hypothesis, we visualize the weight norms customized to different classes.The weights are essentially derived from the predictor of the classifier C. The motivation behind the visualization is rooted on the widely accepted decoupling property (Kang et al., 2020), which claims that the learning of the backbone and classifier is in fact decoupled from each other.In other words, the long-tailed class distribution affects the classifier a lot, but might have little impact on the backbone.To this end, it is largely acknowledged in previous studies (Lin et al., 2017;Kang et al., 2020) that a good long-tailed learner should make weight norms roughly in the same scale, while a bad one sees a weight norm decay from the head to tail.The visualization is shown in Figure 3.
From the plot, we discover that CLStuning makes PLMs bad long-tailed learners, as it possesses large weight norms for the head while small ones for the tail.Applied to CLS-tuning, the focal loss slightly flattens the weight norm slope.As expected, Prompt-tuning makes PLMs own a way more flat distribution.This may be the reason why focal loss does not bring improvement over Prompt-tuning.While one may wonder whether weight norm regularization could boost long-tailed performance, η-norm is actually a calibration method to adjust weight norms so that they can be in similar scales, showcasing minor improvement of weight norm regularization.
The contrast indicates that Prompt-tuning makes PLMs good long-tailed learners.

Analyses
Although we have verified that our hypothesis is valid, we are more of thirst to see why Prompttuning can be so promising and provide further intuitions.To this demand, we consider three research questions here and carry out in-depth analyses to answer them.We also hope the analyses can shed light on the design of Prompt-tuning itself in related areas.
• RQ1: Does the shared embedding contribute to Prompt-tuning?
• RQ3: Does the classifier structure and parameterization (e.g., layer normalization used in MLM head) contribute to Prompt-tuning?

Impact of Shared Embedding
The first question comes into our mind is that whether it is the parameter sharing between the classifier and backbone that helps Prompt-tuning survive from collapsing.Hence, we decouple the parameters shared by the classifier and backbone (i.e., without shared embedding during optimization) and compare the results in Table 3 before and after the parameter decoupling.We observe that the decoupling somehow has little impact on the performance (Prompt-tuning v.s.Prompt-tuning w/ ed.).Reversely, ed. even degrades the performance of Prompt-tuning.The phenomenon gives a possibly negative response to RQ1.

Impact of Input Structure
We explore whether the input structure is a significant factor regarding long-tailed performance.To this end, we arm CLS-tuning with MLM input so that CLS-tuning may share the input structure and representation for classification as Prompt-tuning.
The results in Table 3 demonstrate that the MLM input somewhat affects the performance of CLStuning, in terms of both Acc and F1 scores (CLStuning v.s.CLS-tuning w/ prompt).We attribute the performance detriment to mismatch of the CLS head and MLM input.That is, PLMs are not pre- trained in the way that CLS head should decode the MLM input.And the results naturally suggest a possibly negative response to RQ2.

Impact of Classifier Structure and Parameterization
We also investigate the impact of the classifier structure and parameterization, given structural and parameter differences between classifier heads used by CLS-tuning and Prompt-tuning.We aim to check whether it is the discrepancy between classifiers that biases the learning.To study the impact, we replace the Tanh in CLS-tuning with ReLU.We use ReLU here as an alternative of GELU owing to the fact that ReLU is more prevalent in the finetuning stage.Then, by adding a layer normalization after the ReLU activation, we fill the structural gap between two classifiers.We also perform a natural follow-on action, re-using the statistics of the layer normalization from the MLM head to further enhance the classifier.The results are presented in Table 3.
It is revealed that the ReLU variant sometimes yields surprisingly deteriorated results when compared to CLS-tuning with Tanh.However, when the ReLU variant is additionally armed with a succeeding layer normalization (CLS-tuning w/ LN), it can surpass the original CLS-tuning by certain margins.Notably, CLS-tuning w/ LN has better Acc scores than Prompt-tuning does, potentially suggesting the balanced use of CLS-tuning w/ LN in real-world applications.Besides, by re-using the MLM layer normalization (CLS-tuning w/ pt.LN), CLS-tuning approximates Prompt-tuning at once.The results imply an absolutely positive response to RQ3.We conjecture the observation is underpinned by an information perspective towards regulated features.
For the ReLU variant, negative features will be zero out, leading to a cut-down of information.The information cut-down can be referred to as "dying ReLU" problem (Lu et al., 2019;Kayser, 2017) when negative features take a large portion.The information loss may be problematic to the head, and even more unfriendly to the tail that is underrepresented (i.e., with much fewer examples).As a consequence, under-represented classes can be represented with high bias (and potentially high variance).In contrast, Tanh manipulates features without any tailoring, but restricts values to a constant range (i.e., from -1 to 1).Despite the reduced learning burden, Tanh suffers from large saturation area (Xu et al., 2016).Thereby, the information of some tail classes can be compressed with detriment.
Existing literature (Girshick et al., 2014;He et al., 2015;Xu et al., 2015) calibrates the situation by relaxing negative features.On the other hand, the layer normalization can compensate the information loss caused by ReLU.With learned affinities (a.k.a., element-wise weights and biases), the layer normalization shall properly re-locate and re-scale the ReLU-activated features so that knowledge attained from the head can be transferred to the tail and the debuff of ReLU can be alleviated.
We add an illustration via Figure 4 for a more intuitive understanding of the above explanation.Taking examples of a sampled head class (commerce) and a sampled tail class (delivery) from IFLYTEK, we plot the distributions of the first 10 features in feature vectors (i.e., final hidden state vectors correspond to [CLS] tokens) derived from these examples.From the plot, we can see that ReLU certainly drops information of quite a few features of the tail (6 out of 10) and Tanh compulsively has features descend into the pre-defined range to involve themselves in optimization, both limiting the expressiveness of features.The randomly initialized layer normalization re-activates the dead features from the ReLU by re-locating them, but largely leaving them in fixed scales (6 out of 10).Furthermore, the pretrained layer normalization from the MLM head transfers knowledge from the head to tail and makes the features of the tail diversely distributed by re-scaling them.

Applicability to Few-shot Classification
From the analyses above, we come up with an alternative to CLS-tuning that owns a comparable effectiveness with Prompt-tuning for long-tailed classification, by equipping CLS-tuning with the structure and the layer normalization affinities of the pretrained MLM head.With this much simpler surrogate classifier, we prefer to retrospectively explore how the classifier would perform in the few-shot scenario, especially when compared with Prompt-tuning.The intuition is that, while the layer normalization can not transfer knowledge from the head to tail any longer in few-shot classification, its innate friendliness to the tail can probably be applied to few-shot classes for generalization.Therefore, we conduct experiments on three few-shot datasets, ECOM from FewCLUE (Xu et al., 2021), and RTE, BOOLQ from Super-GLUE (Wang et al., 2019a;Schick and Schütze, 2021).ECOM is a review sentiment classification dataset, RTE (Wang et al., 2019b) is a two-way entailment dataset, and BOOLQ (Clark et al., 2019) is a yes-or-no question answering dataset.
While ECOM has already been designed for fewshot learning, the others are not originally.For the latter two, we treat the original development set as the test set following Gao et al. (2021), and randomly sample 32 examples uniformly from the original training set as the training set following Schick and Schütze (2021).To strictly following a true few-shot setting (Perez et al., 2021), 32 examples that do not overlap with those in our training set are used to form the development set.The implementation is akin to the one we used for the long-tailed experiments.The batch size is re- Moreover, since few-shot experiments are sensitive to the choice of hyperparameters, we again take average accuracy scores over 5 runs as the results, attached with variances.
We can observe from Table 5 that CLStuning with ReLU and pretrained layer normalization can perform better than other CLS-tuning baselines, but may not necessarily hold across datasets.The filled gap between CLS-tuning and Prompttuning shows the scalability of our finding to fewshot classification.However, we encourage future work to explore the regime for a more comprehensive understanding.

Related Work
PLMs have brought classification tasks to a brandnew stage where the solutions to these tasks are way much simpler than ever (Devlin et al., 2019;Wang et al., 2019b).However, PLMs are still suboptimal for some corner cases, such as few-shot classification (Zhang et al., 2021) and long-tailed classification (Li et al., 2020).
For few-shot classification, Prompt-tuning, which finetunes models in a language modeling fashion, is increasingly taking a central role in the mainstream methods (Liu et al., 2021).Since PLMs are mostly trained with language modeling objectives, Prompttuning becomes, as water is to fish, the key to unearthing the few-shot or even zero-shot learning capabilities of PLMs (Brown et al., 2020;Schick and Schütze, 2021;Scao and Rush, 2021).Instantly after the success, the engineering of prompts drives related research on prompt search/generation (Jiang et al., 2020;Shin et al., 2020;Gao et al., 2021) and various downstream applications such as text generation (Li and Liang, 2021), relation extraction (Chen et al., 2020), and entity typing (Ding et al., 2021).Recently, Prompt-tuning also serves as an alternative way for parameter-efficient finetuning by only finetuning parameters of the inserted continuous prompts (Lester et al., 2021;Ma et al., 2022), in place of previously adopted adaptertuning (Houlsby et al., 2019).The parameter efficiency brought by Prompt-tuning has blazed a trail for increasingly large language models.
In contrast, little work has been investigated to make PLMs good long-tailed learners.Intuitively, the tail classes are essentially few-shot ones.Thus, we speculate that Prompt-tuning is also a promising choice to make PLMs good long-tailed learners via transferring knowledge of head classes.As longtailed classification is a long-standing problem in the general area of machine learning (Lin et al., 2017;Liu et al., 2019b;Zhou et al., 2020;Kang et al., 2020;Tang et al., 2020) and the long-tailed phenomenon also exists in the domain of natural language processing, we believe that a systematic exploration on whether and why Prompt-tuning can make PLMs good long-tailed learners will facilitate further advances in the related areas.

Conclusions
Inspired by the success of Prompt-tuning in fewshot learning, we empirically examine whether Prompt-tuning can make PLMs good long-tailed learners in this work.The results validate the hypothesis.We also conduct in-depth analyses on why Prompt-tuning benefits PLMs for long-tailed classification, from the perspectives of coupling, classifier, and input respectively, to offer further intuitions.Based on the analyses, we summarize that the classifier structure and parameterization are crucial for enhancing long-tailed performance of PLMs, in contrast to other factors.Extended empirical evaluation results on few-shot classification show that our finding would shed light on related work that seeks to boost Prompt-tuning.

Limitations
Since prompt-tuning is shown sensitive to small variations of templates, it should be performed with reasonable templates.However, we do not study the impact of different templates since our work is not concerned with finding a good template for long-tailed classification.

A Templates and Verbalizers for
Long-tailed Datasets The templates and verbalizers for three long-tailed datasets are separately listed in Table 6, Table 7, Table 8, Table 9, and Table 10.

Figure 1 :
Figure 1: An example long-tailed class distribution from IFLYTEK dataset(Xu et al., 2020), where we can distinguish tail classes from head classes.

Figure 3 :
Figure 3: Weight norm visualization.Classification weight norms of the model trained on IFLYTEK.

Figure 4 :
Figure 4: An illustration of sampled feature distributions, for both the head and tail.

Table 3 :
Analysis results.AVG denotes average results over all datasets.The best AVG scores are boldfaced.The content after CLS-tuning indicates the activation that is being used, where T is Tanh and R is ReLU.LN stands for layer normalization.pt. is short for pretrained and ed. is short for embedding decoupling.The variances are attached as subscripts.

Table 4 :
Statistics of few-shot datasets.

Table 5 :
Applicability to few-shot classification results.AVG denotes average results over all datasets.The best AVG scores are boldfaced.The variances are attached as subscripts.