Distinguishability Calibration to In-Context Learning

Recent years have witnessed increasing interests in prompt-based learning in which models can be trained on only a few annotated instances, making them suitable in low-resource settings. It is even challenging in fine-grained classification as the pre-trained language models tend to generate similar output embedding which makes it difficult to discriminate for the prompt-based classifier. In this work, we alleviate this information diffusion issue by proposing a calibration method based on a transformation which rotates the embedding feature into a new metric space where we adapt the ratio of each dimension to a uniform distribution to guarantee the distinguishability of learned embeddings. Furthermore, we take the advantage of hyperbolic embedding to capture the relation between dimensions by a coarse-fine metric learning strategy to enhance interpretability. Extensive experiments on the three datasets under various settings demonstrate the effectiveness of our approach.


Introduction
Large pre-trained language models (PLMs) (Devlin et al., 2019;Lan et al., 2020;Liu et al., 2019) have been achieved state-of-the-art performance in many Natural Language Processing (NLP) downstream tasks.More recently, the PLMs with prompt learning demonstrate surprising capabilities in numerous Table 1: The prompt templates for emotion classification.The samples are from GoEmotion (Demszky et al., 2020) dataset.
In an emotion classification task shown in Table 1, an input sentence X, followed by a prompt, "It was [MASK]", is fed to a PLM to predict the missing token at the position of [MASK].The predicted word can be used to identify the emotion label of the input sentence.Such few-shot learning generates a probability distribution over the [MASK] conditioning on the given prompt/context, which is considered as in-context learning of language models.
However, as in-context learning does not require updating PLM parameters, there arises the problem of distribution mismatch between the data used for LM pre-training and the test samples used in in-context learning, which hinders the full exploitation of the knowledge encoded in PLMs (Xie et al., 2022;Zhao et al., 2021;Ge et al., 2022;Shin et al., 2022).To alleviate the context shift, existing methods rely on prior knowledge to increase the overlapping between the two distributions.For example, PTR (Han et al., 2021) appends domain-agnostic tokens to prompts to discriminate the domains, such as "sports", "politics".Another line of studies designs sophisticated handcrafted verbalizers to map the test samples onto the label word space derived from PLMs (Schick and Schütze, 2021;Gao et al., 2021b).Although the gradient-optimized verbalizers (Hu et al., 2022) are proposed to ease the human effort and can be adapted to different downstream tasks via training, it is still consid-ered inferior to the manual verbalizers, especially in both the few-shot and zero-shot settings where training data are scarce.
In this paper, we first show that PLMs have an inherent information diffusion issue in their generated output token embeddings, which share a large proportion of similar information after going through a stack of transformer layers (Gao et al., 2019;Yan et al., 2022).Such token embeddings occupy a narrow cone, leading to largely overlapped output distributions when applied to in-context learning.Next, we elaborate that the overlapped output distributions would violate the distinguishability condition (Xie et al., 2022) under in-context learning.To this end, we propose to flatten the singular value distributions of the output embeddings generated from PLMs to shape the space spanned by the singular values to a desirable manifold.On the one hand, we apply an orthogonal and a scaling constraints to the weight matrix applied to the output embeddings, which can avoid exploding and vanishing values in the feature matrix (Saxe et al., 2014), leading to better discriminative features when trained with limited labelled data.On the other hand, we leverage hyperbolic embeddings to capture the hierarchical relations among fine-grained class labels of training examples to further enhance the distinguishability of output embeddings.
Our proposed framework has been implemented on top of existing prompt-based few-shot learning methods and it demonstrates an average 5.86% performance improvement of F1-measure on three classification tasks under 100-shot learning.We also verify that the improvement stems from a more balanced singular value distribution for the output features and the learnt hierarchical feature space.
In summary, our contributions include: • We propose a transformation-based constraint to output embeddings by rotation and ratio balancing which is able to guarantee the distinguishability of learned embeddings.
• The proposed hyperbolic embedding-based metric learning strategy not only improves the performance of prompt learning but also measures the relation between different categories.
• The experimental results outperform many strong baselines and the visualisation illustrates that the proposed method is able to project the embedding to a less overlapping distribution and improve the interpretability and distinguishability of output.Specifically, across three evaluated datasets, our method surpasses the state-of-the-art by 9.60%, 5.11% and 2.87%, respectively, in the 100-shot setting.

Related works
Information diffusion in PLMs.In a typical L-layer transformer-based PLM, assuming the prompt is a concatenation of a few training examples and a test input X test , consisting of m tokens in total, the goal of in-context learning is to predict the output distribution over the masked token at the t-th position, [MASK].It is formally defined by the following equation: where h denotes the last-layer hidden state corresponding to the token of X test , θ is the parameters in prompt-based learning.
Although we have limited knowledge of the output distribution p(O t |X test ) over token [MASK], many existing studies analyzed the geometry properties of the last layer feature h L , and examined its effects in downstream tasks (Goyal et al., 2020;Zhou and Srikumar, 2022).Due to the softmax bottleneck (Yang et al., 2018) and the likelihood loss in language generation tasks (Gao et al., 2019), the output feature distribution in PLMs tends to be anisotropic and rank-deficient, which limits the expressiveness of the generated representations.Goyal et al. (2020) discussed the information diffusion issue among tokens within a sentence that feeding the tokens in different positions for classification only resulted in a 1.2% variance in classification accuracy.Gao et al. (2019) explored the information diffusion among different sentences via singular value decomposing and they found that the singular value distributions are skewed especially in deeper PLM layers, i.e., larger singular values become more predominant compared to the smaller ones.
Context shift in in-context learning.Many researchers studied the distribution shift (aka.domain shift) between the pretraining corpora and test samples and proposed solutions to decrease the performance variance in prompt-based few-shot learning (Xie et al., 2022;Zhao et al., 2021;Hu et al., 2022;Zhou et al., 2022b;Shin et al., 2022).On the one hand, some in-context learning methods incorporated domain-specific words or learnable tokens in the prompt to discriminate different context.Ben-David et al. (2022) proposed to first generate the name of the domain and then generate domain-related features (DRFs) conditioned on the domain in a supervised manner.Both the generated domain name and DRFs were used as the prompt fed to the model.On the other hand, the sophisticated verbalizers contributed to minimising the distance between the two distributions (Schick et al., 2020;Schick and Schütze, 2021;Gao et al., 2021b;Hu et al., 2022).To broaden the coverage of single-choice verbalizer, Knowledge Prompt Tuning (KPT) (Hu et al., 2022) used the knowledge graph to extract more topic-related words as label words and then refine the label word candidates.To incorporate prior knowledge to calibrate the context shift, Xie et al. ( 2022) simplified a language model as the Hidden Markov Model, where the observed tokens are sampled from a family of concepts and proposed the distinguishability condition to measure context shift as the Kullback-Leibler (KL) divergence.

Contextual Calibration for Output Distribution
Many existing methods calibrate the probabilities of the generated tokens in a language model in order to improve the generation quality.In promptbased learning, we want to find out if the output distribution p(O t |X test ) or the output feature h [mask] , which is a part of the hidden representation from the last layer of a PLM, h ℓ , suffers from the information diffusion issue and occupies a narrow cone.We take RoBERTa-based prompt learning as an example and derive the value of h [mask] from 1,500 randomly selected test samples from an emotion classification dataset, GoEmotions (Demszky et al., 2020), and visualise the results in a 2D plane in Figure 1(a).For comparison, we select the predicted token with the largest probability on each [MASK] and map their corresponding vectors from Word2Vec (Mikolov et al., 2013) to a 2D plane in 1(b).It is clear that the word embeddings learned from Word2Vec has a more uniform distribution around the origin.In contrast, the representations derived by RoBERTa degenerate into a narrow cone, which implies limited expressiveness.
Inspired by the approach proposed in (Yan et al., 2022), we display the singular value distribution of h [mask] and calculate the distribution statistics, i.e., the matrix moment and the average cosine similarity between every [MASK] pair in Figure 1(c).
From the empirical results, we can see that the value of the hidden representation for [MASK] in different samples share much similar information with the token uniformity value (Yan et al., 2022) (tokenuni in Figure 1(c)) of 0.939.This shows that most h [mask] concentrates at very few singular values, which implies a severe information diffusion issue.

Uniform Ratio-based Distinguishability
Although many calibration methods have been proposed, few of them focuses on explicitly addressing the information diffusion issue in the prompt-based learning framework.One main challenge in this task is that the unlabelled data used in language model pre-training is significantly larger than the labelled samples used for prompt tuning.Hence, the optimised distribution in prompt-based fewshot learning can be very different from the true distribution.To avoid inheriting the information issue caused in the pre-training phase, we propose a calibration method to reduce the skewness of the output token distributions, such that the output representations are evenly distributed in the embedding space.The idea is to rotate the original embedding space to an isotropic metric space by an inner product-based operator on a learnable basis.
For each dimension of the basis, we use the inner product to measure its relevance with a given input.The dimension-dependent relevance scores are sent to a Multi-layer Perceptron (MLP) decoder to generate the calibrated output embedding for final prediction.
The framework of the proposed calibration method is shown in Figure 2. In practice, due to the small size of training samples in prompt learning, the relevance scores might be dominated by very few dimensions.Therefore, inspired by Zhou et al. (2022a), who proposed a ratio estimator to balance the distribution from different label categories, we design a scaling matrix in our isotropic distribution scenario.That is, for both labelled and unlabelled data, the multi-class ratio between different dimensions should be similar.Concretely, assuming we have N labelled data {y j , x j } N j=1 and M unlabelled data from pretraining {x j } N +M j=N +1 , where x j is the input sample, y j is the true label, and M ≫ N .To simplify the notation, in the rest of this paper, we use x j to represent the feature of the last embedding layer and h j to represent the output of our calibrated feature.Then, for the representation of a masked token, x j , we assume there are K isotropic directions in the metric space and the corresponding inner product based relevance score is: where σ(•) is the softmax activation function.Here, we can define a rotation matrix based on W k since Eq. ( 1) projects an input embedding onto a new metric space by rotation.To guarantee the orthogonality of the basis in the new metric space, we use the following regulariser during training: where W is the stacking of {W k } K k=1 .Correspondingly, for each dimension k, we can define a ratio score which aims to better separate them to avoid the skewed distribution by minimising the following loss: where R k (x j ) is an MLP-based estimator with a softmax activation: By minimising L t , even if one input sample x j is similar to a basis vector along a popular dimension k, there will still be a probability to assign it a low ratio score R k (x j ) if there are other samples which are more closer to the basis vector in dimension k.
In this way, we can balance the distribution after rotation.We define the stacking of S k as a scaling matrix which aims to distribute x j uniformly into K clusters in the metric space. 2owever, it is difficult to optimise the loss defined in Eq. (3) since the size of the unlabelled data for pre-training is much larger than the labelled data and the unlabelled data is usually unseen to the downstream tasks.We instead define an alternative optimisation objective.First, according to Eq. ( 3), we need to ensure that for any two dimensions k and t, we have 1 By the Jensen's inequality, we have the following lower bound: , in which we can achieve the lower bound for any two independent dimensions by taking 1 It means that for any two dimensions, the sum of their ratio scores should be similar.As such, Eq. ( 3) can be approximated by: (5) Accordingly, we can define the distinguishability loss in a more general form by both the relevance score and the ratio score without the need of sampling from unlabelled data: From our findings in Section 3, much information encoded by the output representations generated by the last layer of a PLM only occupies a space spanned by very few singular value directions.This leads to the information diffusion issue.Therefore, our solution here is to re-project the output features into a new hyperplane, in which the information is more evenly distributed in different directions, and at the same time we can derive a ratio vector by aggregating the rotated components.

Supervised Prompt Learning
By our proposed distinguishability loss-based learning in Section 3.1, an input embedding has been separated into vectors along K independent dimensions.Then, for the labelled data {x j } N j=1 , we propose to use k independent decoders to produce the final prediction.The decoding result is based on the relevance score and ratio score on each independent dimension: where the Decoder k is a decoder for the k-th dimension.Then the representation of h j can be used in the verbalizer p verbalizer ( Ô|h j ), where Ô is the predicted masked token.Finally, the cross-entropy loss H is defined by the predicted Ô and the true label y j : L cls (x j ) = H(y j , p verbalizer ( Ô|h j )).(8) By combining the uniform ratio-based distinguishability loss of L dis and the prompt-based classification loss L cls , we propose our first model, named as Transformation based Adaptation for Ratio bAlanced (TARA) prompt learning, which aims to minimise L TARA = L cls (x j ) + L dis .Note that L cls (x j ) is the default loss term in all the baselines and our proposed methods.

Dimension Rotation by Hyperbolic Embeddings
In Section 3.1, we project the input mask embedding into a K dimensional metric space to avoid skewed distributions.However, we ignore the potential class relations between the dimensions.For example, in emotion classification, both the emotions of 'gratitude' and 'approval' belong to the coarse positive class, but they are associated with different fine-grained labels in the GoEmotions dataset (Demszky et al., 2020).Hence, in this section, we only consider those positive pairs under the same coarse category to achieve a better class disambiguation by a proxy based metric learning (Movshovitz-Attias et al., 2017;Yang et al., 2022), which uses an anchor vector to represent a category for metric loss optimisation and capture the hierarchical structure between coarse-and fine-grained labels in the hyperbolic space.
Strategies for Constructing Sample Pairs.Inspired by the hierarchical structure of coarse-to-fine emotion categories, we assume that a fine-grained emotion should be close to the coarse-grained emotion it belongs to.To implement this idea, we construct sample-anchor pairs (h j , z + i ) for training, where h j is the representation for prompt prediction and z + i ∈ R d is a learnable anchor representation for each coarse class.Metric Learning in a Hyperbolic Space.To maximise the similarity in sample-anchor positive pairs, where the sample and the anchor share the same coarse-grained label, while minimising the similarity in negative pairs, we adopt the following metric learning objective: where {(h j , z + i )} C i=1 represents a set of sampleanchor pairs that we constructed for each sample i, C denotes the number of anchors, z + pj is the representation of positive pairing anchor of j-th sample, and d(•) is the hyperbolic distance metric defined by the Poincaré ball model of the hyperbolic space (Nickel and Kiela, 2017).In a n-dimensional hyperbolic space, all points will fall into a unit open interval: , where ∥•∥ donates the Euclidean norm.The distance d(•) between two points u, v ∈ I n can be formulated as: ).
(10) The motivation of using L metric (h j ) is to push similar categories together in the metric space.Hence, we can obtain our final learningn objective by adding the loss of tree-structured metric learning L metric (h j ) to TARA as: For a comparison, we propose a variant called TML by keeping the learning architectue and simply adding L metric (h j ) to the classification loss of L cls (x j ), but without the ratio balancing term of L dis , that is, L TML = L cls (x j ) + L metric (h j ).

Experiments
Datasets We evaluate our proposed approach on three multi-class text classification datasets, the Emotion3 (Saravia et al., 2018) dataset, an academic paper classification dataset, WOS (Kowsari et al., 2017), and a fine-grained emotion classification dataset, GoEmotions4 (Demszky et al., 2020).All of these datasets have hierarchical label structures.The datasets statistics are shown in Table 2.For all datasets, we remove punctuation, digits, and special characters that do not have specific semantic meanings.For the Emotion dataset which consists of tweet, we also remove user mentions.
Baselines We implement our proposed framework on top of the commonly used prompt-based learning methods and compare it with existing approaches including those which can be used for learning more discriminative representations: • Prompt-baselines.Three commonly used prompt-based methods are selected including Soft Prompts (Brown et al., 2020), Prompt-Tuning (Lester et al., 2021) and PTR (Han et al., 2021).The best-performing methods is used as the default prompt-based training method for the following three comparison models, and denoted as Prompt-baseline.5 • KPT (Hu et al., 2022).It uses a knowledge graph to incorporate topic-related label words to increase the coverage of the verbaliser.
• Context Calibration (Zhao et al., 2021).This method calibrates the output representations by one-layer linear transformation, whose weight matrix is optimised to be diagonal.
• Proxy-NCA (Movshovitz-Attias et al., 2017).It creates a proxy for each class and uses the Neighbourhood Component Analysis (NCA) loss to pull samples closer to their assigned proxies while pushing negative samples away.
Prompt Settings As the performance of promptbased methods heavily relies on prompt templates and verbalisers, we use the same template and verbaliser for all models for fair comparison.The prompt templates are shown in Table 3.The original class labels are used as label words in the verbaliser as in (Schick and Schütze, 2021).Table 3: Prompt templates used in three datasets.

Few-shot Learning on Three Datasets
We randomly select k different training samples for few-shot learning and show the results across the three datasets in Table 4.
For metric-learning, Proxy-NCA with contrastive loss leads to performance degradation compared to the Prompt-baseline, with more significant performance drops on the GoEmotions dataset, which has the largest label categories.By contrast, TML gives better results over the Prompt-baseline and Proxy-NCA, showing its efficiency in encoding hierarchical relations between the coarse-and fine-grained labels.It can be further demonstrated in Figure 3, which shows the similarity matrix  For the calibration methods, Context Calibration and TARA are overall better than the Promptbaseline.This shows that the simple linear transformation of the output representations can greatly improve the performance of prompt-based learning.The superior performance of TARA over Context Calibration demonstrates the benefit of using our proposed rotation and scaling transformations.Combining TML with TARA, our full model achieves the best performance and the improvements are more predominant when K is larger.In the 100-shot setting, our method surpasses the stateof-the-art method, Context Calibration, by 9.6% on Emotion, 5.1% on WOS, and 2.9% on GoEmotions, respectively, verifying its superiority in the few-shot text classification task.

Information Diffusion Alleviation
In addition to the classification results, we also examine the characteristics of the generated output representations to check whether the information diffusion issue has been addressed.Figure 4 shows the PCA projection results of all the [MASK] representations, i.e., h [MASK] in the test samples, which are colour-coded according to their assigned class labels by the model.It is clear that our method can generate more widely distributed [MASK] representations, therefore better reducing the overlaps of the features from different class labels.For example, in the Emotion dataset, the output features from the baseline model mostly reside along the horizontal direction, while ours distribute more evenly across different directions. 6e also calculate the summary statistics of the singular value distribution of the output features, as well as the average similarity between every two [MASK] pairs.The results are shown in Table 5.The average cosine similarity (CosSim) between every token pair is used as a proxy measure of the degree of information diffusion.We can observe that the CosSim value calculated on the output representations generated by our model is significantly lower compared to the other baselines.We also observe an increase in the median and the decrease in variance of the singular value distribution from our model outputs in comparison to the prompt learning baseline.The results show that our model produces the output representations which have a more balanced singular value distribution.The  It is clear that our method distributes the output representations more evenly in the embedding space, while the output representations from the baseline appear to be more concentrated.smaller skewness value further verifies that our proposed model can generate isotropic representations where the embedding dimensions are uncorrelated.

Ablation Study
To study the effect of different components of our proposed distinguishability loss, i.e., the constraints applied to the transformation operation for ratio balancing, we remove one of them and compare the performance changes in Table 6.Here, L orth is applied on W in Eq.2, L t is applied on S k (from Eq.4 and Eq.5), and l 2 is the weight for the L 2 regularisation term on all the other learnable parameters.The L orth and L 2 constraints have similar effects on the overall performance, as they both act as axis transformations, while the constraint L t applied on S k plays a more important role, whose removal leads to a larger performance drop among all the settings.It partly demonstrates the importance of the balancing ratio vector after the rotation transformation.

Conclusion
In this paper, to address the information diffusion issue in prompt-based few-shot learning, we propose a calibration method based on featuretransformation which first rotates output embeddings into a new metric space, and then scales the ratio of each dimension to a uniform distribution to guarantee the distinguishability of the transformed embeddings.On the other hand, we utilise hyperbolic embeddings to capture the hierarchical relations between class labels to guide the metric learning strategy to enhance the interpretability of the learned output embeddings.Extensive experiments on the three multi-class classification tasks under various settings demonstrate the effectiveness of our approach with an average 5.9% performance improvement on the F1-measure.

Limitations
In this work, we only focus on the multi-class classification task with hierarchical class labels.Future work could explore extending our idea to other tasks, such as controllable text generation, which has the similar information diffusion issue.Another potential direction in future work is to learn a prior distribution rather than simply using the uniform distribution in ratio balancing.Since the uniform distribution-based ratio balancing is a strong assumption, it might not be suitable for some tasks in real-world applications.One could use VAE or VQ-VAE to learn a distribution which could be subsequently used to regularise the optimisation of feature transformation.

Figure 1 :
Figure 1: (a): The mapping results of 1,500 [MASK] tokens randomly sampled from the GoEmotions dataset.Each red dot is the output representations derived from prompt-based learning for the [MASK] token of an input example, which will be used to predict the masked token in the corresponding position.(b): Each blue dot is the static word representation of the corresponding predicted token with the largest probability on [MASK] for one of the 1,500 samples in (a) from the GoEmotions dataset.(c): Singular value distribution (after normalisation) of the output representations of the randomly selected 1,500 [MASK] s.It is clear that the representations are dominated by very few singular values.

Figure 2 :
Figure2: Our proposed calibration method is applied to the output embeddings from the last layer of a PLM.After being transformed with a rotation matrix through a Multi-layer Perception (MLP), the resulting output feature is assumed to have a more balanced singular value distribution in different basis directions.Moreover, as the vector norm on each projected direction would change in the new base, we derive a ratio vector to balance the distribution along the rotated directions.
the text is [MASK].GoEmotions <X>The emotional aspect of this text is [MASK].

Figure 3 :
Figure 3: Heatmap for the pair-wised cosine similarity of fine-grained classes on GoEmotion.(a) Label representations from PLM without fine-tuning.(b) Finetuned label representations by classification module only.(c) Fine-tuned label representations with proposed constraint but based on Euclidean distance, i.e., Proxy-NCA.(d) Fine-tuned label representations by TML.

Figure 4 :
Figure4: The PCA projection of the output representations belonging to different classes.In each sub-figure, the left figure is the prompt-baseline, while the right figure is our method.It is clear that our method distributes the output representations more evenly in the embedding space, while the output representations from the baseline appear to be more concentrated.

Table 4 :
Weighted F1 scores on three Datasets.The proposed TML is better than Proxy-NCA.Our full method (TML+TARA) achieves the best performance among all the settings.

Table 5 :
The statistics of the singular value distribution of the output features, as well as the average cosine similarity of all [MASK] token pairs.

Table 6 :
Ablation study of various loss terms in the learning objective for the distinguishability loss.