Parameter-free Automatically Prompting: A Latent Pseudo Label Mapping Model for Prompt-based Learning

,


Introduction
With the advent of the powerful pre-trained language model (PLM) such as GPT-3 (Brown et al., 2020), prompt-based learning becomes blooming in recent years because it can effectively bridge the gap between pre-training tasks and downstream tasks (Liu et al., 2021).Several corresponding studies show better performance than the traditional fine-tuning methods in few-shot learning tasks (Cui et al., 2022), where few-shot learning is generally considered as N-way K-shot learning.Each task consists of N classes with K instances per class.
The pipeline of vanilla prompt-based learning for classification tasks is composed of a prompting phase and a prediction phase.In the prompting phase, the input sentence is reconstructed into a cloze-type sentence with a pre-defined template containing a <MASK> slot.In the prediction phase, the PLM tries to fill the <MASK> slot with corresponding words in the vocabulary based on the context.One challenge in this pipeline is mapping the fill-in word with the expected class label.To address this challenge, label mapping is introduced into the pipeline, assigning a class representative to each class to associate the PLM output with the expected class label.Typically, each class representative is one or a set of human-chosen word(s) that should be highly relevant to each class.Based on the word-occurrence probability, the model can calculate the occurrence probability of the class representatives.Finally, the label of the input sentence is inferred from the class representative with the highest occurrence probability.
As label mapping plays an important role in prompt-based learning (Liu et al., 2021), design-ing more effective label mapping becomes significant.The recent label mapping methods can be divided into two branches: manual label mapping (MLM) and automatic label mapping (ALM).MLM method manually selects the class representative for each class, and it is proven to be very powerful in many tasks (Liu et al., 2021).However, this method is time-consuming and laborintensive.Humans need to thoroughly understand the downstream tasks first and select the appropriate words for each class representative.Additionally, relying heavily on human subjectivity can cause different problems.Thus, ALM methods are proposed to eliminate the effects of human involvement.They can be further divided into soft label mapping, search-based label mapping, and prototypical label mapping.Soft label mapping (Hambardzumyan et al., 2021) replaces the concrete tokens with trainable tokens to generate class representatives.Search-based label mapping (Gao et al., 2021) aims to find a subset of candidate words from the vocabulary using algorithms or additional networks with additional parameters.Prototypical label mapping (Cui et al., 2022) projects the <MASK> vector onto a new embedding space with additional linear networks and fine-tunes the model with contrastive learning.However, these ALM methods involve extra parameters which require a big amount of instances to train, where in practice, the available data is scarce in few-shot scenarios, making it difficult to train the model to find optimal class representatives.Consequently, inexpressive class representatives may degrade the final classification performance.
To address the above problem, we propose a Latent Pseudo Label Mapping (LPLM) model.LPLM first uses a Majority Voter that shares the same parameters with PLM to automatically generate class representatives.As the class representatives produced by this method may be noisy, we build a latent variable model and introduce an EM-style algorithm to gradually reduce the noise and enhance the expressiveness of each class representative, as shown in Figure 1 1 .
The novel prompt-based learning pipeline with LPLM has the advantage that no additional parameters or human knowledge are required to generate class representatives.In the E-step, the distributions of latent variables are updated with a Majority Voter and the class representatives are re-selected based on the parameters optimized in the previous M-step.In the M-step, predictions are made according to the generated class representatives from the E-step, and the parameters of the PLM are updated based on the prediction results.The two steps perform alternately to obtain increasingly optimal distributions for class representatives and to improve the overall performance.Moreover, we introduce two strategies for selecting the keywords in each class representative, which are discussed in Section 3.3.
To verify the effectiveness of LPLM, we conduct a series of experiments on widely-used classification datasets.Since our method can be applied to any classification task, we deliberately choose three different types of datasets, a sentiment classification dataset SST-2 (2-way), a well-known text classification dataset AG's News (4-way), and a popular entity typing dataset Few-NERD (66-way).The experimental results show that LPLM significantly outperforms the other three ALMs and even outperforms MLM.The contributions of this paper can be summarized as follows: • We introduce the latent variable model in prompt-based learning and propose LPLM that optimizes the label mapping without human knowledge or extra parameters.
• We propose LPLM to automatically generate the class representative for each class using the majority voting mechanism, and alternately optimize the parameters of PLM and the distributions of latent variables by the EMstyle algorithm.
• We conduct a series of experiments to compare the performance of LPLM with other baseline label mappings, which demonstrates that LPLM outperforms not only all other ALMs, including the SOTA, but also MLM.
2 Related Work

Prompt-based Learning
As the demand for effectively fine-tuning largescale language models such as GPT-3 gets bigger, the popularity of prompt-based learning (in-context learning) has also increased.To fine-tune the model with prompt-based learning, the template with a <MASK> slot is used to reconstruct the input texts, like '<TEXT>.The category is <MASK>.'.In recent works, prompt-based learning has achieved impressive performance in many downstream tasks, such as text classification (Gao et al., 2021), knowledge probing (Petroni et al., 2019), and data construction (Choenni et al., 2021).
Although prompt-based learning consists of two parts, namely template designing and label mapping, most of the existing works focus on the former due to its simplicity.However, as discussed in Section 1, label mapping is also an important part that determines the performance of promptbased learning as it builds a bridge between the word-occurrence possibility and the class labels.To accomplish this more effectively, in this paper, we propose a novel method for ALM.

Mainstream Label Mappings
While label mapping has a crucial impact on the performance of prompt-based learning (Gao et al., 2021), each of the four mainstream label mapping approaches has its shortcoming.
Manual label mapping engages humaninvolvement to select the class representative for each class (Schick and Schütze, 2021).As this method is heavily dependent on human knowledge, side effects may occur.Contrary to expectations, the level of human knowledge may not always be guaranteed as each task requires different background knowledge.If the selected class representatives are not expressive enough, the overall performance could fluctuate greatly (Gao et al., 2021).
Soft label mapping directly uses trainable continuous tokens as class representatives and aims at optimizing them at the fine-tuning stage (Hambardzumyan et al., 2021).However, to achieve this, enough data is required for optimization, which is not realistic in practical few-shot scenarios.
Search-based label mapping obtains candidate words from the entire vocabulary and uses the validation set to select the best class representatives for fine-tuning the PLM.(Gao et al., 2021).However, this method also faces the same difficulty as soft label mapping.It is difficult to optimize the model to find suitable candidates from a large vocabulary when there are only a few instances available.
Prototypical label mapping takes inspiration from contrastive learning to introduce an additional contrastive loss for the fine-tuning process (Cui et al., 2022).Specifically, after projecting the instances into a new embedding space, the instances that belong to the same class are summarized and the ones belonging to different classes are separated.Nonetheless, this method also requires training additional randomly initialized parameters for the projecting function.When only a few instances are available (e.g., K=1), it is difficult to train the projecting networks and the different classes may hardly be distinguished.

Latent Pseudo Label Mapping
In order to remedy the shortcomings of the methods mentioned in Section 2.2, we introduce LPLM to effectively exploit the knowledge provided by the limited instances.It is built upon a latent variable model and optimized with the EM-style algorithm.In this section, we first define the necessary notations in few-shot classification tasks and introduce the overall architecture of LPLM.Then, we explain the detailed mechanism of the E-step and the M-step with the mathematical derivation and the learning process of the EM-style algorithm.

Task Definition
In the N-way K-shot setting, a set of labeled instances is defined as X which has the size of |X| = N * K as it consists of K instances for each of the N classes.The corresponding set of ground-truth labels is defined as Y which also contains |Y | = N * K elements where each y i ∈ {1, . . ., N } The proposed LPLM focuses on few-shot classification tasks and aims at predicting the label z i for the input instance x i ∈ X.The optimization target is to maximize where p M θ (z i |x i , y i ; θ, T (•)) is the probability that z i is predicted to be y i given the task-specific template T (•) and the pre-trained language model M θ with the parameter θ.In total, N * K few-shot instances are used by the model to optimize its parameter θ.

LPLM Architecture
The architecture of LPLM is designed on the basis of the classic principle-based EM algorithm (Dempster et al., 1977).
In prompt-based learning, M θ only provides the probability of each word in the vocabulary V occurring in the <MASK> slot.To bridge the gap between the word-occurrence probability and the expected label z i , we introduce latent variables W = {w 1 , . . ., w N } where each element denotes a class representative.The distribution of W and each w i ∈ W is updated in each E-step.The objective function is extended to where p(W |x i , y i , X, Y ; θ, T (•)) represents the probability of choosing W as the class representatives.
E-step: Each input instance is first wrapped with a template (with a <MASK> slot) and fed into the PLM M θ after the parameters are updated in the previous M-step.M θ outputs the distribution over all the vocabulary at the <MASK> slot.Since M θ contains prior knowledge, and a task-specific template is used for PLM, the distribution is highly related to its contextual semantics.After averaging the distributions of all instances in the same class, LPLM updates the class representatives w i s with the keywords of the Top-k highest probabilities.The weights of the keywords are calculated by normalizing their corresponding probabilities.
M-step: For each wrapped instance, LPLM obtains the distribution at the <MASK> slot by feeding it into M θ .The probability of each label is calculated by weighted averaging the wordoccurrence probability of the keywords in each class representative.The parameters in M θ are updated to increase the probability of predicting the correct class label for each instance.

EM-style iteration:
The EM algorithm iteratively performs E-steps and M-steps to gradually mine the class-specific semantics in few-shot instances with M θ .In the E-step of round t, the distributions of the latent variables are optimized with θ t which is updated in the previous M-step.The class representatives W t−1 in round t − 1 is updated to W t by M θt .Then, in the M-step, based on the classification result predicted with W t , θ t is updated to θ t+1 and M θt is updated to M θ t+1 .

E-step Details
In E-steps, LPLM generates the class representatives, namely the latent variables, subsequent to the optimization of the distributions based on the parameter θ updated in the previous M-step.The core idea of the generation is for each instance to vote for the class representative of its own class.

Majority Voter
The Majority Voter V θ is the main component in E-steps to determine the class representative for each label where θ is shared with M θ since it also uses the PLM M θ as its basis.The distribution of class representatives are obtained by the voter (3) For each wrapped instance x i ∈ X, V θ outputs a V-dimension distribution at <MASK> slot, where V is the vocabulary for M θ .This distribution is the likelihood of each word in the vocabulary representing its class, namely the vote from x i .LPLM averages the distributions (votes) of all instances in the same class according to Y and selects keywords with the highest probability using a specific selection strategy to form W .In practice, each class representative w i ∈ W is a V-dimension vector that represents a set of highly related keywords that can best express the corresponding class.Since each dimension in the averaged distribution corresponds to one keyword in V, LPLM sets the dimensions of the unselected keywords to 0 and normalizes the distribution to get w i .

Selection Strategy
After obtaining the averaged distribution with V θ , LPLM selects the highly relevant keyword(s) as class representative for each class.For determining the class representatives, we next introduce two selection strategies, Champion selection and Top-k selection.
Champion Selection Champion selection strategy simply picks one keyword with the highest probability in the averaged distribution as the class representative.And in the next E-step, the class representatives are re-selected based on the updated distribution with the parameter-updated V θ .
Although Champion selection strategy can effectively mine the hidden knowledge in PLM, it cannot work well when different classes have similar semantics.In this situation, LPLM is likely to select the same word for different classes as its representatives, which will further lead to a equal probability for predicting these class labels.It will disturb the model to distinguish these classes due to the unique predicting process of prompt-based learning, which we have mentioned in Section 1.Eventually, it will result in a limited improvement of the overall performance.Moreover, Champion selection strategy only considers the semantics contained in one word.Intuitively, if multiple words are selected together, LPLM can extract more semantic information, which will be more beneficial for the prediction.

Latent Variable for Each Instance
Similar to traditional EM algorithm, LPLM arranges each instance with a latent variable according to its corresponding class, where F (•) is an arrangement function.Therefore the instances from the same class should have identical latent variables, namely w y i s.This voting system is consistent with the most basic prediction of PLM-calculating the probability that each word to be filled in the <MASK> slot.In other words, they play the same role in the internal process and this is the reason why V θ can directly use the M θ as its backbone and share θ with the PLM M θ .

Distribution of Class Representatives
Every w i in W has a distribution in probability space R.For Champion selection strategy, R for w i is same with V. Therefore the averaged distribution of V θ can be directly treated as the distribution for the latent variables as Therefore, the distribution for W is Similarly, for Top-k selection strategy, the size of R is determined by C k |V| , where k = 2, 3, ..|V|.Also, the occurrence probability of each element in R can be derived from the majority voting result and assigned to all instances in each class, namely Therefore, the distribution for W is

M-step Details
The target of M-step is to make predictions based on the class representatives, namely the latent variables w i s.In this prediction phase, the above weights are applied to calculate the weighted average of the word-occurrence probabilities of the selected words, so as to obtain the probability of classifying the input sentence x i as the class z i , ) where h M is the last layer's hidden state of the <MASK> slot in the wrapped x i after LPLM feeds each wrapped instance into M θ h M = M θ (T (x i )) (6)

Learning with the EM-style algorithm
With the introduced latent variable W , the objective function is extended as2 p(z i |W, x i , y i , X, Y ; θ, T (•)) represents the probability of the predicted z i same with the groundtruth label y i of the input instance x i with the help of W .While p(W |x i , y i , X, Y ; θ, T (•)) represents the probability of choosing W as the class representatives.
Next, Q-function Q i (W ; θ) is introduced to represent the posterior probability of the latent variable W after θ is updated based on the prediction z i of the previous M-step.Equation 7 is derived as So far, the original objective is transformed into raising the lower bound of the inequality in Equation 8. Next, we elaborate on E-step and M-step in detail.
E-step: In E-step, based on the updated parameter θ, the posterior probability distribution of W is updated.We refer to the previous work (Chen et al., 2019) for the calculation method and expand the Q-function as where M-step: Given the updated latent variables and computed Q-function in E-step, the parameter θ is then optimized in M-step to maximize the lower bound of the inequality in Equation 8.

Experiments
We conduct a series of experiments in few-shot scenarios to demonstrate the effectiveness of LPLM.In this section, we first introduce the experimental setups in use, then present and discuss the experimental results.

Datasets and Implementation Details
We select three well-known classification datasets: SST-2 for 2-way sentiment classification tasks, AG's News for 4-way topic classification tasks, and Few-NERD for 66-class NER tasks.The training, validation, and test set of each dataset are nonoverlapping to ensure basic fairness.
In addition, we also keep the size of validation sets consistent with the training sets to further ensure unique fairness in 'few-shot' settings (Perez et al., 2021), as shown in Table 1.
Experiments are conducted on K=1/2/4/8/16 for K-shot learning tasks.For the evaluation metric, we use an average accuracy over three randomly picked seeds.In order to ensure fairness, we use a fixed template for each task to highlight the performance of different label mapping methods.For the same reason, we uniformly use RoBERTa-large as the pre-training model for all tasks with a fixed learning rate lr = 0.00003.The EM-style algorithm is performed in a total of 10 fine-tuning epochs.

Baselines
As introduced, we mainly compare LPLM with four mainstream label mapping methods.Among them, MLM exploits prior knowledge of humans.We separately highlight those results that rely on human knowledge with italics, e.g., the Manual method in Table 2.For more convincing results, we employ the four label mapping methods with OpenPrompt (Ding et al., 2022) using the PyTorch framework.For PLM, we use the interface provided by HuggingFace (Wolf et al., 2020) and it is optimized by AdamW optimizer (Kingma and Ba, 2015).

Template Setting
The core of prompt-based learning is to fine-tune the PLM using templates and label mappings to  find the best class label for each instance.For all compared label mapping models in our experiments, we use the identical initialed PLM with the same template for each dataset.In this setting, the improvements of different label mapping methods can be observed obviously.The templates are selected from the ACL22 Best demo OpenPrompt platform as shown in Table 3.

Value Selection
To address the shortcomings of Champion selection strategy mentioned in Section 3.3.2,we use Top-k selection strategy in our experiments.To explore the best value of k in Top-k selection strategy, we conduct a series of experiments with different values for k on all datasets for K=1/2/4/8/16.As shown in Figure 3, as k grows, the classification performance starts to increase and peaks when k is around 50 or 500, then gradually decreases.To elaborate this, first note that we expect to mine more semantics carried in the majority voting results of the instances.For instance, if four words are contained in a class representative, it will carry about twice as much semantic information as the class representative of two-word combination.However, the value of k is not larger the better.When k is larger than a certain threshold, numerous identical words can appear in the class representatives of different classes, which will blur the distinction between the class representatives.This tendency may be related to the interpretability of the PLM, but it is not the focus of this paper.Therefore, we use the values that shows the best performance as a reference for k.
These experimental results show that our proposed LPLM successfully finds the class representatives that are rich in semantic information for the corresponding classes.The discriminativeness of the class representatives generated by LPLM not only outperforms other ALM methods but also outperforms MLM which relies on human knowledge.Intuitively, assigning one word to each class is the most common approach in MLM.However, in practice, the class representatives can also be obtained by combining multiple words.
In order to eliminate the effect of different numbers of considered words in MLM, we further carry out experiments on Few-NERD and compare the performance of MLM and LPLM under the same setting of both considering multiple words in each class representative.As shown in Table 4, MLM(Multi) only takes advantage when K=1, and it is surpassed by when K is larger than 1.This shows that as long as there are two or more instances for each class, LPLM can extract more semantic information than the human-specified way.Notably, for MLM(Multi), we manually design the combination of one to five words for each class representative, while for LPLM, each class representative considers the semantics in 500 concrete words.The fairness of the comparison is further discussed in Section 5.4

Analysis
In this section, we further analyze the details of LPLM.For convenience, the following experiments are performed on AG's News dataset.

LPLM vs. Search-based Label Mapping
While the voting process in LPLM may seem similar to the process of finding a subset of candidate words in search-based label mapping, the core motivations of the two mappings are of great difference.
Search-based label mapping contains two separated parts.One is to optimize class representatives from random initialization, while the other is to use them to fine-tune the PLM.
Yet, LPLM operates based on an iterative process.The initial class representatives in the E-step are obtained by PLM with θ.After the class representatives help the PLM optimization in the M-step,  in the next round of the E-step, since there is already an updated θ, the new class representatives perform more distinctively and more accurately.In summary, even though both methods aim to optimize θ, search-based label mapping has to be a blocking process [Search] n → θ , while LPLM can be abstracted into a smooth evolutionary chain, e.g., [LP LM → θ] n .

Ablation Study
To show the effectiveness of Top-k selection strategy, we conduct an ablation study.In this subsection, three different settings of models, the original LPLM, LPLM without Top-k strategy, LPLM without Top-k and the EM-style algorithm are compared.For the model without Top-k, Champion selection strategy is used instead, and denoted as LPLM −T opk .For LPLM without the EM-style algorithm, it is in fact difficult to concretely exclude the EM-style algorithm because the E-step and the M-step together form a tightly locked process in the model.Though, as discussed in Section 5.1, it may not be very accurate, the search-based label mapping can be used as an alternative.This model is referred to as LPLM −T opk−EM where Top-k strategy is also not applied.
The experimental results of the ablation study are concluded in Table 5.It can be observed that when K is relevantly small, the Top-k selection strategy can greatly improve the performance, and when K is larger, the EM-style algorithm grows more helpful for performance improvement. .

Evolutionary Process of the EM-style algorithm
To further demonstrate the effectiveness of the EMstyle algorithm, we show the evolution of class representatives during optimization.As shown in Table 6, at the beginning, all initialized class representatives are noisy because they contain the same word 'news'.After three rounds of the E-step and M-step process, the dissimilarity gradually increases between class representatives of different classes.After all rounds are performed, each class representative further enhances its distinctions and carries more prominent semantics.This demonstrates that our EM-style algorithm can effectively find the distinct class representative with unique semantics for each class.

Fairness of Top-k and Breakthrough Point of Label Mapping
The class representative of each class in MLM (Multi) is selected by combining one to five manually-selected words.In LPLM, the number of selected words for each class representative can reach 500 or bigger.This may raise concerns that the comparison between MLM and LPLM is unfair.However, the ultimate goal of ALM is to improve the overall few-shot classification performance by making the model automatically generate N class representatives for N classes without using human prior knowledge.Therefore, the focus should lie on how differentiated the class representatives are and how much semantic information they can contain.MLM helps N class representatives to contain strong semantics from the beginning by introducing human prior information, while LPLM extracts the discriminative semantic information of each class by selecting Top-k words and integrating them into each class representative.The SOTA ALM (Cui et al., 2022) also incorporates the semantics of all words in the vocabulary into their proposed 'class prototypes' when calculating class representatives.Therefore, the core of label mapping research is no more than to find a semantically rich class representative for each class.In conclusion, compressing more words with different semantics into one class representative is of great significance, especially on few-shot tasks.Moreover, while human ability is limited, there is another advantage of LPLM where it effectively summarizes a big amount of words.That is, the semantic information of hundreds of words can be automatically integrated into the class representatives by the Top-k strategy, which is beyond the reach of humans.As shown in Figure 4, for all settings of K, LPLM always has a range of k values over which it outperforms MLM.Therefore, in practice, it is not necessary to traverse all possible k values.Instead, simply choosing a value of k between 50 and 1000 is sufficient to obtain a better performance than MLM.

Conclusion
In this paper, we propose a novel automatic label mapping model that excludes the reliance on human knowledge in manual label mapping methods by automatically generating a set of keywords as each class representative.To increase the distinction of the class representatives, we further introduce an EM-style algorithm to optimize the distribution of latent variables, namely the class representatives, to discover better class representatives and improve the overall classification performance.

Limitations
Although LPLM achieves remarkable results in our experiments with 2-way, 4-way and 66-way datasets without using prior human knowledge or additional parameters, it may not necessarily be the most appropriate model for datasets where representatives have a large probability space R or for tasks where N*K is large.According to Equation 10, we need to compute Q i for each instance, which requires O(N * K) computation complexity.And computing each Q i need sum over all ) computation complexity at E-step.This property of the EM-style algorithm may lead to an efficiency degradation of LPLM.In our implementation, we adopt to simplify the calculation, incorporating the distribution update process of latent variables into p M θ (W |x i , y i , X, Y ; θ, T (•)) and adopt an approximation of the Q-function with value 1 for the W consistent with the voting result and 0 for the other W s so that the computation complexity can be reduced to O(N * K).
In brief, how to keep efficiency with large representative selecting space or numerous of tasks becomes an interesting direction for future work.

Figure 1 :
Figure 1: The evolutionary process of the selected class representative and the improvement of the overall performance by adopting the EM-style algorithm.

Figure 2 :
Figure 2: Pipeline of prompt-based learning with LPLM based on the EM-style algorithm.'repre's denote the selected class representatives.
Top-k Selection We propose Top-k selection strategy to solve the potential issue of Champion selection strategy.Top-k strategy selects the set of keywords with top-k highest probabilities as the class representative (k = 2, 3, . . ., |V |).After setting the value of unselected dimensions to 0 and normalizing the averaged distribution, LPLM calculates the class representative w i for each class.

Figure 3 :
Figure 3: The fluctuation of performance when choosing the different k value for Top-k selection strategy on AG's News (left) and Few-NERD (right).

Figure 4 :
Figure 4: Comparison between MLM and LPLM with different values for k in Top-k on AG's News, where the solid lines represent the performance of LPLM and the dashed lines represent the performance of the MLM.
with a computation complexity of O((C k |V| ) N ), which results in total O(N * K * (C k |V| ) N

Table 1 :
Size of the datasets used in the classification experiments.

Table 2 :
The overall performance.Manual, Soft, Search, and Proto represent the four baselines, Manual Label Mapping, Soft Label Mapping, Search-based Label Mapping, and Prototypical Label Mapping, respectively.

Table 3 :
The selected templates for each dataset.

Table 4 :
Comparison with MLM which artificially selects one or multiple words for each of the 66 classes.

Table 6 :
Evolutionary process based on EM-style algorithm for K=16 task on AG's News.The Top-2 words are demonstrated.