DynaMaR: Dynamic Prompt with Mask Token Representation

Recent research has shown that large language models pretrained using unsupervised approaches can achieve significant performance improvement on many downstream tasks. Typically when adapting these language models to downstream tasks, like a classification or regression task, we employ a fine-tuning paradigm in which the sentence representation from the language model is input to a task-specific head; the model is then fine-tuned end-to-end. However, with the emergence of models like GPT-3, prompt-based fine-tuning has been proven to be a successful approach for few-shot tasks. Inspired by this work, we study discrete prompt technologies in practice. There are two issues that arise with the standard prompt approach. First, it can overfit on the prompt template. Second, it requires manual effort to formulate the downstream task as a language model problem. In this paper, we propose an improvement to prompt-based fine-tuning that addresses these two issues. We refer to our approach as DynaMaR -- Dynamic Prompt with Mask Token Representation. Results show that DynaMaR can achieve an average improvement of 10% in few-shot settings and improvement of 3.7% in data-rich settings over the standard fine-tuning approach on four e-commerce applications.


Introduction
Unsupervised pre-trained Language Models (LMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have achieved state-of-the-art performance on many natural language understanding tasks.In general, these models are fine-tuned for different tasks through the addition of a taskspecific head on top of the [CLS] token representation (Scao and Rush, 2021).
An alternative method to applying LMs on downstream tasks is through discrete prompts.A discrete prompt is an additional text phrase inserted along with the original input text that encapsulates the task of interest.By adding the prompt, we convert the downstream task into a masked language (MLM) problem.For example, to classify the sentiment of a movie review, "I hate this movie.",we can append a prompt to the input to get "I hate this movie.It was [MASK]".The pre-trained language model is thus prompted to identify the sentiment of the input statement and classify the [MASK] token as "terrible" instead of "great" (Liu et al., 2021).In this paper, we call a function that includes a prompt and its position information a prompt template.
Prompt-based approaches have shown success in low-data regimes (Petroni et al., 2019;Schick and Schütze, 2021;Jiang et al., 2020;Gao et al., 2021;Lester et al., 2021).Prompt-based fine-tuning is beneficial in few-shot learning, because it provides extra task information to the model through the prompt text (Schick and Schütze, 2021).However, when we explore this technique in practice, two issues have arisen.First, the trained model can overfit on words or phrases within the prompt and on the position of the [MASK] token in the prompt (Zhong et al., 2021).For example, in movie review sentiment analysis, when we append the prompt, "Does the user like the movie?[MASK]", to a negative review, "This is a bad movie.", the trained model is inclined to predict the positive class, because the word "like" frequently appears in positive reviews and the masked language model has greater attention on the words/phrases that are closer to the mask token as shown in Figure 1.We call this issue prompt-related overfitting in this work.
We tackle prompt-related overfitting by introducing a dynamic prompt approach.In this approach, we create a prompt pool consisting of multiple prompt templates.To construct this pool, we generate a set of prompt candidates and filter by a similarity score we propose, called the pairwise prompt dissimilarity score (detailed in Section 3).We then introduce the dynamic component of the algorithm by randomly selecting a prompt template from the pool and applying to the input for each training step.For example, in the movie review sentiment analysis task, the trained model will randomly see either "does the user like the movie?[MASK]" or "does the user dislike the movie?[MASK]" appended to the original input.This prevents the model to overfit on spurious correlations between words in the prompt and the class label.
In addition, as previously mentioned, the standard prompt-based fine-tuning setup can be inefficient.It requires significant input and answer engineering to reformulate the downstream tasks as MLM problems (Liu et al., 2021).This process is time-consuming especially for tasks with large numbers of classes.Besides, another disadvantage of the standard setup is that it cannot be directly applied to regression problems, as they cannot be easily converted to MLM problems.To simplify this process, we fine-tune the model by feeding the mask token representation into a task-specific classifier/predictor head instead of the pre-trained MLM head to avoid the answer engineering process, as shown in Figure 2. We refer to our promptbased approach with these two improvements as Dynamic Prompt with Mask Token Representation (DynaMaR).We apply DynaMaR to both few-shot and data-rich settings and, for the first time, show improvement gains across four tasks not only in few-shot settings but also in data-rich settings.
Our contributions include: (1) proposing Dyna-MaR, which can be applied without reformulating downstream tasks into language problems and is robust to prompt-related overfitting, (2) showing DynaMaR can achieve improvements in both fewshot and data-rich settings, (3) proposing a prompt dissimilarity score to evaluate the degree of dissimilarity between two prompt templates and to help construct a diverse dynamic prompt pool, (4) demonstrating that a larger dynamic prompt pool achieves better performance on downstream tasks.

Related Work
Our work can be divided into three components: language model fine-tuning, prompt generation, and the design of the prompt template.
Language Model Fine-tuning is the main focus of our work.Recently, a large amount of research has focused on improved language model finetuning methods (Howard and Ruder, 2018;Dodge et al., 2020;Lee et al., 2020;Zhang et al., 2021).These works mainly focus on optimization and regularization techniques to stabilize finetuning.In contrast to these works, Gao et al. (2021) describe the concept of prompt-based fine-tuning for language models.We adapt and simplify the core ideas from this work to create a simple yet efficient prompt-based fine-tuning approach.
Prompt Generation is a key process in promptbased fine-tuning.The choice of prompt significantly influences performance.The most natural way to generate prompts is through manual design.Petroni et al. (2019) employ manually generated prompts with ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) models.They evaluate on the LAMA (LAnguage Model Analysis) benchmark (Bordes et al., 2013;Nickel et al., 2016) without fine-tuning and conclude that the model is able to recall knowledge learned from the pre-training tasks.While manually crafting prompts is intuitive, creating effective prompts through manual effort requires time, experience, and expertise.To address this issue, a number of automatic prompt searching methods have been proposed.For example, Jiang et al. (2020) propose a data mining-based method that searches for a prompt based on the shortest path between the original inputs and answers.They also propose paraphrasing-based methods that take a seed prompt and paraphrase it into several semantically similar expressions.Gao et al. (2021) treat prompt generation as a text generation task and utilize T5, a sequence-to-sequence pretrained model, in the template search process.They generate templates by specifying the position to insert a prompt template and then inputting samples into T5 to decode the templates.These automatic approaches achieve comparable performance to manually designed prompts.Besides, Logan IV et al. (2021) propose the null prompt method.Instead of generating prompts, they concatenate a [MASK] token with original inputs and it performs competitively to manually designed prompts.In our experiments, we utilize the prompt generation methods to create candidates for the dynamic prompt pool, while also including the null prompt approach as one of the baselines.
Prompt Template Design Factors are the factors that we take into consideration to create a metric that informs how prompts are selected for the dynamic prompt pool.Numerous previous works analyze prompt template design factors and the impact of prompt design on performance.Liu et al. (2021) summarize the factors that influence the application of prompt-related technologies in language models.Logan IV et al. ( 2021) note that the order in which the original input and the [MASK] token are concatenated is an important consideration.Zhong et al. (2021) propose to unify the prompts into a question-answering format.These previous works indicate that prompt construction impacts performance.To this end, we hypothesize that diversity in the set of prompt templates is an important factor in the performance of the model and propose a prompt dissimilarity score for measuring diversity.

Our Method: DynaMaR
In this section, we describe details of our approach, DynaMaR.Before explaining the training process, we define two concepts: the dynamic prompt pool and the inference prompt.
Dynamic Prompt Pool is a pool of prompt templates from which a prompt template will be randomly selected and applied to the input during training.
Inference Prompt is the prompt template used during inference.It is selected from the set of templates in the dynamic prompt pool.In general, it is the prompt template among those in the dynamic prompt pool that can achieve the highest performance in a fixed prompt setting.
We generate the candidates for the dynamic prompt pool and inference prompt through manual generation and paraphrasing-based methods proposed by Jiang et al. (2020).However, we do not include all candidates in the dynamic prompt pool.We want to ensure the prompts within a pool are sufficiently diverse so that the model will not overfit on any of them.Therefore, we introduce a prompt dissimilarity score to measure the level of dissimilarity between these candidates.We consider three factors in developing this metric: (1) prompt position, or whether to append or prepend the prompt to the input or even insert into the middle of pairwise inputs, (2) prompt wording or the prompt word selection, and (3) prompt format, or whether to create prompts in statement format or in the question-answering format proposed by Zhong et al. (2021).To define the prompt dissimilarity score, we first introduce the normalized Hamming distance and the normalized Levenshtein distance.
Normalized Hamming Distance is equal to the number of different bits between two binary representations divided by the length of the binary representations (Norouzi et al., 2012).Let HD(b i , b j ) be the Hamming distance between binary representations b i and b j with equal length K.The equation of normalized Hamming distance N HD(b i , b j ) then follows: Normalized Levenshtein Distance is equal to the minimum number of operations (substitution, insertion and deletion) required to transform a given string into another string divided by the length of the longer string and is calculated in a recursive fashion (Yujian and Bo, 2007).Let LD(s i , s j ) be the Levenshtein distance between string s i and s j .Let |s i | and |s j | be the length of prompt string s i and s j , respectively.Let t(x) be a function that keeps a string of all but the first character of x.The equation of the normalized Levenshtein distance N LD(s i , s j ) follows: , otherwise. (3) Suppose we generate N prompt templates.Let p i and p j be two prompt templates with s i , s j as prompt strings, respectively, where i = j and i, j ∈ {1, 2, . . ., N }.We treat the prompt position and format information as categorical variables and convert them into binary representations, b i , b j .Let P DS(p i , p j ) denote the prompt dissimilarity score between prompt templates p i and p j .The prompt dissimilarity score equation can be found below: (5) In our experiment, we use 0.5 as the pairwise prompt dissimilarity score threshold.We add the prompt templates that have prompt dissimilarity score larger than the threshold to others to a dynamic prompt pool.During the training process, we randomly pick one prompt template from the pool for each training step and apply it to the original input.We treat the mask token representation from the modified input as the sentence embedding and train the model by directly feeding it into a task-specific predictor head.

Data
In this experiment, we use four e-commerce proprietary datasets: (1) Variation Elimination (VE), (2) Music Match (MM), (3) Music Genre (MG), and (4) Price Prediction (PP).VE is a binary classification problem with pairwise-document inputs where the label identifies whether two items are the variations of the same product or not.For example, similar shirts (from the same producer and brand) in different sizes or colors are considered to be variations.MM is a binary classification problem with pairwise-document inputs that identifies whether two music tracks from different sources are the same or not.MG is a 303-way classification problem with single-document inputs that classifies music tracks to genres.PP is a regression problem with single-document inputs that aims to estimate the sales price based on the product information.It should be noted that the percentage of inputs with number of tokens larger than 512 in VE, MM, MG, PP are 90%, 75%, 82%, 1%, respectively.
For each task, we split the dataset into three parts: (1) train, (2) validation, and (3) test.We use the full training dataset for the data-rich settings.We also sample multiple few-shot training datasets for fewshot learning settings.In few-shot learning, each classification dataset contains roughly 20 samples for each class.For the regression task (i.e., PP), we randomly sample 1% of the full training dataset as a few-shot training dataset.

Model and Tokenizer Setup
For training the tokenizer, we collect an English product catalog dataset with text features including title, description, and detail bullet points.We train a 32K BPE vocabulary on this dataset using the SentencePiece library (Kudo and Richardson, 2018).
We create a 500M parameter transformer encoder-only model, with 38 hidden layers, 1024 embedding size, 16 attention heads, and maximum sequence length of 512.We train the model using the LANS optimizer (Zheng et al., 2020) with a batch size of 8192 and a learning rate of 10 −4 on the product catalog dataset.

Prompt Generation and Selection
To create the dynamic prompt pool for our tasks, we first generate 20 prompt templates for each task and select 5 out of them using the prompt dissimilarity score.Specifically, for each task, we first manually design 10 prompt templates.By treating prompt template generation as paraphrase generation task (Jiang et al., 2020), we use these 10 prompt templates as seeds to generate another 10 templates per task by leveraging the public T5 paraphrase generation model from Hugging Face1 .Afterwards, we use the prompt dissimilarity score to select 5 prompt templates out of the 20 based on the method discussed at the end of Section 3. The selected prompt templates are used as each task's dynamic prompt pool.For inference, we evaluate each template in the dynamic prompt pool through the evaluation process discussed in Section 4.5, and select the prompt template that produces the performance on each task.Table 5 shows the dynamic prompt templates as well as the inference prompt selected for each task.

Fine-tuning (Ft) Methods
We compare DynaMaR with the following approaches: • Promptless Fine-tuning -CLS (PFt-CLS) is our baseline approach where we fine-tune the model by feeding the [CLS] token representation into a predictor head.
• Promptless Fine-tuning -Average Pooling (PFt-Avg) fine-tunes the model by using the average of sequence output for prediction.
• Null Prompt -Prefix (NP-Prefix) prepends the [MASK] token to the original inputs and fine-tunes the model by feeding the [MASK] token representation into a predictor head.This approach avoids the issue where the model overfits the prompt template since it does not require a template.
• Null Prompt -Suffix (NP-Suffix) is the same as the above approach except that the [MASK] token is appended to the inputs instead of being prepended.
• Fixed Prompt with Mask Token Representation (FiTeR) utilizes a static prompt template in both the training and inference stages and fine-tunes the model by feeding the [MASK] token representation into a predictor head.
Note that we use a task-specific predictor head in combination with all above approaches including the prompt-based fine-tuning methods, which typically use the pre-trained MLM head for prediction.The reason is that we have a regression task as one of our evaluation datasets, and as already discussed in Section 1, it is not straight forward to convert regression tasks into MLM tasks.

Model Training and Evaluation Setup
As mentioned in Section 1, we measure the performance in both few-shot and data-rich settings.For both VE and MM, we use Area Under the Precision-Recall Curve (PRAUC) as the evaluation metric.For MG, we use classification accuracy as the evaluation metric.For PP, we use Root Mean Square Error (RMSE) as the evaluation metric.We validate the performance every 2 training steps in the few-shot settings and every 100 steps in the data-rich settings.We use early stopping with a patience of 3 validation steps to select the best model for each task.We then evaluate the best models on the test datasets.For few-shot learning, we report the average performance across multiple few-shot datasets per task to reduce the variation in performance.In Table 1 and Table 2, we calculate and report the improvement percentage, which is the ratio of positive change as compared to PFt performance.

Results
Table 1 and 2 show the performance results for both few-shot and data-rich settings.In both settings, PFt-Avg shows degradation in average of  performance compared to PFt-CLS.This shows that average pooling generates worse sentence representations than does taking the [CLS] token representation.
In contrast, both null prompt approaches show improvement in average performance compared to PFt-CLS in both few-shot and data-rich settings.The improvement could be a result of aligning the format of the downstream tasks and that of the pretraining task.By changing the input format to be similar to that of the MLM task, we reduce the amount of data that are required to coach the model to learn the new task.
Also, there is a difference in the performance of NP-suffix and NP-prefix.This is likely due to the positional differences of the [MASK] token in the two methods.For example, suppose we want to perform sentiment analysis on a sentence like "I love the movie".Prepending or appending the [MASK] token would result in different distances between [MASK] and the word "love", which holds the key information for classification.Such positional differences could lead to different performance even though the two methods are very similar in spirit.
Another observation is that FiTer shows higher improvement in average of performance compared to null prompt approaches.Recall that FiTer introduces task information through the prompt templates, while the null prompt approaches do not, which supposedly addresses the issue where the model overfits the prompt templates.Hence, the results show that the benefits of adding the extra task information outweigh the possible performance loss caused by the prompt-related overfitting issue.
Finally, DynaMar outperforms FiTer on all tasks in both setting, with the only exception being MG in the data-rich setting.This indicates that increasing the diversity of prompt templates used during training will improve model generalization.We also observe that DynaMar does not show significant improvement over PFt-CLS on both MG and VE.This is because both tasks contain a large num-ber of documents with length longer than 512, as mentioned in Section 4.1.As a result of this, we need to truncate more of the original inputs for these tasks in order to insert prompts, which can lead to information loss.Thus, DynaMar is less efficient in problems with long documents.

Analysis
Larger dynamic prompt pool, better performance.The size of the dynamic prompt pool influences the performance.We compare the average improvement percentage across four tasks with the size of dynamic prompt pool = 1, 3, 5 (prompt information can be found in Appendix A).From Figure 3, we can see that performance improves as the dynamic prompt pool is made larger.

Limitations and Future Directions
As mentioned in Section 4.6, our method does not show substantial improvement on tasks involving long documents.Besides, the threshold of prompt disimilarity score can be treated as a parameter.This work lack of a study on the effect of this threshold.In addition, we focus on e-commerce related English classification/regression tasks in this work, the performance of our method in other nature language processing use cases remains unexplored.As a next step, we will conduct additional studies on these three topics.

Conclusion
In this work, we discuss methods for generating prompts and propose a way to select prompt templates to include in the dynamic prompt pool.Also, we show that using the mask representation of a prompt either equals or improves upon the performance of standard fine-tuning on four e-commerce applications in both few-shot and data-rich settings.In addition, we find DynaMaR outperforms the fixed prompt approach in both settings.Furtherwe show that a larger dynamic prompt pool leads to improved model performance when employing DynaMaR.
Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. 2021.Adapting language models zero-shot learning by meta-tuning on dataset and prompt collections.In Conference on Empirical Methods in Natural Language Processing (EMNLP).

A Dynamic Prompt Pool with Different Sizes
We need to define two prompt-related parameters while using DynaMaR: the dynamic prompt pool and the inference prompt.The list of prompts in the pool and the inference prompt selected for dynamic prompt pool sizes of 1, 3, and 5 can be found in Table 3, Table 4, and Table 5, respectively.

Figure 1 :
Figure 1: BERT Attention Distribution.The figure shows that the MLM model puts greater attention on the prompt than the original input.

Table 1 :
Few-shot Learning Performance Comparison.