MPrompt: Exploring Multi-level Prompt Tuning for Machine Reading Comprehension

The large language models have achieved superior performance on various natural language tasks. One major drawback of such approaches is they are resource-intensive in fine-tuning new datasets. Soft-prompt tuning presents a resource-efficient solution to fine-tune the pre-trained language models (PLMs) while keeping their weight frozen. Existing soft prompt methods mainly focus on designing the input-independent prompts that steer the model to fit the domain of the new dataset. Those methods often ignore the fine-grained information about the task and context of the text. In this paper, we propose a multi-level prompt tuning (MPrompt) method for machine reading comprehension. It utilizes prompts at task-specific, domain-specific, and context-specific levels to enhance the comprehension of input semantics at different granularities. We also propose an independence constraint to steer each domain-specific prompt to focus on information within its domain to avoid redundancy. Moreover, we present a prompt generator that incorporates context-related knowledge in the prompt generation to enhance contextual relevancy. We conducted extensive experiments on 12 benchmarks of various QA formats and achieved an average improvement of 1.94\% over the state-of-the-art methods.


Introduction
In recent years, pre-trained language models (PLMs) have been widely applied in questionanswering tasks (Pandya and Bhatt, 2021), particularly in machine reading comprehension (Baradaran et al., 2022), and achieved remarkable success through the pretrain-then-finetune paradigm (Roberts et al., 2020;Khashabi et al., 2020b).Despite the excellent performance, due to the explosive growth of parameter sizes in PLMs, the fine-tuning paradigm has become resource intensive.
Recently, soft-prompt tuning has been widely explored as a parameter-efficient approach to addressing the aforementioned issues (Liu et al., 2023).For example, Li and Liang (2021) proposed Prefixtuning, which prepends a sequence of optimizable prefixes to each transformer layer while keeping the parameters of PLMs frozen.Prefix-tuning provides a lightweight alternative to fine-tuning and has achieved comparable performance with fewer trainable parameters.Lester et al. (2021) proposed Prompt-tuning, which only prepends optimizable prompt vectors to the input sequence, which used fewer parameters compared to Prefix-tuning.Ma et al. (2022) discovered negative tokens in Prompttuning that have a detrimental effect on downstream tasks and proposed XPrompt to mask these negative tokens, resulting in improved performance.However, the aforementioned methods are inputindependent, i.e., assigning a uniform prompt to all inputs of a given task, which under-utilizes the input semantics for the answer generation in machine reading comprehension.
There is a growing trend towards designing input-dependent prompts (a.k.a dynamic prompts) for various tasks (Gu et al., 2021;Clive et al., 2022;Tang et al., 2022).For example, Gu et al. (2021) proposed DialogPrompt for a dialog system, which dynamically generates prompt vectors according to the input dialogue context.Tang et al. (2022) extracts input-related information from BERT (Devlin et al., 2018) as contextualized prompts for natural language generation (Lewis et al., 2019;Raffel et al., 2020), which improves the relevance between the generated text and the input text.However, to the best of our knowledge, there has been little research exploring input-dependent prompt methods for question-answering tasks, especially for machine reading comprehension.It is challenging to apply input-independent methods to ma-chine reading comprehension where the answer is context-sensitive.
To address the above issues, we propose MPrompt, a novel Multi-level Prompt tuning approach for machine reading comprehension.Our method utilizes the dataset and the context information to create three levels of prompts: task-specific, domain-specific, and context-specific.The taskspecific prompts are input-independent and generate a prompt based on the tasks.The domainspecific prompts utilize the domain knowledge generated from the dataset while context-specific prompts rely on the input context.These multilevel prompts endow PLMs with multiple finegrained considerations of input semantics.To further enhance the domain-specific prompts and avoid information redundancy, we propose the independence constraint to steer each prompt to focus on knowledge within the domain rather than cross-domain knowledge.Furthermore, we extract context-related knowledge from a small-scale PLM, such as T5-small (Raffel et al., 2020), and integrate it into the prompt generation process to enrich the context sensitivity of prompts.With the help of these three levels of prompts, we achieve an average improvement of 1.94% over the state-of-the-art methods on 12 benchmark datasets.
Our main contributions are as follows: • We propose a novel multi-level prompt tuning (MPrompt) for machine reading comprehension which generates prompts at task-specific, domain-specific, and context-specific levels to improve answer generation.
• We propose an independence constraint to steer each domain-specific prompt to focus on intra-domain information, avoiding information redundancy, at the same time enriching the domain-related semantics.
• We propose a prompt generator based on a small-scale PLM to integrate context-related knowledge into prompt generation, which enriches the context awareness and sensitivity of the generated prompts.
2 Related Work

Machine Reading Comprehension
Machine Reading Comprehension (MRC) is a challenging task and hot topic in Question Answering (QA) (Pandya and Bhatt, 2021; Baradaran et al., 2022).It aims to comprehend contexts and provides answers to corresponding questions.In recent years, the focus of Machine Reading Comprehension research has shifted from Extractive Question Answering (Seo et al., 2016;Wang et al., 2017;Tan et al., 2018) to Generative Question Answering (Izacard and Grave, 2020;Khashabi et al., 2020bKhashabi et al., , 2022;;Jiang et al., 2022).For example, Lewis et al. (2020) has explored a retrievalaugmented generation scheme that combined pretrained retrieval models to enhance the performance of the generative question answering models.Khashabi et al. (2020bKhashabi et al. ( , 2022) ) unified the input format of different QA tasks into the same format and fine-tune the generative models (Raffel et al., 2020) for question answering.However, with the explosive growth in the parameter size of PLMs, the fine-tuning process becomes exponentially more resource intensive.One way to relax this computational requirement is through prompt learning (Li and Liang, 2021;Liu et al., 2023).

Prompt Learning
With the success of GPT-3 (Brown et al., 2020), prompt learning (Liu et al., 2023) has provided another efficient way to utilize PLMs, which has attracted widespread attention.The format of prompts can be in human-readable natural language (discrete prompts) (Shin et al., 2020;Schick and Schütze, 2020), or embedding vectors (continuous prompts) (Lester et al., 2021;Li and Liang, 2021;Liu et al., 2021a,b;Ma et al., 2022).The continuous prompts provide a more flexible solution that encodes information into a trainable embedding which presents the information to a pre-trained model more efficiently.For example, Lester et al. (2021) proposed Prompt-tuning, which achieves competitive performance by prepending trainable prompts to input sequences, and Ma et al. (2022) further improved the Prompt-tuning by pruning the negative prompt tokens.The aforementioned approaches did not sufficiently consider the full utilization of the input semantics and applied the same prompt for all examples in the dataset, which potentially limits the delivery of the language models.Therefore, Tang et al. (2022) extracts contextualized prompts based on the input text from external PLMs, resulting in better performance in natural language generation.Clive et al. (2022) proposes to combine taskspecific prompts with dynamic prompts, enabling the model to have finer-grained control over the generated text.
However, there has been little research exploring input-dependent prompt learning in question answering.In contrast to natural language generation, question-answering tasks emphasize understanding of the given question and context.Therefore, a lack of input-dependent prompts may lead to an under-leverage of the context information present in addition to the questions, particularly in machine reading comprehension tasks.

Methodology
Our proposed multi-level prompt tuning (MPrompt) framework is illustrated in Figure 1.The framework consists of a prompt generator and a generative question answering model, whereas the former relies on a smaller-sized encoder-decoder architecture.The prompt generator generates domainspecific and context-specific prompts and elicits context-related knowledge from small-scale PLMs into the generation process.

Task-specific Prompt
Many previous works (Li and Liang, 2021;Lester et al., 2021) have demonstrated that shareable prompt parameters learned from particular tasks can effectively enhance the performance of pretrained language models on downstream tasks.Therefore, following Li and Liang (2021), we construct task-specific prompts that share common prompt information within the task.
We prepend a prefix P ∈ R t×d for the different types of attention class in the pre-trained language models, where t is the length of the task-specific prompt and d is the dimension of the embedding in generative QA model.For each attention class2 , the prefix for key-value pairs T = {T 1 , T 2 , ..., T L } are learned through an MLP, T = MLP(P ), where L denotes the number of layers in the generative QA model,

Domain-specific Prompt
In question answering scenarios, especially in machine reading comprehension, the context plays a crucial role as it contains the answer or the evidence in support of the answer.Meanwhile, the context in QA datasets can often be divided into several domains.For example, in NewsQA (Trischler et al., 2016), the context can be grouped into different domains such as politics, economics, society, and so on.To improve the semantic understanding of context, the context from different domains should utilize different prompts, and each domain-specific prompt should imply a specific knowledge shared within the domain.
However, most QA datasets do not have explicit information about the domain of the context.To avoid additional annotation costs, we cluster the context C in an unsupervised manner to obtain different domains D ∈ {D 1 , ..., D n }, where n denotes the number of domains, and each context can only belong to one domain.Each domain has its own shared prompt, therefore the domainspecific prompts D = {D 1 , ..., D n }, where D i ∈ R ρ×dp ∀i ∈ {1, ..., n}, D i denotes the prompt shared within the domain D i , ρ denotes the length of the domain-specific prompts, d p denotes the dimension of embedding from the prompt generator.
Intuitively, domain-specific prompts should encapsulate information for each respective domain.Therefore, we introduce the independence constraint to steer D i to focus on the information within domain D i .Focusing on the knowledge specific to each domain can enhance contextual understanding, as confirmed by subsequent experiments.Specifically, for any pair of D a and D b ∈ D, we introduce the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005;Song et al., 2007) to measure the independence between the prompts of two domains: where , ϕ and ψ denote the kernel functions.HSIC = 0 indicates independence, when ϕ and ψ are universal kernels.However, HSIC is not invariant to isotropic scaling, which can be addressed by normalizing HSIC which is known as Centered Kernal Alignment (CKA) (Nguyen et al., 2020;Raghu et al., 2021;Chen et al., 2023): where CKA ∈ [0, 1], and CKA = 0 implies independence.
Computing the pair-wise independence requires iterations, which is slow for large n.To reduce computational costs, we randomly sample m pairs of domains as Θ to calculate the L idp constraints in each training iteration: (3)

Context-specific Prompt
The domain-specific prompts provide shared intradomain information, which provides fine-grained knowledge compared to task-specific prompts.However, there are still diversities among contexts within the same domain, and utilizing such diverse information is critical for answering questions accurately.Therefore, we construct context-specific prompts to enhance the understanding of each context, which provides fine-grained knowledge compared to domain-specific prompts.Specifically, all contexts have a shared context-specific prompt C ∈ R κ×dp , where κ denotes the length of the context-specific prompt.Furthermore, we propose the prompt generator to ensure that C generates different prompts for different contexts, especially for those contexts unseen in the training data and discuss its other roles in the next section.

Prompt Generator
In general, task-specific prompts are related to the task of specific datasets, while domain-specific and context-specific prompts both are closely related to the context.To better leverage domain-specific and context-specific prompts to enhance PLMs' understanding of the context semantics, we introduce a small-scale PLM to encode contexts and integrate them into the prompt generation process.
For a context c i , which belongs to the domain D j .The encoder of the prompt generator takes the context c i as its input, while the concatenation of domain-specific prompt D j and context-specific prompt C serves as the input X for the decoder, where X ∈ R (ρ+κ)×dp .It should be noted that we have removed the original decoder embedding layer.The output of the prompt generator is mapped to key-value pairs P = {P 1 , ..., P L } through the MLP, where P ∈ R (ρ+κ)×2dL , P l = (P l,K , P l,V ), P l,K and P l,V ∈ R (ρ+κ)×d , and L denotes the number of layers in the generative QA model.Intuitively, the knowledge related to the context c i is steered from the encoder of PLMs, and then integrated into the prompt generation process in the decoder.In this way, our approach allows for better learning of the semantics between prompt and context than previous work (Li and Liang, 2021;Lester et al., 2021;Ma et al., 2022), since both domain-specific prompt and context-specific prompt are closely related to the context.

Applying Multi-level Prompts
Overall, P contains the information of domainspecific and context-specific prompts as well as knowledge from PLMs related to the context, while T task contains the shared information within the task.In order to exploit multi-level prompt information to enhance the performance on question answering, we integrate the above different levels of prompts into the encoder of the generative QA model.Specifically, for the self-attention computation of layer l in the encoder of the generative QA model, the original K l and V l are augmented as: where K ′ l and V ′ l ∈ R (t+ρ+κ+M )×d , M denotes the length of the input sequence.For the selfattention and cross-attention computation of layer l in the decoder, K l and V l are augmented as: ) where K ′ l and V ′ l ∈ R (t+M )×d .To train the multi-level prompts, the loss function is a weighted sum of the two loss terms: where λ is the hyperparameter used to control the independence constraint, L NLL is the text generation loss, as follows: where y t denotes the t-th element of the target sequence, and x represents the input sequence.It is worth noting that, guided by Equation 8, we only update the MLP, task-specific, domain-specific, and context-specific prompts, while keeping all other parameters frozen.

Implementation
We convert each dataset into a unified text-to-text format to suit generative question answering models following (Khashabi et al., 2020b(Khashabi et al., , 2022)).Our MPrompt is based on three scales of pre-trained UnifiedQA (Khashabi et al., 2020b) (which is a T5 model for question-answering tasks): Base, Large, XL with 220M, 770M and 3B parameters, respectively.For the prompt generator, we utilize UnifiedQA-Small with 60M parameters to ensure that there is no excessive demand for GPU memory.
In all experiments, we employ the AdamW optimizer (Loshchilov and Hutter, 2017) and set β 1 = 0.9, β 2 = 0.999, and the weight decay is 0.01.We train our method with a learning rate of 5e-5, 10% warmup ratio, λ=1e-4, 50 epochs and record the model with the best performance on the validation set.To ensure a fair comparison, we fix the length of task-specific prompts to 10 and adjust the lengths of domain-specific and contextspecific prompts to {5, 10, 15, 20, 30, 40, 50, 60}.We use Kmeans (MacQueen, 1967) and Sentence-Transformers (all-mpnet-base-v2) (Reimers and Gurevych, 2019) to cluster the context and fix the number of clusters to 3 to obtain domain information D. The visualization of the clustering results by t-SNE (Van der Maaten and Hinton, 2008) is deferred to Appendix A.2.For all baselines, all hyperparameter settings are based on the reported values in the original paper to achieve optimal results.Our method is implemented with PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2020) library and experiments are conducted on Ubuntu 22.04 systems with NVIDIA RTX A100 or 4090 GPUs.Other implementation details and optimal hyperparameters are deferred to Appendix A.3.

Performance Comparison
Table 2 displays the main experimental results of different methods on 12 benchmark datasets.We conduct a comprehensive comparison between MPrompt and state-of-the-art methods, including Prompt-tuning (Lester et al., 2021), Prefixtuning (Li and Liang, 2021), and XPrompt (Ma et al., 2022) for different parameter sizes of PLMs.The datasets cover a wide range of questionanswering scenarios, which is beneficial for the comprehensive evaluation of different methods.
We observe that: (1) Our method MPrompt outperforms other soft-prompt methods by a large margin across all tasks and model scales.For example, MPrompt achieves absolute improvements of 2.17%, 1.85%, and 1.82% relative to Prefixtuning on UnifiedQA-Base, Large, and XL respectively.It is due to the input-independent prompt learning methods applying a uniform prompt to all inputs for a given task, which evidently underutilizing the input semantics in answer generation.However, MPrompt significantly improves the performance in question-answering tasks by enhancing the contextual comprehension of the PLMs with multiple levels of prompts.( 2 OBQA, QASC, and BoolQ-NP datasets.It is because Prefix-tuning provides deeper prompts, while XPrompt removes negative prompts in Prompttuning.However, MPrompt achieves higher performance than Prefix-tuning and XPrompt at the same model sizes, demonstrating its effectiveness. (3) Due to the luxury of having high computational resources and a full-weight update scheme in full fine-tuning, there is still a significant performance gap between soft-prompt tuning and full finetuning.However, As shown in Table 2, MPrompt matches the fine-tuning performance on all tasks and even outperforms the fine-tuning performance of UnifiedQA-Base and XL on most tasks.Specifically for UnifiedQA-Base, MPrompt achieves the best performance on SQuAD2, NewsQA, NarQA, MCTest, ARC (easy), RACE, and BoolQ, resulting in +0.69%, +0.62%, +0.24%, +1.31%, +0.78%, 0.21%, and 0.25% improvements over fine-tuning, respectively.We incorporate context knowledge from other PLMs (such as UnifiedQA-small in this paper) into prompt generation to enrich the semantics.
In summary, our method achieved excellent performance compared to state-of-the-art soft prompt methods, closing and even surpassing the performance gap over fine-tuning.This demonstrates that MPrompt effectively enhances contextual comprehension and enriches the semantics of the PLMs which significantly improves the quality of downstream question-answering tasks.

Ablation Analysis
In this part, we perform an ablation study on the various components of MPrompt, as shown in Figure  2. Firstly, we observe a decrease in performance when removing domain-specific or context-specific prompts.The domain-specific or context-specific prompts are constructed based on inputs of different granularity, which enhances the semantic comprehension of the input.Secondly, when removing the independence constraint, there was a significant decrease in performance.The independence constraint steers domain-specific prompts to focus on intra-domain information rather than interdomain information, which can effectively avoid information redundancy.Furthermore, performance decreases when the prompt generator is removed.
The prompt generator ensures that context-specific prompts are generated differently for different contexts, even those that never appear in the training data, which enhances the semantic understanding of the input context.Moreover, the prompt generator elicits context-related knowledge from PLM and incorporates it into the prompt generation process, which helps improve the context awareness of the prompts.

Sensitivity Analyses
In this part, we conducted comprehensive sensitivity analyses on our proposed method, including the length of prompts, the weight λ of the loss L idp , different clustering results D, different scales of PLMs in the prompt generator, and the number of sampled domain pairs m.

The Length of Prompts
In MPrompt, the length of prompts is a key factor that affects model performance.Here, we investigate how the length of domain-specific and prompts impacts the final performance.We fixed the length of one prompt to 10 and varied the other in the range of {5, 10, 15, 20, 30, 40, 50, 60}.As shown in Figure 3, in most cases, MPrompt shows stable performance for the length of domain-specific and context-specific prompts.Moreover, since DROP and OBQA require reasoning ability (Roberts et al., 2020), they are more sensitive to the prompt length compared to other datasets.

The Weight of Loss L idp
We investigated the impact of loss weighing λ on the results, as shown in Table 3.We found the change of weighting has minor impact on the SQuAD2 dataset and there is an optimal weight of 0.0001 for DROP, OBQA, and BoolQ-NP datasets.L idp takes values between [0, 1], a too large λ means that the model is not focusing on generating answers as its primary goal.An extremely small λ would make the domain-specific prompts lose focus on unique intra-domain information.

Different Scales of Prompt Generator
In general, increasing the parameter number of PLMs brings abundant semantic knowledge.Therefore, we investigated the impact of PLMs with different scales on performance, as shown in Figure 4.The prompt generator delivers significant performance improvements.Our evaluation shows, larger-scale PLMs tend to have better results, but require more computational resources.To balance the trade-off between cost and performance, the UnifiedQA-small already delivers satisfactory performance gains with a small computational overhead (60M parameters).

Number of sampled domain pairs
We investigated the impact of sampled domain pairs on the results.The number of clusters is set to 6, which requires 15 iterations per batch.We evaluate the number of sample pair m in {1, 3, 5,

Conclusion
In this paper, we propose a novel Multi-level Prompt (MPrompt) tuning method for machine reading comprehension.Our method strengthens PLMs' utilization of input semantics through three levels of prompts: task-specific prompts, domainspecific prompts, and context-specific prompts.The task-specific prompts are input-independent and generate prompts specific to a task.The domain-specific prompts utilize the domain knowledge generated from the dataset while contextspecific prompts are relying on the input context.Our experiments show the combination of three level prompts improves the answer generation performance on different sizes of PLMs and 12 benchmark datasets.In future work, we will extend our method to more tasks such as summarization, translation, and sentiment analysis.

Limitations
In our method, the length of prompts is the most critical parameter that affects performance.In our experiments, we observe that MPrompt is sensitive to prompt length for some challenging datasets.To obtain the optimal hyperparameter combination, it is inevitable to perform a grid search on the length of prompts.Our model is designed for encoderdecoder structure, so the decoder-only structure like LLaMA, GPT, or Bloom is not applicable.Our model requires access to the parameter of the model which any black box model is not applicable to our algorithm.

A Appendix
A.1 Datasets: Details We evaluated our method on 12 datasets covering a wide range of QA tasks.Due to some datasets (such as ARC, OpenBookQA and QASC) lacking the context, following Khashabi et al. (2020bKhashabi et al. ( , 2022)), we used the datasets that contain retrieved contexts.Due to limited test access for some datasets, such as SQuAD2, NewsQA, DROP, QASC, BoolQ, and BoolQ-NP, we used the validation set as the test set and re-randomized an equal number of samples from the training set as the validation set.For MCTest, we used the sum of mc160 and mc500.
For RACE, we used RACE-middle, which consists of English reading comprehension questions designed for Chinese middle school students.The datasets would be available in our code.

A.2 Visualization of context clustering results with Kmeans
In the paper, we cluster the contexts by Kmeans and fix the number of clusters to 3, since we do not have access to the gold standard clustering results for each dataset.To observe the results of clustering, we conducte visualization using t-SNE (Van der Maaten and Hinton, 2008), as shown in Figure 5.Most of the datasets present better clustering results when the number of clusters is 3, which will provide better domain information.

A.3 Implementation details
In Table 6, we report the hyperparameters used for training our models recorded in the experimental section.For model inference (answer generation), we set num_beams to 2, min_length to 1, and early_stopping to True.For MLP, we set the hidden layer dimension to 512 and utilize the Tanh activation function.For domain-specific prompts and context-specific prompts, we initialize each prompt token as an embedded vector extracted from the prompt generator's vocabulary, as Lester et al. (2021)

Figure 3 :
Figure 3: Evaluation of prompt length on the UnifedQAbase model.ρ and κ denote the length of the domainspecific and context-specific prompts, respectively.
) Prefixtuning and XPrompt have comparable performance at the same model size.Both algorithms outperform Prompt-tuning on the NewsQA, DROP,

Table 2 :
Comparison of state-of-art algorithm on different datasets.The unit for all the metrics here is in percentage(%).The numbers in blue indicate the performance gain (↑) of our method compared to Prefix-tuning.

Table 5 :
Evaluation of different the number of sampled domain pairs per batch m.

Table 6 :
done.Hyperparameter settings for our method."Task len" indicates the token length of task-specific prompts."bsz" indicates batch size."max_ans_length" indicates the maximum length of generated answers during inference.