Fixed Input Parameterization for Efficient Prompting

Recent works have shown that attaching prompts to the input is effective at conditioning Language Models (LM) to perform specific tasks. However, prompts are always included in the input text during inference, even when they are fixed, thus incurring substantial computational and memory overhead. Also, there is currently no straightforward method of utilizing prompts that are longer than the maximum input length of the LMs without incurring additional costs during inference. We formally define Fixed Input Parameterization (FIP) problem that focuses on injecting the fixed prompt into the parameters of an LM to be an efficient alternative to attaching fixed prompts to the input. We show that in scenarios with long fixed prompts, FIP can be up to 280 times more efficient in terms of total FLOPs than previous approaches. We further explore methodologies for FIP and show promising results in persona-dependent conversation, semantic parsing, and zero-shot learning with task instructions. Through these explorations, we show that FIP can be a promising direction for conditioning language models, in scenarios with long and fixed prompts 1 .


Introduction
Contemporary works on Language Models (LMs) (Raffel et al., 2020;Brown et al., 2020;Sanh et al., 2022;Thoppilan et al., 2022) have shown that attaching prompts to the input is effective at conditioning LMs to perform specific tasks.Note that the prompt in this work refers to a broader aspect of prompts which includes both the prompts used to induce specific behavior as well as prompts used to provide some contextual knowledge such as persona for dialogue agents.LMs are trained to condition on the given prompts in hopes of generalizing to unseen prompts during inference.Unseen prompts can be a persona for persona-dependent conversation (Zhang et al., 2018;Xu et al., 2022), database schema for semantic parsing (Yu et al., 2018;Hazoom et al., 2021), and task instruction for zero-shot learning with task instructions (Wei et al., 2022;Sanh et al., 2022).In these tasks, a new prompt is fixed to the input at every inference.For instance, in persona-dependent conversation, a persona description is appended to the dialogue history, so that the LM can always be conditioned on the persona.For another example, in semantic parsing, the LM is conditioned on the database schema as well as natural language questions to generalize to a new database.Lastly, zero-shot learning with task instructions involves adding natural language instructions to the inputs for adapting LMs to novel tasks.
However, concatenating prompts to input sequences for prompt-dependent inference has two major limitations.(1) During inference, prompts are always included in the input text and thus incur computational and memory overhead (Liu et al., 2022).(2) It is challenging to fit a long text such as the detailed description of a persona as a prompt into Transformer-based models whose input lengths are often fixed (Tay et al., 2022).For instance, in persona-dependent conversation, the model constantly refers to the persona description along with the dialogue history (Wolf et al., 2019;Roller et al., 2021), as shown in the left side of Figure 1.Moreover, in real-world scenarios, a persona may consist of a long detailed text description of a character or person, not just a few profile sentences.Naively concatenating long prompts to the input sequences is challenging due to the quadratic cost in time and memory of Transformer-based architectures with regard to the input sequence length.Other approaches specialized for processing long inputs (Beltagy et al., 2020;Katharopoulos et al., 2020;Izacard and Grave, 2021), or those that augment the LM with a retrieval mechanism (Han et  procedure of a previous approach where the persona (prompt) is concatenated to every input.The right side describes FIP, where the persona is injected into the model in advance, so that the model is able to generate responses without constantly referring to the persona description.Thus, FIP approach takes less time to generate responses than the previous method.2022) may be used but still come with increased overall memory and computations, ultimately leading to a delay in generating responses.This problem becomes critical in situations where the LMs are deployed, and fast inference speed is required.
In this work, we formally define Fixed Input Prarameterization (FIP) problem, where we focus on injecting a given fixed prompt into the parameters of an LM to address the two limitations mentioned above.With FIP, LMs can produce promptdependent outputs without the computational overhead of appending fixed prompts at inference time (the right side of Figure 1), and it also enables the injection of longer prompts in a wholistic way.
More specifically, we first show that Fixed Input Prarameterization (FIP) is much more efficient (up to 280 times) in terms of total FLOPs compared to previous approaches that may be used for handling long prompts such as Fusionin-Decoder (Izacard and Grave, 2021) or Linear Transformer (Katharopoulos et al., 2020).Next, we explore different methodologies as baselines for FIP, including the continued pre-training approach on the prompt as well as a novel distillation approach called Pseudo-INput Generation (PING) for successful FIP.We apply these FIP methods to three different tasks with fixed prompts: persona-dependent conversation, semantic parsing, and zero-shot learning with instructions.We compare the methods against LMs with explicit prompts as the upper bound as well as the LM without both the prompt and FIP as the lower bound.Experimental results show meaningful improve-ments with respect to the lower bound, but also exhibit a non-trivial gap with the upper bound.Despite the performance and efficiency trade-off, we still believe that FIP is a direction worth exploring considering its computational benefit, especially when inference costs are critical in real-world applications.
In sum, our main contributions are three folds: • We formally define the Fixed Input Parameterization (FIP) problem and demonstrate its necessity in terms of computation and memory efficiency, in scenarios with long prompts.
• We explore baseline approaches for FIP, showing that performance can approach the upper bound (unconstrained) performance in some cases.
• We show that the injection of long prompts (e.g., detailed description of persona) can be achieved through FIP and show its efficiency in comparison with previous methods, being up to 280 times more efficient during inference.
With the help of appropriate prompts, one can exploit knowledge learned by a pre-trained LM and manipulate the LM's behavior.However, for the in-context learning scenario, processing prompts that involve many training examples for each inference incurs substantial computational and memory overhead (Liu et al., 2022).Given training data, Liu et al. (2022) replace in-context learning with fine-tuning a small set of parameters for tackling the above issue.We tackle the same issue but assume a stricter scenario where there is no training data for the given prompt.
Efficient Transformers One can consider using efficient Transformer-based (Vaswani et al., 2017) architectures for handling long input sequences (Tay et al., 2022).The main challenge of using a vanilla Transformer architecture is the quadratic cost in time and memory with regard to the input sequence length due to the self-attention operation.There has been a surge of recent works addressing this problem (Dai et al., 2019;Beltagy et al., 2020;Katharopoulos et al., 2020;Zhu et al., 2021;Guo et al., 2021).They are primarily dedicated to improving either the efficiency of the selfattention mechanism or the general efficiency of the Transformer architecture through sparse models.Also, there has been an attempt to distill a unique prompt to handle long inputs (Askell et al., 2021).Our Fixed Input Prarameterization (FIP) approach tackles the efficiency problem of performing prompt-dependent tasks by keeping the input sequences short (without prompts), bounding the time and memory complexity to a constant invariant of the length of the prompt.In contrast to (Askell et al., 2021), Our work focuses on formally framing the problem into a more general and realistic setting since we aim to inject new prompts with no corresponding training data instead of only one prompt with corresponding training data.
Persona-dependent Conversation Endowing a chabot with a persona (Zhang et al., 2018;Xu et al., 2022) is challenging, but it enables the chatbot to deliver more personal, specific, consistent, and engaging conversations (Zhang et al., 2018) and gain user trust (Liu et al., 2020;Song et al., 2019;Qian et al., 2018).To achieve this, previous works have attached a persona to the dialog history at every inference time, so that the model can always be conditioned on the persona.However, when given a long persona description or long conversation history as a persona, this approach brings the critical problem of increased overall memory and computations, resulting in delayed response generation.
FIP allows a dialogue agent to generate responses without a persona description as the explicit input once the persona is injected.
Semantic Parsing Semantic parsing is the task of mapping a natural language query into a SQL query executable on a database.Specifically, crossdomain (cross-database) semantic parsing, where models are trained and tested on different domains (databases) (Yu et al., 2018) introduces many generalization challenges (Hazoom et al., 2021).Previous works concatenate the natural language query with the serialized database schema as the input to address the problem (Suhr et al., 2020;Deng et al., 2021;Xie et al., 2022).With FIP, the model is adapted to a new database schema in advance, so that it can map natural language queries to SQL queries on the new database without explicitly referring to the schema during inference.
Zero-shot Learning with Task Instructions Recent works (Sanh et al., 2022;Wei et al., 2022) have addressed zero-shot generalization to new tasks (Brown et al., 2020;Kim et al., 2021) by multi-task prompted training.With multi-task prompted training, the models learn to use task instructions as prompts to generalize to unseen tasks.It is demonstrated that this approach improves generalization ability to novel tasks and offers an effective substitute for unsupervised language model pre-training.Through FIP, the LM can be aware of a novel task instruction before performing the task and thus does not require the instruction, which can be lengthy, to make predictions.

Fixed Input Prarameterization
In this section, we formally define Fixed Input Prarameterization (FIP) as a task and describe the benefits of the formulation.Prompt-dependent generation is a task of generating an output sequence y that is a proper response to the input sequence x and coherent to the prompt z.Utilizing the prompt during inference, the generated sentence is obtained by y = f (z, x) where f denotes an LM such as T5 and GPT-2.Fixed Input Prarameterization (FIP), i.e., parameterization of prompts, allows LMs to perform prompt-dependent generation without using prompts during inference.To achieve this, we need to design a FIP method H to inject a prompt z into an LM f .The process of FIP can be represented as where f z denotes an LM injected with the prompt.
Then the prompt-dependent output sequence can be obtained by y = f z (x).FIP can also be applied for long prompts whose length exceeds the LM's input sequence length.Given a long prompt z, we decompose it into multiple sub-prompts {z i } each of which fits the LM's input length, i.e., z = z 1:n = [z 1 ; z 2 ; ...; z n ].Then the FIP process can be executed iteratively, injecting each sub-prompt sequentially while the LM is aware of the previous sub-prompts: . . .
The above formulation can be seen as a high-level abstraction of iterative FIP that we aim to approximate.In practice, in order to fully inject z 1:n , we repeat ( 2)-( 4) multiple times (i.e., multiple epochs).
Why is Fixed Input Prarameterization necessary?FIP brings advantages in terms of efficiency when applied to prompt-dependent tasks.
The previous approach of appending prompts to the input sequences has the drawback of the model repeatedly referring to the prompt at each inference time.This becomes critical in scenarios requiring long prompts, as Transformer architecture has quadratic computational and memory costs due to the limitation of the self-attention operation.We propose FIP as a solution to this computation bottleneck.Once a prompt is injected into the LM in advance, the LM no longer needs to refer to the prompt during inference.As a result, the model's input length remains independent of the length of prompts and is able to utilize prompts of any length efficiently.We discuss the efficiency gain of FIP in Section 6.1.
Evaluation Metric for FIP FIP can be evaluated by the evaluation metric of the fixed promptdependent task at hand.We also introduce a metric called the FIP score (FIP score) to measure the degree of injection.The metric is agnostic of the target task by comparing the results with that of an LM given actual prompts during inference.Let X w/ prompt denote the LM's task score with the prompt as an additional input (upper bound) and X w/o prompt denote the LM's task score without the prompt (lower bound).We define FIP score as the min-max scaling score of X F IP , where X F IP represents the score of the LM on the target task after FIP, i.e., FIP score = max(0, X F IP − X w/o prompt ) / (X w/ prompt − X w/o prompt ).We limit using FIP only in situations where X w/ prompt > X w/o prompt because there is no reason to inject a prompt if task performance degrades when using the prompt.Even if the range of individual task scores may vary from task to task, FIP score represents the overall injection effectiveness of the FIP methods, agnostic of the individual task score range.

Methods for Fixed Input Prarameterization
In this section, we explore methods of Fixed Input Prarameterization (FIP) that can address promptdependent tasks without accessing the prompt during inference.To achieve this, the model should be trained to store the prompt in its parameters.This can be seen as parameterizing the prompt into the model instead of feeding the prompt explicitly to the model.This is challenging as the prompt is unseen to the model and has no corresponding training data.In Section 4.1, a baseline method by continued pre-training is introduced, followed by a method for improving the baseline with curriculum learning.Section 4.2 presents a novel distillation-based method called Pseudo-INput Generation (PING) that learns to generate pseudo-inputs to inject novel prompts.

Continued Pre-training
We establish the Continued Pre-training method as a straightforward baseline for FIP.This method injects prompts into the parameters of an LM by continuing with the pre-training objective of the LM on the target prompt.The pre-training objective is a straightforward option as it works in an unsupervised manner.In our experiments, we leverage the pre-trained T5 model (Raffel et al., 2020) and thus use the masked language modeling objective which is the pre-training objective of T5.Following Raffel et al. ( 2020), we randomly replace 15% of a given prompt with special mask tokens; then, the model is trained to predict the sequence of masked tokens.In this process, the model learns about the prompt the same way the model learns knowledge during the pre-training stage.

Curriculum learning
We further investigate the baseline method by leveraging curriculum learning (Bengio et al., 2009;Campos, 2021) during continued pre-training.We set the mask ratio as the difficulty criteria (Wettig et al., 2022) and gradually increase the ratio throughout the Continued Pre-training.As the mask ratio increases, the model should predict more masked tokens given less context.With curriculum learning, we expect the LM to gradually better adapt to the prompt, improving its prompt-dependent task performance.Throughout the experiments, we increase the mask ratio linearly from 15% to 30%, 50%, and 70% and report the best score.

Pseudo-INput Generation (PING)
The purpose of FIP is to inject a prompt into the parameters of an LM which can also be done indirectly through distillation.In this subsection, we propose a novel distillation-based method called Pseudo-INput Generation (PING) that distills a novel prompt into a student LM that does not have access to the prompt through a teacher LM that does have access to the prompt.In order for distillation, pseudo-inputs are needed since we assume a scenario where the prompt to be injected has never been seen during training and does not have separate training data.An overview of PING is illustrated in Figure 2. As shown in the figure, during Phase 1, an input generator is trained with the task-specific training data.When given a prompt of the task as the input, the generator is expected to generate the task inputs that correspond to the prompt.During Phase 2, the input generator is frozen and is used to generate pseudo-inputs from the unseen prompt, which are then given to the teacher together with the prompt, while only the pseudo-inputs are given to the student.This way, the student learns to follow the teacher and is able to learn about the prompt indirectly.

Experimental Setup
In this section, we explain the experimental setups in detail.Experiments are performed with the T5base (Raffel et al., 2020) (220M parameters) model unless noted otherwise.

Prompt-dependent tasks
In order to evaluate the effectiveness of Fixed Input Prarameterization (FIP) methods, we select three prompt-dependent tasks-persona-dependent conversation, semantic parsing, and zero-shot learning with task instructions; all these tasks require fixed prompts during inference.Fixed prompts come in the form of a persona in persona-dependent conversation, database schema in semantic parsing, and task instruction in zero-shot learning with task instructions.As described in the introduction and Section 3, when FIP is applied for these tasks, there would be apparent benefits in real-world scenarios.With these tasks, not only the performance of the baseline FIP methods is evaluated, but also the significance of FIP is emphasized by comparison with the (unconstrained) previous approaches that concatenate prompts to the input.

Datasets
Following datasets of prompt-dependent tasks mentioned in Section 5.1 are utilized to evaluate Fixed Input Prarameterization (FIP).
PERSONA-CHAT / MSC PERSONA-CHAT (Zhang et al., 2018) is a crowd-sourced dataset intended for training agents to perform engaging and personal chit-chat by comprising the dialogues to be grounded on specific personas.For each dialogue, two speakers have a 6-8 turn conversation conditioned on a given persona.Based on PERSONA-CHAT, Multi Session Chat (MSC) (Xu et al., 2022) is a dialogue dataset collected to be comprised of long-term conversations each consisting of 5 continuing, but distinct chat sessions.In this work, we consider both the persona and dialogue history of the first two sessions as a prompt in MSC to incorporate long-term conversational context.Performance on both tasks are measured via perplexity (PPL).We randomly select 100 dialogues from the validation sets respectively as the persona-dependent conversation benchmark for testing our method.The persona descriptions are 60 tokens long on average in PERSONA-CHAT and the combined prompts average 811 tokens in MSC.
Spider Spider (Yu et al., 2018) is a large crossdomain semantic parsing and text-to-SQL dataset for developing natural language interfaces to crossdomain databases.Models must generalize to new database schemas as well as new queries to perform well on it.Evaluation metrics include Exact Matching (EM) and Execution Accuracy (EA).We utilize the development set containing 20 databases with about 50 questions per database as a semantic parsing benchmark for FIP.The database schemas range in length from 55 to 430 token lengths.
WSC / RTE / COPA For the task of zero-shot task generalization, Sanh et al. (2022) have trained the LM on a diverse set of tasks and evaluated on a held-out group of tasks to evaluate generalization performance.We choose coreference resolution, natural language inference, and sentence completion tasks, three out of their four held-out tasks, and test FIP on WSC, RTE, and COPA datasets (Wang et al., 2019).We utilize task instructions (prompts) provided from Sanh et al. (2022) and report average task scores of using them.The task instructions are comprised of about 20-30 tokens.

Implementation Details
For the Continued Pre-training method (Section 4.1), we use the Adam optimizer (Kingma and Ba, 2015) with a constant learning rate 1e-4 and batch size 8.We perform 5-20 steps of injection.For PING (Section 4.2), input generators are trained on each tasks for 1-2 epochs.We use KLdivergence for distilling the last layer's output of the decoder and perform 10-100 steps of injection.For T5-base, we use a single 16GB T4 GPU and for the larger models we use 4 32GB V100 GPUs.
In order for injection and comparison with upper-bound (W/ PROMPT) and lower-bound (W/O PROMPT) performance, we first need two different versions of the LM adapted to the given task.For the task of persona-dependent conversation and semantic parsing, W/ PROMPT model is fine-tuned together with prompts since prompts are explicitly used during inference, while W/O PROMPT model is fine-tuned on the task without being given the prompt.We perform FIP on the W/O PROMPT model since we assume having no access to prompts during inference.
For the zero-shot learning, we modify the prompts developed by Sanh et al. (2022) in the form of a fixed prompt.We replace the placeholders on their prompts with fixed words, then append the actual content to the prompt in a keyvalue format.For example, if the original is If {Premise} is true, is it also true that {Hypothesis}?, then the converted prompt is If "Premise" is true, is it also true that "Hypothesis"?Premise:{Premise} Hypothesis:{Hypothesis}.This ensures that the prefix is fixed, which can be injected with FIP.We use the T0-3B LM checkpoint for the zero-shot generalization.

Experimental Results
In this section, we first explore the inference efficiency of models performing prompt-dependent tasks and show that Fixed Input Prarameterization (FIP) leads to a meaningful gain in computational efficiency.Then the baseline and proposed methods are tested and compared on datasets discussed in Section 5.2.The results indicate that the Pseudo-INput Generation (PING) method achieves the best performance among FIP methods, sometimes even outperforming the upper bound, which uses explicit prompts during inference.In Section 6.3, we provide a concrete instance of injecting a real persona description into a conversational model, demonstrating the feasibility of long prompt injection.

Inference Efficiency
The comparison of inference efficiency of a model with FIP, a baseline model that naively concate-  1: Inference efficiency of different models that can be used for performing prompt-dependent inference.We depict how many times FIP is efficient in comparison with the other approaches inside the parenthesis.When there is outof-memory (OOM) using the 16GB T4 GPU, we estimate the FLOPs in italics assuming a linear correlation between prompt length and FLOPs.
nates the prompt to the input, Fusion-in-Decoder (FiD) (Izacard and Grave, 2021), and Linear Transformer (Katharopoulos et al., 2020) are shown in Table 1.We consider FiD as one of the options for processing long inputs because it processes long input sequences by encoding chunks of input sequences separately, reducing the quadratic complexity to linear.Linear Transformer also reduces the complexity to linear by linearizing the attention mechanism.We measure FLOPs and forward propagation latency via DeepSpeed Flops profiler 2 using a single 16GB T4 GPU.
As shown in Table 1, T5 W/ FIP is much more efficient than other models, especially as we assume a longer prompt length.This is because the efficiency of FIP remains the same independent of the prompt length while the costs of others increase linearly.Specifically, when the prompt length is 8 times the model's max input sequence length, one can achieve 80× computational efficiency in terms of FLOPs by applying FIP.Furthermore, in a scenario where the prompt length is 28× the model's max input sequence length (shown in Section 6.3 when trying to utilize a long persona that is over 13,000 token length long), previous approaches show an out-of-memory (OOM) issue using the 16GB T4 GPU, which means it is impossible to utilize such long prompts.FIP is estimated to be 280× more efficient in terms of total FLOPs if the GPU RAM were hypothetically big enough.

Task Performance
We report the task performance obtained by applying different FIP methods on three promptdependent tasks in  needs to be bridged in future work.
We find that the performance of different methods depends on the complexity of the input sequence structure.We believe that PING achieves a good performance in PERSONA-CHAT, MSC, Spider, and WSC because those datasets have relatively simple input sequences, such as a short utterance and simple query.In datasets with many components or multiple complex sentences (e.g., COPA and RTE), the low quality of generated pseudoinputs degrades the performance of PING.On the other hand, CP and CP W/ CURR perform better in datasets with complex structure.These findings encourage the community to explore a more integral FIP method that can cover different datasets.

Long Prompts Injection
To demonstrate the effectiveness of FIP on injection of long prompts into LMs, we show how the method works with a real-world example.We pick a Wikipedia page (Elon Musk), considering it as a long persona description, and inject the entire article (over 13,000 tokens) into an LM trained with PERSONA-CHAT.Here, we use T5-large as a base model and apply PING. Figure 4 shows an actual instance of interactions with the LM that underwent FIP through PING.The responses show the LM successfully reflecting the description of the person on the Wikipedia page without having the description appended to the input.Moreover, the inference of FIP is 280× more computationally efficient in terms of FLOPs than the baseline, as shown in Section 6.1.

Conclusion
In this paper, we formally define Fixed Input Prarameterization (FIP) problem that focuses on injecting the prompt into the parameters of an LM, as an efficient alternative to attaching fixed prompts to the inputs for prompt-dependent tasks.Through experiments, we show that FIP is much more computationally efficient (up to 280 times) in terms of total FLOPs for handling long prompts compared to the previous alternatives.We further explore baseline methodologies for FIP and find that Pseudo-INput Generation (PING), a distillation-based approach, shows promising results in personadependent conversation, semantic parsing, and zero-shot learning with task instructions.Through the explorations, we show that FIP can be a promising direction for conditioning language models efficiently, in scenarios with long and fixed prompts.
Limitations While Fixed Input Prarameterization (FIP) enables performing prompt-dependent tasks efficiently, there are limitations that need to be addressed in future work.In particular, the current FIP methods cause task performance degradation.Moreover, the computational cost needed for the injection of prompts and the storage required to store the parameters of every injected model have not been extensively considered.For example, when considering previous conversation history as the prompt to be injected in a long-term conversation setting, fast injection may also be a requirement for real-world application.Updating or adding a relatively small number of parameters (Hu et al., 2021;Wang et al., 2021) may be a potential avenue for addressing the problems.

Figure 1 :
Figure 1: Fixed Input Prarameterization (FIP) example on a persona-dependent conversation.The left side presents an inference

Figure 2 :
Figure 2: Illustration of the Pseudo-INput Generation (PING).During Phase 1, an input generator is trained with the task-specific training data.The inputs are prompts of a task, and the outputs are task inputs corresponding to the prompt.Input and output examples applied to semantic parsing are shown.During Phase 2, the input generator generates pseudo-inputs from the given target prompt, which are used to distill knowledge from the teacher to the student.Blue square boxes indicate frozen parameters; yellow rounded boxes indicate unfrozen parameters.

Figure 3 :
Figure 3: FIP scores in PERSONA-CHAT as we scale the sizes of the LM.There is a consistent trend of improved injection performance across FIP methods as the LM scales.
Figure 4: A real-world example of Fixed Input Prarameterization with a long prompt.(Top) The process of injecting a Wikipedia article describing a person (Elon Musk) into a model with FIP.The article is more than 13,000 tokens long.Actual conversation between the persona injected model and a human (cherry-picked).

Table 2
This sheds light on the possibility that FIP may be able to reach the upper bound performance.However, the results show at the same time that there is still a gap between the performance of FIP methods and the upper bound W/ PROMPT that

Table 2 :
Fixed Input Prarameterization performance on three prompt-dependent tasks.W/ PROMPT stands for the upper bound (unconstrained) method, which uses the prompt during inference by appending it to the input.W/O PROMPT depicts the lower bound method of not utilizing the prompts at all.Lastly, we show three W/ FIP methods: CP and CP W/ CURR stand for the Continued Pre-training (baseline) and the Continued Pre-training with curriculum learning, respectively, as explained in Section 4.1; PING depicts our novel proposed method utilizing distillation.

Table A1 :
Prompt Injection performance in PERSONA-CHAT as model size increases.There is a consistent trend of improved injection performance across PI methods as the model scales, and CP tends to increase more rapidly.