Prefix-Tuning: Optimizing Continuous Prompts for Generation

Fine-tuning is the de facto way of leveraging large pretrained language models for downstream tasks. However, fine-tuning modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix. Prefix-tuning draws inspiration from prompting for language models, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We show that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics that are unseen during training.


Introduction
Fine-tuning is the prevalent paradigm for using large pretrained language models (LMs) (Radford et al., 2019;Devlin et al., 2019) to perform downstream tasks (e.g., summarization), but it requires updating and storing all the parameters of the LM. Consequently, to build and deploy NLP systems that rely on large pretrained LMs, one currently needs to store a modified copy of all the LM parameters for each task. This can be prohibitively expensive given the size of current LMs; for example, GPT-2 has 774M parameters (Radford et al., 2019) and GPT-3 has 175B parameters (Brown et al., 2020).
A natural approach to this problem is lightweight fine-tuning, which freezes most of the pretrained parameters and only tunes a smaller set of parameters. For example, adapter-tuning (Rebuffi et al., Figure 1: Fine-tuning (top) updates all LM parameters (the red Transformer box) and requires storing a full model copy for each task. We propose prefixtuning (bottom), which freezes the LM parameters and only optimizes the prefix (the red prefix blocks). Consequently, we only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step. 2017; Houlsby et al., 2019) inserts additional taskspecific layers between the layers of pretrained language models. Adapter-tuning has promising performance on natural language understanding and generation benchmarks, attaining comparable performance with fine-tuning while adding only around 2-4% task-specific parameters (Houlsby et al., 2019;Lin et al., 2020).
At the limit, GPT-3 (Brown et al., 2020) can be deployed using in-context learning, which is a form of prompting, without modifying any LM parameters. In in-context learning, Brown et al.
(2020) prepend a natural language task instruction (e.g., TL;DR for summarization) and a few examples to the task input, and then generate the task output from the LM. However, since Transformers can only condition on a bounded-length context (e.g., 2048 tokens for GPT-3), in-context learning is restricted to very small training sets.
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation (NLG) tasks, inspired by prompting. Consider the task of generating a textual description of a data table, as shown in Figure 1, where the task input is a linearized table (e.g., "name: Starbucks | type: coffee shop") and the output is a textual description (e.g., "Starbucks serves coffee."). Prefix-tuning prepends a sequence of continuous task-specific vectors to the input, which we call a prefix, depicted by red blocks in Figure 1 (bottom). To generate each token, the LM can attend to the prefix as if it were a sequence of "virtual tokens", but unlike prompting, the prefix consists entirely of free parameters which do not correspond to real tokens. In contrast to fine-tuning in Figure 1 (top), which updates all LM parameters and thus requires storing a tuned copy of the model for each task, prefix-tuning only optimizes the prefix. Consequently, we only need to store one copy of the large LM and a learned task-specific prefix, yielding a very small overhead for each additional task (e.g., 250K parameters for table-to-text).
In contrast to full fine-tuning, prefix-tuning is also modular: we train an upstream prefix which steers an unmodified LM, and therefore, a single LM can support many tasks at once. In the context of personalization where the tasks correspond to users (Shokri and Shmatikov, 2015;McMahan et al., 2016), we would have a separate prefix for each user trained only on that user's data, thereby avoiding data cross-contamination. Moreover, the prefix-based architecture enables us to even process examples from multiple users/tasks in a single batch, something that is not possible with other lightweight fine-tuning approaches like adaptertuning.
We evaluate prefix-tuning on table-to-text generation using GPT-2 and abstractive summarization using BART. In terms of storage, prefix-tuning stores 1000x fewer parameters than full fine-tuning. In terms of performance when trained on full datasets, prefix-tuning and fine-tuning are comparable for table-to-text ( §6.1), while prefix-tuning suffers a small degradation for summarization ( §6.2). In low-data settings, prefix-tuning outperforms finetuning on both tasks ( §6.3). Prefix-tuning also extrapolates better to tables (for table-to-text) and articles (for summarization) with unseen topics ( §6.4).

Related Work
Fine-tuning for natural language generation. Current state-of-the-art systems for natural language generation (NLG) are based on fine-tuning pretrained LMs. For table-to-text generation, Kale (2020) fine-tunes a sequence-to-sequence model (T5;Raffel et al., 2020). For extractive and abstractive summarization, researchers fine-tune masked language models (e.g., BERT;Devlin et al., 2019) and encode-decoder models (e.g., BART;Lewis et al., 2020), respectively (Zhong et al., 2020Liu and Lapata, 2019;Raffel et al., 2020). For other conditional NLG tasks such as machine translation and dialogue generation, fine-tuning is also the prevalent paradigm (Zhang et al., 2020c;Stickland et al., 2020;Zhu et al., 2020;Liu et al., 2020). In this paper, we focus on table-to-text using GPT-2 and summarization using BART, but prefix-tuning in principle can be applied to other generation tasks and pretrained models, such as masked LMs.
Lightweight fine-tuning. Prefix-tuning falls under the broad class of lightweight fine-tuning methods, which freeze most of the pretrained parameters and only tune a smaller set of parameters. The key question is how to augment the LM architecture and decide which subset of pretrained parameters to tune. One line of research learns a task-specific parameter mask (Zhao et al., 2020;Radiya-Dixit and Wang, 2020). Another line of research inserts new modules with trainable parameters. For example, Zhang et al. (2020a) trains a "side" network that is fused with the pretrained model via summation; adapter-tuning inserts task-specific layers (adapters) between each layer of the pretrained LM (Houlsby et al., 2019;Lin et al., 2020;Rebuffi et al., 2017;Pfeiffer et al., 2020). Compared to this line of work, which tunes around 3.6% of the LM parameters, our method obtains a further 30x reduction in task-specific parameters, tuning only 0.1% while maintaining comparable performance on table-to-text tasks.
Prompting. Prompting is a way of leveraging a pretrained LM by prepending instructions and a few examples to the task input and generating the task output from the LM. For autoregressive LMs, the most successful form of prompting is GPT-3's in-context learning (Brown et al., 2020), which uses manually designed prompts to adapt its generation for different tasks in few-shot settings. For masked LMs like BERT and RoBERTa (Liu et al., 2019), prompt engineering has been explored for natural language understanding tasks (Jiang et al., 2020;Schick and Schütze, 2020). For example, AutoPrompt (Shin et al., 2020) searches for a sequence of discrete trigger words and concatenates it with each input to elicit sentiment or factual knowledge from BERT and RoBERTa. In contrast with AutoPrompt, our method optimizes continuous prefixes, which are more expressive ( §7.2); moreover, we focus on language generation tasks.
Continuous vectors have been used to steer LMs; for example, Subramani et al. (2020) showed that a pretrained LSTM language model can reconstruct arbitrary sentences by optimizing a continuous vector for each sentence, making the vector inputspecific. In contrast, prefix-tuning optimizes a taskspecific prefix that applies to all instances of that task. As a result, unlike the previous work whose application is limited to sentence reconstruction, prefix-tuning can be applied to NLG tasks.
Controllable generation. Controllable generation aims to steer a pretrained language model to match a sentence-level attribute (e.g., positive sentiment or sports). Such control can happen at training time: Keskar et al. (2019) pretrains the language model (CTRL) to condition on metadata such as keywords or URLs. The control can also happen at decoding time, by weighted decoding (GeDi, Krause et al., 2020) or iteratively updating the past activations (PPLM, Dathathri et al., 2020). However, there is no straightforward way to apply these controllable generation techniques to enforce fine-grained control over generated contents, as demanded by tasks like table-to-text and summarization. P*-tuning. Prefix tuning is an instance of a new class of methods that has emerged, which we call p*-tuning (since the other prominent instances, ptuning and prompt-tuning, also start with p), all based on the idea of optimizing a continuous prefix or prompt. Concurrent with our work, Qin and Eisner (2021) learn mixtures of soft fill-in-the-blank prompts to elicit knowledge from LMs such as BERT and BART. Hambardzumyan et al. (2021) learns task-specific embeddings that adapts BERT for sentiment classification. Both works show that tuning soft prompts outperforms previous work, which optimizes over discrete prompts. P-tuning (Liu et al., 2021) shows that jointly updating the prompt embeddings and LM parameters improves GPT-2's performance on natural language understanding tasks, in both few-shot and full data settings. In a followup work, Prompt-tuning (Lester et al., 2021) simplifies our approach and applies it to T5 (Raffel et al., 2020), demonstrating that the performance gap between fine-tuning and p*tuning vanishes as the model size grows.

Problem Statement
Consider a conditional generation task where the input x is a context and the output y is a sequence of tokens. We focus on two tasks, shown in Figure 2 (right): In table-to-text, x corresponds to a linearized data table and y is a textual description; in summarization, x is an article and y is a summary.

Autoregressive LM
Assume we have an autoregressive neural language model p φ (y | x) parametrized by φ (e.g., GPT-2; Radford et al., 2019). As shown in Figure 2 (top), let z = [x; y] be the concatenation of x and y; let X idx denote the sequence of indices that corresponds to x, and Y idx denote the same for y.
The activation vector at time step i ] is a concatenation of all activation layers at this time step, and h (j) i is the activation vector of the j-th layer at time step i. 1 An autoregressive neural LM computes h i as a function of z i and the past activations in its left context, as follows: where the last layer of h i is used to compute the distribution for the next token: to logits over the vocabulary.

Encoder-Decoder Architecture
We can also use an encoder-decoder architecture (e.g., BART; Lewis et al., 2020) to model p φ (y | x), where x is encoded by the bidirectional encoder, and the decoder predicts y autoregressively (conditioned on the encoded x and its left context). We use the same indexing and activation notation, as shown in Figure 2 (bottom): each h i for i ∈ X idx is computed by the a bidirectional encoder; each h i for i ∈ Y idx is computed by an autoregressive decoder using the same equation (1).

Fine-tuning
In the full fine-tuning framework, we initialize with the pretrained parameters φ. Here p φ is a trainable language model distribution and we perform gradient updates on the following log-likelihood objective: (2)

Prefix-Tuning
We propose prefix-tuning as an alternative to full fine-tuning for conditional generation tasks. We first provide intuition in §4.1 before defining our method formally in §4.2.

Intuition
Prompting has demonstrated that conditioning on a proper context can steer the LM without changing its parameters. For example, if we want the LM to generate a word (e.g., Obama), we can prepend its common collocations as context (e.g., Barack), and the LM will assign much higher probability to the desired word. Extending this intuition beyond generating a single word or sentence, we want to find a context that steers the LM to solve an NLG task. Intuitively, the context could influence the encoding of the task input x by guiding what to extract from x, and it could influence the generation of the task output y by steering the next token distribution. However, it's non-obvious whether such a context exists. Using natural language task instructions (e.g., "summarize the following table in one sentence") for the context might guide a human to solve the task, but this fails for moderately-sized pretrained LMs. 2 Optimizing over the discrete instructions might help, but discrete optimization is computationally challenging. Instead of optimizing over discrete tokens, we can optimize the instruction as continuous word embeddings, whose effects will be propagated upward to all Transformer activation layers and rightward to subsequent tokens. This is strictly more expressive than a discrete prompt which is constrained to the embeddings of real words. Prefix-tuning goes one step further in increasing expressivity by optimizing the activations of all the layers, not just the embedding layer. As another benefit, prefixtuning can directly modify representations deeper in the network, therefore, avoiding long computation paths across the depth of the network.

Method
Prefix-tuning prepends a prefix for an autoregressive LM to obtain z = [PREFIX; x; y], or prepends prefixes for both encoder and decoder to obtain z = [PREFIX; x; PREFIX ; y], as shown in Figure 2. Here, P idx denotes the sequence of prefix indices, and we use |P idx | to denote the length of the prefix.
We follow the recurrence relation in equation (1), except that the activations of the prefix indices are free parameters, given by a matrix P θ (parametrized by θ) of dimension |P idx | × dim(h i ).
The training objective is the same as equation (2), but the set of trainable parameters changes: the language model parameters φ are fixed and the prefix parameters θ are the only trainable parameters.
Here, each h i is a function of the trainable P θ . When i ∈ P idx , this is clear because h i copies directly from P θ . When i ∈ P idx , h i still depends on P θ , because the prefix activations are always in the left context and will therefore affect any activations to the right.

Parametrization of P θ
Empirically, directly updating the P θ parameters leads to unstable optimization and a slight drop in performance. 3 So we reparametrize the matrix P θ [i, :] = MLP θ (P θ [i, :]) by a smaller matrix (P θ ) composed with a large feedforward neural network (MLP θ ). Now, the trainable parameters include P θ and the parameters of MLP θ . Note that P θ and P θ has the same number of rows (i.e., the prefix length), but different number of columns. 4 Once training is complete, these reparametrization parameters can be dropped, and only the prefix (P θ ) needs to be saved.

Datasets and Metrics
We evaluate on three standard neural generation datasets for the table-to-text task: E2E (Novikova et al., 2017), WebNLG (Gardent et al., 2017), and DART (Radev et al., 2020), as shown in Table 1. The datasets are ordered by increasing complexity and size. E2E only has 1 domain (i.e. restaurant reviews); WebNLG has 14 domains, and DART is open-domain, using open-domain tables from Wikipedia. For evaluation, we report the metrics using the official evaluation scripts (see details in Appendix A.1).
For the summarization task, we use the XSUM (Narayan et al., 2018) dataset, which is an abstractive summarization dataset on news articles. We report ROUGE-1, ROUGE-2 and ROUGE-L.

Methods
For (FT-FULL), fine-tuning only the top 2 layers (FT-TOP2), and adapter-tuning (ADAPTER). 5 We also report the current state-of-the-art results on these datasets: On E2E, Shen et al. (2019) uses a pragmatically informed model without pretraining. On WebNLG, Kale (2020) fine-tunes T5-large. On DART, no official models trained on this dataset version are released. 6 For summarization, we compare against fine-tuning BART (Lewis et al., 2020).

Architectures and Hyperparameters
For  (Loshchilov and Hutter, 2019) and a linear learning rate scheduler, as suggested by the Hugging Face default setup. The hyperparameters we tune include the number of epochs, batch size, learning rate, and prefix length. Hyperparameter details are in the appendix. The default setting is 10 epochs, batch size 5, learning rate 5 · 10 −5 and prefix length 10. The table-to-text models are trained on TITAN Xp or GeForce GTX TITAN X machines. Prefixtuning takes 0.2 hours per epoch to train on 22K examples, whereas fine-tuning takes around 0.3 hours per epoch. The summarization models are trained on Tesla V100 machines, taking 1.25 hours per epoch on the XSUM dataset. For time efficiency, prefix-tuning is around 30% faster than fine-tuning. For GPU memory efficiency, prefixtuning with batchsize 1 takes 18% of the total GPU memory, whereas fine-tuning takes 50%.
At decoding time, for table-to-text, we use beam search with beam size 5. For summarization, we use beam size 6 and length normalization 0.8. Decoding takes 1.2 seconds per sentence (without batching) for table-to-text, and 2.6 seconds per batch (using a batch size of 10) for summarization. 6 Main Results

Table-to-text Generation
We find that by updating only 0.1% task-specific parameters, 7 prefix-tuning is effective in table-to-text generation, outperforming other lightweight baselines (ADAPTER and FT-TOP2) even by updating 30x fewer parameters and achieving a comparable performance with (full) fine-tuning. This trend holds for all datasets: E2E, WebNLG, 8 and DART.
If we match the number of parameters for prefixtuning and adapter-tuning to be 0.1%, Table 2 shows that prefix-tuning is significantly better than ADAPTER (0.1%), attaining 4.1 BLEU improvement per dataset on average. Even when we compare with fine-tuning (100%) and adapter-tuning (3.0%), which update significantly more parameters than prefix-tuning, prefix-tuning still achieves results comparable or better than those two systems. This demonstrates that prefix-tuning is more Pareto efficient than adapter-tuning, significantly reducing parameters while improving generation quality.
Additionally, attaining good performance on DART suggests that prefix-tuning can generalize to tables with diverse domains and a large number of relations. We will delve deeper into extrapolation performance (i.e., generalization to unseen categories or topics) in §6.4.
In summary, prefix-tuning is an effective and space-efficient method to adapt GPT-2 to table-totext generation. It also maintains the performance gains when scaling up to GPT-2 LARGE , suggesting it has the potential to scale to even larger models with a similar architecture, like GPT-3.

Summarization
As shown in Table 3, with 2% parameters, prefixtuning obtains slightly lower performance than finetuning (36.05 vs. 37.25 in ROUGE-L). With only 0.1% parameters, prefix-tuning underperforms full fine-tuning (35.05 vs. 37.25). There are several differences between XSUM and the three table-totext datasets which could account for why prefixtuning has comparative advantage in table-to-text: (1) XSUM contains 4x more examples than the three table-to-text datasets on average; (2) the input articles are 17x longer than the linearized table input of table-to-text datasets on average; (3) summarization is more complex than table-to-text because it requires selecting key contents from an article.

Low-data Setting
Based on the results from table-to-text ( §6.1) and summarization ( §6.2), we observe that prefixtuning has a comparative advantage when the number of training examples is smaller. To explore the low-data setting more systematically, we subsample the full dataset (E2E for table-to-text and XSUM for summarization) to obtain small datasets of size {50, 100, 200, 500}. For each size, we sample 5 different datasets and average over 2 training random seeds. Thus, we average over 10 models for each low-data setting. 9 Figure 3 (right) shows that prefix-tuning outperforms fine-tuning in low-data regimes by 2.9 BLEU on average, in addition to requiring much fewer parameters, but the gap narrows as the dataset size increases.
Qualitatively, Figure 3 (left) shows 8 examples generated by both prefix-tuning and fine-tuning models trained on different data levels. While both methods tend to undergenerate (missing table contents) in low data regimes, prefix-tuning tends to be more faithful than fine-tuning. For example, finetuning (100, 200) 10 falsely claims a low customer rating while the true rating is average, whereas prefix-tuning (100, 200) generates a description that is faithful to the table.

Extrapolation
We now investigate extrapolation performance to unseen topics for both table-to-text and summarization. In order to construct an extrapolation setting, we split the existing datasets so that training and test cover different topics. For  FT (50) The Eagle coffee shop is located in the riverside area near Burger King. FT (100) The Eagle is a cheap coffee shop near Burger King in the riverside area. It has a low customer rating and is not family friendly. FT (200) The Eagle is a cheap Chinese coffee shop with a low customer rating. It is located near Burger King in the riverside area. FT (500) The Eagle is a cheap Chinese coffee shop with average customer ratings. It is located in the riverside area near Burger King.  Table 3: Performance of methods on the XSUM summarization dataset. Prefix-tuning slightly underperforms fine-tuning in the full-data regime.
news-to-sports within-news  In news-to-sports, we train on news articles and test on sports articles. In within-news, we train on {world, UK, business} news and test on the remaining news categories (e.g., health, tech). On both table-to-text and summarization, prefixtuning extrapolates better than fine-tuning under all metrics, as shown in Table 4 and the 'U' columns of Table 2 (middle).
We also find that adapter-tuning achieves good extrapolation performance, comparable with prefix-tuning, as shown in Table 2. This shared trend suggests that preserving LM parameters indeed has a positive impact on extrapolation. However, how prefix-tuning improves extrapolation is an open question and we will discuss this further in §8.

Intrinsic Evaluation
We compare different variants of prefix-tuning to study the impact of various design decisions. §7.1 studies the impact of the prefix length. §7.2 studies tuning only the embedding layer, which is more akin to tuning a discrete prompt. §7.3 compares prefixing and infixing, which inserts trainable activations between x and y. §7.4 studies the impact of various prefix initialization strategies. §7.5 further studies the data efficiency of prefix-tuning.

Prefix Length
A longer prefix means more trainable parameters, and therefore more expressive power. 11 Figure 4 shows that performance increases as the prefix   length increases up to a threshold (200 for summarization, 10 for table-to-text) and then a slight performance drop occurs. Prefixes longer than the threshold lead to lower training loss, but slightly worse test performance, suggesting that they tend to overfit the training data.

Full vs Embedding-only
Recall in §4.1, we discussed optimizing the continuous embeddings of the "virtual tokens." We instantiate that idea and call it embedding-only. The word embeddings are free parameters, and the remaining activation layers are computed by the Transformer.
Table 5 (top) shows that the performance drops significantly, suggesting that tuning only the embedding layer is not sufficiently expressive. Embedding-only upper bounds the performance of discrete prompt optimization (Shin et al., 2020), because discrete prompt restricts the embedding layer to exactly match the embedding of a real word. Consequently, we have this chain of increasing expressive power: discrete prompting < embeddingonly < prefix-tuning.

Prefix-tuning vs Infix-tuning
We also investigate how the trainable activations' position in the sequence affects performance. In ra n d o m "a ct iv e " "e le p h a n t" "s u m m a ri ze " "t a b le -t o -t e x t: " "b a n a n a " "b e a u ti fu l" "d iv id e " "k e e p " prefix-tuning, we place them at the beginning [PREFIX; x; y]. We can also place the trainable activations between x and y (i.e. [x; INFIX; y]) and call this infix-tuning. Table 5 (bottom) shows that infix-tuning slightly underperforms prefix-tuning. We believe this is because prefix-tuning can affect the activations of x and y whereas infix-tuning can only influence the activations of y.

Initialization
We find that how the prefix is initialized has a large impact in low-data settings. Random initialization leads to low performance with high variance. Initializing the prefix with activations of real words significantly improves generation, as shown in Figure 5. In particular, initializing with task relevant words such as "summarization" and "table-to-text" obtains slightly better performance than task irrelevant words such as "elephant" and "divide", but using real words is still better than random. Moreover, in full data settings, the initialization trick has no impact, and random initialization leads to equally good performance. Since we initialize the prefix with activations of real words computed by the LM, this initialization strategy is concordant with prefix-tuning's philosophy, which preserves the pretrained LM as much as possible.

Data Efficiency
We also investigate the data efficiency of prefixtuning (without initialization trick, a.k.a random initialization) and full fine-tuning by comparing their performance on 5 different data scales of the E2E task (10%, 20%, 40%, 60%, and 80%). Figure 6 shows that prefix-tuning has better performance than fine-tuning when using more than 20% of the data. For data scale of 10%, prefix-tuning with random initialization yields comparable or slightly lower performance than full fine-tuning,  Figure 6: Data efficiency curves: percentage of training set vs. performance on table-to-text (E2E). Prefixtuning (without the initialization trick) is more dataefficient than fine-tuning when using more than 20% of the data.
necessitating the initialization trick ( §6.3) to improve the performance in this low-data regime.

Discussion
We will discuss several favorable properties of prefix-tuning and some open problems.
Personalization. As we note in §1, prefix-tuning is advantageous when there are a large number of tasks that needs to be trained independently. One practical setting is user privacy (Shokri and Shmatikov, 2015;McMahan et al., 2016). In order to preserve user privacy, each user's data needs to be separated and a personalized model needs to be trained independently for each user. Consequently, each user can be regarded as an independent task. If there are millions of users, prefix-tuning can scale to this setting and maintain modularity, enabling flexible addition or deletion of users by adding or deleting their prefixes without cross-contamination.
Batching across users. Under the same personalization setting, prefix-tuning allows batching different users' queries even though they are backed by different prefixes. When multiple users query a cloud GPU device with their inputs, it is computationally efficient to put these users in the same batch. Prefix-tuning keeps the shared LM intact; consequently, batching requires a simple step of prepending the personalized prefix to user input, and all the remaining computation is unchanged. In contrast, we can't batch across different users in adapter-tuning, which has personalized adapters between shared Transformer layers. This batching benefit could also help create efficient ensembles of multiple prefixes trained on the same task (Lester et al., 2021).
Inductive bias of prefix-tuning. Recall that finetuning updates all pretrained parameters, whereas prefix-tuning and adapter-tuning preserve them.
Since the language models are pretrained on general purpose corpora, preserving the LM parameters might help generalization to domains unseen during training. In concordance with this intuition, we observe that both prefix-tuning and adaptertuning have significant performance gain in extrapolation settings ( §6.4); however, how these methods improve extrapolation is an open question.
While prefix-tuning and adapter-tuning both freeze the pretrained parameters, they tune different sets of parameters to affect the activation layers of the Transformer. Recall that prefix-tuning keeps the LM intact and uses the prefix and the pretrained attention blocks to affect the subsequent activations; adapter-tuning inserts trainable modules between LM layers, which directly add residual vectors to the activations. Moreover, we observe that prefixtuning requires vastly fewer parameters compared to adapter-tuning while maintaining comparable performance. We think this gain in parameter efficiency is because prefix-tuning keeps the pretrained LM intact as much as possible, and therefore exploits the LM more than adapter-tuning.
Recent work by Aghajanyan et al. (2020) uses intrinsic dimension to show that there exists a lowdimensional reparameterization that is as effective for fine-tuning as the full parametrization. This explains why good accuracy on downstream tasks can be obtained by updating only a small number of parameters. Our work echoes this finding by showing that good generation performance can also be attained by updating a very small prefix. However, prefix-tuning is not just about the size of trainable parameters, but more importantly, which subset of parameters to modify. Therefore, it would be interesting future work to explore other lightweight fine-tuning methods that achieve an even better accuracy-size tradeoff.
The WebNLG (Gardent et al., 2017) dataset consists of 22K examples, and the input x is a sequence of (subject, property, object) triples. The average output length is 22.5. In the training and validation splits, the input describes entities from 9 distinct DBpedia categories (e.g., Monument). The test split consists of two parts: the first half contains DB categories seen in training data, and the second half contains 5 unseen categories. These unseen categories are used to evaluate extrapolation. We use the official evaluation script, which reports BLEU, METEOR and TER (Snover et al., 2006).
For the summarization task, we use the XSUM (Narayan et al., 2018) dataset, which is an abstractive summarization dataset on news articles. There are 225K examples. The average length of the articles is 431 words and the average length of the summaries is 23.3. We report ROUGE-1, ROUGE-2 and ROUGE-L, computed by the python package rouge-score.
Extrapolation data splits. We construct two extrapolation data splits news-to-sports and within-news from the original XSUM dataset. XSUM dataset is drawn from BBC news, and we identify the topic of each article based on its URL. Since "news" and "sports" are the two domains with the most articles, we create our first train/test split. Additionally, "news" has subdomains such as "UK", "world", and "technology". Consequently, we create a second data split, using the top 3 news subdomains (i.e. {world, UK, business }) as training data and the rest as test data.

A.2 Hyperparameters
In Table 6, we report the hyperparameters used to train the best-performing models documented in the experiment section.
As for the search range of each hyperparameters: the learning rates are selected from {1e-5, 5e-05, 8e-05}; the number of epochs are selected from {5, 10} for table-to-text and {5, 25, 30 } for summarization; We select the largest batch size that can fit into GPU memory and didn't explicitly tune for an optimal batch size. Prefix length are selected from {1, 5, 10, 20, 40} for table-to-text and {1, 10, 20, 50, 80, 100, 200, 300} for summarization. We use perplexity and automatic generation metrics on the validation set to select the best-performing models.
For table-to-text in the low data settings, we use a learning rate of 5e-5, and a batch size of 10. We use a prefix length of 6, since we apply the initialization trick and initialize the prefix with "table-to-text:", which contains 6 BPE tokens. Instead of tuning the number of epochs, we tune the max steps of updates in {100, 200, 400, 600 }, as shown in Table 8. We apply early stopping based on the performance of validation set, where the validation size =30% training size.
For summarization in the low data settings, we use a learning rate of 5e-5 and a warmup step of 100. We use a batch size of 5 for prefix-tuning and 6 for fine-tuning. We apply the initialization trick and use the word "summarize" to initialize   the prefix, resulting in a prefix length of 1. We tune the number of epochs in {3, 5, 10, 20, 30}, shown in Table 8. We also apply early stopping based on validation performance.
For the extrapolation setting, the hyperparameters for our table-to-text model is the same as the hyperparameters of WebNLG. The hyperparameters for summarization is shown in the last block of Table 6. Table 9 shows the validation performance on the three table-to-text datasets.  A.4 Additional Results for Low-data Settings Figure 7 supplements the low-data performance curves in Figure 3 by plotting the relationship between training size and generation metrics for both prefix-tuning and fine-tuning.

A.3 Validation Performance
A.5 Additional Results for the Initialization Experiment Figure 8 supplements Figure 3 by plotting additional metrics for our initialization technique §7.4. It validates that random initialization (from a uniform (0,1) distirbution) significantly underperforms initializing with real words; Additionally, initializing with task-relevant words (e.g., "summarization" and "table-to-text") attains slightly better generation scores than initializing with task-irrelevant words (e.g., "elephant" and "banana").   : Prefix-tuning (orange) outperforms fine-tuning (blue) in low-data regimes in addition to requiring many fewer parameters. The top three plots correspond to summarization, measured by ROUGE-1, ROUGE-2, and ROUGE-L. The bottom three plots correspond to table-to-text, measured by NIST, METEOR, and CIDEr. The x-axis is the training size and the y-axis is the evaluation metric (higher is better).

A.6 Qualitative Examples for Extrapolation
ra n d o m in it "a c ti v e " "e le p h a n t" "s u m m a ri ze " "t a b le -t o -t e x t: " "b a n a n a " "b e a u ti fu l" "d iv id e " "k e e p " 2 3 4 5 6 7 NIST ra n d o m in it "a ct iv e " "e le p h a n t" "s u m m a ri ze " "t a b le -t o -t e x t: " "b a n a n a " "b e a u ti fu l" "d iv id e " "k e e p " "a ct iv e " "e le p h a n t" "s u m m a ri ze " "t a b le -t o -t e x t: " "b a n a n a " "b e a u ti fu l" "d iv id e " "k e e p " "a c ti v e " "e le p h a n t" "s u m m a ri ze " "t a b le -t o -t e x t: " "b a n a n a " "b e a u ti fu l" "d iv id e " "k e e p "  Figure 8: Initializing the prefix with activations of real words significantly outperforms random initialization, in a low-data setting with 100 training data.