Decoder Tuning: Efficient Language Understanding as Decoding

With the evergrowing sizes of pre-trained models (PTMs), it has been an emerging practice to only provide the inference APIs for users, namely model-as-a-service (MaaS) setting. To adapt PTMs with model parameters frozen, most current approaches focus on the input side, seeking powerful prompts to stimulate models for correct answers. However, we argue that input-side adaptation could be arduous due to the lack of gradient signals and they usually require thousands of API queries, resulting in high computation and time costs. Specifically, DecT first extracts prompt-stimulated output scores for initial predictions. On top of that, we train an additional decoder network on the output representations to incorporate posterior data knowledge. By gradient-based optimization, DecT can be trained within several seconds and requires only one PTM query per sample. Empirically, we conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a 200x speed-up. Our code is available at https://github.com/thunlp/DecT.


Introduction
Recent advances in pre-trained models (PTMs) demonstrate the power of the "pre-trainingfine-tuning" paradigm, which empowers broad downstream NLP tasks with a single backbone model (Devlin et al., 2019;Raffel et al., 2020;Radford et al., 2019). Given the million even billionscale models, model-as-a-service (MaaS) has become an emerging practice in deploying massive PTMs, where users can only get access to model inference APIs (Brown et al., 2020;. Under such a scenario, PTMs' parameters are frozen, and users cannot fine-tune the model on downstream tasks for adaptation. To find an alter- native way, researchers have studied MaaS PTM adaptation methods extensively. Most existing approaches in this line are based on prompts, which modify inputs with specific patterns. By wrapping inputs into cloze-style questions or prepending inputs with a few demonstrative examples, PTMs could produce the right outputs directly and show strong "in-context" learning abilities (Petroni et al., 2019;Brown et al., 2020) without any parameter update. Besides heuristic prompt design, some recent works try to optimize the input prompts without gradients. Among them, Black-box Tuning (BBT)  and BBTv2 (Sun et al., 2022a) apply evolutionary algorithm (Hansen and Ostermeier, 2001) on continuous prompt tokens, while RLPrompt (Deng et al., 2022) adopts reinforcement learning to find discrete prompt tokens. Nevertheless, gradientfree optimization is rather difficult and these inputside methods need to query the PTMs thousands of times for optimization, which leads to huge inference costs in terms of time and computation resources. Moreover, their final performance is not satisfying as well.
Given the flaws of input-side adaptation, we turn to output-side adaptation, which builds tunable decoder networks on model outputs. Comparatively, output-side adaptation enjoys two major advantages: (1) We can directly tune decoder networks on top of model outputs with back-propagation rather than arduous alternatives. (2) We can reduce thousands of model queries to only once per sample. However, designing decoder networks is not straightforward. Past studies have shown that merely tuning an MLP or LSTM (Hochreiter and Schmidhuber, 1997) over output features cannot provide satisfying results (Sun et al., 2022a,b), leaving this path underexplored.
In this work, we aim to solve the performance issue for output-side adaptation, and we argue that there are two critical reasons behind it: (1) Simply utilizing PTMs as feature extractors ignores the infilling ability of PTMs, which is a strong prior for adaptation. (2) MLP and LSTM are not proper networks especially when training data is not sufficient.
Based on these findings, we present Decoder Tuning (DecT), an enhanced output-side adaptation method. Specifically, DecT has two crucial design choices to address the above issues. First, DecT queries the PTM with prompts and adopts model output scores as the initial predictions, which takes advantage of internal model knowledge. Second, on top of the output representations, we select a Prototypical Network (ProtoNet) (Snell et al., 2017) as the decoder network and train it to fit the training data, which is more suitable for few-shot learning. In this way, DecT modifies the initial model scores with subsequent training data, thus achieving better performance.
Through few-shot learning experiments on ten language understanding datasets, we highlight three advantages of DecT (see Figure 1). (1) DecT achieves a 5% absolute accuracy improvement on average, greatly outperforming previous works. (2) DecT is highly efficient. Compared with major prompt engineering baselines, DecT dramatically reduces the average adaptation time from over 16,000 seconds (BBTv2) to 3 seconds. (3) DecT only requires one PTM query for each example, while other methods need about 10 4 calls. This advantage is vital when PTM calls are not for free. In addition, we also conduct extensive ablation stud-ies and validate the impact of each component of DecT.

Preliminaries
Given a set of training data and PTM M, we need to predict the label y ∈ {1, . . . , K} for sample x, where K is the number of classes. We assume that each class has the same amount of n training samples.
In the MaaS setting, M is a black-box inference API with fixed parameters. Therefore, we can only query the model with input x and get corresponding outputs. To better utilize the PTMs, it has been a common practice to wrap input samples into prompts. Specifically, we enclose each input x into a template T with a [MASK] token (here we assume using a masked language model). Then, we query M with T (x) and get the final layer hidden states h at the [MASK] position and scores s = S M (x) ∈ R K over label words V. Take sentiment analysis as an example, we can use as the template with V = {bad, great} as label words for negative and positive sentiment respectively. The output scores on these label words further correspond to the classes.

Methodology
In this section, we elaborate on our proposed Decoder Tuning (DecT) method for the classification task. We start with reviewing current input-side adaptation methods, then give an overview of DecT and finally detail it step-by-step.

Input-side Adaptation
Previous MaaS adaptation methods seek for optimal prompts that stimulate PTMs to output correct answers 1 . Without loss of generality, we formulate these methods with a transformation function f (·) which pre-processes the input x. f (·) can be specialized by adding demonstrations (Brown et al., 2020), discrete prompt tokens (Deng et al., 2022) or soft ones (Sun et al., 2022a,b). Denote the final score as q(x) and probability as P (y|x) = Softmax(q(x)), these methods define q(x) = S M (f (x)) and optimize f (·) for correct predictions. Although optimizing f (·) without model gradients is possible, we argue that it is highly burdensome. Forwarding through a large "black box" model M, it is rather challenging to find corresponding inputs for specific outputs without the guidance of gradient signals. As a result, users may get suboptimal performance with expensive query costs. We empirically validate it in experiments.

Overview of DecT
For more effective and efficient PTM adaptation, we turn to output-side adaptation rather than inputside. Overall, output-side adaptation can be viewed as a post-processing of model outputs which uses another function g(·) to process the model outputs, and get the final scores q(x) = g(S M (x)). Different from input-side, output-side adaptation is easy-to-optimize with gradient descent, and for each sample, we only need to query the PTM once.
For DecT, as shown in Figure 2, we model the post-processing as decoding, which refers to a post-modification to the initial model predictions. Specifically, we first query the PTM with promptenclosed inputs to get model outputs, including the scores for each class and hidden states. Intuitively, output scores contain prior knowledge inside the PTM, so we retain them as part of the final scores.Then, we tune an additional decoder function on the hidden states to fit the training data and make final predictions. Next, we describe how we query the model and then specify the implementation of the score function.

Querying with Prompts
To get model outputs, we simply follow the procedure in Section 2 and query the model with manual template-wrapped inputs. We then process the scores by calibration.
Calibration. As stated in , PTMs tend to assign higher probabilities on those frequent label words, leading to biased output scores. To eliminate the prediction bias, we further calibrate the output scores with empty input x c ="" following . Querying the model with x c , we can obtain the calibaration scores s c and normalize them by s c /mean(s c ). Then we calibrate s byŝ = diag(s c /mean(s c )) −1 s. (1) After that, the calibrated scoresŝ are balanced over classes.

Tuning the Outputs
After getting the hidden states and calibrated scores, we perform DecT outside the PTM to modify the output scores fitting the training data. Denote the final score on class k as q(x, k), we calculate it by the following function: where Dec(·) is a trainable decoder function, λ is a hyperparameter controlling the weight of PTM scores andŝ k is the k-th logit inŝ. By tuning Dec(·), the final predictions incorporate training data on top of PTM outputs, which combine both knowledge effectively.
The design choice of Dec(·) is fairly flexible. In practice, we select Prototypical Networks (Pro-toNet) (Snell et al., 2017) due to their simplicity and remarkable performance in few-shot learning and prompt-based tuning (Cui et al., 2022). For this, we project the hidden states with a linear layer parameterized by W and get sample representation v = Wh. ( On prototypes, classical approaches model them as points in the embedding space, which overlook the different class characteristics. Inspired by Ding et al. (2022a), we model prototypes as hyperspheres with an additional radius parameter. Concretely, the prototype for class k contains two parameters, center position vector z k and radius scalar r k . We randomly initialize z k and initialize r k as the average distance between z k and instances in class k: As for the score function, we calculate the Euclidean distances between instances and prototypes.
According to Eq. 2, the final logit is From a geometric view, the score function calculates the distance from instance x to the "surface" of the prototype, where r k +λŝ k is the whole radius acting like the bias term. With the scores, we can calculate the predicted probability by the Softmax function: and we can optimize W and r k by the crossentropy loss

Experiments
In this section, we first introduce the experimental settings (Section 4.1), then discuss the results for few-shot experiments (Section 4.2), efficiency comparison (Section 4.3), and experiment results for more training data (Section 4.4).
Splits. We randomly sample n = 1, 4, 16 data instances for each class from the training set for few-shot learning, and sample same amount data for validation.   (2022)). We run other experiments over 5 random seeds and report average accuracy and standard deviation (%). Best results are in bold.
crease as the amount of training data increases. On MNLI and FewNERD, we tune λ on the validation set and select λ = 1 and λ = 1/16 respectively. In Appendix A.3, we give the templates and label words. Table 1 presents the main few-shot learning results. From the results, we have the following observations: (1) Overall, DecT outperforms the stateof-the-art baseline methods by a large margin (more than 5% on average), showing its superior performance. Across different tasks, DecT and baselines obtain similar results on some easy sentiment analysis and topic classification tasks. Specifically, we highlight that DecT is much more favorable on difficult datasets, such as Yahoo and FewNERD. While other baseline methods struggle to optimize well, DecT surpasses them significantly (about 10% on Yahoo and 20% on FewN-ERD under 16-shot setting compared with BBTv2 and ICL).

Main Results
(2) On stability, DecT also has consistently low variance and some baselines (ICL and RLPrompt) are highly unstable. Given the difficulty of few-shot PTM adaptation, it is of great significance that the adaptation method is robust to random seeds.
(3) On baselines, optimizationfree methods, i.e. zero-shot prompt and ICL are strong baselines. However, as shown in the table, ICL gives the best results in the 1-shot setting, and it can hardly improve with more training data due to the input length restriction. Compared with them, merely optimizing the input prompts  (ICL, BBT, and RLPrompt) show marginal improvements, showing the limitation of black-box prompt optimization. In contrast, BBTv2, which inserts additional learnable prompt tokens inside the PTM, is more powerful. With the superior results of DecT, we argue that output-side optimization is a promising way for MaaS PTM adaptation.

Efficiency Comparison
Despite the superior performance, another major advantage of DecT is its high efficiency. In Figure 1 Table 3: Experiment results for more training data. We run all experiments over 5 random seeds and report the average accuracy and standard deviation (%). † : Update model parameters.
prompt optimization methods. For BBT, BBTv2, and RLPrompt, users have to query the model near 10 4 times and spend several hours for sufficient optimization even in the few-shot scenario. When the inference API is not for free such as OpenAI API 2 , using these methods would be expensive, and this further burdens their usage in the scenarios of rich data and large models.
In terms of tunable parameters, DecT demands 130K additional parameters for the linear projection layer, which is less than 0.04% of RoBERTa LARGE (355M) that largely saves storage space.

Beyond Few-shot
As shown in Section 4.3, the simple architecture and high efficiency enable DecT to scale on more training data, while baseline methods struggle to finish training within acceptable time limits. To explore the scalability of DecT beyond the few-shot setting, we conduct experiments with increased training data (n = 64 and 256). For reference, we compare DecT with fine-tuning, the strongest baseline which update full model parameters.
The detailed results are presented in Figure 1 and Table 3 and we have the following conclusions.
(1) DecT continually improves its performance on more training data at a low cost. The average accuracy gains 6% from 16-shot to 256-shot while the average training time is less than 100 seconds. (2) Compared with fine-tuning, DecT is even on par with it in the 64-shot scenario and gradually falls behind in the 256-shot setting, which is reasonable as we only tune a small portion of parameters outside the model. Through further task-level observation, we find DecT still performs well on sentiment analysis and topic classification, but cannot catch up with fine-tuning on NLI and entity typing, which are identified as harder tasks as they require complex reasoning or fine-grained semantic understanding. (3) In experiments, we find fine-tuning is more sensitive to random seeds in the few-shot setting 2 https://openai.com/api/  Table 4: Ablation study of model scores s and radius parameter r. We run each experiment over 5 random seeds and report average accuracy and standard deviation (%). Best results are in bold.
due to the huge amount of trainable parameters and relatively few loss signals, which is evidenced by the high variance in the 64-shot setting. Therefore, the stability advantage of DecT has been verified again.
To conclude, we take the first step to applying MaaS methods beyond few-shot learning. The results show that DecT is competitive against finetuning on regular classification tasks, but is limited on difficult tasks. How to adapt PTMs on challenging tasks without parameter updates still needs further exploration.

Analysis
In addition to main experiments, we further provide more analytical experiments for understanding DecT. We conduct ablation study on several components in Section 5.1. Then we evaluate the impact of hyperparameter λ (Section 5.2), PTMs (Section 5.3), and templates (Section 5.4) respectively.

Ablation Study
To validate each component of our proposed DecT, especially the effect of model scores s, radius parameter r, and ProtoNet, we conduct extensive ablation studies. We present results in Table 4 and Figure 4.   (Raffel et al., 2020). We run each experiment over 5 random seeds and report average accuracy and standard deviation (%).
Ablating model scores. Apparently, model scores contribute largely to the few-shot performance of DecT, especially when the training data is extremely scarce (1-shot), which illustrates that model scores contain beneficial prior model knowledge for language understanding. Also, incorporating training data reduces the variance. When there are more training data, model scores bring less enhancement, which is reasonable as the relative weights of model and ProtoNet scores change.
Ablating radius. Meanwhile, the radius is also helpful under low-shot scenarios, which characterizes the difference across classes. But as the number of training data increases, ProtoNet dominates model predictions and the impact of r is weakened as well.
Ablating decoder. As stated previously, the design choice of the decoder function is flexible. We replace ProtoNet with a two-layer MLP and evaluate the performance. In Figure 3 we can see that ProtoNet significantly outperforms MLP in the 1-shot setting, which matches the advantages of ProtoNet in the few-shot setting. In 4shot and 16-shot experiments, ProtoNet still gets higher scores, but with smaller margins. On stability, ProtoNet achieves consistently lower standard deviation scores, which serve as another advantage. Overall, we find ProtoNet is a vital component in DecT, and simply replacing it with MLP would worsen the performance.

Impact of λ
As a hyperparameter, λ controls the relative importance of model scores and prototype scores. Here we examine its impact on AG's News and SST2.
In Figure 4, we can observe that: (1) λ largely affects DecT in the 1-shot settings. As λ increases, DecT gradually performs better and stabler, which illustrates the importance of model knowledge in this case.
(2) With the shot number increases, the impact of varying λ is weakened, and the best practices become smaller. These observations validate our selection strategy in Section 4.1, which effectively balances model and data knowledge.

Impact of PTMs
In this section, we explore how DecT applies to PTMs with different architecture and scales. We select T5 (Raffel et al., 2020), an encoder-decoder PTM, at different scales, from T5 Small , T5 Base , T5 Large to T5 3B .

Impact of Templates
Although DecT is an output-side adaptation method, the choice of templates also affects the final performance. To assess the influence of templates, we conduct experiments on AG's News and SST2 and show results in Table 6. Overall, DecT does not rely much on templates. While different templates may induce fluctuant zero-shot performance, DecT largely moderates the gaps between them. Additionally, we try two templates searched from RLPrompt (Deng et al., 2022) and they both achieve satisfying results. On SST2, the template from RLPrompt is even better than manually designed ones. Therefore, we highlight that DecT is complementary with input-side adaptation algorithms, and they can work together for better performance.

Related Work
Our work explores how to efficiently adapt large PTMs. In this section, we review three lines of research for prompt-based tuning (data efficiency), parameter-efficient tuning (parameter efficiency), and MaaS adaptation methods respectively.

Prompt-based Tuning
The major practice for prompt-based tuning is wrapping text pieces into human-designed templates. By this means, prompt-based tuning converts downstream tasks to pre-training tasks (e.g. masked language modeling) and greatly enhances the zero/few-shot learning ability of PTMs. Firstly applied in knowledge probing (Petroni et al., 2019), it has been adopted broadly in NLP (Schick and Schütze, 2021;Ding et al., 2021a;Han et al., 2021;Cui et al., 2022). Other works also investigate automatic or learnable prompts (Shin et al., 2020;Hambardzumyan et al., 2021;Schick et al., 2020), but the optimization of prompts is a non-trivial problem. In our work, we adopt manual templates to stimulate model knowledge and help data-efficient model adaptation.

Parameter-efficient Tuning
Another line of work explores tuning a small fraction of model parameters to reduce computation and storage budgets, namely parameterefficient tuning (PET) (Ding et al., 2022c). Typical PET methods include inserting tunable modules (Houlsby et al., 2019;Li and Liang, 2021;Hu et al., 2022a), adding soft prompt tokens (Lester et al., 2021) and specifying certain parameters (Zaken et al., 2022). Although PET methods achieve remarkable performance with few parameter updates, they still require model gradients, which are not available in the MaaS setting.

MaaS Adaptation
With inference-only APIs, there are also works that adapt models without tuning any model parameters. Brown et al. (2020) present in-context learning, which concatenates test inputs with several exemplars. Although elegant and easy to use, researchers also find that in-context learning is unstable Lu et al., 2022) and limited by input length. Other approaches try to optimize prompts with either black-box optimization methods (Sun et al., 2022a,b) or reinforcement learning (Deng et al., 2022). However, due to the lack of gradient signals, they need thousands of model queries, resulting in high costs when the model is large and API calls are not for free. Different from the abovementioned methods, we adapt models at the output side, which need not optimize the distant input prompts. We demand only one API call for each training sample and achieve better results across tasks.

Conclusion
In this paper, we present DecT, which performs both data and parameter-efficient adaptation with off-shelf PTMs. By fusing prior model knowledge and posterior data knowledge, DecT achieves superior performance on ten language understanding tasks. Meanwhile, DecT exceeds existing baselines by three orders of magnitude in terms of training time and the number of queries, highlighting its advantages in real-world deployment. In future works, we are eager to explore how to combine input and output-side adaptation methods for better PTM adaptation, and how to extend this line of research to more challenging scenarios.