When Gradient Descent Meets Derivative-Free Optimization: A Match Made in Black-Box Scenario

Large pre-trained language models (PLMs) have garnered significant attention for their versatility and potential for solving a wide spectrum of natural language processing (NLP) tasks. However, the cost of running these PLMs may be prohibitive. Furthermore, PLMs may not be open-sourced due to commercial considerations and potential risks of misuse, such as GPT-3. The parameters and gradients of PLMs are unavailable in this scenario. To solve the issue, black-box tuning has been proposed, which utilizes derivative-free optimization (DFO), instead of gradient descent, for training task-specific continuous prompts. However, these gradient-free methods still exhibit a significant gap compared to gradient-based methods. In this paper, we introduce gradient descent into black-box tuning scenario through knowledge distillation. Furthermore, we propose a novel method GDFO, which integrates gradient descent and derivative-free optimization to optimize task-specific continuous prompts in a harmonized manner. Experimental results show that GDFO can achieve significant performance gains over previous state-of-the-art methods.


Introduction
Large pre-trained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019;Raffel et al., 2020) have attracted considerable attention for their versatility and potential for solving a wide spectrum of Natural Language Processing (NLP) tasks.Especially, through prompt-based learning (PL) (Liu et al., 2021a;Gu et al., 2022), PLMs have consistently demonstrated impressive performance on various downstream tasks with a * Equal contribution.
‡ Shanghai Engineering Research Center of Big Data Management It is noted that prior gradient-based approaches, such as Adapter (Houlsby et al., 2019) and LoRA (Hu et al., 2021), are unable to be used in black-box scenarios.GDFO is the first to introduce gradient descent in the black-box tuning scenario.few labeled samples.However, it is a challenge to extend the benefits of these large PLMs to a broader audience.For users, the cost of running these models may be prohibitive; for service providers, they may not open source the model parameters due to commercial considerations and potential risks of misuse1 .One possible solution is to deploy PLMs as a service, enabling users to access the advanced capabilities of PLMs through their inference APIs, such as GPT-3 (Brown et al., 2020), ERNIE (Sun et al., 2021) and Yuan (Wu et al., 2021b).
In this scenario, the large pre-trained language model provided by the server is considered as a black box.In order to perform various downstream tasks, users are required to construct task-specific prompts or select training samples (Brown et al., 2020) to input into the black box.We can manually construct discrete prompts, which are simple and effective but may not fully utilize training data, potentially resulting in suboptimal performance on some tasks.Instead of designing hand-crafted discrete prompts, there have been an increasing number of studies on continuous prompt tuning (Lester et al., 2021;Liu et al., 2021a;Ding et al., 2022), which aim to train continuous prompts and add them to the original samples.Trainable continuous prompts have also shown remarkable success on various tasks, but most existing methods optimize the continuous prompts through back-propagation, which is unavailable in the black-box scenario.To solve the issue, Sun et al. (2022b) have recently proposed Black Box Tuning (BBT), which utilizes random projection matrices and derivative-free optimization (DFO) (Kolda et al., 2003;Conn et al., 2009;Rios and Sahinidis, 2013), instead of gradient descent, for training continuous prompts in the black-box scenario.Built upon BBT, BBTv2 (Sun et al., 2022a) prepends continuous prompts to each layer of the PLM and further presents a divide-andconquer gradient-free algorithm to alternately optimize the prompts at different layers.Both BBT and BBTv2 have shown their superiority against other gradient-free methods.Despite the success, there remains a significant gap compared to gradientbased methods on certain tasks.For example, compared against BBTv2, Adapter (Houlsby et al., 2019), a gradient-based method, leads by 4.35% on the DBPedia dataset (as shown in Figure 1).Therefore, we consider that the incorporation of gradient descent into the black-box scenario may potentially enhance the performance of the model.
Based on the insights discussed above, in this paper, we introduce gradient descent into the blackbox scenario through knowledge distillation techniques.In particular, we propose a novel approach named GDFO to combine Gradient descent with Derivative-Free Optimization, allowing them to jointly optimize task-specific continuous prompts.First, we adopt the technique of knowledge distillation, where a student model is trained to emulate the knowledge of the black-box model, referred to as the teacher model.Then, a prompt generator is trained by gradient descent through the student model, while utilizing derivative-free optimization algorithms to optimize continuous task-specific prompts.The continuous prompts generated by the prompt generator and the prompts optimized by the derivative-free algorithm are further integrated to serve as the final prompts.Finally, we perform extensive experiments on seven benchmark datasets to show that GDFO can achieve significant performance gains over other state-of-the-art methods.The main contributions of the paper are summarized as follows: • To the best of our knowledge, we are the first to utilize gradient descent to optimize taskspecific continuous prompts in the black-box scenario through knowledge distillation.
• We propose a novel method GDFO, which integrates gradient descent and derivative-free optimization to optimize task-specific continuous prompts in a harmonized manner.
• We conduct comprehensive experiments on seven benchmark datasets under the black-box scenario.Empirical results demonstrate the superiority of GDFO over other competitors.
2 Related Work

Prompt-based Learning
Prompt-based learning, in which the PLM is adapted to various tasks by task-specific prompts, has emerged as a promising framework.Brown et al. (2020) shows that PLM can perform excellently in few-shot learning by using manual prompts concatenated with samples.However, designing prompts in a hand-crafted fashion requires substantial time and experience and may not find the optimal ones (Jiang et al., 2020;Shin et al., 2021).To solve the problem, researchers attempt to use automated prompts.LM-BFF (Gao et al., 2021) uses prompt-based fine-tuning with automatically searched prompts and generates task demonstrations to be a part of the input context.P-tuning (Liu et al., 2021b) optimizes the continuous prompts using gradient descent as an alternative to discrete prompt searching.P-tuning v2 (Liu et al., 2021a) adopts continuous prompts for each layer of the PLMs to improve the model performance.Prefixtuning (Li and Liang, 2021) optimizes continuous task-specific vectors and prepends them to the input texts.Input-tuning (An et al., 2022) fine-tunes both the continuous prompts and the input representations, leading to a more effective way to adapt unfamiliar inputs to frozen PLMs.

Black-box Tuning
Due to commercial considerations, large PLMs such as GPT-3 (Brown et al., 2020) are only pro- vided as a service in the cloud, resulting inaccessible parameters and gradients of PLMs.To tackle this issue, BBT (Sun et al., 2022b;Diao et al., 2022) has been proposed to optimize the continuous prompts via derivative-free optimization (DFO).
As an improved version of BBT, BBTv2 (Sun et al., 2022a) inserts prompts to each layer of the PLMs instead of optimizing the prompt merely in the input layer.Furthermore, GrIPS (Prasad et al., 2022) proposes a gradient-free search approach to generate discrete prompts.Besides, RLPrompt (Deng et al., 2022) optimizes discrete prompts through reinforcement learning and utilizes a continuous policy network which is highly parameter-efficient to generate prompts.PALP (Cho et al., 2022) combines linear models and in-context learning (Brown et al., 2020) to augment training samples with the templates for better contextualization.To improve the computational efficiency, PromptBoosting (Hou et al., 2022) constructs a pool of prompts via a gradient-free approach and ensembles many weak learners using the ADABOOST algorithm to enhance the model performance.Despite the success of the above approaches, all of them do not optimize continuous prompts through gradient descent (GD) in the black-box scenario, our method introduces GD to the scenario through knowledge distillation and combines GD and DFO to jointly optimize continuous prompts, which pro-vides a novel insight for future black-box tuning approaches.

Knowledge Distillation
As a representative method of model compression, knowledge distillation transfers the knowledge from a larger deep neural network (teacher) to a smaller network (student) (Hinton et al., 2015;Kim and Rush, 2016).There have been different distillation algorithms being proposed to face more complex settings of transferring knowledge, including adversarial distillation (Ma et al., 2020;Wang et al., 2022), multi-teacher distillation (Guan et al., 2020;Yuan et al., 2021) and data-free distillation (Fang et al., 2022;Binici et al., 2022).Furthermore, the superior success of PLMs has also spurred researchers to distill PLMs into smaller models while retaining performance.DistilBERT (Sanh et al., 2019) introduces a triple loss combining language modeling and cosine-distance losses to leverage the inductive biases learned by large models during pre-training.TinyBERT (Jiao et al., 2020) performs a Transformer distillation method at both the pre-training and task-specific learning stages.
NewsBERT (Wu et al., 2021a) designs a collaborative learning framework where the student model can learn from the experience of the teacher model.
In this paper, we consider knowledge distillation to transfer knowledge from a black-box teacher to a student, which is used for training a prompt generator by gradient descent.

Method
In this section, we describe our approach GDFO.We first give an overview of GDFO, which is illustrated in Figure 2. GDFO first trains a student model by aligning its outputs to that of the teacher model (i.e., the black-box model).Then, GDFO trains the prompt generator by gradient descent while simultaneously optimizing the continuous prompts via DFO.Finally, the final prompts are obtained by integrating the prompts generated by the prompt generator with those optimized by DFO, which are then fed into the black-box model together with query instances to obtain predictions.Next, we describe each component of GDFO in detail.

Knowledge Distillation
Given a student model S and a teacher model T , the objective of knowledge distillation (KD) is to enhance the performance of S by aligning its outputs with those of T , which is accomplished by reducing the divergence between the probability distributions generated by S and T .In the blackbox scenario, the black-box model is considered as T .We utilize T 's outputs as soft targets for S to learn.Given a training instance, we randomly select n tokens from the PLM vocabulary to construct a random prompt p r , which is concatenated to the beginning of the instance.Additionally, a hand-crafted template 2 is appended to the end of the instance.We use the concatenated sentence as the input x.We denote S(x) and T (x) as the output logits of S and T for input x, respectively.The KD can be conducted by minimizing the Kullback-Leibler (KL) divergence distance between the student and teacher predictions: where σ(•) denotes the softmax function and τ is a temperature hyper-parameter.The student parameters are updated according to L KL and the cross-entropy loss L CE over the ground-truth y: where λ is a weight and L CE is defined as: (3) 2 The details of templates are shown in Table 1.end for 19: end for

Prompt Generator
Upon the completion of training S via knowledge distillation, the student parameters are frozen and a prompt generator is optimized by gradient descent with the purpose of generating continuous prompts p GD ∈ R D for given samples.Meanwhile, following BBT (Sun et al., 2022b), we optimize intermediate vector z ∈ R d through CMA-ES (Covariance Matrix Adaptation Evolution Strategy) (Hansen and Ostermeier, 2001;Hansen et al., 2003), which is a widely used evolutionary algorithm for non-convex black-box optimization in continuous domain.Then a random projection matrix A ∈ R D×d is utilized to project z into the highdimensional space.Finally, we randomly sample n tokens from the PLM vocabulary as initial prompt p 0 and get final continuous prompt p ∈ R D : where α is a balancing weight.Further information regarding the initialization of A and the specifics of the optimization procedure of CMA-ES can be found in Sun et al. (2022b).Given a training instance, p is concatenated to the beginning of it and a hand-crafted template 2 is appended to the end of it.The concatenated sample is fed into S and T .Then the output logits are obtained and used to compute L CE , which is utilized to update the parameters of the prompt generator and optimize z through CMA-ES.The overall training procedure of GDFO is summarized in Algorithm 1.

Inference
It 's just incredibly dull.

Prompt Generator❄
It During the inference stage, given a query instance, we first input it into the prompt generator to generate p GD .Subsequently, we combine p GD , p 0 , and Az that have been optimized through CMA-ES to obtain the final continuous prompt p through the Equation 4. Next, similar to the training stage, we concatenate p to the front of the query instance and append the hand-crafted template 2 to the end of it.Finally, we input the concatenated sample to the black-box model to obtain the prediction.The overall inference procedure is shown in Figure 3.

Experiments
In this section, we perform comprehensive experiments to compare our proposed model with twelve competitive baselines on seven downstream tasks.

Datasets
We perform experiments on a variety of language understanding tasks, including sentiment analysis, topic classification, natural language inference (NLI), and paraphrasing.Statistics of these datasets are given in Table 1.Specifically, we utilize the following datasets: Sentiment analysis: SST-2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015) consist of text samples with assigned sentiment labels (e.g.positive or negative).
Topic classification: AG's News (Zhang et al., 2015) and DBPedia (Zhang et al., 2015) contain text samples with pre-defined topics.NLI: SNLI (Bowman et al., 2015) and RTE (Wang et al., 2018) are composed of sentence pairs and the objective is to determine the relationship between the two sentences, such as entailment, contradiction and neutral.
Paraphrase: MRPC (Dolan and Brockett, 2005) contains sentence pairs and the goal is to recognize semantic equivalence between the two sentences.

Baselines
We compare GDFO with twelve competitive methods, which can be grouped into two categories: gradient-based methods and gradient-free methods.
For gradient-based methods, we consider six baselines: For gradient-free methods, we also consider six baselines: (1) Manual Prompt conducts subsequent experiments using hand-crafted prompts following the pre-defined templates in Table 1.

Implementation
Few-shot setting We adopt the same procedure as described in previous studies (Zhang et al., 2020;Sun et al., 2022a)  Experimental settings To compare with BBTv2 (Sun et al., 2022a), we mainly use RoBERTa LARGE (Liu et al., 2019) as the blackbox model.For hyper-parameters, we use the grid search to find the best for our model.For knowledge distillation, we use BERTLARGE (Devlin et al., 2019) as our student model.We set the temperature τ to 1 and the balancing weight λ to 0.5.We fine-tune the student model for 2,000 epochs with the learning rate 1e − 4. For prompt generator, we use a fully connected layer and set the dimensionality of the fully connected layer to 1024.The learning rate of the prompt generator is 1e − 5.For CMA-ES, following Sun et al. (2022b), we set the prompt length n to 50.The dimensionality of z is set to 500 and the population size of CMA-ES is set to 20.The balancing weight α is set to 0.5.We train our prompt generator and run CMA-ES for 8,000 API calls.All baseline results are recorded in Sun et al. (2022a).We run all the experiments on a single NVIDIA v100 GPU.

Main Results
The results of 16-shot setting on various downstream tasks are shown in Table 2. From the table, GDFO consistently outperforms all the baselines on the average performance.Specifically, in the gradient-based comparison, GDFO achieves an average accuracy of 81.85%, which outperforms the runner-up gradient-based model, LoRA, by a notable 3.84% improvement.When compared against the gradient-free methods, GDFO leads BBTv2 by 5.26% and 3.89% on the SNLI and RTE datasets, respectively.Our model generates a continuous prompt for each sample, rather than using an optimized continuous prompt for all samples, such as BBT and BBTv2.Furthermore, the incorporation of both DFO and gradient descent during the training stage allows GDFO for more comprehensive and efficient training of continuous prompts, resulting in a notable improvement in the model performance.

Ablation Study
We conduct an ablation study to investigate the characteristics of the main components of GDFO.As illustrated in Figure 4, the results3 demonstrate that GDFO outperforms GDFO-w/o-KD.For instance, on the SNLI dataset, the accuracy of GDFO is 62.53%, whereas that of GDFO-w/o-KD is only 58.51%.This indicates that the knowledge distillation module, which transfers the knowledge of the teacher model to the student model by aligning Table 2: Results (%) of 16-shot setting on various downstream tasks.Following Sun et al. (2022a), we report mean and standard deviation of performance over 3 different splits.We highlight the best results in bold.
SST-2 Yelp P. DBPedia When removing the prompt generator, our method degrades to BBT (Sun et al., 2022b).The comparison of GDFO and BBT is shown in Table 2.The detailed analysis are described in Section 4.5.
the outputs of the student model with those of the teacher model, effectively improves the model performance.Additionally, when removing derivativefree optimization, a significant decline is observed across all datasets, with an average decrease of 6.5%.This demonstrates the effectiveness of incorporating derivative-free optimization in the blackbox scenario.It is worth noting that when removing the prompt generator, the student model will not function, which means that gradient descent is eliminated.In this case, our method degrades to a gradient-free method BBT.The results, as shown in  We report mean and standard deviation of performance over 3 different splits.The results of BBT and BBTv2 are reported in (Sun et al., 2022a (Liu et al., 2019).We report mean and standard deviation of performance over 3 different splits.
We highlight the best results in bold.
models whose architectures are similar to the blackbox model tend to exhibit superior performance.For instance, when both the black-box model and the student model are RoBERTa LARGE (Liu et al., 2019), GDFO achieves the best performance.When comparing models with identical architectures, such as BART LARGE (Lewis et al., 2020) and T5 LARGE (Raffel et al., 2020), T5 exhibits superior performance, which may be due to the fact that the T5 model has twice the number of parameters as the BART model.The increased capacity allows the T5 model to better capture and represent the relationships within the input data, resulting in improved performance.Effect of Balancing Weight The balancing weight α plays a crucial role in determining the performance of the model by controlling the influence of p GD and Az.As the value of α increases, the influence of pGD becomes more prominent, while conversely, as the value of α decreases, the influence of Az becomes more pronounced4 .As illustrated in the Figure 6, when α is set to an extreme value, either too large or too small, it tends to have a negative impact on the model performance.We observe that the average performance of the model across three datasets is optimal when α is set to 0.5, further emphasizing the importance of the combination of derivative-free optimization and gradient descent in improving the performance of the model.

Conclusion
In this paper, we introduced gradient descent into the black-box tuning scenario through knowledge distillation for the first time, which provided a novel insight for future black-box tuning approaches.Furthermore, we proposed a novel method, GDFO, which integrates gradient descent and derivative-free optimization for jointly training continuous prompts.GDFO first trains a student model to enhance the performance by aligning its outputs with those of the teacher model (i.e., the black-box model).After that, GDFO trains a prompt generator using gradient descent while simultaneously optimizing a continuous prompt using DFO algorithm.Experimental results on various datasets show that GDFO can achieve significant performance gains over other gradient-free and gradient-based methods.

Limitations
We summarize the limitations of this work as follows: (1) We conduct experiments on 7 language understanding tasks across 4 types (i.e., sentiment analysis, topic classification, natural language inference and paraphrasing).However, the effectiveness of GDFO on tasks such as sequence labeling and generation tasks has yet to be fully examined.(2) Our proposed method uses a student model and a prompt generator, thereby resulting in a higher computational resource requirement in comparison to gradient-free methods.Therefore, it may not be suitable for implementation on certain edge devices, but it is more appropriate for personal or enterprise users who have access to a certain degree of computational resources and have stringent requirements for the model performance.(3) We only focus on the few-shot setting in this paper.It is possible to extend our work to other scenarios such as semi-supervised learning and we will further explore it in the future research.

Ethics Statement
The proposed method has no obvious potential risks.All the scientific artifacts used/created are properly cited/licensed, and the usage is consistent with their intended use.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Accuracy (%) on the AG's News and DBPedia datasets.Experimental setup is detailed in Section 4.3.It is noted that prior gradient-based approaches, such as Adapter(Houlsby et al., 2019) andLoRA (Hu et al.,  2021), are unable to be used in black-box scenarios.GDFO is the first to introduce gradient descent in the black-box tuning scenario.

Figure 2 :
Figure 2: The overall architecture of GDFO.The details of the model are described in Section 3. The training procedure is shown in Algorithm 1.
(1) Model Tuning fine-tunes the entire PLM through training data.(2) Adapter (Houlsby et al., 2019) is a new module added between layers of a PLM.The parameters of the original network remain fixed, yielding a high degree of parameter sharing.(3) BitFit (Zaken et al., 2022) is a sparsefinetuning method where most of the network parameters are frozen and only the bias-terms of the model (or a subset of them) are being modified.(4) LoRA (Hu et al., 2021), an efficient adaptation strategy, allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers' change, while keeping the pre-trained weights frozen.(5) Prompt Tuning (Lester et al., 2021) freezes the entire PLM and only allows additional tunable tokens to be prepended to the input text.(6) P-Tuning v2 (Liu et al., 2021a) applys continuous prompts for every layer of the PLM instead of the mere input layer.
(2) In-Context Learning(Brown et al., 2020) provides a few training examples for the model to improve its capability of few-shot learning.(3) Feature-MLP trains a two-layered MLP classifier provided with embeddings encoded by the PLM.(4) Feature-BiLSTM trains a bidirectional LSTM on the word representations and connects it to a

Figure 4 :
Figure4: Ablation study: Results (%) of 16-shot problems over seven datasets.w/o KD denotes removing knowledge distillation and w/o DFO denotes removing derivative-free optimization.When removing the prompt generator, our method degrades to BBT(Sun et al., 2022b).The comparison of GDFO and BBT is shown in Table2.The detailed analysis are described in Section 4.5.

Figure 5 :
Figure 5: Accuracy (%) on different black-box models.We report mean and standard deviation of performance over 3 different splits.The results of BBT and BBTv2 are reported in(Sun et al., 2022a).

Figure 6 :
Figure6: Effect of the balancing weight α on three datasets.We report mean and standard deviation of performance over 3 different splits.

C2.D
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4 Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.