Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation

Parameter-efficient fine-tuning (PEFT) methods have provided an effective way for adapting large vision-language models to specific tasks or scenarios. Typically, they learn a very small scale of parameters for pre-trained models in a white-box formulation, which assumes model architectures to be known and parameters to be accessible. However, large models are often not open-source due to considerations of preventing abuse or commercial factors, hence posing a barrier to the deployment of white-box PEFT methods. To alleviate the dependence on model accessibility, we introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. Specifically, considering that the backpropagation gradients are blocked, we approximate the gradients of textual prompts by analyzing the predictions with perturbed prompts. Secondly, a lightweight adapter is deployed over the output feature of the inaccessible model, further facilitating the model adaptation process. Empowered with these designs, our CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods. Code is released at https://github.com/guozix/cbbt.


Introduction
Large-scale vision-language (VL) models (Radford et al., 2021;Jia et al., 2021;Li et al., 2021;Yao et al., 2021;Alayrac et al., 2022;Yuan et al., 2021) have demonstrated remarkable performance in a wide range of applications.Various model finetuning methods have been proposed to exploit the potential of pre-trained VL models for downstream vision (Zhou et al., 2022b;Lu et al., 2022b;Wang et al., 2022;Sun et al., 2022c;Zhang et al., 2022;Wortsman et al., 2022;Li et al., 2023) and natural language processing (Lu et al., 2022a;Yan et al., 2022) tasks.Most existing methods conduct parameter-efficient fine-tuning (PEFT (Houlsby et al., 2019)), which updates a tiny fraction of the model parameters or introduces a small number of extra parameters for tuning, in order to transfer pre-trained knowledge in a computation-and data-efficient manner.
Although impressive improvements have been achieved, standard PEFT methods need to pass signals forward and backward through the entire pre-trained model to update the parameters, which relies on the availability of the architecture, parameters, and even the inference source code of the model.Nevertheless, the trend of building machine learning models as a service leads to many proprietary services that only provide an API interface for model inference, e.g., ChatGPT, Bard, and GPT-4, where the parameters and inference code of the models are not open-source due to commercial or safety considerations.Under such black-box circumstances, existing PEFT methods can hardly be adopted.Thus, it is worthwhile to develop methods that can tune pre-trained VL models in a black-box setting.Moreover, in the era of large foundation models, running super large pre-trained models on local devices can be very costly as the scale of the pre-trained model has constantly increased.Although existing PEFT methods restrict learnable parameters to a fairly small scale, it is still a burden to accommodate models with billions of parameters in limited computing resources for most users.
To tackle these problem of tuning black-box VL models, there exist a few very recent efforts.For instance, BlackVIP (Oh et al., 2023) pioneered black-box prompting for VL models by learning an asymmetric autoencoder-style coordinator with a zeroth-order optimization to modify visual prompts in the pixel space.However, modifying prompts in the large pixel space causes inefficiency and the method requires up to 9k parameters in the coordinator to achieve the goal.Besides, the performance of their visual prompts is subject to the diverse semantic features of a well-trained generative selfsupervised learning model.Even so, the method demonstrates limited performance improvements after prompting, showing that prompt tuning in the black-box setting is very challenging.
In this paper, we propose a collaborative blackbox tuning method dubbed CBBT for tuning pretrained VL models and adapting them to downstream tasks.Unlike in BlackVIP (Oh et al., 2023), we learn the prompt for the textual input instead of images, and we adapt the visual features using an adapter.The basic idea is illustrated in Fig. 1.
A query-efficient approximation method (Wierstra et al., 2014) is used to estimate the gradients and optimize the textual prompt with the blackbox pre-trained VL model, from which true gradients are not accessible.Specifically, we query the model with randomly perturbed prompts and then summarize the change in model prediction loss to estimate the gradient of learnable parameters (i.e., the prompts).We equip single-step gradient optimization with information from history updates via a momentum strategy, which leads to faster convergence and better results.
Under the circumstance where the output features are available for the pre-trained VL models, we further adapt the visual features by introducing a lightweight adapter module.As demonstrated in Fig. 1, the visual adapter can be learned effortlessly by supervised learning, without having knowledge of the pre-trained VL backbone.
With the joint optimization of the textual prompt and the visual adapter, our CBBT achieves significant model adaptation performance.To evaluate its effectiveness, we conduct extensive experiments on eleven downstream benchmarks, showing superior performance compared to existing black-box VL adaptation methods.
The main contributions of this work can be summarized as follows: • We advocate textual prompting for adapting pretrained black-box VL models to downstream tasks.Satisfactory prompt tuning results are obtained with an effective gradient approximation algorithm.
• We expedite the tuning process by utilizing history updates as beneficial information for each optimization step, which brings about accelerated convergence and better results.
• We adapt the visual features jointly with the textual prompt when output features are available.
The comprehensive comparison shows that our method achieves state-of-the-art performance compared to other black-box tuning approaches.

Related Work
Black-box Prompt Tuning for Large Language Models.BBT (Sun et al., 2022b) adopts derivativefree optimization using covariance matrix adaptation evolution strategy (CMA-ES) (Hansen et al., 2003) to optimize the prompt in a low-dimensional intrinsic subspace.With this method, the adaptation of large language models works well on natural language tasks, surpassing even the white-box prompting performance.BBTv2 (Sun et al., 2022a) further enhances the capacity of BBT by using deep prompt tuning.BDPL (Diao et al., 2022) tunes a set of discrete prompts for language models by modeling the choice of words in the prompt as a policy of reinforcement learning, and a variance-reduced policy gradient estimator (Williams, 1992;Dong et al., 2020;Zhou et al., 2021) is used to optimize the discrete prompt based on loss value.Black-box Adaptation for VL Models.To the best of our knowledge, BlackVIP (Oh et al., 2023) is the first work to tackle black-box tuning problem of pre-trained VL models.It designs an asymmetric autoencoder-style coordinator to generate inputdependent image-shaped visual prompts and optimize the coordinator by zeroth-order optimization using simultaneous perturbation stochastic approximation (SPSA) (Spall, 1992(Spall, , 1998(Spall, , 1997)).However, the improvement brought by this method (after visual prompting) is relatively limited compared to the baseline, i.e., the pre-trained CLIP (Radford et al., 2021).LFA (Ouali et al., 2023) liberalizes the regimes of black-box models by assuming precomputed features from pre-trained backbones are accessible.They optimize a projection layer for a better alignment between pre-computed image features and class prototypes by a multi-stage procedure.They first solve the orthogonal procrustes problem (Schönemann, 1966) by singular value decomposition (SVD) and further refine the projection matrix using adaptive reranking loss.Albeit superior adaptation performance is obtained, we advocate that the complex-phased optimization can be substituted by end-to-end supervised learning with a lightweight adapter, which effortlessly provides comparable results given labeled image features.

Prompt Feature Adapter q Losses
Back Propagation Learnable q perturbations Here we introduce the general form of prompt tuning and adapter method and the dilemma when applied to black-box VL models.
Prompt tuning for VL models.Given a pretrained VL model, e.g., CLIP (Radford et al., 2021), existing soft prompt tuning approaches (Zhou et al., 2022b,a;Sun et al., 2022c) for classification tasks typically prepend learnable embeddings to the class names of the target dataset: where i ∈ {1, . . ., C} denotes the index of classes, c i denotes word embedding of the i-th class name c i .For j ∈ {1, . . ., M }, v j is a learnable word embedding whose dimension is the same as the dimension of normal word embeddings in the vocabulary.The prediction of an input image x is obtained by computing similarities between the image feature f and prompted textual class features {t i } C i=1 : where the features of images are encoded by pretrained image encoder f = Enc I (x), and textual class embeddings are generated by text encoder t i = Enc T (ϕ(c i )).⟨•, •⟩ calculates the cosine similarity and τ is a temperature parameter.The objective of prompt module ϕ is maximizing the classification probability of the ground-truth class of few-shot image samples: When given a while-box model, it is straightforward to calculate the gradient of with respect to the prompt, and optimization of the prompt can be performed via gradient descent: Unfortunately, in the black-box setting, the gradients are unable to be backpropagated through the pre-trained black-box Enc I and Enc T via the chain rule, and the term ∇ ϕ L(y, x, ϕ) cannot be directly obtained.Thus, current gradient-based prompt tuning methods are not feasible in this situation.
Adapter learning for VL models.Adapter learning methods (Gao et al., 2021;Zhang et al., 2022) for VL models usually manipulate the output features of pre-trained models for adaptation to target tasks.For instance, an adapter module can be introduced to transfer the visual features to new domains with f = ψ(f ), and then the prediction is obtained by: Learning such an adapter module by minimizing L(y, f , ψ) does not require back-propagation through the entire pre-trained VL model, which provides convenience for adaptation without knowing the details of the backbone model.But access to the output features of the pre-trained model is required to construct and optimize the adapter module (Zhang et al., 2022;Ouali et al., 2023).Further Analyses of the Black-box PEFT.Given a black-box pre-trained model, the unavailability of gradients set a barrier to prompt tuning.Therefore, we intuitively have the idea of optimizing the prompt by estimating gradients.Input gradient approximation has been explored in the application of black-box model attacks (Ilyas et al., 2018b,a) and black-box model reprogramming (Tsai et al., 2020).We employ a perturbation-based gradient approximation method to estimate the gradient of learnable parameters in the prompt.The estimated gradient serves as an effective guide for the tuning of the prompt.
Although the gradient approximation technique provides barely satisfactory optimizing guidance, it is still suboptimal compared to the real gradients.Merely conducting single-step gradient descent based on the results of the estimated gradient leads to inefficient training.Inspired by the previous design of optimizers, we try to expedite the optimization based on the estimated gradient with a momentum.The basic idea is that information from previous updates is useful for the current step, and accumulated gradients possibly provide more promising exploration directions.we empirically find that equipping the momentum strategy for gradient approximation brings expedited convergence and remarkable adaptation performance gain.
Although we have no access to the internal variables of typical black-box models, under the circumstance where output features of the pre-trained VL backbone are available, post-processing adapter modules can be directly learned by labeled samples for PEFT.
Motivated by the above analyses, we propose to adapt black-box VL models with a collaborative PEFT consisting of optimization from two perspectives.Firstly, we tune a textual prompt under the guidance of the estimated gradient.Perturbationbased gradient approximation and effective optimization strategy are used to facilitate the training.Secondly, we learn a lightweight adapter to transfer pre-trained visual features.Joint optimization of the prompt and adapter brings superior adaptation performance.The overview of the proposed model is illustrated in Fig. 1.
In the following, we begin by presenting the perturbation-based gradient approximation method in Section 3.2.Then, we explain how to expedite the tuning process by leveraging information from previous updates to achieve a better optimization in Section 3.3.Finally, we introduce the adapter module and joint training schedule in Section 3.3.

Perturbation Based Gradient Approximation
Suppose the prompt module ϕ has parameter θ with dimension D. Let f (θ) be the loss function defined in Eq. (3).To approximate the gradient of the loss function with respect to θ, one possible avenue is to add a small increment to each dimension of θ and sum up the slope of all dimensions: where e i is a one-hot vector and its i-th element is equal to 1.Such an approximation may work well for low-dimensional parameters but is not suitable for problems where D might be large.For example, the dimension of each word embedding of pre-trained CLIP is 512, i.e., θ ∈ R M ×512 .Thus M × 512 independent API calls for the black-box model must be applied to obtain the complete estimated gradient of parameter θ, which causes inefficiency.
To alleviate the cost of the above gradient estimation method, we adopt a stochastic perturbationbased gradient estimation technique formulated as: g i is the slope of the loss function along the direction of the perturbation.ϵ i is a vector randomly drawn from a unit sphere with an L2-norm of 1. β is a small value controlling the scale of perturbations.b is a scaling factor balancing the bias and variance trade-off of the estimator.
To mitigate noise in the estimated gradients, we sample random perturbation ϵ i for q times, and the gradient of θ is approximated by averaging the slope of q directions (Wierstra et al., 2014;Ilyas et al., 2018a;Tu et al., 2019): The upper bound of the estimation g w.r.paper as: Setting a smaller β can reduce the last error term in Eq. ( 9) but may cause an increase in noise due to numerical precision.Increasing the number of samples q reduces the first error term but consumes more queries for the model API.

Effective Optimization Based on Estimated Gradient
To expedite the optimization based on the estimated gradient, we facilitate the tuning process by leveraging the momentum strategy.Specifically, we estimate the first-order moments of the parameters' gradient by The first-order moments accelerate the optimization and reduce the noise in the gradient of each step.And we obtain the adaptive estimation of the secondorder moment by which is used to adjust the learning rate of each dimension adaptively.
In our experiments, we use optimizers that integrate the momentum as a practical implementation.To analyze the optimization results of different optimizers, we illustrate the trend of normalized loss value |L(θ * ) − L(θ)| / |L(θ * ) − L(θ 0 )| in Fig. 2. Adam (Kingma and Ba, 2014) shows a fast and steady convergence and satisfied final results.We have also tried more advanced techniques, e.g., LAMB (You et al., 2019), but no significant improvement in performance is observed.Empirical results show that optimizing the prompt with Adam optimizer based on the estimated gradient provides expedited convergence and superior adaptation performance.

Visual Adapter Module
The pre-trained VL models can be effectively adapted to downstream tasks through the black-box prompt tuning method mentioned above.Meanwhile, under the assumption that having access to the output features of the black-box model (Ouali et al., 2023), a lightweight adapter module can be directly learned from labeled few-shot samples.
Adapter modules (Houlsby et al., 2019;Gao et al., 2021;Zhang et al., 2022) have been proven to be effective in the adaptation of VL models.During the training process of the adapter, the gradients do not need to be back-propagated through the entire pre-trained model, making it possible to equip the adapter module with black box models of which only the output features are available.
The text features have been adapted in our method by tuning the learnable prompt.Thus, we introduce an adapter module only for the visual features to achieve a collaborative adaptation.Specifically, we add an adapter module to the output of the visual encoder of the pre-trained VL model.Access to computed image features and labels allows the adapter to be learned at ease through direct supervised learning.During training, the visual adapter module and text prompts are optimized in turn to achieve a joint adaptation.
In our experiment, we attempt two simple but effective adapter designs, CLIP-Adapter (Gao et al., 2021) and Tip-Adapter (Zhang et al., 2022).Both of which can be well suited for the manipulation of image features for better adaptation.

Implementation Details
Datasets.We perform the few-shot adaptation on black-box pre-trained CLIP (Radford et al., 2021) for image classification tasks following the general protocol in existing methods (Zhou et al., 2022b;Ouali et al., 2023;Oh et al., 2023).In particular, we adopt 11 commonly used datasets to evaluate our method, including ImageNet (Deng et al., 2009), Caltech101 (Li et al., 2004), Oxford-Pets (Parkhi et al., 2012), StanfordCars (Krause et al., 2013), Flowers102 (Nilsback and Zisserman, 2008), Food101 (Bossard et al., 2014), FGV-CAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), UCF101 (Soomro et al., 2012), DTD (Cim- Learnable Prompts.The learnable prompts are shared across all classes in the target dataset.By default, the length of the prompt is set to be M = 1, which reduces the number of parameters in the learnable prompt.A small parameter optimization space helps maintain the quality of the estimated gradients with limited resource for exploration, resulting in effective tuning results.The effect of different prompt sizes is analyzed in Sec.4.4.To initialize the prompt with different length, we use "a", "a photo", "a photo of a", and "a photo of a a photo of a" for M = 1, 2, 4, 8, respectively.Adapter Module.Following CLIP-Adapter (Gao et al., 2021), our adaptor module adopts a two-layer MLP that follows the pre-trained visual encoder.The input and output dimensions are the same as the dimension of the CLIP image feature, and the number of hidden units is a quarter.Following Tip-Adapter (Zhang et al., 2022), we use the averaged feature of random augmented training images from 10 epochs as the initialization of the cache to construct the projection layer.
Training Details.We employ the official CLIP model to evaluate our proposed method.For a comprehensive comparison, we conduct experiments with different visual backbones, i.e., ResNet50 and ViT/B16.The query number q is set as q = 256 by default, and its effect is discussed in Sec.4.4.
The hyperparameters b and β in Eq. ( 7) are set as D and 1/D, respectively.D is the dimension of the parameter in the prompt.
From Table 1, our black-box prompt tuning method (ViT/B16 backbone) surpasses previous work Oh et al. (2023) with an average accuracy margin of 7.3% across 11 datasets, demonstrating the effectiveness of our black-box textual prompting for the adaptation of the VL model.Furthermore, when the context length of the prompt is fixed as M = 1, our black-box prompt tuning method performs comparably to the white-box prompt method,   i.e., CoOp (1 ctx), with a slight difference of less than 2%.By assuming pre-computed features are available, LFA (Ouali et al., 2023) optimizes a projection layer in a multi-stage procedure as introduced in Section 2. We advocate that end-to-end learning of adapter methods (Gao et al., 2021;Zhang et al., 2022) provides a much more brief avenue meanwhile gives satisfactory performance.As shown in Table 1, optimizing the adapter module from CLIP-Adapter and Tip-Adapter can achieve comparable performance with LFA.Thus, we integrate our black-box prompt tuning method with these more flexible adapter modules.From Table 1, the collaborative adaptation of black-box prompting and adapter module brings remarkable performance and achieves a new state-of-the-art result.

Comparison with Black-Box Optimizers
Existing black-box prompt tuning methods have explored various effective optimization techniques when the gradient is unavailable.Here we compare our method with two other different optimization algorithms based on our implementation.In particular, CMA-ES algorithm (Hansen et al., 2003) is considered as state-of-the-art in evolutionary computation and is previously used to optimize the prompt for large language models (Sun et al., 2022b,a).SPSA-GC was proposed by BlackVIP (Oh et al., 2023) to learn a visual prompt for adaptation of pre-trained CLIP.
For a fair comparison, we unify the number of API calls per iteration for all competitors to 10.This is achieved by: setting the population size of CMA-ES as 10; setting the number of repeated two-side estimations of SPSA-GC as 5; setting the number of samplings of our perturbation-based gradient approximation as q = 10.The experiments are conducted on CLIP ResNet50 model, and the prompt length was set to 1.All optimizers are trained for 750 iterations until convergence, and the results are listed in Table 2. From the Table, our method outperforms the SPSA-GC algorithm, which is also based on gradient estimation.Although CMA-ES exhibits faster convergence, noticeable fluctuations are observed even in the later stages of training.Our perturbation-based gradient approximation method is more suitable for the adaption of the VL model.

Ablation Study
Ablation studies are performed to evaluate the effect of various factors, including the number of queries, the prompt length, the number of few-shot samples, and the collaborative training schedule.The experiments are mainly on the CLIP ResNet50 model.Effect of the number of queries q.The number of samplings q controls the times of querying the black-box model in each iteration.It has a significant impact on the number of API calls required for learning the prompt.Fig. 3 illustrates the adaptation performance with different q values.Generally, larger values of q yield more reliable gradients but also require more time and API calls for the black-box model.To trade-off the performance and computational cost, we use q = 256 for the results presented in Section 4.2.Effect of prompt length.We further investigate the effect of Prompt length M .For comparison, all the experiments are conducted under 16-shot training data, with the same number of sampling (q = 256) and iterations.The results are illustrated in Fig. 4. One can see that the trend of performance on different tasks varies as the context length of the prompt changes.For white-box prompt tuning, longer prompts usually can lead to better adaptation to downstream datasets, e.g., DTD and Eu-roSAT.However, blindly lengthening the context (e.g.M = 16) will not result in continuously rising performance.Increasing the length of context brings little improvement for OxfordPets.We attribute these results to the varying degrees of data diversity among different tasks.
But in the case of black-box models, the experimental phenomenon changes due to the influence of gradient approximation.Lengthening the context of the prompt brings trivial benefits and may even result in noticeable performance degradation.The expanded parameter space of a long context leads to practical difficulties in gradient estimation thus the optimization may lead to a suboptimal result.Increasing the number of sampling q may improve the reliability of estimated gradients, but scaling up q in proportion to the size of the prompt leads to severe inefficiency.Thus, we use the prompt length of 1 as a trade-off.Effect of the number of few-shot samples.The number of few-shot samples determines the amount of training data used to adapt the pre-trained VL model.To demonstrate its effect, we keep the de-fault configuration and vary the number of samples used for prompt tuning.Both black box and white box models undergo the same number of iterations.As shown in Fig. 5, increasing the number of samples clearly leads to better adaptation results.Moreover, we observe that in extremely data-scarce scenarios with only 1-shot sample per class, tuning the prompt based on the estimated gradient outperforms white-box tuning on all three datasets.One possible explanation is that optimizing with true gradients can lead to overfitting when the amount of data is too small.In contrast, gradient approximation provides a more robust optimization direction.As the amount of data increases, the advantages of direct white-box learning become more obvious.Effect of the collaborative training schedule.In our experiment, the prompt and the adapter module are optimized jointly to maximize their collaborative performance.During training, we alternately update the prompt and the adapter module at different epochs.To assess the effectiveness of this joint optimization schedule, we conducted experiments using three different ways of training: (i) tuning the prompt until convergence and then optimizing the adapter module (P-A); (ii) tuning the adapter module until convergence and then optimizing the prompt (A-P); (iii) our collaborative training schedule (ALT).We train "Ours (CLIP-Adapter)" under the above three schedules, and the results are shown in Table 3.As shown in the table, recurrently updating the prompt and the adapter alternately (ALT) achieves superior collaborative adaptation performance, demonstrating its effectiveness.

Conclusion
In this paper, we present CBBT, a black-box adaptation approach for VL models.We effectively tune a soft prompt for the text encoder by gradient approximation and jointly learn a lightweight adapter module to transfer the visual features of the pre-trained backbone.Equipped with the textual prompt and the visual adapter, our method achieves a collaborative adaptation for both modalities.Experiments on various datasets show that our CBBT performs favorably against the state-of-the-art methods.

Limitations
We optimize the prompt in the original highdimensional prompt embedding space, which leads to unsatisfactory optimization results for the prompt with a long context, as shown in Section 4.4.The high-dimensional parameter in the prompt also makes the gradient approximation more difficult.
We have tried to optimize the prompt in a smaller subspace following the approach in BBT (Sun et al., 2022b).But the adaptation performance decreased a lot even though we only released a small proportion of the original dimensions.The intrinsic dimensionality property (Aghajanyan et al., 2020;Qin et al., 2021) for vision-language pre-trained models needs further investigation.
Besides, we optimize a continuous prompt with the need for the token embedding layer of pretrained models.Learning a discrete prompt for the adaptation of VL models is worthy of exploration, considering that the discrete text prompt provides an explicit explanation, and discrete text inputs are more suitable for the invocation of the latest pre-trained model APIs with natural language inputs and/or outputs.

A Generalization Ability of Black-Box Prompt
To evaluate the generalization ability of our method, we conducted experiments on the extensively evaluated domain shift benchmarks and base-to-new setting (training on samples from base classes, testing on samples from new classes) commonly used in studies for adaptation of CLIP.
Generalization to other domains.Following CoOp (Zhou et al., 2022b) and CoCoOp (Zhou et al., 2022a), we evaluate the transferability of the prompt learned from ImageNet to the three specially designed datasets.results are shown in Table 4.Given the high variance inherent in these trials, the results are averaged over three random re-runs to ensure reliable comparisons.
Our prompt learned by black-box optimization performs better than CoOp with a clear margin.Moreover, compared to CoCoOp, which relies on input-conditioned prompts generated by a metanetwork, our vanilla prompt demonstrates superior performance on two of the three benchmarks.Generalization from base to new classes.Following CoCoOp (Zhou et al., 2022a), we split the classes of the target dataset into two sets.In the base-to-new setting, the methods are trained using data from base classes and tested separately on base and new classes to evaluate the generalization ability to unseen classes in training.The results are shown in Table 5.
While CoOp improves pre-trained CLIP on base classes, it fails grievously on novel classes.Co-CoOp optimizes for each instance to gain more generalization over an entire task.Our method achieves comparable results to CoCoOp by tuning a single prompt with the black-box optimizer.Optimizing the prompt by estimated gradient avoids the trend of overfitting to training samples, thus making up the superior of our method on generalization ability to white-box prompt tuning.

B More Results with Longer Prompt
In Fig. 4 of our paper, we optimize prompts with different lengths under a fixed training time budget by setting the same number of samplings q as 256 for gradient approximation.Such a setting ensures training efficiency but may lead to suboptimal results for longer prompts, resulting in a performance drop of longer prompts.To demonstrate this, we have conducted experiments in which the value is scaled proportionately according to the size of the prompt, and the results are reported in Table 6.
From the table, with sufficient training time available, proportionately scaling the samplings for tuning of the longer prompts achieves stable convergence and clear improvements (especially on Eu-roSAT).Nonetheless, our optimized prompts consistently outperform hand-crafted hard prompts of any length.

C Computational Time Budget
The added computation burden of our method compared to white-box prompting methods lies within the multiple samplings required by the gradient approximation.We provide the training duration linked to the tuning methods presented in Table 1 on the EuroSAT dataset in Table 7.All training procedures are conducted on a single 3090 GPU.We record the minutes used for complete training and divide the time by the number of trained epochs to ascertain the time per epoch.While the sampling process inevitably elongates the training period, the overall consumed time is acceptable.

D Analysis of the Error in Gradient Estimation
The upper bound of the error of gradient approximation is 4 ∥∇f (θ)∥ 2 2 according to Eq. ( 9).It is a theoretical value obtained through multiple bounding steps in the proof.The actual estimation error of the gradient during training is much lower than the theoretical upper bound since the experiments are conducted on reasonably annotated datasets with pre-trained CLIP and properly initialized prompts.As the training proceeds, the value of the true gradient becomes small, making the error of the estimated gradient, bounded by the true gradient, become small simultaneously.Thus, the results of "Ours (w/o adapter)" are closely comparable to "CoOp (1 ctx)" in Table 1.

E Applying to Larger Black-Box Models
It is promising to apply our method to larger blackbox models.In fact, there exist closed-sourced model APIs, e.g., GPT-3, that provide the feature extraction function.It is possible to adapt pre-trained models of this kind by transferring the extracted features.Additionally, inspired by recent discrete prompt tuning approaches in Maus et al. (2023); Wen et al. (2023), it is practically feasible to discretize the learned prompts by projecting the continuous embedding to discrete token space to support a broader range of black-box models that only allows discrete input, e.g., ChatGPT, Bard.Our research will persist in exploring more practical adaptation techniques for vision-language models.

Figure 1 :
Figure 1: Overview of our proposed method.We collaboratively optimize the textual prompt and the image feature adapter for the adaptation of black-box pre-trained VL models.The prompt is optimized by estimated gradients since backpropagation cannot be applied to the black-box model.The visual adapter module is learned by direct supervised learning given output features from the pre-trained model.3 Method 3.1 PEFT in the Black-box Framework

Figure 2 :
Figure 2: Trend of loss during training on EuroSAT.We adopt ADAM optimizer for expedited convergence and superior adaptation performance.

Figure 4 :
Figure 4: Ablation results of "Ours (w/o Adapter)" with different context length of the prompt.

Figure 5 :
Figure 5: Ablation results of "Ours (w/o Adapter)" with different quantity of few-shot training data.

Table 1 :
Few-shot adaptation performance on 11 image classification tasks.Black-box methods are indicated with gray shadows.

Table 2 :
Comparison of different black-box optimizers.

Table 3 :
Ablation study on the training schedule.

Table 4 :
Comparison of manual and learned prompt in domain generalization.The prompts are learned on 16-shot data from ImageNet.

Table 6 :
More results with longer prompt and varying samplings q. "ctx" denotes the length of the prompt.

Table 7 :
Comparison of training time budget.