Prompt Tuning for Unified Multimodal Pretrained Models

,


Introduction
Recent years have witnessed the great success of large-scale pretraining based on large models and big data in natural language processing (NLP) (Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Raffel et al., 2020;Brown et al., 2020) and computer vision (Chen et al., 2020b,a,c;Chen and He, 2021;Bao et al., 2021;He et al., 2021b).Inspired by the success of BERTlike models (Devlin et al., 2019), researchers have found that pretraining can level up the downstream performance of cross-modal representation learning algorithms by a large margin (Chen et al., 2020d;Lu et al., 2019;Su et al., 2020;Tan and Bansal, 2019;Wang et al., 2021).
Following this line of research, unified multimodal pretrained models have gradually attracted much attention, and very recently, a series of such models based on the sequence-to-sequence learning framework have unified both cross-modal understanding and generation tasks and even achieved state-of-the-art performance (Li et al., 2022;Wang et al., 2022a;Yu et al., 2022;Alayrac et al., 2022;Wang et al., 2022b;Chen et al., 2022).Furthermore, note that the scale of unified multimodal pretrained models has been growing rapidly, showing a similar trend of developments in large language models (Raffel et al., 2020;Brown et al., 2020;Chowdhery et al., 2022).
Despite the great success of large-scale pretrained models across multiple domains, training such models requires a large amount of computation costs.The conventional finetuning is though effective in gaining high performance yet suffers from low training efficiency, especially when the pretrained model is of large scale in model size.There is a strong necessity for parameter-efficient transfer learning methods in the applications of large-scale foundation models.The most popular method in this field is prompt tuning (Liu et al., 2021a), which demonstrates success in natural language processing (Li and Liang, 2021;Liu et al., 2021c;Lester et al., 2021;Liu et al., 2021b;He et al., 2021a;Gu et al., 2022) and computer vision (Jia et al., 2022;Du et al., 2022;Zhou et al., 2021Zhou et al., , 2022)).In comparison with finetuning, prompt tuning only tunes pretrained models by a trivial amount of parameters (e.g., 1%).Prompt tuning freezes most parameters of the pretrained model and only tunes several prompt embeddings, as well as the output layer if necessary.Recent advances have shown that prompt tuning can help pretrained models achieve comparable performance with finetuning across different downstream tasks, including natural language understanding and generation, image classification, etc.However, the studies on the parameter-efficient transfer methods for multimodal pretrained models, especially the unified multimodal pretrained models, are still scarce.Furthermore, along with the trend of model scaling in unified multimodal pretrained models, how to tune such models cost-effectively should be a significant topic of research in multimodal pretraining.
This work fills in the void and takes the lead to explore prompt tuning for the unified multimodal pretrained models.We propose OFA-PT, an implementation of prompt tuning based on the recently open-sourced unified multimodal pretrained model OFA (Wang et al., 2022a).To be more specific, in the stage of downstream transfer, we insert a sequence of learnable embeddings to each layer of the encoder and decoder, and only tune those embeddings while keeping the parameters of the pretrained model frozen.For the rest of the setups, we use the same finetuning procedures, which transform data to the format for sequence-to-sequence learning and train the model with maximum likelihood estimation for optimization.In comparison with finetuning, the number of tunable parameters (~1% of the total) for prompt tuning is much smaller than that of finetuning, leading to fewer computation costs, e.g., memory.
Through extensive experiments we observe that the parameter-efficient prompt tuning is able to help the pretrained model achieve comparable performance with finetuning across 4 multimodal downstream tasks, spanning from understanding to generation.To analyze the differences between finetuning and prompt tuning, we follow the assumption that prompt tuning with most parameters in the pretrained model frozen should induce model robustness.We experiment on the tuning methods with adversarial attack and observe phenomena consistent with the hypothesis.To take a step further, this study delves into the implementation details and investigate whether experimental factors, e.g., the prompt length, prompt depth, and reparameterization, could saliently influence the final downstream performance.We find that in general a longer prompt length (longer than 20 tokens) is a preferable choice, and our experiments show that 64 should be favored in most cases as a longer prompt sequence will not only increase the computation costs but also incur performance degradation.Also, we show that reparameterizaton with additional trainable parameters cannot introduce significant improvements in downstream performance.

Method
This section introduces the details of our proposed method.It provides the detailed implementation of prompt tuning on a unified multimodal pretrained model.The overall framework is illustrated in Figure 1.

Preliminaries
We select the unified sequence-to-sequence framework as it unifies understanding and generation tasks, and we specifically implement prompt tuning on the recently open-sourced state-of-the-art model OFA * (Wang et al., 2022a).In brief, it is built with a Transformer-based (Vaswani et al., 2017) encoder-decoder framework.
Both the encoder and decoder consist of Transformer layers.To be more specific, an encoder layer consists of a multi-head self attention and a point-wise Feed-Forward Network (FFN).To build a connection between the encoder and decoder, the Transformer decoder layer additionally contains a cross-attention module in comparison with the encoder layer.The cross-attention is essentially multi-head attention, where the keys K and values V are the transformation of the encoder output states, instead of the inputs.Such architecture can handle tasks that provide inputs of the sequence-tosequence format.
In this work, we focus on prompt tuning for the transfer of the multimodal pretrained model.We leave the prompt learning in the stage of pretraining to the future work.

Prompt Tuning for Multimodal
Pretrained Models In the following, we introduce our implementation details of prompt tuning on the sequence-tosequence multimodal pretrained model.Note that our method can extend to other generative multimodal pretrained models, e.g., BERT-like models.
Basic Implementation We focus on implementing prefix tuning (Li and Liang, 2021;Liu et al., 2021b) based on its outstanding performance in either natural language understanding or generation.In comparison with the other prompt tuning methods, e.g., P-Tuning (Liu et al., 2021c), Prompt Tuning (Lester et al., 2021), PPT (Gu et al., 2022), adding soft prompt embeddings to each layer demonstrates enhanced training stability and improved downstream task performance even on relatively small models.Specifically, for the encoder and decoder, we add tunable prompt embeddings to each layer.Formally, we refer the pretrained model to a function M(•), and the generation function of the prompt embeddings to G(•).
The formulation is demonstrated below: where x refers to the multimodal inputs, L refers to the number of layers, and l refers to the prompt length, which should be predefined by a hyperparameter.At each layer, we prefix soft prompt embeddings p (i) to the input hidden states h (i) Note that we only prefix prompt embeddings at Transformer layers.In the simplest practice, the prompt generator G is a sparse embedding matrix of R L×l×h , and we select the corresponding embedding at the i-th index and the j-th layer as the prompt embedding.Below we provide an illustration of some more complex implementations, and we compare those methods in this study.
In the downstream tuning process, we only tune the newly added prompt embeddings at each layer and keep the parameters of the large pretrained model frozen.Therefore, while there are only a small amount of parameters that need to be updated, e.g., 1%, the computation costs are far fewer than those of finetuning.
Reparameterization Except for the simplest implementation of adding a sparse embedding matrix at each layer, a more complex one should be adding an encoder, e.g., an MLP layer, to reparameterize prompt embeddings.We also investigate the influence of reparameterization in this context.
Prompt Length Similar to previous studies (Li and Liang, 2021;Liu et al., 2021b), we find that the length of prompt embeddings make a great difference in different downstream tasks.In this study, we investigate how this factor imposes influence on model performance in different downstream tasks.
Prompt Depth To investigate the impacts of the place of prompt embedding insertion, we delve into the issue of prompt depth.Specifically, we simplify it to adding prompt embeddings to the encoder or decoder only, as well as to both modules.

Experiments
To validate the effectiveness of prompt tuning for multimodal pretrained models, we conduct experiments on 5 cross-modal tasks.Specifically, we experiment on cross-modal generation tasks, including referring expression comprehension and image captioning, and cross-modal understanding tasks, including visual entailment, image captioning, and visual question answering (VQA).We use the commonly used base-size and large-size models for the experiments, whose sizes are around 180M and 470M respectively.We provide more details about the experimental setups in the Appendix A.1.

Datasets & Metrics
Referring Expression Comprehension We conduct experiments on the 3 subtasks of referring expression comprehension, namely RefCOCO, Re-fCOCO+, and RefCOCOg (Yu et al., 2016;Mao et al., 2016).This task requires the model to generate a correct bounding box that answers the given text query on a provided image.We use Acc@0.5 as the evaluation metric.
Image Captioning We evaluate the image captioning capability of our method on the Microsoft COCO Image Captioning dataset (Chen et al., 2015).In this task, the model should generate a description that corresponds to the information of the given image.We use BLEU@4 (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), CIDEr (Vedantam et al., 2015), and SPICE (Anderson et al., 2016) as the evaluation metrics.
Visual Entailment To evaluate the performance of entailment, we implement the experiments on SNLI-VE (Xie et al., 2019).Given an image and a text, the model should figure out their relations, whether they are entailment, contradiction, or neutrality.We follow the setups in Wang et al. (2022a) and add the given premise to the input.We use accuracy as the evaluation metric.
VQA We implement our experiments on VQA 2.0 (Antol et al., 2015;Goyal et al., 2017).This task requires the model to generate the correct answer based on an image and a question about certain information on the image.Following Wang et al. (2022a), we use the all-candidate evaluation, which requires the model to generate a probability for each candidate among the 3, 129 most frequent answers.We use accuracy as the evaluation metric.

Experimental Results
Below we provide the detailed experiment results, including the comparison of prompt tuning and finetuning, as well as prompt tuning and other parameter-efficient tuning methods.

Comparison with Finetuning
We demonstrate the experimental results of the 4 tasks in Table 1 and Table 2.In general, for the base-size model, OFA-PT underperforms the original finetuned OFA by significant margins, but for the large-size model, OFA-PT is able to achieve comparable performance.To be more specific, in the evaluation of referring expression comprehension, for the base-size model, prompt tuning significantly underperforms finetuning by lagging behind a large margin of 5.64 on average across RefCOCO, RefCOCO+, and Re-fCOCOg, but for the large-size model, prompt tuning only slightly underperforms finetuning by a small margin of 0.59.In the evaluation of image captioning, for the base-size model, OFA-PT underperforms the finetuned OFA by a margin of 4.0, but for the large-size model, the performance gap is only 0.8.In the evaluation of visual entailment, the gap between the algorithms is closer, which is around 0.17.In the evaluation of VQA, for the base-size model the performance gap is 3.63 be- tween prompt tuning and finetuning, and for the large-size model the gap is 2.17 on the test-std set.Different from the other tasks, even in the experiments on the large-size model, the gap is still significant.We hypothesize that it is still necessary to search a better hyperparameter setup for this task due to the sensitivity of prompt tuning to hyperparameters.
Comparison with Other Parameter-Efficient Tuning Methods We additionally add a comparison with two parameter-efficient tuning methods, namely Adapter (Houlsby et al., 2019) and Bit-Fit (Zaken et al., 2022) to test whether prompt tuning is the best solution of light-weight transfer.Table 3 and 4 demonstrate the results of different light-weight tuning methods implemented on the aforementioned datasets.In all the downstream tasks, OFA-PT surpasses the performance of OFA with Adapter or BitFit.The results reflect the simple but effective prompt tuning over other parameter-efficient tuning baselines.We suppose that changes in biases and adding intermediate layers might be conflicted with the complex architectural designs of the unified multimodal pretrained model, whereas the simple prepended learnable prefixes have separate components, e.g., weights, positional embeddings, etc., which can result in easier training with less human efforts on hyperparameter tuning.

Analyses
In this section, we move forward to analyzing prompt tuning in multimodal pretraining.Specifically, we examine the robustness of prompt tuning based on the assumption that keeping most parameters of the pretrained model frozen should lead to improved robustness to adversarial attack.Also, we evaluate how different setups of prompt tuning, say the prompt length, the depth of prompt, and reparameterization, influence the downstream performance, and try to provide a recommended setup for consistently better performance.
Robustness Analysis To test whether the multimodal pretrained model with prompt tuning for downstream transfer is robust, we conduct experiments of adversarial attack for the examination.Adversarial attack was first proposed in computer vision, which revealed the vulnerability of deep learning models.The most common adversarial attack methods in computer vision are gradientbased methods, such as FGSM (Goodfellow et al., 2014), PGD (Madry et al., 2017), MIM (Dong et al., 2017) and SI (Lin et al., 2019).Most of the typical unimodal adversarial attack on tasks are gradientbased methods.Among them, we select FGSM, which requires only one step of gradient computation on text and image embeddings.Experimental results are demonstrated in Figure 2. OFA-PT consistently demonstrates better robustness in comparison with the finetuned OFA across all tasks.This confirms our hypothesis and also shows one significant advantage of prompt tuning not reflected in the standard evaluation.In practice, if model vulnerability is a issue that matters, we recommend the application of prompt tuning or the robust prefix tuning framework (Yang and Liu, 2022)  mance, we evaluate the prompt tuning performance on the downstream tasks with a prompt length selected from {10, 16, 30, 64, 100, 120}.As shown in Figure 3, a general tendency is that a longer prompt length with more parameters to tune can encourage improvements in downstream performance across the tasks.However, we observe diminishing marginal utility and a prompt too long may even negatively impact the performance.Although the best prompt length for tasks are different, we em-pirically advise that the length of 64 tokens can achieve a better performance on average.See Appendix A.2 for more details.
Prompt Depth As we base our implementation on the encoder-decoder model, we intuitively assume that where to insert prompt embeddings matters the performance.To simplify this issue, in our practice, we evaluate the performance of inserting prompts to the encoder only, to the decoder only, or to both the encoder and decoder.Experimental results are demonstrated in Table 5 and 6.We find that it is best to insert prompts to every layer of the whole Transformer model, though compared with the other alternatives it is less computationefficient.In the comparison between insertion to the encoder only and to the decoder only, we observe that the former solution leads to a significantly better results across multiple downstream tasks.This suggests that the insertion of prompts to the bottom layers might contribute more to the success of downstream transfer.
Figure 3: Analysis of prompt lengths on multimodal downstream tasks.We observe that increasing prompt lengths can generally bring performance improvements.Yet it cannot extend to all scenarios, and the increase might meet saturation.Based on the experimental results, we recommend 64 for the prompt length as it helps the model achieve the average best results across tasks.Reparameterization Empirically, directly updating the trainable embeddings leads to unstable optimization and a slight drop in performance.
Prior work usually leveraged an encoder, e.g., an MLP (Li and Liang, 2021), to reparameterize the trainable embeddings.We evaluate the performance of reparameterization, and we demonstrate the experimental results in Table 7 and 8.For generation tasks, e.g., RefCOCO and RefCOCOg, MLP brings consistent improvements.For understanding tasks, e.g., SNLI-VE and VQA, MLP leads to relatively negative impacts.Thus we cannot come to a conclusion about which should be a preferable one.To achieve better performance on a specific dataset, it is still necessary to make an attempt on both methods.

Related Work
In this section, we include the review of multimodal pretraining as well as prompt tuning.

Multimodal Pretraining
The rise of vision & language pretraining started from the transfer of BERT (Devlin et al., 2019) to cross-modal representation learning.A series of studies (Lu et al., 2019;Su et al., 2020;Tan and Bansal, 2019;Chen et al., 2020d;Li et al., 2019) introduced BERT to multimodal pretraining.
The key idea of such transfer is that the powerful Transformer model can handle visual and linguistic information simultaneously.To take a step forward, recent studies have turned their focuses to the encoder-decoder framework, which is adaptive to both cross-modal understanding and generation, a series of encoder-decoder-based models or similar models that can perform sequence-to-sequence learning (Dong et al., 2019) have achieved new state-of-the-art performance across the downstream tasks (Wang et al., 2021;Li et al., 2022;Wang et al., 2022a;Yu et al., 2022;Wang et al., 2022b;Chen et al., 2022).Furthermore, these recent state-ofthe-art models have unified different tasks concerning multiple modality combinations into a single framework and pretrained model.Also, we have witnessed similar trends in large language models that consistently scaling unified multimodal pretraiend model can lead to predictable performance improvement (Wang et al., 2022a,b;Chen et al., 2022).This indicates that prompt tuning should be a perfect combination with the recent unified multimodal pretrained model and it can unleash the power of large-scale pretrained models with fewer computation costs than the conventional finetuning.

Prompt-based Learning
Brown et al. ( 2020) illustrated that large-scale pretrained models can learn from the context and perform few-shot and zero-shot learning with the prompts of task instruction or a few task examples.This new paradigm raised attention of researchers in how to leverage pretrained models without tuning all the parameters, which is expensive in computation costs.Instead of using hard prompts by handcrafting, Li and Liang (2021) demonstrated that only tuning soft prompt embeddings at each layer is sufficient for the pretrained model to achieve competitive performance in natural language generation, and later a number of studies showed that prompt tuning can be essentially effective for low-resource scenarios (Liu et al., 2021c;Gu et al., 2022;Sun et al., 2022b) and it can even achieve comparable performance with finetuning (Lester et al., 2021;Liu et al., 2021b).Following this trend, a series of modification to prompts and adapters (Hu et al., 2022;He et al., 2021a;Jiang et al., 2022;Sun et al., 2022a) for improvements in performance or training efficiency have emerged and made prompt tuning a heated topic in the whole NLP community.
Recent prompt tuning methods for multimodal pretrained models mostly serve for CLIP-like models (Zhou et al., 2021(Zhou et al., , 2022;;Rao et al., 2021).Similarly, researchers tried to incorporate adapters to CLIP and also achieved satisfactory performance (Gao et al., 2021;Zhang et al., 2021).Except for prompt tuning for CLIP-like models, another line of work explored visual prompts for frozen language models.Tsimpoukelli et al. (2021) showed that when there is a powerful large pretrained language model, a visual encoder for prompt tuning is sufficient for multimodal few-shot learning.To take a step forward, Alayrac et al. (2022) proposed Flamingo, a colossal multimodal model that enables in-context learning.It could achieve state-of-the-art performance in a series of cross-modal downstream tasks in either few-shot or full-shot learning scenarios.Such tremendous success indicates the strong potential of prompt tuning in multimodal pretraining.

Conclusion
In this work, we explore prompt tuning for unified multimodal pretrained models.Specifically, we propose OFA-PT, which is an implementation of prefix tuning, a simple but effective prompt tuning method, on the recently open-sourced SoTA model OFA.Through extensive experiments, we demonstrate that the unfiied multimodal pretrained model with the parameter-efficient prompt tuning can achieve comparable performance with the finetuned model, but with fewer parameters to tune (e.g., 1%), and prompt tuning can surpass other light-weight tuning methods, e.g., Adapter and Bit-Fit.Through our analysis, we figure out a significant advantage of prompt tuning about its robustness against adversarial attack.Furthermore, we provide a comprehensive analysis about the influence of prompt tuning setups, including the prompt length, prompt depth, and reparameterization.Potentially prompt tuning can be an alternative to finetuning, but still, there are some salient limitations in this method, e.g., slow convergence and training instabilities.We hope that future studies in this field can alleviate the aforementioned problems and thus promote the application of prompt tuning.

Limitations
This section disccuses the limitations of prompt tuning for the unified multimodal pretrained mod-els, and point out some directions for future work.
One limitation of prompt tuning in this setup is the sensitivity to hyperparameter tuning.It is difficult to search for a suitable hyperparamter setup.The hyperparameter tuning experience in finetuning is not suitable for prompt tuning.Fortunately, we find that prompt tuning for generative multimodal pretrained models is not as sensitive to hyperparameters as prompt tuning for pretrained language models.We provide details of hyperparameter setups in Appendix A.1.
Another limitation of prompt tuning in this setup is slow convergence.Though prompt tuning has noticeable advantages in training efficiency, it costs at least 40 epochs for prompt tuning to achieve the nearly best performance on some datasets (e.g., RefCOCO).A larger number of training epochs may incur more computation costs though prompt tuning has an advantage in training efficiency compared with finetuning.We demonstrate more details in Appendix A.2.This indicates that finding a better solution for fast and stable convergence is also important besides reaching comparable or even improved performance over the conventional finetuning.
Despite the aforementioned limitations, prompt tuning demonstrates significantly better robustness against adversarial attack.In the future, we should pay more attention to this merit and find ways to leverage it.

Ethics Statement
We base our method on an existing multimodal pretrained model, which is capable of vision-language understanding and generation.Thus, there exist potential risks in AI-generated contents.Additionally, as our method only finetunes only a small amount of parameters of the pretrained models, we lack control of the output model, which may generate harmful contents.These results may possibly be attributed to the noise in the pretraining data.In the future research, it is essential to study how to increase the controllability on the generation while most parameters of the output model are originated from the pretrained model.

A Appendix
A.1 Experimental Setups Referring Expression Comprehension Referring expression comprehension requires models to locate an image region described by a language query.We perform experiments on RefCOCO (Yu et al., 2016), RefCOCO+ (Yu et al., 2016), and Ref-COCOg (Mao et al., 2016).We report the standard metric Acc@0.5 on the validation and test sets.For finetuning, the batch size is set to 128, the learning rate is set to 0.03, and the prompt length varies from 10-120.For Adapter, the batch size is set to 128 and the learning rate is set to 5e − 5.For Bitfit, the batch size is set to 128 and the learning rate is set to 0.001.Visual Entailment Visual entailment requires the model to evaluate the semantic relation between the given image and text, i.e., entailment, neutrality, or contradiction.We perform experiments on the SNLI-VE (Xie et al., 2019) dataset.We report accuracy on both dev and test sets.The model is finetuned with a learning rate of 0.03 and a batch size of 128.The prompt length varies from 10-120.For Adapter, the batch size is set to 128 and the learning rate is set to 5e − 5.For Bitfit, the batch size is set to 128 and the learning rate is set to 0.001.Image Captioning Image captioning is a standard vision & language task that requires models to generate an appropriate and fluent caption for an image.We report BLEU@4 (Papineni et al., 2002), ME-TEOR (Lavie and Agarwal, 2007), CIDEr (Vedantam et al., 2015), and SPICE (Anderson et al., 2016) scores on the Karpathy test split.We finetune the model with a learning rate of 0.03, a batch size of 256, and a prompt length varying from 10-120.For Adapter, the batch size is set to 128 and the learn- ing rate is set to 5e − 5.For Bitfit, the batch size is set to 128 and the learning rate is set to 0.001.
We only finetune the model with cross-entropy loss, without further CIDEr optimization.
Visual Question Answering Visual question answering (Antol et al., 2015;Goyal et al., 2017) is a cross-modal task that requires the models to answer the question given an image.We conduct experiments on VQA 2.0 and report the score on the test-std set.For finetuning, the batch size is set to 256 and the learning rate is set to 0.03.Exponential Moving Average (EMA) with a decay rate of 0.9999 is employed in finetuning.The prompt length varies from 10-120.For Adapter, the batch size is set to 128 and the learning rate is set to 5e − 5.For Bitfit, the batch size is set to 128 and the learning rate is set to 0.001.

A.2 Additional Experimental Results
In this section, we provide more experimental results for comprehensive understanding of the performance of prompt tuning.Below we summarize the detailed performance of prompt tuning on the downstream tasks in the conditions of different prompt lengths.See Table 10.On average, a prompt length of 64 helps achieve the best average performance in the downstream tasks.
To evaluate the training efficiency of different methods, we experiment on the base model OFA of different sizes, spanning from 93M to 930M paramters.Figure 4 demonstrates their performance in efficiency by evaluating their used time of processing 100 samples.We find that prompt tuning consistently performs better than finetuning in training efficiency.For the huge-size model, it can perform around 2 times faster than finetuning.However, based on our observation, the advantage in training efficiency does not lead to less required computation resource.Table 9 lists the detailed computation resource consumption of both finetuning and prompt tuning.Specifically, we compute the computation resource consumption by calculating the GPU-hours of finetuning and prompt tuning on different tasks.We find that for image captioning and VQA, prompt tuning consumes less resource, but for the other tasks prompt tuning adversely consumes more.It reflects that for tasks similar to pretraining tasks, especially those with more data in the pretraining stage, prompt tuning is able to outperform finetuning, but for others, prompt tuning even incurs more carbon footprints.This indicates that the real computation resource consumption for downstream transfer should be an important issue in the field of prompt tuning and the solution to this problem can further the developments of the application.

A.3 Experimental Configuration
The experiments are conducted on Linux servers equipped with an Intel(R) Xeon(R) Platinum CPU @2.90GHz, 1024GB RAM and 8 NIVDIA A100-80GB GPUs.We run our experiments on 32 A100 GPUs.All models are implemented in Pytorch version 1.8.1 and Python 3.7.4.

Figure 1 :
Figure 1: Model overview.An illustration of our multimodal prompt tuning architecture.Specifically, for the encoder and decoder, we add tunable prompt embeddings to each layer.

Figure 4 :
Figure 4: Efficiency of different tuning methods.We report the spent time per 100 samples of finetuning and prompt tuning on RefCOCO.

Table 2 :
Experimental results of methods on multimodal understanding benchmark datasets, SNLI-VE and VQA.

Table 3 :
24.30 OFA-PT 90.05 92.31 85.59 84.54 89.40 77.77 85.27 85.89 41.81 31.51141.4 24.42 Evaluation of different parameter-efficient tuning methods using large-size models on multimodal generation tasks.We find that OFA-PT can generally outperform OFA with Bitfit and Adapter.

Table 4 :
Evaluation of different parameter-efficient tuning methods using large-size models on multimodal understanding tasks.OFA-PT outperforms the baselines significantly.

Table 5 :
Experimental results on adversarial attack using large-size models.We discover that in the scenario of adversarial attack prompt tuning suffers from lower performance degradation across the tasks.Evaluation of different prompt insertion methods on multimodal understanding tasks.We specifically evaluate the performance of prompt tuning with prompts inserted to the encoder only, to the decoder only, or to both the encoder and decoder.

Table 7 :
Ablation study results of multimodal generation tasks on reparameterization using large-size models.

Table 8 :
Ablation study results of multimodal understanding tasks on reparameterization using large-size models.

Table 10 :
Evaluation average performance of prompt tuning on the downstream tasks with different prompt lengths.