Deeply Coupled Cross-Modal Prompt Learning

Recent advancements in multimodal foundation models (e.g., CLIP) have excelled in zero-shot generalization. Prompt tuning involved in the knowledge transfer from foundation models to downstream tasks has gained significant attention recently. Existing prompt-tuning methods in cross-modal learning, however, either solely focus on language branch, or learn vision-language interaction in a shallow mechanism. In this context, we propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism, which enables the mutual exchange of respective representation through a well-connected multi-head attention module progressively and strongly. We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the robustness to domain shift as well. Thorough experimental analysis evidently demonstrates the superb few-shot generalization and compelling domain adaption capacity of a well-executed DCP. The code can be found at https://github.com/GingL/CMPA.


Introduction
Large foundation models pre-trained on web-scale image-text pairs such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) have shown promising performance on zero-shot image classification.Research has repeatedly shown that the general knowledge learned by the foundation models can also be transferred to diverse downstream tasks, such as few-shot image classification (Zhou et al., 2022b,a), visual grounding (Subramanian et al., 2022), visual question answering (Liu et al., 2022) and so on.They have exhibited a significant potential in open-vocabulary scenarios.Thus, the + Work was done during internship at SenseTime Research * Corresponding author challenge associated with how to efficiently and effectively adapt large pre-trained models to downstream tasks has garnered increasing attention especially in low-resource training scenarios.
Directly fine-tuning the foundation model is infeasible due to the massive training parameters and the catastrophic forgetting caused by overfitting (Kirkpatrick et al., 2016).In contrast, the parameter-efficient prompt tuning approach explored in natural language processing has yielded significant success (Lester et al., 2021), leading to an increased examination of this technique within the realm of multi-modality, especially in the language-branch of CLIP.For example, CoOp (Zhou et al., 2022b) and ProDA (Lu et al., 2022b) explore the vanilla few-shot learning based on CLIP by adjusting the embedding or distribution of the text prompt.CoCoOp (Zhou et al., 2022a) and ProGrad (Zhu et al., 2022) focus more on the unseen classes.They contextualize the text prompt either under the supervision of visual clues or tweak gradient direction to improve the generalization ability of the model.
The aforementioned approaches, however, only adjust the text embedding of CLIP and neglect the visual branch.The success of VPT (Jia et al., 2022) demonstrates the effectiveness of visual prompt learning.Inspired by this work, UPT (Zang et al., 2022) and MaPLe (Khattak et al., 2022) synergize the visual and textual prompts.Specifically, UPT improves the few-shot learning ability by generating visual and text prompts initially.MaPLe achieves better performance in the classification of unseen classes.They uncover the underlying rationale and limitations of dual-branch prompt tuning.
Concretely, the dual-branch CLIP learns the visual and language synergy only based on contrastive learning, whereas both branches lack mutual communication at the early stage of the network.Multi-modal prompt learning techniques, such as MaPLe and UPT, incorporate languagevision interactions of the network and achieve substantially improved performance, highlighting the significance of the cross-modal interactions.However, previous studies have leveraged languagevision interactions at a superficial level.For example, UPT generates visual and text prompts before they are fed into the corresponding encoders.MaPLe generates visual prompts conditioned on language counterparts by a mapping function.Many studies (Dosovitskiy et al., 2021;Wang et al., 2022a) have shown that neural networks, especially transformer-based models, can leverage the deep fusion of information from multiple views to improve their performance.It remains less explored in the thread of multi-modal few-shot learning.To this end, we design Deeply coupled Cross-modal Prompt learning (DCP) enhancing the language-vision interaction.Specifically, DCP is built upon CLIP, with additional text and visual prompts across multiple layers.Different from previous methods with deep prompt tuning (Jia et al., 2022;Zang et al., 2022;Khattak et al., 2022), DCP only initializes the first layer of visual and text prompt randomly.The subsequent prompts are generated by Cross-Modal Prompt Attention (CMPA) module, which elegantly integrates the prompts from the preceding cross-modal layer.CMPA is characterized with stronger connection in two folds, i.e., Depth and Breadth. 1) Depth means that CMPA intensifies the correlation of the prompts among different layers.2) Breadth refers to that CMPA amplifies the interaction between visual and language modalities.CMPA is the core module to realize the deep coupling between two modalities.Essentially, DCP empowered by CMPA amalgamates uni-branch and dual-branch multi-modal pre-training paradigms in a favorable way in an attempt to bridge the discrepancy between visual and textual knowledge without introducing too much overhead.
To conclude, the contributions of this work are as follows: • We develop a deeply coupled cross-modal prompt learning (DCP) method with a core module cross-modal prompt attention (CMPA).CMPA can reinforce the interaction between visual and language modals across different layers.
• We benchmark our method on 11 image classification datasets consisting of generic objects, scenes, actions and fine-grained categories.
Our method surpasses visual prompt tuning, text prompt tuning and existing competitive multi-modal prompt tuning methods under the few-shot setting.
• We conduct experiments on domain adaptation tasks.Our method achieves comparable performance to the state-of-the-art methods, indicating the robustness of our method to domain shift.
2 Related Work 2.1 Vision-language Pre-trained Models The advent of Transformer (Vaswani et al., 2017) has accelerated the development of large-scale pretraining.The application of Transformer in the multi-modal is divided into two schools of thought: one is the single-stream model, in which language and vision information are fused at the beginning and fed directly into the encoder together; the other is the dual-stream model, in which language and vision information first pass through two separate encoder modules at the beginning, and then the different modal information is fused through the cross Transformer.At the outset, the basic architecture of some contemporaneous work is BERT.The images are detected with Faster-RCNN (Ren et al., 2015) for region features, and these image region features are fed into BERT along with text information to align the text and image information.Following the same process as BERT, these methods first pretrain and then fine-tune on the corresponding tasks.Single-stream networks (Li et al., 2019;Alberti et al., 2019;Chen et al., 2019;Li et al., 2020;Su et al., 2020;Zhou et al., 2020;Qi et al., 2020;Lu et al., 2020) fuse information from different modalities directly through an encoder.The dual-stream models (Lu et al., 2019;Tan and Bansal, 2019) integrate different modal information through cross modal transformer.Empirically single-stream networks are more sufficient for information fusion, while dual-stream networks can be more efficient for training due to fewer training parameters.In the design of our method, we aim to combine the advantages of the single-stream and dual-stream, so as to enhance the cross-modal integration without introducing many training parameters.
Recent cross-modal large-scale pre-training models have made greater breakthroughs in train-ing data scale and tasks by devising various model architectures and training objectives, and have achieved impressive performance in many downstream tasks.CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) got remarkable zero-shot results after being pre-trained on millions or billions of (image, text) pairs collected from the internet.Coca (Yu et al., 2022) combined the advantages of the contrast learning method (Radford et al., 2021) and the generative model SiMVLM (Wang et al., 2022b) by adding caption loss to the contrast loss of CLIP.OFA (Wang et al., 2022a), Unified-IO (Lu et al., 2022a) and Florence (Yuan et al., 2021) unified vision, language and multi-modal tasked by pretraining on both cross-modal and uni-modal data.These methods have achieved state-of-the-art results in many downstream tasks.Some methods are dedicated to improving the performance of certain specific tasks.UniTAB (Yang et al., 2022) focused on grounded vision-language tasks such as grounded captioning and visual grounding.GLIP (Li et al., 2022) unified object detection and phrase grounding for pre-training.Pretraining models have opened up a situation where deep learning models scale and perform in tandem, becoming a revolutionary breakthrough in artificial intelligence and deep learning.

Prompt Learning
For a long time, first pre-training then fine-tuning was the dominant approach to apply large foundation models to downstream tasks.However, fine-tuning for large models is inefficient and may cause catastrophic forgetting (Kirkpatrick et al., 2016).Prompt learning is proposed to address the above problems.The prompt is usually a series of trainable parameters inserted into the input.The success of prompt learning in NLP (Lester et al., 2021) has inspired its application in other modalities.VPT (Jia et al., 2022) is a typical successful application of prompt learning on computer vision.Prompt learning has generated more attention and made great progress in cross-modal learning.
SoftCPT (Ding et al., 2022) and CPL (He et al., 2022) applied prompt tuning to different vision and language tasks and outperformed single-task prompt tuning method.CoOp (Zhou et al., 2022b), ProDA (Lu et al., 2022b) and UPT (Zang et al., 2022) adapted prompt learning to traditional fewshot visual recognition with CLIP as the backbone.
CoCoOp (Zhou et al., 2022a), ProGrad (Zhu et al., 2022) and MaPLe (Khattak et al., 2022) improved the classification performance of pre-trained models on novel categories by prompt learning.Different from previous methods, our approach brings stronger connection between modalities and layers with proposed cross-modal prompt attention.The stronger interaction between vision and language enables our method to get state-of-the-art performance in the few-shot learning.

Method
In this section, we first introduce the preliminaries, including CLIP (Radford et al., 2021), CoOp (Zhou et al., 2022b) and VPT (Jia et al., 2022).Then, we describe our deeply coupled prompt learning (DCP) and detail its underlying module CMPA.

Preliminaries
CLIP is a dual-encoder pre-trained model which consists of a text encoder and an image encoder.The text and image are independently encoded by the corresponding encoder, then projected to the same embedding space by a projection layer.Specifically, the backbone of the image encoder is ResNet (He et al., 2016) (d=256) or ViT (d=512), which can map the high-dimension image into a low-dimension embedding.The text encoder is built based on the decoder of Transformer (Vaswani et al., 2017), which is also known as GPT (Brown et al., 2020), to generate a vectorized representation for a sequence of words.The model uses a contrastive loss to align the two modalities during training stage.The training objective is to maximize the cosine similarity for the match image-text pairs and minimize the unmatched ones.
In zero-shot image recognition, the image encoder of CLIP encodes the image into a feature representation x.The input text is usually in the form of "a photo of a {class}."(discrete prompt), where the "{class}" token is the name of each category.For each dataset containing K categories, a set of text prompts {w i } K i=1 are generated by the text encoder.The prediction probability is computed as where τ is a temperature parameter.
CoOp adapts CLIP to downstream tasks with prompt tuning.Specifically, CoOp tries to learn prompt embedding (continuous prompt) during few-shot training to avoid manual prompts.The prompt fed in the text encoder is designed as , where [V ] m (m ∈ {1, ..., M }) is initialized with the same dimension as word embeddings.The parameters of the CLIP model is frozen while the prompt is trainable.The prediction probability of CoOp is where g(•) denotes the text encoder.
VPT is an efficient and effective way to adapt large-scale Transformer models in vision with only a small amount of trainable parameters.The deep VPT can be formulated as where L denotes the number of Transformer layers and Head is the classification head.Only the prompts and classification head is learnt during training.VPT achieves impressive performance on 24 downstream recognition tasks.

Cross-modal Prompt Attention
Inspired by the advance of prompt learning in vision and language, recent studies start to explore multi-modal prompt learning (Zang et al., 2022;Khattak et al., 2022).These methods update the visual and text prompt simultaneously to achieve balance in the learning of visual and text embedding.Although the visual and text embedding are adapted to the few-shot data, the interaction between visual and text is still insufficient.Hence we propose deeply coupled cross-modal prompt learning (DCP), which can enhance the communication between prompts across different layers and modalities.The essential module of DCP is cross-modal prompt attention, which fuses visual and text with multi-head cross-modal attention.Figure 1 depicts the pipeline of DCP and the detailed architecture of cross-modal prompt attention (CMPA).
Our method follows the implementation of CLIP, which is also a dual-encoder model.Differently, we add prompts to every branch, and enable information fusion between vision and language during training through CMPA.Specifically, CMPA is a multi-head attention with visual and text prompts as inputs.The language prompts of the first layer are initialized with the pre-trained CLIP word em-beddings of the template 'a photo of a <class>', whereas the visual prompts inserted into the first layer are randomly initialized from a normal distribution.Then, the prompts of the next layer are generated by CMPA based on the prompts from the preceding layer.Formally, CMPA can be formulated as where P l t and P l v denote the text prompt and visual prompt the the l layer of each encoder, respectively.N is the depth of CMPA, which is smaller than the length of text and visual encoder.d k is the dimension of keys.
Different from previous methods, only the prompts from the first layer are randomly generated.The subsequent prompts condition on the prompts from both visual and language modal.CMPA enables information communication between vision and text through corresponding prompts.Totally, CMPA brings stronger feature fusion from two aspects: layers and modalities.Note that CMPA shares parameters from different layers, and the additional trainable parameters is only in a small amount.

Experiments
In this section, we conduct experiments to evaluate the effectiveness of our method under two settings.One is few-shot visual recognition including 11 different datasets covering generic objects, scenes, actions and fine-grained categories.The other is domain adaptation, where we train our model on ImageNet and evaluate it on other four datasets.

Implementation Details
We use the pre-trained ViT-B/16 CLIP model as our backbone.The length of prompt tokens for visual and textual context are both 16.The prompt depth is 9 as a trade-off between accuracy and training efficiency.We set the batch-size to 4 with a learning rate of 0.0035 via SGD optimizer.We use 20 epochs for most datasets, except Im-ageNet, SUN397 and Food101.Also, 5-epoch setting works for diverse shots of Food101, 1/2/4shot of ImageNet, and 1/2-shot of SUN397, respectively.

Main Results
Baseline Methods.We compare our method with the original zero-shot CLIP, text prompt learning (CoOp), visual prompt learning (VPT) and multimodal prompt learning (MaPLe), which all have ViT-B/16 as visual backbone.Basically, we follow the implementation of MaPLe (Khattak et al., 2022).The prompt length of CoOp is set to 16. VPT uses a prompt length of 8 and the visual and text prompt length of MaPLe is 2. The training epoch of CoOp is defined as 10, and that of VPT and MaPLe is 5.We use the deep variant of VPT in few-shot experiments.The prompt depth of MaPLe is 9 as their original setting.
Performance Analysis.Figure 2 demonstrates our results comparison with other methods.The top left sub-figure shows the average performance of four methods.We can have the following findings.1) Overall, cross-modal prompt learning (DCP and MaPLe) gets a large performance gain compared with single-modal prompt learning methods (VPT and CoOp).VPT and CoOp achieve comparable performance on different shots.These results demonstrate the superiority of crossmodal prompt learning over uni-modal prompt learning.2) Although both belong to multi-modal prompt learning methods, our method still outperforms MaPLe on 1/2/4/8/16 shots settings by 1.72/3.18/3.19/2.20/2.76(%).MaPLe utilized a linear layer to generate visual prompts from text prompts.Our proposed DCP enhances the interaction between vision and language with a cross- modal prompt attention, which can not only guide visual embedding learning through text prompts, but also influence the language embedding with visual prompts.3) Compared with 2/4/8/16 shots, our approach achieves a lower performance gain on one shot.We can also find that on separate datasets, our method achieves the best performance in almost all 16-shot cases (except for Food101).This phenomenon indicates that our method is more effective in cases where the number of shots is rela-tively large.This is probably because the alignment between different modals is more challenging due to the small number of samples per category.
For individual datasets, we find that our approach has significant performance improvements on Flowers102, StanfordCars, FGVCAircraft, and EuroSAT.However, on the datasets of general categories such as ImageNet and Caltech101, our method does not achieve satisfactory performance when the number of shots is less than 16.We can  conclude that our method is more robust for finegrained classification datasets, and we need more shots for general category classification.On the dataset of Food101, our method performs slightly lower than MaPLe.We also find that all methods underperform zero-shot on 1-shot setting.We suppose this phenomenon comes from the noisy training data of Food101 (Bossard et al., 2014).

Ablation Study
The are two important settings in CMPA: the feature fusion method in different prompts and parameter sharing of CMPA across different layers.We conduct corresponding ablation experiments in this section to find the optimal setting.Feature Fusion in Prompts.Before the visual and text prompts are fed into the CMPA, the dimension of the batch size is supposed to be consistent.The defined batch size only affects visual prompt while the batch size of text prompts is actually the number of the dataset due to the implementation of CLIP.The dimension transformation of visual and text prompts is shown in Figure 3.The batch size of text prompt is actually the number of categories in the dataset.We experiment with three settings to align the batch size of visual and text prompts.Figure 4 reports the average accuracy over three runs on different shots (1/2/4/8/16) of 10 datasets (without ImageNet for time efficiency).'Avg' means that we use the average of visual and text prompts across the dimension of batch.'Max' stands for using the features with the highest response across the batch dimension as the visual and text prompt.'First' represents that we select the first embedding across the batch dimension of visual and text prompts to feed into CMPA.Overall, the 'avg' setting of feature fusion can achieve better performance compared with 'max' and 'first'.
Parameter Sharing.We intend to learn as few parameters as possible to achieve a transfer of largescale pre-trained models in downstream tasks.Setting the prompt depth to 9 means that there are 9 CMPA modules, which greatly increases the number of trainable parameters for the model.Hence we conduct the experiment in which the parameters of CMPA are shared across different layers.Table 1 shows the average results of different shots on 11 datasets.'PS' is short for 'parameter sharing'.It can be observed that on most shots (except for 8 shots) the performance of parameter sharing is higher than non-sharing setting.

Domain Generalization
After prompt tuning on specific datasets, we do not want to lose the general knowledge of the pretrained large model.In this section, we conduct domain adaptation experiments to evaluate the generalization ability of our model DCP.

Main Results
Table 2 compares our method DCP with other prompt learning methods on cross-domain tasks.
The compared methods include zero-shot CLIP, unimodal prompt learning methods (CoOp, Co-CoOp and VPT-Deep) and multi-modal prompt learning methods (MaPLe and UPT).The best results on different datasets are in bold, and the second best results are underlined.We can observe that 1) prompt learning does not corrupt the generalization ability of pre-trained large models; 2) multi-modal prompt learning methods outperform unimodal prompt learning methods in generalization performance; 3) our method can get comparable performance as the state-of-the-art methods.

Discussion and Conclusion
This paper proposes a deeply coupled cross-modal prompt learning method, with a core module crossmodal prompt attention.Our method focuses on optimizing the interaction across different models and layers to address the alignment between vision and language.Experiments on few-shot image clas-sification and domain adaptation evidence that our method can transfer the general knowledge learned by pre-trained foundation models to downstream tasks without penalty of the original generalization ability.Our method provides a strong baseline on few-shot image classification.The deep fusion between visual and language information may enable our approach to have greater potential for complex cross-modal tasks, such as referring expression comprehension (Subramanian et al., 2022), image retrieval (Baldrati et al., 2022) and visual question answering (Liu et al., 2022).We will apply our method to such complicated cross-modal tasks to evaluate its effectiveness in our future work.

Limitations
We discover that for datasets with a relatively large number of categories, our method requires a more delicate setting of epoch under different shots.Figure 5 shows the average results on Sun397 and ImageNet of different epochs.It can be observed that for datasets with a large number of categories (such as Sun397 and ImageNet), as the number of shots decreases, the performance deteriorates with an increase in the number of epochs, which is not evident on the datasets with a small number of cat-egories.We will delve further into this problem to find the reason and solution.

Figure 1 :
Figure 1: The architecture of deeply coupled prompt learning and cross-modal prompt attention module.

Figure 2 :
Figure 2: Main results of few-shot image classification on 11 datasets.The accuracy (%) is the average over three runs on 1/2/4/8/16 shots.Overall, our DCP (red line) outperforms other methods by a large margin on the average results of 11 datasets.

Figure 3 :
Figure 3: The illustration of feature fusion (FF).The left branch represents the text prompt, and the right shows the visual prompt.

Figure 4 :
Figure 4: The comparison of different feature fusion methods on 10 datasets without ImageNet.

4. 2 . 1
Datasets and Implementation DetailsFollowing(Zhou et al., 2022b), we use Im-ageNet(Deng et al., 2009) as source domain, and ImageNet V2(Recht et al., 2019), ImageNet-Sketch(Wang et al., 2019), ImageNet-A(Hendrycks et al., 2021b) and ImageNet-R(Hendrycks et al., 2021a) as target domains.We train our model on the 16 shots of ImageNet, and test it on other four datasets.Different from the settings in few-shot task, the training epoch on 16shot ImageNet in cross domain task is set to 5. We also decrease the prompt length to 8.

Figure 5 :
Figure 5: Accuracy comparison of different epochs on Sun397 and ImageNet.
The backbone of VPT is ViT, which is the same as the image encoder of CLIP.There are two variants of VPT: VPT-Shallow and VPT-Deep.VPT-Shallow only inserts prompts into the first layer of the Transformer.The visual prompt can be defined as p = [P ] 1 [P ] 2 ...[P ] N , where [P ] n (n ∈ {1, ..., N }) keeps the same dimension as the image embedding.The input of VPT-shallow is [x cls , p, x], where x cls is the classification token [CLS].VPT-Deep introduces visual prompts at every Transformer layer.

Table 1 :
.56 75.69 78.42 80.55  w/o PS 67.42 71.34 75.27 78.4980.53Theperformance comparison with and without parameter sharing.The results are the average accuracy on 11 datasets of different shots.

Table 2 :
Domain generalization comparison of DCP with existing approaches.The winners and runners-up are marked in bold font and underlined, respectively.