CPL: Counterfactual Prompt Learning for Vision and Language Models

Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts.Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel Counterfactual Prompt Learning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09% and 25.08% relative improvements across three few-shot scenarios on unseen test sets respectively.


Introduction
Pre-trained vision and language foundation models (Radford et al., 2021;Jia et al., 2021) have shown encouraging results toward open-domain visual-concept matching.Benefiting from prompt engineering (Song et al., 2022a;Liu et al., 2022), where free-form text prompts are designed for specific task goals, those foundation models can be easily transferred to a wide array of tasks under 1 Our code is released at https://github.com/eric-ai-lab/CPL.zero-shot and few-shot scenarios, including image classification (Deng et al., 2009), visual question answering (Shen et al., 2021), image-text retrieval (Jia et al., 2021), etc.But manually constructing prompts for vision and language models such as CLIP is a tedious, time-consuming process, which usually requires prior domain knowledge and leads to suboptimal solutions.
Prompt tuning (Lester et al., 2021), on the other hand, liberates us from manual prompt engineering and automates this process.Prompt tuning methods (Ju et al., 2021;Lin et al., 2014;Zhou et al., 2022) are proposed to effectively transfer CLIP to image recognition tasks after tuning a learnable prompt with a few examples of the classes.However, those methods purely conduct empirical risk minimization (ERM) and optimize for predictive accuracy, which often produces spurious, inefficient, or entangled representations (Wang and Jordan, 2021).Therefore, the generalization ability of existing prompt tuning methods for vision and language models is limited, and they often fail to transfer well to unseen classes or concepts.For example, the image classification performance of the SOTA method CoCoOp (Zhou et al., 2022) is similar or even degrades on unseen classes when compared with zero-shot CLIP.
Learning non-spurious representation for better generalization requires disentangling features that causally determine the prompts.One solution is counterfactual reasoning.Counterfactual ("counter to the facts") is a concept that describes the human capacity to learn from limited prior experiences by imagining the outcome of an alternative action that could have been taken.So we can do counterfactual intervention by asking "what if ..." questions in prompt learning.For example, as shown in Figure 1, a change in the visual feature of the barn would cause the label to change (if we view the two prompts as two labels).
Therefore, we introduce a new causality-based approach, Counterfactual Prompt Learning (CPL), for non-spurious and efficient prompt learning.First, we introduce a text-based negative sampling strategy to discover the most semantically-similar negative sample based on text similarity.Then we generate a counterfactual example by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causally causes prompt change.Finally, we adopt contrastive learning in the joint optimization framework (with counterfactual construction) to tune the learnable prompts using both factual and counterfactual examples.The causally fine-tuned prompts will eventually guide vision-and-language foundation models to distinguish images from unseen concepts, thereby improving the generalization ability of prompt learning.
We extensively evaluate CPL using seven standard datasets for image classification, two for image-text-retrieval, and one for visual question answering (VQA).We show that CPL outperforms the baseline on all three tasks: on image classification, our method achieves 3.55% average relative improvement on unseen classes across the seven datasets in terms of accuracy; on image-text retrieval, our method improves the most (4.09% relative improvement in terms of Recall@1) when using 0.5% of total training instances on MSCOCO (Lin et al., 2014) andFlickr30K (Plummer et al., 2015); on VQA, we gain up to 25.08% relative improvement on the VQAv2 (Goyal et al., 2017a) dataset.
Our main contributions are summarized below: • We introduce Counterfactual Prompt Learning (CPL), a task-agnostic causalitybased prompt learning method to effectively transfer CLIP to unseen concepts for different downstream tasks.
• We propose a text-based negative sampling strategy, where we compute BERTScore (Zhang et al., 2019) between text prompts, based on which we sample the most semantically-similar negative images.
• We introduce a optimization framework that simultaneously constructs counterfactuals by identifying minimal non-spurious feature change, and learns the generalized prompt representation from both factual and counterfactual examples.
• We conduct extensive experiments on image classification, image-text retrieval, and visual question answering, and validate the superiority of CPL to existing prompt tuning methods in transferring effectiveness on unseen concepts.

Related Work
Vision-and-Language Models.Vision-and-Language models pre-trained on large-scale image-text pairs have demonstrated great potential in multimodal representation learning (Jia et al., 2021;Yao et al., 2021;Yuan et al., 2021).Among them, the representative CLIP (Radford et al., 2021) benefits from 400M curated data and defines various prompt templates to carry out zero-shot image classification.However, those prompts still require hand-crafted designs.In this work, we automatically learn task-agnostic and task-relevant prompts without human priors.In addition, by considering the counterfactual examples, we can further improve various vision-and-language tasks, including visual question answering and image-text retrieval in a few-shot scenario.
Prompt Tuning.Many works focus on learning from discrete natural language prompts, e.g., Auto-Prompt (Shin et al., 2020) elicits knowledge from language models with automatically generated discrete prompts.Lately, many other works (Zhou et al., 2021(Zhou et al., , 2022) ) directly tune prompts in continuous vector forms.Guo et al. (2021) introduces Q-Learning to optimize the soft prompt.P-Tuning v2 (Liu et al., 2021) shows that continuous prompt tuning achieves the same performance as fine-tuning in various settings.Prompt tuning also receives great interest in the computer vision domain.For example, CoOp proposes a continuous prompt optimization strategy to avoid prompt design.CoCoOp (Zhou et al., 2022) extends CoOp by further learning an instance-conditional network to generate an input-conditional token for each image.However, these methods trained with empirical risk minimization (ERM) may learn to rely on correlations between class labels and spurious attributes by minimizing average training error (Zhang et al., 2022).They usually learn spurious, inefficient, and entangled representation, lacking generalization ability to unseen scenarios.
Counterfactual Reasoning.A number of recent works have investigated generating counterfactual images (Besserve et al., 2020), or counterfactual text in specific language domains (e.g., court view (Wu et al., 2020), dialogue generation (Zhu et al., 2020), Natural Language Inference (Kaushik et al., 2019;Gokhale et al., 2021), named entity recognition (Zeng et al., 2020)); On the vision end, Zhang et al. (2021) proposes to add intervention over the changed domain on images during the data-generation process and steer the generative model to produce counterfactual features to augment the training process.Agarwal et al. (2020) uses automated semantic image manipulations to generate synthetic data to make models more robust against spurious correlations; On the vision and language end, Chen et al. (2020) proposes to generate counterfactual VQA samples by masking critical objects in images or words in questions to augment the training data and gain a huge improvement on the VQAv2 dataset.Gokhale et al. (2020) proposes template-based counterfactual image augmentation methods.Fu et al. (2020) proposes a novel training strategy for visual language navigation that dynamically generates counterfactuals to account for unseen scenarios.To our best knowledge, CPL is the first to apply counterfactual generation to prompt-based few-shot learning for vision and language models.Few-shot Learning.Recently, several few-shot efficient learners on vision (He et al., 2022) and language (Brown et al., 2020) tasks were proposed including CLIP.GPT (Brown et al., 2020), as a strong few-shot learner, is capable of performing a new language task by learning from only a few training instances.Frozen (Tsimpoukelli et al., 2021) is developed based on GPT and made into a multimodal few-shot learner by expanding the soft prompting to include a collection of images and text.Their method demonstrates strong few-shot capabilities on visual question answering and image classification tasks.Similarly, CoCa (Yu et al., 2022) is pre-trained from scratch and end-to-end using both web-scale data and annotated images by considering all labels as text, therefore unifying supervision for learning representations through natural language.It can achieve state-of-the-art performance with few-shot transfer or by minimal taskspecific adaptation on a wide range of downstream vision-and-language tasks, including visual recognition, multimodal understanding, crossmodal retrieval, and image captioning.SimVLM (Wang et al., 2021b) is pre-trained with prefix language modeling on datasets with weak supervision.It exhibits its efficacy on few-shot captioning tasks.Even though all these models mentioned above can already achieve improvement on some few-shot tasks, how to exploit their few-shot reasoning ability using limited training examples still deserves the effort.In this work, we study this direction via the lens of prompt learning utilizing CLIP as a starting point.

Problem Formulation
Our goal is to learn generalizable prompt representation with limited data.The prompt in CLIP is divided into two parts: task-agnostic prompt p and task-relevant prompt h.Task-agnostic prompt p is learned end-to-end automatically.The set of task-relevant prompts H = {h 0 , h 1 , . . ., h C } is mapped from the label space Y with some predefined rules hinging on the task type, where C is the total number of classes.The final prompt t c is the concatenation of the task-agnostic prompt and the task-relevant prompt fed into CLIP's text encoder: Existing works to this problem (Zhou et al., 2021(Zhou et al., , 2022) ) propose to first extract visual feature v of each input image by feeding it into CLIP's vision encoder F ; and text embeddings are generated by feeding {t c } C c=1 into the CLIP's text encoder G.The probability of i-th class is computed as where τ is the temperature parameter, < • > denotes the cosine similarity.Cross-entropy loss is then minimized and the gradients can be backpropagated via the text encoder G to update the learnable prompt representation p.During training, the weights of CLIP always remain frozen.During inference, Eq. 1 is used to compute the probability for each class.

Method Overview
An overview of the Counterfactual Prompt Learning (CPL) framework is shown in Figure 2.For pre-processing, we construct task-relevant prompts for all training samples.The goal is to optimize the task-agnostic prompt p.2During training, given a positive image-prompt pair, we first perform text-based negative sampling to find the most semantically-similar negative sample based on text similarity scores.Then we adopt a controllable counterfactual generation strategy to construct the counterfactual from the positive and negative samples in the visual feature space.Finally, we perform contrastive learning using both generated counterfactual image features and factual image features in a joint optimization framework to fine-tune the task-agnostic prompt p, allowing the model to un-derstand non-spurious semantic information and learn generalized prompt representations.

Controllable Counterfactual Generation
By viewing image feature v as a potential cause of the label, a non-spurious feature shall be a sufficient cause of the label.So we would like to generate counterfactuals by identifying minimal non-spurious feature change that causes the label change.The illustration of the counterfactual construction process is shown in Figure 3.Given positive image features v and negative image features v − , we can generate negative counterfactual image features v ′ as below: where • is the element-wise multiplication and u is the parameter controlling the amount of negative image feature that replaces the positive image feature.The negative image features are extracted from those images similar to the original image at the semantic level, which we will introduce in Section 3.4.
To capture the non-spuriousness, we would like to construct counterfactuals by replacing essential non-spurious features only.This can be achieved by minimizing the amount of feature change u * to the original image that can causally incur label change: Given the factual and counterfactual features v and v ′ , we aim to learn the prompt that can help CLIP better align visual features v and textual features G(t) with same semantic meanings.This can be achieved by maximizing the mutual information (MI) between v and G(t).Therefore, by minimizing the InfoNCE loss (Hjelm et al., 2018), we can maximize the lower bound on MI(v, G(t)).To this end, we define the contrastive objective function based on the InfoNCE estimator following Khosla et al. (2020): ), (4) where S (•, •) is normally the cosine similarity function and τ is the temperature value.

Text-based Negative Sampling
We then discuss how to perform negative sampling for constructing counterfactual features.As suggested in Robinson et al. (2020), good negative samples have different labels and are difficult to be distinguished from an anchor point, while their semantic representations are close (Suresh and Ong, 2021).Since not all negative samples can serve as useful negatives (Chuang et al., 2020), indiscriminate leverage of these data may harm model robustness and algorithm efficiency.Therefore, during training, in each batch, we only utilize the most semantically-similar one to generate counterfactual image features.Other image samples are filtered out.
Semantic concepts may be highly complex in the visual representations, and thus it is hard to directly measure semantic similarity in the visual space.While language is more expressive and naturally preserves semantic meanings.Therefore, we propose a text-based negative sampling method.We first measure the text similarity between prompts with BERTScore (Zhang et al., 2019), which computes pairwise cosine similarity between reference sentences and candidate sentences using BERT contextual embedding (Devlin et al., 2019).We compute a similarity matrix with the value of each element being: (5) Denote B as the collection of sampled instances.
During training, each prompt where C is the size of sampled instances) can be treated as a query.Given a query prompt h q , its most semantically similar prompt (the one with the highest BERTScore) h k is searched from B.
Then we use the CLIP vision encoder to obtain the features of the corresponding positive and negative images v and v − .

Joint Optimization
In addition to the contrastive learning loss as introduced in Eq. 4, we also adopt the standard crossentropy loss for training: where y c denotes the one-hot ground-truth annotation of the label.We treat all downstream tasks in this work as classification tasks, where the model predicts if the image and text prompt pair is matched or not.
Then the task-agnostic prompt p is learned by minimizing the weighted combination of contrastive learning loss and cross-entropy loss: where λ determines the weight of L CL .
In fact, we can seek to put Eq. 3 and Eq. 7 in a single-stage optimization framework.The intuition is that we generate counterfactual image Algorithm 1 Counterfactual Prompt Learning 1: X: image space 2: Y: label space 3: hc: task-relevant prompt for the c-th class 4: H: the set of task-relevant prompts 5: p: the task-agnostic prompt 6: v: image features 7: v − : negative image features 8: u: parameter controls the generation of counterfactual image features 9: function CPL(X, Y) 10: H ← Y 11: tc ← [p, hc] 12: for each i, j do 13: sim(i, j) = BERTScore(hi, hj) ▷ Eq. 5 14: end for 15: for q in the batch do 16: v ← vq 17: Find the index k that maximize sim(q, k) with the given index q 18: v − ← vk 19: Generate counterfactual image features ▷ Eq. 2 20: LCE ← cross-entropy loss ▷ Eq. 6 21: LCL ← contrastive loss ▷ Eq. 4 22: Update p and u with the joint optimization loss ▷ Eq. 7 23: end for 24: end function features with minimal feature change that can maximize the negative prediction probability, and at the same time, utilize contrastive learning to learn the prompt that can guide CLIP to explicitly distinguish between factual images and counterfactual images.Putting all pieces together, we have: (8) In Eq. 8, the gradients can be back-propagated all the way through the text encoder G to the taskagnostic prompt, making use of the rich knowledge encoded in the pre-trained CLIP model to optimize the prompt.
Algorithm 1 presents the learning algorithm of CPL.In summary, given few input training samples {(x 1 , y 1 ) , . . ., (x n , y n )}, CPL consists of three main steps: (1) compute the similarity matrix between different text prompts within the sampled batch; (2) generate counterfactual image features; (3) optimize p and u with contrastive learning loss and cross-entropy loss.

Task-relevant Prompt Construction
We construct task-relevant prompts H for image classification, image-text retrieval, and visual question answering, respectively.For image classifi-cation, the prompts are class labels for each task; for image-text retrieval, captions for each image are adopted as prompts; for visual question answering, we first use a pre-trained generative T5 model (Raffel et al., 2019) to convert the questionanswer pairs into declarative sentences referring to the VQA prompt generation method proposed in Song et al. (2022b).Then, motivated by Wei et al. (2022), we add additional category information into the prompt generated from templates based on the question type to help the model perform intermediate reasoning steps.Specifically, we add "The question is asking about others" for Other questions before the generated declarative sentence.In a similar vein, "The question is asking about yes or no" and "The question is asking about numbers" are added for Yes/No and Number questions.

Tasks and Datasets
Image Classification.We employ seven publicly available image classification datasets used in CLIP: SUN397 (Xiao et al., 2010), Cal-tech101 (Griffin et al., 2007), ImageNet (Deng et al., 2009), OxfordPets (Parkhi et al., 2012), StandfordCars (Krause et al., 2013), Flow-ers102 (Nilsback and Zisserman, 2008), and Food101 (Bossard et al., 2014).These datasets constitute a comprehensive benchmark, which covers a diverse set of vision tasks including the classification of generic objects, fine-grained image recognition, action classification, etc.To evaluate the generalization ability of methods, we split those datasets into seen and unseen classes.Only images in the seen classes will be used for training.The setting follows the few-shot evaluation protocol in CLIP, where we use 16 shots for training and full test sets for testing.
Image-Text Retrieval.We consider two datasets for image-text retrieval: MSCOCO (Lin et al., 2014) andFlickr30K (Plummer et al., 2015).We adopt the widely used Karpathy split (Karpathy and Fei-Fei, 2015) for both the MSCOCO and Flickr30K datasets, where MSCOCO contains 113/5K/5K for train/validation/test.Flickr30K contains 29K/1K/1K images for train/validation/test.We construct few-shot setting subsets for both Co-CoOp and CPL by taking 0.5%, 1%, and 3% of training instances.We train the model with the subsets and evaluate its performance on the complete  (Zhou et al., 2022) on seen and unseen classes across seven image classification datasets in terms of accuracy (%) under the few-shot setting.The relative difference (%) compared with CLIP is reported in color.(Goyal et al., 2017a) in terms of accuracy (%).The relative improvements over CLIP are reported in color.Incorporating category information into taskrelevant prompts can further improve the performance.test set.We use Recall at 1 (R@1) as the default evaluation metric.
Visual Question Answering.VQAv2 (Goyal et al., 2017b) is an extended dataset from the VQA (Antol et al., 2015) dataset.The questions are categorized into three types: Number, Yes/No, and Other.We set up the experiments following Anderson et al. (2018), which treats visual question answering as a classification problem: for each question, the model picks the corresponding answer from a given set of predefined most frequent candidate answers and matches it with the image.
The questions are first converted into a masked template using the pre-trained T5 model and predefined rules.The infilled template along with the questions will be turned into prompts that naturally connect questions and answers.The model will predict whether the given prompt and image pairs are matched.We construct the few-shot setting by taking 0.5%, 1%, and 3% instances for training.

Implementation Details
Baselines.We mainly compare CPL with Co-CoOp (Zhou et al., 2022), one of the earliest prompt tuning methods proposed for vision-and-language pre-trained models.CoCoOp considers each input image and injects the learnable instance-aware tokens into the context vectors as the final prompt.For a fair comparison, both CPL and CoCoOp adopt CLIP (Radford et al., 2021) as the pre-trained vision-and-language backbone and are compared with respect to their relative improvements over zero-shot CLIP.
Prompt Tuning.The task-agnostic prompt is randomly initialized from a zero-mean Gaussian distribution with the standard deviation 0.02, where we set length L = 4 by default.For vision and language tasks, in contrast to image classification, where an image is labeled by a category, the taskrelevant prompts comprise more fine-grained details, usually a sentence.We here similarly tokenize the whole sentence using the CLIP word embedding (Radford et al., 2021), and feed the tokenized results to the text encoder with task-agnostic prompt vectors, to generate the language embedding for each prompt.In both the image-text retrieval and visual question answering, all data in the test set can be treated as belonging to unseen classes.

Main Results
Image Classification.The experimental results for image classification are shown in Table 1.With better prompts learned from counterfactual examples, our CPL method achieves clear advantages over CoCoOp for both seen and unseen classes across almost all datasets.Particularly on unseen classes, we gain an average relative improvement of 3.55%.
Meanwhile, CoCoOp shows its poor generalization ability.Specifically, we found that CoCoOp performs worse than CLIP on StandfordCars on both seen and unseen classes, and on Caltech101 and Flower102 on unseen classes, indicating that it tends to learn and leverage spurious relations and could not generalize well on unseen classes in some cases.We believe all these mentioned above can be sufficient evidence that the main idea of CPL, learning non-spurious prompt representation can aid CLIP adapting at test time, is practical.
Image-Text Retrieval.Table 2 reports results on image-text retrieval on the unseen test set.CPL can beat the zero-shot CLIP consistently across the three different settings, demonstrating that CPL can also learn better prompt representation and more effectively exploit the limited amount of data on image-text retrieval.Meanwhile, CoCoOp performs even worse than CLIP on Flickr30k using 0.5% training data, which suggests that a tiny quantity of training data for image-text retrieval can lead to spurious prompt representation if using naïve instance-conditional prompt tuning method.
Visual Question Answering.For visual question answering, the results are shown in Table 3.As can be seen, CPL surpasses the baseline Co-CoOp with a relative improvement of up to 25.08% when using 1% instances for training.This proves the concept that CPL can be effective on more complicated vision-and-language tasks.In fact, visual question answering is more challenging for zero-shot CLIP which is pre-trained for image-text matching.During pre-training, CLIP sees most sentences similar to captions in image-text retrieval and those captions can be directly used as prompts; while for VQA, question-answer pairs have to be adapted into declarative prompts.Therefore, zeroshot CLIP has poor performance on VQA, but fewshot prompt tuning via CPL can help reduce the prompt domain gap significantly.Apart from the vanilla CPL method, we examined another variant  of CPL where we do not add additional category information into the prompt (denoted as CPL w/o.Category Information), the results indicate that constructing task-relevant prompts by adding categorical information contributes to the improvement.

Ablation Analysis
Negative Sampling.We compare the random sampling vs. BERTScore sampling over ImageNet for image classification, MSCOCO for image-text retrieval, and VQAv2 for visual question answering in Table 4.With more challenging negative examples, BERTScore sampling leads to more effective prompt tuning and overbeats random sampling on all three tasks.The qualitative visualizations of the two sampling strategies are shown in Figure 4, from which it can be seen that BERTScore-sampled images are much more semantically similar to the original images.
Non-spurious Feature Visualization.We visualize the heatmap of the learned non-spurious feature weights in the image level in Figure 4.The weights are mainly centralized on the semantically meaningful regions that are aligned to the text prompts.
Number of Shots in Image Classification.We then study the effects of the number of shots on CPL for image classification.Following the fewshot evaluation protocol adopted in CLIP, we use 4, 8, and 16 shots for training on ImageNet.From Figure 5, increasing the number of shots keeps improving the performance of both two methods   Contribution of Contrastive Learning.In Section 3, we use the coefficient λ to weigh the contrastive learning loss and combine it with the crossentropy loss.It is observed that the scale of contrastive learning loss is smaller, hence we try to use a larger λ to balance the two loss terms.Figure 6 shows the average accuracy result across seen and unseen classes on the SUN397 dataset under four different λ values.Note that when λ is zero, there is no contribution from the contrastive loss and the method actually learns the prompt using standard cross-entropy loss.From experimental results obtained on the SUN397 dataset, we can observe that using λ = 1 leads to the best performance.

Conclusion
In this paper, we propose a Counterfactual Prompt Learning (CPL) framework to avoid timeconsuming prompt engineering and learn more generalizable prompt representation for vision and language models.We conduct abundant experiments on seven widely used image classification datasets, two image-text retrieval datasets, and one visual question answering dataset.Our proposed CPL  method outperforms the previous prompt tuning baseline and the zero-shot CLIP across the three tasks.In the future, we plan to develop more sophisticated methods based on CPL and extend CPL to other vision and language tasks.

Limitations
There are fairness issues in large pre-trained vision and language models such as CLIP.The proposed prompt learning method in this study automatically learns the prompt and does not address those issues in the pre-trained model.Considering the method is proposed for the few-shot setting, careful inspection and tuning are also needed when testing our method on other biased datasets.The methodologies proposed in Booth et al. ( 2021) and Wang et al. (2021a) may possibly be paired with CPL to potentially address the issues.Another limitation is the absence of explainability in CPL, which is a common problem with existing soft prompt tuning methods.Back-mapping tuned soft prompts representation to natural language is a way for interpretation; however, due to the limited size of vocabulary used by CLIP during the training, prior methods such as searching for the nearest words in the embedding space can not accurately match the vector to natural language.Expanding the dictionary size for CLIP embedding or developing more advanced back-mapping techniques can possibly address the limitation.

A
: A large long train on a steel track B: A large long train on a steel track near a barn What if we add a barn to image A (or remove the barn from image B)? Will the prompt be changed ?

Figure 1 :
Figure 1: A conceptual overview of counterfactual prompt learning.CPL constructs counterfactuals by identifying non-spurious feature change that causally causes the prompt change.In this case, the "barn" feature is the essential cause between Prompt A and B.

Figure 2 :
Figure2: The counterfactual prompt learning framework.We freeze the vision encoder F and the text encoder G, and only optimize the task-agnostic prompts and the instance-conditioned net M (blue blocks).Please refer to Section 3.2 for the explanation.

Figure 3 :
Figure 3: Counterfactual generation process.v and c are the positive image feature and label, while v − and c − are the negative image feature and label.• is element-wise multiplication.By mixing v and v − , the counterfactual image feature v ′ is predicted as a negative label c − by the discriminator D. u is minimized so a minimal change to the positive image feature u is captured here to causally change the label.

Figure 4 :
Figure 4: Visualization of the weights of the controller parameter u on images.The first column is the original positive examples; the second column is BERT-sampled negative examples; the third column is randomly-sampled negative examples for comparison.The BERTScore between the text prompts of positive examples and sampled examples are shown at the bottom.

Figure 5 :
Figure 5: Accuracy comparison on ImageNet (Deng et al., 2009) unseen classes under three different shots.CPL performs better than CoCoOp consistently and has lower standard errors.

Figure 6 :
Figure 6: Ablation of four different λ values on the SUN397 dataset in terms of average accuracy (%).The performance of CPL peaks at λ = 1.

Table 3 :
Result comparison on the VQAv2 dataset

Table 4 :
Random sampling vs. BERTScore sampling for CPL over three tasks.On ImageNet, we measure the average accuracy across seen and unseen classes.On MSCOCO and VQAv2, we both use 1% instances for few-shot learning.