XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding

Recently, Contrastive Visual-Language Pre-training (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks. Yet, most CLIP-based methods require tasks-specific designs and sufficient training data. In this paper, we introduce a simple yet efficient paradigm for low-resource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models. In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem. Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues. Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings. 1


Introduction
Pre-trained Visual-Language models such as X-VLM (Zeng et al., 2021) and CLIP (Radford et al., 2021) have been proposed to unify visual and textual representations in the same embedding space and shown great potential for Visual Language Understanding (VLU).Conventional fine-tuning approaches (Clark et al., 2020;Lee et al., 2020;Wang et al., 2023) heavily depend on the time-consuming and labor-intensive process of data annotation, which are bothersome in low-resource scenarios.In the literature, Ben Zaken et al. (2022); Song et al. (2022) propose partial-parameter fine-tuning to preserve the pre-trained knowledge of these models.Yao et al. (2021); Song et al. (2022); Tsim-poukelli et al. (2021) reformulate visual grounding and visual question answering as a "fill-inblank" problem by hand-crafted prompts.Gao et al. (2021); Zhang et al. (2022) utilize lightweight adapters (Houlsby et al., 2019) to retain the knowledge of CLIP.Besides, Zhou et al. (2022b,a); Zhu et al. (2022) address image classification tasks by utilizing textual representations describing image categories.
Despite the success, we suggest there are still some drawbacks in existing works.i) The discrete prompt paradigm requires labor-intensive promptengineering, while the soft template paradigm results in an unstable training process.ii) Adapters or partial-parameter fine-tuning methods may underperform due to their relatively large number of tunable parameters, requiring additional training data to achieve satisfactory results.iii) The aforementioned methods are task-specific in design, implying that their effectiveness may be derived from task-specific architectures.Hence, it is vital for us to design a more unified parameter-efficient tuning approach in order to solve various VLU tasks.
In this paper, we present XtremeCLIP, an extremely parameter-efficient tuning method for solving various VLU tasks based on CLIP (Radford et al., 2021).XtremeCLIP reformulates a series of VLU tasks uniformly into an open-book affinitymatching problem.Here, we adopt a knowledgebase prototype matrix to record the salient characteristics for each class by visual-textual fusion features, then perform affinity matching between image-text pairs and prototypes of each class.We further utilize the implicit sorting information of ground-truth labels by contrastive learning to provide more supervised cues from low-resource training sets.During model training, all parameters of textual and visual encoders in CLIP are fixed.Hence, XtremeCLIP is extremely parameterefficient.We conduct extensive experiments on a visual entailment (VE) benchmark (i.e., SNLI-VE), a 2 XtremeCLIP: The Proposed Method The model architecture and training procedure of XtremeCLIP are in Figure 1.First, a knowledgebase prototype matrix is constructed (Snell et al., 2017) by combining visual and textual features, designed to serve as a repository of the key characteristics for each class.Then, open-book affinity matching is performed between the image-text instance and the prototypes for each class.

Prototype Matrix Construction
Given a set of N image-text training instances: , where l i denotes the ground-truth label, txt i denotes the corresponding textual description of the image img i .Image-text pairs are encoded using visual V and textual T encoders of CLIP: A fusion function F is employed to obtain uniform image-text representations that capture the interactions between visual and textual information: where F(v 1 , v 2 ) ∈ R 5d .These fusion features are used to construct the knowledge-base prototype matrix denoted as W P by averaging them per their ground-truth labels:

Open-book Matching
Prototype Matching for VE and VQA.In VE or VQA, affinity matching is performed between the fusion feature of a given image-text pair and the prototypes for each class: Prototype Matching for IC.In traditional IC tasks, only images are provided without corresponding textual descriptions.We obtain textual descriptions (prompts) for all classes, followin Radford et al. (2021).Given an image img i and the textual descriptions of all image categories {t c |c = 1 • • • C}, the predicted probability (denoted as P i,c ) of img i w.r.t. the c-th image category is as follows: Thus, the entire probabilistic distribution P i is:

Training Paradigm
XtremeCLIP has only one set of tunable parameters, namely the Prototype Matrix denoted by W p .Its fusion function, visual, and textual encoders are solely utilized for constructing the prototype matrix, with all parameters frozen during the training phase.In XtremeCLIP, the model is trained using the Cross-Entropy (CE) loss given P i .The sample-wise CE loss is defined as follows: where l i,c denotes the ground-truth label w.r.t. the c-th class.However, the model can hardly achieve satisfactory performance with only supervised signals from CE in low-resource tasks.Given that instances' affinity with ground-truth classes should be ranked higher than other classes, this implicit sorting information can be utilized to guide the model to recognize instances' ground-truth classes via contrastive learning (Zhong et al., 2020).We define the affinity of the ground-truth category (i.e., the prototype matching probability, denoted as P i,l ) as positive samples and other affinities in P i as negative samples.Following Liu et al. (2022); Liu and Liu (2021), the sample-wise Contrastive Learning (CL) loss is computed as: The total loss function for XtremeCLIP, namely L, is defined as: 3 Experiments

Experimental Settings
We briefly describe the experimental settings and leave more details in Appendix.Baselines.In our work, we compare XtremeCLIP with zero-shot CLIP (Radford et al., 2021); finetuning paradigms including standard fine-tuning, mixout (Lee et al., 2020), pre-trained weight decay (weight decay) (Lee et al., 2020) and Layerwise Learning Rate Decay (LLRD) (Clark et al., 2020); partial-parameter fine-tuning paradigms including BitFit (Ben Zaken et al., 2022) and Bi-Nor (Song et al., 2022); and adapter-based methods including CLIP-Adapter (Gao et al., 2021) and Tip-Adapter (Zhang et al., 2022).Backbone.For fair comparison, all baselines and our approach adopt the ViT-B/16 (ViT-Base with the patch size 16 × 16) version of CLIP.Other versions of CLIP are also experimented with.

Experimental Results
VE&VQA results in low-resource settings.Table 1 presents the results of XtremeCLIP and baselines, in low-resource VE and VQA.The finetuning paradigms perform worse than partial finetuning paradigms in all settings, which demonstrates conventional fine-tuning paradigms are datahungry and not suitable for low-resource VLU tasks.XtremeCLIP consistently outperforms partial fine-tuning and adapter-based methods, showing that reformulating VLU tasks as prototype affinity matching can efficiently utilize visual-textual information with much fewer trainable parameters.
Few-shot IC.     cantly increases while full fine-tuning only slightly increases, which demonstrates XtremeCLIP has higher data-using efficiency.

Conclusion
We propose XtremeCLIP, a simple and efficient paradigm that reformulates VLU tasks as a prototype affinity matching problem.We adopt contrastive learning to leverage implicit sorting information from ground-truth labels, providing more supervised cues to handle insufficient supervised signals in small datasets.Experimental results demonstrate that XtremeCLIP consistently outperforms all baselines in low-resource scenarios.

Limitations
In this paper, the proposed XtremeCLIP framework is mainly focused on CLIP-based deterministic VLU tasks.In future work, we will extend XtremeCLIP to other Pre-trained Vision-Language models and apply XtremeCLIP to generative tasks such as image captioning, visual grounding or visual relation extraction.Figure 3 presents the probability distributions of several images before and after fine-tuning of our approach.The constructed knowledge-base prototype matrix indeed captures the salient characteristics of categories.Based on the knowledge, images can be correctly classified even in zero-shot learning.After fine-tuning, the performance of XtremeCLIP is further boosted.(Helber et al., 2019) 10 8100 DTD (Cimpoi et al., 2014) 47 1692 FGVC (Maji et al., 2013) 102 3333 VE DNLI-VE (Xie et al., 2018) 3 17901 VQA VQA V2 (Goyal et al., 2017) 2 80541

B Experimental Details B.1 Training Corpora
We collect the pre-processed IC training corpora (i.e.FGVC (Maji et al., 2013), EuroSAT (Helber et al., 2019) and DTD (Cimpoi et al., 2014)) from the open-sourced project of (Zhang et al., 2022) on Github2 .The hand-crafted prompt templates that describe the category names for Eu-roSAT, DTD, and FGVC are listed in Table 5.During model training, we randomly select 8 and 16 images of each category for few-shot IC.
For visual entailment and visual questionanswering tasks, we download the pre-processed SNLI-VE (Xie et al., 2018) and VQA v2 (Goyal et al., 2017) from the open-sourced project X-VLM (Zeng et al., 2021) on Github3 and randomly select 2000, 5000, and 10000 samples from each dataset for low-resource VLU tasks.
The statistics are listed in Table 6.

B.2 Experimental Details of Our Approach
We employ ViT-B/16 from OpenAI CLIP4 as the default underlying model.We train XtremeCLIP by AdamW algorithm with β 1 = 0.9, β 2 = 0.999, ϵ = 1e − 4. The training is processed on an NVIDIA Tesla A100 GPU.We run XtremeCLIP 20 epochs for VE and VQA with a batch size of 16 and it takes around 20 minutes; and 100 epochs for IC with a batch size of 16 and it takes around 60 minutes.
For CLIP-Adapter (Gao et al., 2021) 5 and Tip-Adapter (Zhang et al., 2022) 6 , we directly take their open-sourced codes on GitHub.Though Tip-Adapter is proposed for few-shot IC only, by replacing the image features with the visual-textual fusion features of the input image-text pairs when constructing the instance retrieval matrix, it can be directly utilized for other VLU tasks as well.
To adapt CLIP-Adapter to VE and VQA, we respectively apply visual and textual adapter to the visual V and textual encoder T of CLIP to learn adaptive visual and textual features, then weight sum the adaptive visual and textual features with the original visual and textual feature from CLIP, following the original paper.Thereafter, we get the visual-textual fusion representations by the fusion function F. Finally, we perform image-text pair classification with the classification head as in Gao et al. (2021).

B.4 Fusion Function Ablation
Table 7 shows the results of XtremeCLIP with various fusion functions, including traditional higher order and element-wise exponential operations.The results indicate that the selected fusion func- tion, namely F in Eq. 2, is both simple and highly effective, outperforming the others.

B.5 Additional Comparasion
where C denotes the number of classes, M c denotes the prototype of the c-th class and c ∈ 1 • • • C, I(•) denotes the indicator function, and [•] denotes the concatenation operator.

Figure 2 :
Figure 2: Accuracy (%) of XtremeCLIP and full finetuning with various training samples on VE and FGVC.

Figure 3 :
Figure 3: Probability distributions before and after finetuning for the few-shot IC task.

Table 1 :
Table 2 presents the performance of XtremeCLIP and baselines, in few-shot IC.Accuracy (%) on Visual Entailment and Visual Question Answering tasks with 2000, 5000, 10000 training samples.Here, #Params.denotes the number of tunable parameters.Best results are in bold.
Data-scale study.Figure 2 presents the influence of the number of training instances on XtremeCLIP.As the number of training data increases, the accuracy of XtremeCLIP on VE and FGVC signifi- Jingkang Yang, Chen Change Loy, and Ziwei Liu.2022b.Learning to Prompt for Vision-Language Models.IJCV, pages 2337-2348.Beier Zhu, Yulei Niu, Yucheng Han, Yuehua Wu, and Hanwang Zhang.2022.Prompt-aligned Gradient for Prompt Tuning.ArXiv.

Table 5 :
The hard prompt templates for image classification datasets.{} denotes the position of the category names to be filled in.

Table 6 :
Statistics of experimental datasets.#Class: the number of task categories.#Test: the number of test instances.
Table 8 presents the image classification results of XtremeCLIP and the baseline methods, namely CoOp (Zhou et al., 2022b), and Linear Probe (Gao et al., 2021), which are solely utilized for image classification.The results demonstrate that reformulating image classification as an open-book matching paradigm indeed helps XtremeCLIP consistently outperform CoOp and Linear Probe.