Moming Tang
2023
XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding
Moming Tang
|
Chengyu Wang
|
Jianing Wang
|
Chuanqi Tan
|
Songfang Huang
|
Cen Chen
|
Weining Qian
Findings of the Association for Computational Linguistics: ACL 2023
Recently, Contrastive Visual-Language Pre-training (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks. Yet, most CLIP-based methods require tasks-specific designs and sufficient training data. In this paper, we introduce a simple yet efficient paradigm for low-resource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models. In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem. Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues. Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings.
Search
Co-authors
- Chengyu Wang 1
- Jianing Wang 1
- Chuanqi Tan 1
- Songfang Huang 1
- Cen Chen 1
- show all...