MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting

Wenyan Li, Dong Li, Wanjing Li, Yuanjie Wang, Hai Jie, Yiran Zhong


Abstract
Pretrained vision-language (VL) models have shown impressive results on various multi-modal downstream tasks recently. Many of the benchmark models build on pretrained causal language models (LMs), leveraging the original few-shot learning and generalization capability of the LMs trained with large text corpora. However, these models are often gigantic and require large-scale image and text data with high computational cost to train. This paper introduces a moderate-size model called MAP for efficient VL transfer learning through adapter-based pretraining and prompting. We aim to answer the question of how much we can complete through VL pretraining within the low-data regime while maximizing efficiency in transferring knowledge of a moderate-size frozen LM. Our experiments demonstrate that MAP achieves substantially better zero-shot and few-shot performance on downstream VL tasks with only 10% the size of pretraining data and a 30x lighter pretrained LM backbone compared to Frozen. MAP also outperforms fully trained models of comparable size at retaining its transfer learning ability when the amount of training data reduces.
Anthology ID:
2023.clasp-1.19
Volume:
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)
Month:
September
Year:
2023
Address:
Gothenburg, Sweden
Editors:
Ellen Breitholtz, Shalom Lappin, Sharid Loaiciga, Nikolai Ilinykh, Simon Dobnik
Venue:
CLASP
SIG:
SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
185–190
Language:
URL:
https://aclanthology.org/2023.clasp-1.19
DOI:
Bibkey:
Cite (ACL):
Wenyan Li, Dong Li, Wanjing Li, Yuanjie Wang, Hai Jie, and Yiran Zhong. 2023. MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting. In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 185–190, Gothenburg, Sweden. Association for Computational Linguistics.
Cite (Informal):
MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting (Li et al., CLASP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.clasp-1.19.pdf