Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Qingyi Si, Yuanxin Liu, Zheng Lin, Peng Fu, Yanan Cao, Weiping Wang


Abstract
Despite the excellent performance of vision-language pre-trained models (VLPs) on conventional VQA task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our experiments involve 2 VLPs, 2 compression methods, 4 training methods, 2 datasets and a range of sparsity levels. Our results show that there indeed exist sparse and robust subnetworks, which are competitive with the debiased full VLP and clearly outperform the debiasing SoTAs with fewer parameters on OOD datasets VQA-CP v2 and VQA-VS. The codes can be found at https://github.com/PhoebusSi/Compress-Robust-VQA.
Anthology ID:
2023.emnlp-main.34
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
513–529
Language:
URL:
https://aclanthology.org/2023.emnlp-main.34
DOI:
10.18653/v1/2023.emnlp-main.34
Bibkey:
Cite (ACL):
Qingyi Si, Yuanxin Liu, Zheng Lin, Peng Fu, Yanan Cao, and Weiping Wang. 2023. Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 513–529, Singapore. Association for Computational Linguistics.
Cite (Informal):
Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering (Si et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.34.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.34.mp4