Vision-Language Pretraining: Current Trends and the Future

Aishwarya Agrawal, Damien Teney, Aida Nematzadeh


Abstract
In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger but noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., Lu et al., 2019; Radford et al., 2021). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. (e.g., Lu et al., 2019; Chen et al.,2020; Tan and Bansal, 2019; Li et al., 2020a,b). In addition to the larger pretraining datasets, the transformer architecture (Vaswani et al., 2017) and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrained models on downstream tasks (Hendricks et al., 2021). In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image–language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal representation learning.
Anthology ID:
2022.acl-tutorials.7
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Luciana Benotti, Naoaki Okazaki, Yves Scherrer, Marcos Zampieri
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38–43
Language:
URL:
https://aclanthology.org/2022.acl-tutorials.7
DOI:
10.18653/v1/2022.acl-tutorials.7
Bibkey:
Cite (ACL):
Aishwarya Agrawal, Damien Teney, and Aida Nematzadeh. 2022. Vision-Language Pretraining: Current Trends and the Future. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 38–43, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Vision-Language Pretraining: Current Trends and the Future (Agrawal et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-tutorials.7.pdf
Data
Visual Question Answering