Aishwarya Agrawal


pdf bib
Vision-Language Pretraining: Current Trends and the Future
Aishwarya Agrawal | Damien Teney | Aida Nematzadeh
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger but noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., Lu et al., 2019; Radford et al., 2021). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. (e.g., Lu et al., 2019; Chen et al.,2020; Tan and Bansal, 2019; Li et al., 2020a,b). In addition to the larger pretraining datasets, the transformer architecture (Vaswani et al., 2017) and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrained models on downstream tasks (Hendricks et al., 2021). In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image–language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal representation learning.


pdf bib
Visual Storytelling
Ting-Hao Kenneth Huang | Francis Ferraro | Nasrin Mostafazadeh | Ishan Misra | Aishwarya Agrawal | Jacob Devlin | Ross Girshick | Xiaodong He | Pushmeet Kohli | Dhruv Batra | C. Lawrence Zitnick | Devi Parikh | Lucy Vanderwende | Michel Galley | Margaret Mitchell
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes
Gordon Christie | Ankit Laddha | Aishwarya Agrawal | Stanislaw Antol | Yash Goyal | Kevin Kochersberger | Dhruv Batra
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Analyzing the Behavior of Visual Question Answering Models
Aishwarya Agrawal | Dhruv Batra | Devi Parikh
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing