End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching

Chi Chen, Peng Li, Maosong Sun, Yang Liu


Abstract
Recently there has been an emerging interest in unsupervised vision-and-language pre-training (VLP) that learns multimodal representations without parallel image-caption data. These pioneering works significantly reduce the cost of VLP on data collection and achieve promising results compared to supervised VLP. However, existing unsupervised VLP methods take as input pre-extracted region-based visual features from external object detectors, which both limits flexibility and reduces computational efficiency. In this paper, we explore end-to-end unsupervised VLP with a vision encoder to directly encode images. The vision encoder is pre-trained on image-only data and jointly optimized during multimodal pre-training. To further enhance the learned cross-modal features, we propose a novel pre-training task that predicts which patches contain an object referred to in natural language from the encoded visual features. Extensive experiments on four vision-and-language tasks show that our approach outperforms previous unsupervised VLP methods and obtains new state-of-the-art results.
Anthology ID:
2022.emnlp-main.742
Original:
2022.emnlp-main.742v1
Version 2:
2022.emnlp-main.742v2
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10799–10810
Language:
URL:
https://aclanthology.org/2022.emnlp-main.742
DOI:
10.18653/v1/2022.emnlp-main.742
Bibkey:
Cite (ACL):
Chi Chen, Peng Li, Maosong Sun, and Yang Liu. 2022. End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10799–10810, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching (Chen et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.742.pdf