Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction

Kraig Tou; Zijun Sun

Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction

Abstract

Many leading methods in Vision and language (V+L) pretraining utilize masked language modeling (MLM) as a standard pretraining component, with the expectation that reconstruction of masked text tokens would necessitate reference to corresponding image context via cross/self attention and thus promote representation fusion. However, we observe that the minimization of MLM loss in earlier training stages can depend disproportionately on local text signals, leading to poor training efficiency and inconsistency with the goal of representation fusion. The extent of this lack of cross modal interaction depends strongly which token(s) are masked. To address this issue, we propose a curriculum masking scheme as a replacement for random masking. Tokens are selected to be masked at a frequency proportional to the expected level of cross modal interaction necessary to reconstruct them. This is achieved using a parallel mask selection agent that measures the cross modal flow of information and treats it as a reward to be maximized. By additionally masking contiguous spans that include key objects and their relations, we also achieve better relational understanding, which has been shown to be lacking in many SOTA models. Our experiments on a wide range of V+L tasks show that we trail closely behind state-of-the-art methods despite pretraining on 300x to 1000x less data and we also achieve either top or runner-up performance on tasks from the ARO benchmark which tests compositional relationships. Finally, we demonstrate the potential of our method to scale to larger pretraining data.

Anthology ID:: 2024.naacl-long.203
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3672–3688
Language:
URL:: https://aclanthology.org/2024.naacl-long.203
DOI:
Bibkey:
Cite (ACL):: Kraig Tou and Zijun Sun. 2024. Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3672–3688, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction (Tou & Sun, NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.203.pdf

PDF Cite Search