Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding

Keqin Chen, Richong Zhang, Samuel Mensah, Yongyi Mao


Abstract
Weakly supervised phrase grounding aims to learn an alignment between phrases in a caption and objects in a corresponding image using only caption-image annotations, i.e., without phrase-object annotations. Previous methods typically use a caption-image contrastive loss to indirectly supervise the alignment between phrases and objects, which hinders the maximum use of the intrinsic structure of the multimodal data and leads to unsatisfactory performance. In this work, we directly use the phrase-object contrastive loss in the condition that no positive annotation is available in the first place. Specifically, we propose a novel contrastive learning framework based on the expectation-maximization algorithm that adaptively refines the target prediction. Experiments on two widely used benchmarks, Flickr30K Entities and RefCOCO+, demonstrate the effectiveness of our framework. We obtain 63.05% top-1 accuracy on Flickr30K Entities and 59.51%/43.46% on RefCOCO+ TestA/TestB, outperforming the previous methods by a large margin, even surpassing a previous SoTA that uses a pre-trained vision-language model. Furthermore, we deliver a theoretical analysis of the effectiveness of our method from the perspective of the maximum likelihood estimate with latent variables.
Anthology ID:
2022.emnlp-main.586
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8549–8559
Language:
URL:
https://aclanthology.org/2022.emnlp-main.586
DOI:
10.18653/v1/2022.emnlp-main.586
Bibkey:
Cite (ACL):
Keqin Chen, Richong Zhang, Samuel Mensah, and Yongyi Mao. 2022. Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8549–8559, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding (Chen et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.586.pdf