ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Sanjay Subramanian; William Merrill; Trevor Darrell; Matt Gardner; Sameer Singh; Anna Rohrbach

doi:10.18653/v1/2022.acl-long.357

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

Abstract

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP’s contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP’s relative improvement over supervised ReC models trained on real images is 8%.

Anthology ID:: 2022.acl-long.357
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5198–5215
Language:
URL:: https://aclanthology.org/2022.acl-long.357
DOI:: 10.18653/v1/2022.acl-long.357
Bibkey:
Cite (ACL):: Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. 2022. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5198–5215, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension (Subramanian et al., ACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.acl-long.357.pdf
Video:: https://aclanthology.org/2022.acl-long.357.mp4
Code: allenai/reclip + additional community code
Data: CLEVR, MS COCO, RefCOCO

PDF Cite Search Code Video