Prompting Vision-Language Models For Aspect-Controlled Generation of Referring Expressions

Danfeng Guo; Sanchit Agarwal; Arpit Gupta; Jiun-Yu Kao; Emre Barut; Tagyoung Chung; Jing Huang; Mohit Bansal

doi:10.18653/v1/2024.findings-naacl.178

Prompting Vision-Language Models For Aspect-Controlled Generation of Referring Expressions

Danfeng Guo, Sanchit Agarwal, Arpit Gupta, Jiun-Yu Kao, Emre Barut, Tagyoung Chung, Jing Huang, Mohit Bansal

Abstract

Referring Expression Generation (REG) is the task of generating a description that unambiguously identifies a given target in the scene. Different from Image Captioning (IC), REG requires learning fine-grained characteristics of not only the scene objects but also their surrounding context. Referring expressions are usually not singular; an object can often be uniquely referenced in numerous ways, for instance, by color, by location, or by relationship with other objects. Most prior works, however, have not explored this ‘aspect-based multiplicity’ of referring expressions. Hence, in this work, we focus on the Aspect-Controlled REG task, which requires generating a referring expression conditioned on the input aspect(s), where an aspect captures a style of reference. By changing the input aspect such as color, location, action etc., one can generate multiple distinct expressions per target region. To solve this new task, we first modify BLIP for aligning image-regions and text-expressions. We achieve this through a novel approach for feeding the input by drawing a bounding box around the target image-region and prompting the model to generate the referring expression. Our base REG model already beats all prior works in CIDEr score. To tackle Aspect-Controlled REG, we append ‘aspect tokens’ to the prompt and show that distinct expressions can be generated by just changing the prompt. Finally, to prove the high-quality and diversity of the data generated by our proposed aspect-controlled REG model, we also perform data-augmentation-based evaluation on the downstream Referring Expression Comprehension (REC) task. With just half of the real data augmented with the generated synthetic data, we achieve performance comparable to training with 100% of real data, using a SOTA REC model.

Anthology ID:: 2024.findings-naacl.178
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2793–2807
Language:
URL:: https://aclanthology.org/2024.findings-naacl.178/
DOI:: 10.18653/v1/2024.findings-naacl.178
Bibkey:
Cite (ACL):: Danfeng Guo, Sanchit Agarwal, Arpit Gupta, Jiun-Yu Kao, Emre Barut, Tagyoung Chung, Jing Huang, and Mohit Bansal. 2024. Prompting Vision-Language Models For Aspect-Controlled Generation of Referring Expressions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2793–2807, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Prompting Vision-Language Models For Aspect-Controlled Generation of Referring Expressions (Guo et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.178.pdf
Video:: https://aclanthology.org/2024.findings-naacl.178.mp4

PDF Cite Search Video Fix data