Resilience through Scene Context in Visual Referring Expression Generation

Simeon Junker, Sina Zarrieß


Abstract
Scene context is well known to facilitate humans’ perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models’ visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
Anthology ID:
2024.inlg-main.29
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
344–357
Language:
URL:
https://aclanthology.org/2024.inlg-main.29
DOI:
Bibkey:
Cite (ACL):
Simeon Junker and Sina Zarrieß. 2024. Resilience through Scene Context in Visual Referring Expression Generation. In Proceedings of the 17th International Natural Language Generation Conference, pages 344–357, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Resilience through Scene Context in Visual Referring Expression Generation (Junker & Zarrieß, INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.29.pdf