Resolving References in Visually-Grounded Dialogue via Text Generation

Bram Willemsen; Livia Qian; Gabriel Skantze

doi:10.18653/v1/2023.sigdial-1.43

Resolving References in Visually-Grounded Dialogue via Text Generation

Bram Willemsen, Livia Qian, Gabriel Skantze

Abstract

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

Anthology ID:: 2023.sigdial-1.43
Volume:: Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:: September
Year:: 2023
Address:: Prague, Czechia
Editors:: Svetlana Stoyanchev, Shafiq Joty, David Schlangen, Ondrej Dusek, Casey Kennington, Malihe Alikhani
Venue:: SIGDIAL
SIG:: SIGDIAL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 457–469
Language:
URL:: https://aclanthology.org/2023.sigdial-1.43
DOI:: 10.18653/v1/2023.sigdial-1.43
Bibkey:
Cite (ACL):: Bram Willemsen, Livia Qian, and Gabriel Skantze. 2023. Resolving References in Visually-Grounded Dialogue via Text Generation. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 457–469, Prague, Czechia. Association for Computational Linguistics.
Cite (Informal):: Resolving References in Visually-Grounded Dialogue via Text Generation (Willemsen et al., SIGDIAL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.sigdial-1.43.pdf

PDF Cite Search