ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue

Haoqin Tu, Yitong Li, Fei Mi, Zhongliang Yang


Abstract
Incorporating visual knowledge into text-only dialogue systems has become a potential direction to imitate the way humans think, imagine, and communicate. However, existing multimodal dialogue systems are either confined by the scale and quality of available datasets or the coarse concept of visual knowledge. To address these issues, we provide a new paradigm of constructing multimodal dialogues as well as two datasets extended from text-only dialogues under such paradigm (ReSee-WoW, ReSee-DD). We propose to explicitly split the visual knowledge into finer granularity (“turn-level” and “entity-level”). To further boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset. To demonstrate the superiority and universality of the provided visual knowledge, we propose a simple but effective framework ReSee to add visual representation into vanilla dialogue models by modality concatenations. We also conduct extensive experiments and ablations w.r.t. different model configurations and visual knowledge settings. Empirical, encouraging results not only demonstrate the effectiveness of introducing visual knowledge at both entity and turn level but also verify the proposed model ReSee outperforms several state-of-the-art methods on automatic and human evaluations. By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts. Our code is available at https://github.com/ImKeTT/ReSee.
Anthology ID:
2023.emnlp-main.479
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7720–7735
Language:
URL:
https://aclanthology.org/2023.emnlp-main.479
DOI:
10.18653/v1/2023.emnlp-main.479
Bibkey:
Cite (ACL):
Haoqin Tu, Yitong Li, Fei Mi, and Zhongliang Yang. 2023. ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7720–7735, Singapore. Association for Computational Linguistics.
Cite (Informal):
ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue (Tu et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.479.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.479.mp4