Going beyond Imagination! Enhancing Multi-modal Dialogue Agents with Synthetic Visual Descriptions

Haolan Zhan, Sameen Maruf, Ingrid Zukerman, Gholamreza Haffari


Abstract
Building a dialogue agent that can seamlessly interact with humans in multi-modal regimes, requires two fundamental abilities: (1) understanding emotion and dialogue acts within situated user scenarios, and (2) grounding perceived visual cues to dialogue contexts. However, recent works have uncovered shortcomings of existing dialogue agents in understanding emotions and dialogue acts, and in ground- ing visual cues effectively. In this work, we investigate whether additional dialogue data with only visual descriptions can help dialogue agents effectively align visual and textual features, and enhance the ability of dialogue agents to ground perceived visual cues to dialogue contexts. To this end, in the absence of a suitable dataset, we propose a synthetic visual description generation pipeline, and con- tribute a large-scale synthetic visual description dataset. In addition, we propose a general training procedure for effectively leveraging these synthetic data. We conduct comprehensive analyses to evaluate the impact of synthetic data on two benchmarks: MELD and IEMOCAP. Our findings suggest that synthetic visual descriptions can serve as an effective way to enhance a dialogue agents’ grounding ability, and that the training scheme affects the extent to which these descriptions improve the agent’s performance.
Anthology ID:
2024.sigdial-1.36
Volume:
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
September
Year:
2024
Address:
Kyoto, Japan
Editors:
Tatsuya Kawahara, Vera Demberg, Stefan Ultes, Koji Inoue, Shikib Mehri, David Howcroft, Kazunori Komatani
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
420–427
Language:
URL:
https://aclanthology.org/2024.sigdial-1.36
DOI:
Bibkey:
Cite (ACL):
Haolan Zhan, Sameen Maruf, Ingrid Zukerman, and Gholamreza Haffari. 2024. Going beyond Imagination! Enhancing Multi-modal Dialogue Agents with Synthetic Visual Descriptions. In Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 420–427, Kyoto, Japan. Association for Computational Linguistics.
Cite (Informal):
Going beyond Imagination! Enhancing Multi-modal Dialogue Agents with Synthetic Visual Descriptions (Zhan et al., SIGDIAL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigdial-1.36.pdf