Visual-Textual Alignment for Graph Inference in Visual Dialog

Tianling Jiang, Yi Ji, Chunping Liu, Hailin Shao


Abstract
As a conversational intelligence task, visual dialog entails answering a series of questions grounded in an image, using the dialog history as context. To generate correct answers, the comprehension of the semantic dependencies among implicit visual and textual contents is critical. Prior works usually ignored the underlying relation and failed to infer it reasonably. In this paper, we propose a Visual-Textual Alignment for Graph Inference (VTAGI) network. Compared with other approaches, it makes up the lack of structural inference in visual dialog. The whole system consists of two modules, Visual and Textual Alignment (VTA) and Visual Graph Attended by Text (VGAT). Specially, the VTA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. The VGAT module views the visual features with semantic information as observed nodes and each node learns the relationship with others in visual graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our VTAGI outperforms previous state-of-the-art models.
Anthology ID:
2020.coling-main.170
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1874–1885
Language:
URL:
https://aclanthology.org/2020.coling-main.170
DOI:
10.18653/v1/2020.coling-main.170
Bibkey:
Cite (ACL):
Tianling Jiang, Yi Ji, Chunping Liu, and Hailin Shao. 2020. Visual-Textual Alignment for Graph Inference in Visual Dialog. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1874–1885, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Visual-Textual Alignment for Graph Inference in Visual Dialog (Jiang et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.170.pdf
Data
VisDial