Multi-grained Attention with Object-level Grounding for Visual Question Answering

Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu


Abstract
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.
Anthology ID:
P19-1349
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3595–3600
Language:
URL:
https://aclanthology.org/P19-1349/
DOI:
10.18653/v1/P19-1349
Bibkey:
Cite (ACL):
Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, and Yong Zhu. 2019. Multi-grained Attention with Object-level Grounding for Visual Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3595–3600, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Multi-grained Attention with Object-level Grounding for Visual Question Answering (Huang et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1349.pdf
Video:
 https://aclanthology.org/P19-1349.mp4
Data
Visual Question AnsweringVisual Question Answering v2.0