Cascaded Mutual Modulation for Visual Reasoning

Yiqun Yao, Jiaming Xu, Feng Wang, Bo Xu


Abstract
Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm the effectiveness of CMM to comprehend natural language logics under the guidence of images. Our code is available at https://github.com/FlamingHorizon/CMM-VR.
Anthology ID:
D18-1118
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
975–980
Language:
URL:
https://aclanthology.org/D18-1118
DOI:
10.18653/v1/D18-1118
Bibkey:
Cite (ACL):
Yiqun Yao, Jiaming Xu, Feng Wang, and Bo Xu. 2018. Cascaded Mutual Modulation for Visual Reasoning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 975–980, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Cascaded Mutual Modulation for Visual Reasoning (Yao et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1118.pdf
Attachment:
 D18-1118.Attachment.zip
Video:
 https://aclanthology.org/D18-1118.mp4
Code
 FlamingHorizon/CMM-VR
Data
CLEVRNLVRVisual Question Answering