Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions

Arjun Akula, Spandana Gella, Keze Wang, Song-Chun Zhu, Siva Reddy


Abstract
Neural module networks (NMN) are a popular approach for grounding visual referring expressions. Prior implementations of NMN use pre-defined and fixed textual inputs in their module instantiation. This necessitates a large number of modules as they lack the ability to share weights and exploit associations between similar textual contexts (e.g. “dark cube on the left” vs. “black cube on the left”). In this work, we address these limitations and evaluate the impact of contextual clues in improving the performance of NMN models. First, we address the problem of fixed textual inputs by parameterizing the module arguments. This substantially reduce the number of modules in NMN by up to 75% without any loss in performance. Next we propose a method to contextualize our parameterized model to enhance the module’s capacity in exploiting the visiolinguistic associations. Our model outperforms the state-of-the-art NMN model on CLEVR-Ref+ dataset with +8.1% improvement in accuracy on the single-referent test set and +4.3% on the full test set. Additionally, we demonstrate that contextualization provides +11.2% and +1.7% improvements in accuracy over prior NMN models on CLOSURE and NLVR2. We further evaluate the impact of our contextualization by constructing a contrast set for CLEVR-Ref+, which we call CC-Ref+. We significantly outperform the baselines by as much as +10.4% absolute accuracy on CC-Ref+, illustrating the generalization skills of our approach.
Anthology ID:
2021.emnlp-main.516
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6398–6416
Language:
URL:
https://aclanthology.org/2021.emnlp-main.516
DOI:
10.18653/v1/2021.emnlp-main.516
Bibkey:
Cite (ACL):
Arjun Akula, Spandana Gella, Keze Wang, Song-Chun Zhu, and Siva Reddy. 2021. Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6398–6416, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions (Akula et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.516.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.516.mp4
Data
CLEVRCLEVR-Ref+