Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

Violetta Shevchenko, Damien Teney, Anthony Dick, Anton van den Hengel


Abstract
The limits of applicability of vision-and language models are defined by the coverage of their training data. Tasks like vision question answering (VQA) often require commonsense and factual information beyond what can be learned from task-specific datasets. This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We use an auxiliary training objective that encourages the learned representations to align with graph embeddings of matching entities in a KB. We empirically study the relevance of various KBs to multiple tasks and benchmarks. The technique brings clear benefits to knowledge-demanding question answering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge absent from existing models. More surprisingly, the technique also benefits visual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and show that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities. The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
Anthology ID:
2021.lantern-1.1
Volume:
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Month:
April
Year:
2021
Address:
Kyiv, Ukraine
Editors:
Marius Mosbach, Michael A. Hedderich, Sandro Pezzelle, Aditya Mogadala, Dietrich Klakow, Marie-Francine Moens, Zeynep Akata
Venue:
LANTERN
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–18
Language:
URL:
https://aclanthology.org/2021.lantern-1.1
DOI:
Bibkey:
Cite (ACL):
Violetta Shevchenko, Damien Teney, Anthony Dick, and Anton van den Hengel. 2021. Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge. In Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 1–18, Kyiv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge (Shevchenko et al., LANTERN 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.lantern-1.1.pdf
Data
COCO CaptionsConceptNetGQAMS COCOOK-VQASNLI-VEVisual GenomeVisual Question Answering v2.0