Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Marius Mosbach, Michael A. Hedderich, Sandro Pezzelle, Aditya Mogadala, Dietrich Klakow, Marie-Francine Moens, Zeynep Akata (Editors)

Anthology ID:
Kyiv, Ukraine
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Marius Mosbach | Michael A. Hedderich | Sandro Pezzelle | Aditya Mogadala | Dietrich Klakow | Marie-Francine Moens | Zeynep Akata

pdf bib
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko | Damien Teney | Anthony Dick | Anton van den Hengel

The limits of applicability of vision-and language models are defined by the coverage of their training data. Tasks like vision question answering (VQA) often require commonsense and factual information beyond what can be learned from task-specific datasets. This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We use an auxiliary training objective that encourages the learned representations to align with graph embeddings of matching entities in a KB. We empirically study the relevance of various KBs to multiple tasks and benchmarks. The technique brings clear benefits to knowledge-demanding question answering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge absent from existing models. More surprisingly, the technique also benefits visual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and show that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities. The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.

pdf bib
Visual Grounding Strategies for Text-Only Natural Language Processing
Damien Sileo

Visual grounding is a promising path toward more robust and accurate Natural Language Processing (NLP) models. Many multimodal extensions of BERT (e.g., VideoBERT, LXMERT, VL-BERT) allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks (language modeling and classification) with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy. We propose possible strategies in this respect. A first type of strategy, referred to as transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call associative grounding, harnesses image retrieval to match texts with related images during both pretraining and text-only downstream tasks. We draw further distinctions into both strategies and then compare them according to their impact on language modeling and commonsense-related downstream tasks, showing improvement over text-only baselines.

pdf bib
Exploiting Image–Text Synergy for Contextual Image Captioning
Sreyasi Nag Chowdhury | Rajarshi Bhowmik | Hareesh Ravi | Gerard de Melo | Simon Razniewski | Gerhard Weikum

Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal. A notable trait is the inclusion of media such as images placed at meaningful locations within a textual narrative. Most often, such images are accompanied by captions - either factual or stylistic (humorous, metaphorical, etc.) - making the narrative more engaging to the reader. While standalone image captioning has been extensively studied, captioning an image based on external knowledge such as its surrounding text remains under-explored. In this paper, we study this new task: given an image and an associated unstructured knowledge snippet, the goal is to generate a contextual caption for the image.

pdf bib
Large-Scale Zero-Shot Image Classification from Rich and Diverse Textual Descriptions
Sebastian Bujwid | Josephine Sullivan

We study the impact of using rich and diverse textual descriptions of classes for zero-shot learning (ZSL) on ImageNet. We create a new dataset ImageNet-Wiki that matches each ImageNet class to its corresponding Wikipedia article. We show that merely employing these Wikipedia articles as class descriptions yields much higher ZSL performance than prior works. Even a simple model using this type of auxiliary data outperforms state-of-the-art models that rely on standard features of word embedding encodings of class names. These results highlight the usefulness and importance of textual descriptions for ZSL, as well as the relative importance of auxiliary data type compared to the algorithmic progress. Our experimental results also show that standard zero-shot learning approaches generalize poorly across categories of classes.

pdf bib
What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Ronja Utescher | Sina Zarrieß

Multi-modal texts are abundant and diverse in structure, yet Language & Vision research of these naturally occurring texts has mostly focused on genres that are comparatively light on text, like tweets. In this paper, we discuss the challenges and potential benefits of a L&V framework that explicitly models referential relations, taking Wikipedia articles about buildings as an example. We briefly survey existing related tasks in L&V and propose multi-modal information extraction as a general direction for future research.