Marcus Rohrbach

2025

Predicting Implicit Arguments in Procedural Video Instructions
Anil Batra | Laura Sevilla-Lara | Marcus Rohrbach | Frank Keller
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like verb,what,where/with. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step’s where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models’ contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

2024

pdf bib abs

InFact: A Strong Baseline for Automated Fact-Checking
Mark Rothermel | Tobias Braun | Marcus Rohrbach | Anna Rohrbach
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

The spread of disinformation poses a global threat to democratic societies, necessitating robust and scalable Automated Fact-Checking (AFC) systems. The AVeriTeC Shared Task Challenge 2024 offers a realistic benchmark for text-based fact-checking methods. This paper presents Information-Retrieving Fact-Checker (InFact), an LLM-based approach that breaks down the task of claim verification into a 6-stage process, including evidence retrieval. When using GPT-4o as the backbone, InFact achieves an AVeriTeC score of 63% on the test set, outperforming all other 20 teams competing in the challenge, and establishing a new strong baseline for future text-only AFC systems. Qualitative analysis of mislabeled instances reveals that InFact often yields a more accurate conclusion than AVeriTeC’s human-annotated ground truth.

2021

pdf bib

2019

pdf bib abs

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Satwik Kottur | José M. F. Moura | Devi Parikh | Dhruv Batra | Marcus Rohrbach
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image (using the conversation history as context). It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the ‘state’ of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our code and dataset are publicly available.

pdf bib abs

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

2018

pdf bib abs

A Dataset for Telling the Stories of Social Media Videos
Spandana Gella | Mike Lewis | Marcus Rohrbach
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories. However, if someone is unable to consume video, either due to a disability or network bandwidth, this severely limits their participation and communication. Automatically telling the stories using multi-sentence descriptions of videos would allow bridging this gap. To learn and evaluate such models, we introduce VideoStory a new large-scale dataset for video description as a new challenge for multi-sentence video description. Our VideoStory captions dataset is complementary to prior work and contains 20k videos posted publicly on a social media platform amounting to 396 hours of video with 123k sentences, temporally aligned to the video.

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.

Co-authors

Venues

tacl1

ws1

Fix author

Marcus Rohrbach

2025

2024

2021

2019

2018

2016

2015

2013

Co-authors

Venues