Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Yifu Qiu; Yftah Ziser; Anna Korhonen; Shay B. Cohen; Edoardo Maria Ponti

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B Cohen, Edoardo Ponti

Abstract

Can unified vision–language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)—effectively captioning the action between frames—is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Anthology ID:: 2026.findings-acl.1772
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35579–35600
Language:
URL:: https://aclanthology.org/2026.findings-acl.1772/
DOI:
Bibkey:
Cite (ACL):: Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B Cohen, and Edoardo Ponti. 2026. Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35579–35600, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics (Qiu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1772.pdf
Checklist:: 2026.findings-acl.1772.checklist.pdf

PDF Cite Search Checklist Fix data