UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Baining Zhao; Jianjie Fang; Zichao Dai; Ziyou Wang; Jirong Zha; Weichen Zhang; Chen Gao; Yue Wang; Jinqiang Cui; Xinlei Chen; Yong Li

doi:10.18653/v1/2025.acl-long.1558

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, Yong Li

Abstract

Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban aerial spaces remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.

Anthology ID:: 2025.acl-long.1558
Original:: 2025.acl-long.1558v1
Version 2:: 2025.acl-long.1558v2
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32400–32423
Language:
URL:: https://aclanthology.org/2025.acl-long.1558/
DOI:: 10.18653/v1/2025.acl-long.1558
Bibkey:
Cite (ACL):: Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, and Yong Li. 2025. UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32400–32423, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces (Zhao et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1558.pdf

PDF (v2) PDF (v1) Cite Search Fix data