CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Yong Zhao; Kai Xu; Zhengqiu Zhu; Yue Hu (胡月); Zhiheng Zheng; Yingfeng Chen; Yatai Ji; Chen Gao; Yong Li; Jincai Huang

doi:10.18653/v1/2025.emnlp-main.630

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Abstract

Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings—spanning environment, action, and perception—largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose -Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/tsinghua-fib-lab/CityEQA.git.

Anthology ID:: 2025.emnlp-main.630
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12465–12480
Language:
URL:: https://aclanthology.org/2025.emnlp-main.630/
DOI:: 10.18653/v1/2025.emnlp-main.630
Bibkey:
Cite (ACL):: Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, and Jincai Huang. 2025. CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12465–12480, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space (Zhao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.630.pdf
Checklist:: 2025.emnlp-main.630.checklist.pdf

PDF Cite Search Checklist Fix data