Knowing More, Acting Better: Hierarchical Representation for Embodied Decision-Making

Chunhui Zhang; Zhongyu Ouyang; Xingjian Diao; Zheyuan Liu; Soroush Vosoughi

Knowing More, Acting Better: Hierarchical Representation for Embodied Decision-Making

Chunhui Zhang, Zhongyu Ouyang, Xingjian Diao, Zheyuan Liu, Soroush Vosoughi

Abstract

Modern embodied AI uses multimodal large language models (MLLMs) as policy models, predicting actions from final-layer hidden states. This widely adopted approach, however, assumes that monolithic last-layer representations suffice for decision-making—a structural simplification at odds with decades of cognitive science, which highlights the importance of distributed, hierarchical processing for perception and action. Addressing this foundational asymmetry, we introduce a hierarchical action probing method that explicitly aggregates representations from all layers, mirroring the brain’s multi-level organization. Experiments reveal that early layers facilitate spatial grounding, middle layers support contextual integration, and later layers enable abstract generalization—which shows MLLMs inherently encode distributed action-relevant structures. These layer-wise features are integrated by a lightweight probe for spatial reasoning and contextual understanding, without costly backbone fine-tuning. This hierarchical solution shows significant improvements over standard last-layer embodied models in physical simulators, achieving a 46.6% success rate and a 62.5% gain in spatial reasoning tasks. These findings challenge conventional assumptions in embodied AI, establishing hierarchical probing as a principled alternative grounded in both cognitive theory and empirical evidence.

Anthology ID:: 2025.findings-emnlp.1042
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19144–19155
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1042/
DOI:
Bibkey:
Cite (ACL):: Chunhui Zhang, Zhongyu Ouyang, Xingjian Diao, Zheyuan Liu, and Soroush Vosoughi. 2025. Knowing More, Acting Better: Hierarchical Representation for Embodied Decision-Making. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19144–19155, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Knowing More, Acting Better: Hierarchical Representation for Embodied Decision-Making (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1042.pdf
Checklist:: 2025.findings-emnlp.1042.checklist.pdf

PDF Cite Search Checklist Fix data