How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang; Ping Jian (鉴萍); Zhongbin Guo; Zuming Zhang; Chengzhi Li; Yonghong Deng; Xinyue Zhang; Wenpeng Lu

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, Wenpeng Lu

Abstract

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities.

Anthology ID:: 2026.acl-long.496
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10842–10864
Language:
URL:: https://aclanthology.org/2026.acl-long.496/
DOI:
Bibkey:
Cite (ACL):: Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, and Wenpeng Lu. 2026. How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10842–10864, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study (Yang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.496.pdf
Checklist:: 2026.acl-long.496.checklist.pdf

PDF Cite Search Checklist Fix data