VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu; Junlong Ren; Qi Chai; Deheng Ye; Yujun Cai; Hao Wang

doi:10.18653/v1/2025.emnlp-main.1111

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

Abstract

Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

Anthology ID:: 2025.emnlp-main.1111
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21884–21898
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1111/
DOI:: 10.18653/v1/2025.emnlp-main.1111
Bibkey:
Cite (ACL):: Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, and Hao Wang. 2025. VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21884–21898, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft (Fu et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1111.pdf
Checklist:: 2025.emnlp-main.1111.checklist.pdf

PDF Cite Search Checklist Fix data