TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making

Kechen Jiao; Zhirui Fang; Jiahao Liu; Bei Li; Qifan Wang; Xinyu Liu; Junhao Ruan; Zhongjian Qiao; Yifan Zhu; Yaxin Xu; Jingang Wang; Xiu Li

doi:10.18653/v1/2025.emnlp-main.484

TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making

Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, Xiu Li

Abstract

Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model’s intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of **26.67%**, achieving a **6%** improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.

Anthology ID:: 2025.emnlp-main.484
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9574–9588
Language:
URL:: https://aclanthology.org/2025.emnlp-main.484/
DOI:: 10.18653/v1/2025.emnlp-main.484
Bibkey:
Cite (ACL):: Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, and Xiu Li. 2025. TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9574–9588, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making (Jiao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.484.pdf
Checklist:: 2025.emnlp-main.484.checklist.pdf

PDF Cite Search Checklist Fix data