Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy Chen, Shafiq Joty


Abstract
Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo.
Anthology ID:
2024.emnlp-main.20
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
334–350
Language:
URL:
https://aclanthology.org/2024.emnlp-main.20
DOI:
Bibkey:
Cite (ACL):
Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy Chen, and Shafiq Joty. 2024. Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 334–350, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing (Jiao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.20.pdf