DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models

Chengcheng Han, Xiaowei Du, Che Zhang, Yixin Lian, Xiang Li, Ming Gao, Baoyuan Wang


Abstract
Chain-of-Thought (CoT) prompting has successfully enhanced the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters. However, it is ineffective, or even detrimental, to the performance on reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion parameters. In this paper, we propose Dialogue-guided Chain-of-Thought (DialCoT) to improve the reasoning capabilities of SLMs, with the aim of generating intermediate reasoning steps in a dialogue format to guide the model to the final answer. Furthermore, we optimize the model to choose the optimal reasoning path through the Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning capabilities. Compared to previous methods, our advantages lie in: 1) We transform the process of solving complex reasoning problems into decomposing problems and solving a series of simpler sub-questions, significantly reducing task difficulty and making it more suitable for SLMs. 2) We optimize the model to choose the optimal reasoning path through the PPO algorithm. Comprehensive experiments on four arithmetic reasoning datasets show that our method can achieve significant performance gains over state-of-the-art competitors.
Anthology ID:
2023.emnlp-main.501
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8055–8068
Language:
URL:
https://aclanthology.org/2023.emnlp-main.501
DOI:
10.18653/v1/2023.emnlp-main.501
Bibkey:
Cite (ACL):
Chengcheng Han, Xiaowei Du, Che Zhang, Yixin Lian, Xiang Li, Ming Gao, and Baoyuan Wang. 2023. DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8055–8068, Singapore. Association for Computational Linguistics.
Cite (Informal):
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models (Han et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.501.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.501.mp4