Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Tianduo Wang, Shichen Li, Wei Lu


Abstract
Teaching small-scale language models to perform math reasoning is a valuable yet challenging task. Besides obtaining labeled data from human experts, one of the most common ways to collect high-quality data is by sampling from a larger and more powerful language model. Although previous works have demonstrated the effectiveness of this method, such a knowledge distillation paradigm can be costly and unstable, especially considering that many large language models, such as GPT-4, are closed-sourced, proprietary, and their behaviors are unpredictable. In this work, to avoid relying on outputs from large models, we demonstrate that the reasoning abilities of small-scale language models can be enhanced through self-training, which involves training models with their own outputs. We also show that the vanilla self-training can be further augmented by an alignment algorithm, direct preference optimization (DPO). We empirically found that models trained with the DPO objective are capable of making better generations that largely benefit multi-turn self-training. The experiments show our models outperform the state-of-the-art models with comparable sizes on a series of downstream math reasoning tasks with minimal resource requirements.
Anthology ID:
2024.acl-long.643
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11917–11928
Language:
URL:
https://aclanthology.org/2024.acl-long.643
DOI:
10.18653/v1/2024.acl-long.643
Bibkey:
Cite (ACL):
Tianduo Wang, Shichen Li, and Wei Lu. 2024. Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11917–11928, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning (Wang et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.643.pdf