Zimeng Bai
2025
CoTD-PO: Chain-of-Thought Distillation with Preference Optimization
Lujie Niu
|
Haochen Sun
|
Fangkun Zhao
|
Sheng Chen
|
Zimeng Bai
|
Jiawei Zhang
|
Caixia Yuan
|
Xiaojie Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Chain-of-Thought (CoT) distillation has emerged as a promising paradigm to enhance the reasoning ability of small language models by imitating the reasoning and outputs of larger teacher models. However, existing approaches suffer from a critical limitation: a distribution mismatch between teacher-generated training trajectories and the student model’s own generative distribution. This mismatch leads to exposure bias during inference and often induces mode collapse or mode averaging, thereby degrading the student model’s generative diversity and robustness. To address these issues, we propose CoTD-PO (Chain-of-Thought Distillation with Preference Optimization), a reinforcement learning framework that shifts the training paradigm from passive imitation to active trajectory exploration. Instead of forcing the student to imitate exact teacher traces, our method enables the student to sample its own answer paths. To support training with non-open-source teacher models, we approximate the teacher’s output distribution through preference-based scoring. Furthermore, we adopt an offline iterative training procedure that enables stable and efficient optimization. Experiments on diverse open-ended generation tasks demonstrate that CoTD-PO significantly outperforms standard CoT distillation baselines, achieving higher output quality while mitigating mode collapse and preserving semantic diversity.
Search
Fix author
Co-authors
- Sheng Chen 1
- Lujie Niu 1
- Haochen Sun 1
- Xiaojie Wang 1
- Caixia Yuan 1
- show all...