Yishuo Cai


2024

pdf bib
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu | Yishuo Cai | Zhenhong Zhou | Renjie Gu | Haiqin Weng | Liu Yan | Tianwei Zhang | Wei Xu | Han Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.