Haiqin Weng
2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
|
Yishuo Cai
|
Zhenhong Zhou
|
Renjie Gu
|
Haiqin Weng
|
Liu Yan
|
Tianwei Zhang
|
Wei Xu
|
Han Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.
Search
Co-authors
- Rongwu Xu 1
- Yishuo Cai 1
- Zhenhong Zhou 1
- Renjie Gu 1
- Liu Yan 1
- show all...