RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models Saeed Khaki author JinJin Li author Lan Ma author Liu Yang author Prathap Ramachandra author 2024-06 text Findings of the Association for Computational Linguistics: NAACL 2024 Kevin Duh editor Helena Gomez editor Steven Bethard editor Association for Computational Linguistics Mexico City, Mexico conference publication khaki-etal-2024-rs 10.18653/v1/2024.findings-naacl.108 https://aclanthology.org/2024.findings-naacl.108/ 2024-06 1665 1680