Pulkit Pattnaik
2024
Enhancing Alignment using Curriculum Learning & Ranked Preferences
Pulkit Pattnaik
|
Rishabh Maheshwary
|
Kelechi Ogueji
|
Vikas Yadav
|
Sathwik Tejaswi Madhusudhan
Findings of the Association for Computational Linguistics: EMNLP 2024
Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (one chosen and rejected response per prompt) to align LLMs to human preferences. In practice, multiple responses could exist for a given prompt with varying quality relative to each other. We propose to utilize these responses to create multiple preference pairs for a given prompt. Our work focuses on aligning LLMs by systematically curating multiple preference pairs and presenting them in a meaningful manner facilitating curriculum learning to enhance the prominent DPO technique. We order multiple preference pairs from easy to hard, according to various criteria thus emulating curriculum learning. Our method, which is referred to as Curri-DPO consistently shows increased performance gains on MTbench, Vicuna bench, WizardLM, highlighting its effectiveness over standard DPO setting that utilizes single preference pair. More specifically, Curri-DPO achieves a score of 7.43 on MTbench with Zephyr-7B, outperforming majority of existing LLMs with similar parameter size. Curri-DPO also achieves the highest win rates on Vicuna, WizardLM, and UltraFeedback test sets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of up to 7.5% when compared to standard DPO. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences.
Search