Bac Le
2026
Causal Direct Preference Optimization for Language Model Alignment
Uyen Le | Thin Nguyen | Toan Nguyen | Toan Doan | Trung Le | Bac Le
Findings of the Association for Computational Linguistics: EACL 2026
Uyen Le | Thin Nguyen | Toan Nguyen | Toan Doan | Trung Le | Bac Le
Findings of the Association for Computational Linguistics: EACL 2026
Direct Preference Optimization (DPO) is a powerful approach for aligning large language models (LLMs) with human preferences by formulating preference learning as a supervised classification problem over pairwise human-labeled outputs, thereby enabling stable and efficient training. We show that DPO inherits bias from confounders (e.g., topic, style, user objectives) that shape data generation and carry through to training, hindering recovery of true human preferences. We address this from a causal perspective, proposing Causal Direct Preference Optimization (CDPO), a general framework that incorporates causal inference principles to mitigate the influence of confounders and sharpen the signal of genuine human preferences. Our approach preserves the tractability of direct optimization while enhancing robustness to spurious correlations and annotation biases. Empirical evaluations on benchmark datasets show that CDPO surpasses DPO-based baselines by achieving unbiased fine-tuning through causal reasoning, confirming the effectiveness of confounder-aware preference optimization.