DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models

Ruizhe Chen; Wenhao Chai; Zhifei Yang; Xiaotian Zhang; Ziyang Wang; Tony Quek; Joey Tianyi Zhou; Soujanya Poria; Zuozhu Liu

doi:10.18653/v1/2025.acl-long.926

DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models

Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Ziyang Wang, Tony Quek, Joey Tianyi Zhou, Soujanya Poria, Zuozhu Liu

Abstract

Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (DiffPO), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, DiffPO avoids the time latency associated with token-level generation. Designed as a plug-and-play module, DiffPO can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that DiffPO achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, DiffPO demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.

Anthology ID:: 2025.acl-long.926
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18910–18925
Language:
URL:: https://aclanthology.org/2025.acl-long.926/
DOI:: 10.18653/v1/2025.acl-long.926
Bibkey:
Cite (ACL):: Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Ziyang Wang, Tony Quek, Joey Tianyi Zhou, Soujanya Poria, and Zuozhu Liu. 2025. DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18910–18925, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models (Chen et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.926.pdf

PDF Cite Search Fix data