Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li; Yanran Wu; Xinyu Luo; Ruqi Zhang

doi:10.18653/v1/2025.emnlp-main.578

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li, Yanran Wu, Xinyu Luo, Ruqi Zhang

Abstract

Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-shifted speculative sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.

Anthology ID:: 2025.emnlp-main.578
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11468–11478
Language:
URL:: https://aclanthology.org/2025.emnlp-main.578/
DOI:: 10.18653/v1/2025.emnlp-main.578
Bibkey:
Cite (ACL):: Bolian Li, Yanran Wu, Xinyu Luo, and Ruqi Zhang. 2025. Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11468–11478, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner (Li et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.578.pdf
Checklist:: 2025.emnlp-main.578.checklist.pdf

PDF Cite Search Checklist Fix data