Weak-to-Strong Honesty Alignment via Learning-to-Rank Supervision

Yunfan Xie; Lixin Zou; Dan Luo; Min Tang; Chenliang Li

doi:10.18653/v1/2025.findings-acl.529

Weak-to-Strong Honesty Alignment via Learning-to-Rank Supervision

Yunfan Xie, Lixin Zou, Dan Luo, Min Tang, Chenliang Li

Abstract

Honest alignment refers to the ability of a language model to truthfully convey its knowledge limitations by appropriately refusing to answer questions when it lacks sufficient information. Existing solutions, such as prompt engineering and fine-tuning, face limitations: the former provides only marginal improvements, while the latter struggles to enhance honesty when annotated data is scarce.To overcome the above limitations, we propose , a novel framework that enhances honesty through weak-to-strong generalization. Specifically, we train the strong LLMs under weak model supervision to improve their honesty. For the weak model, we employ a learning-to-rank strategy to train a “honest head”, which learns to select the most honest response among model’s outputs generated through beam search. For the strong LLM, we leverage the self-labeled dataset to update its parameters. Our proposal requires only minimal training data to train the weak honest model, yet achieve decent performance for labeling data. In addition, it enables the strong LLMs to have the capabilities to generalize even facing with the flawed label data. Extensive experiments show significantly boosts honest alignment in large models even with limited labeled data. Our code is available at https://github.com/zewanfaan/WHAT_Honesty.

Anthology ID:: 2025.findings-acl.529
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10154–10168
Language:
URL:: https://aclanthology.org/2025.findings-acl.529/
DOI:: 10.18653/v1/2025.findings-acl.529
Bibkey:
Cite (ACL):: Yunfan Xie, Lixin Zou, Dan Luo, Min Tang, and Chenliang Li. 2025. Weak-to-Strong Honesty Alignment via Learning-to-Rank Supervision. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10154–10168, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Weak-to-Strong Honesty Alignment via Learning-to-Rank Supervision (Xie et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.529.pdf

PDF Cite Search Fix data