Supervised Optimism Correction: Be Confident When LLMs Are Sure

Junjie Zhang; Rushuai Yang; Shunyu Liu; Ting-En Lin; Fei Huang; Yi Chen; Yongbin Li; Dacheng Tao

doi:10.18653/v1/2025.findings-acl.463

Supervised Optimism Correction: Be Confident When LLMs Are Sure

Junjie Zhang, Rushuai Yang, Shunyu Liu, Ting-En Lin, Fei Huang, Yi Chen, Yongbin Li, Dacheng Tao

Abstract

In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit Q-function for inference.Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated Q-value estimations of suboptimal steps. To address this limitation, we propose **S**upervised **O**ptimism **C**orrection (SOC), which introduces a simple yet effective auxiliary loss for token-level Q-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularizationto boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses.Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.

Anthology ID:: 2025.findings-acl.463
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8867–8880
Language:
URL:: https://aclanthology.org/2025.findings-acl.463/
DOI:: 10.18653/v1/2025.findings-acl.463
Bibkey:
Cite (ACL):: Junjie Zhang, Rushuai Yang, Shunyu Liu, Ting-En Lin, Fei Huang, Yi Chen, Yongbin Li, and Dacheng Tao. 2025. Supervised Optimism Correction: Be Confident When LLMs Are Sure. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8867–8880, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Supervised Optimism Correction: Be Confident When LLMs Are Sure (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.463.pdf

PDF Cite Search Fix data