BOSE: A Systematic Evaluation Method Optimized for Base Models

Hongzhi Luan; Changxin Tian; Zhaoxin Huan; Xiaolu Zhang; Kunlong Chen; Zhiqiang Zhang; Jun Zhou

doi:10.18653/v1/2025.findings-acl.830

BOSE: A Systematic Evaluation Method Optimized for Base Models

Hongzhi Luan, Changxin Tian, Zhaoxin Huan, Xiaolu Zhang, Kunlong Chen, Zhiqiang Zhang, Jun Zhou

Abstract

This paper poses two critical issues in evaluating base models (without post-training): (1) Unstable evaluation during training: in the early stages of pre-training, the models lack the capability to answer questions as required, leading to unstable evaluation results. This instability makes it difficult to provide solid conclusions to guide the training, especially for key experiments such as data ablation and scaling law. (2) Inconsistency between base and instruct models: base models generally exhibit poorer evaluation performance compared to corresponding instruct models. This gap poses a challenge for assessing whether a base model with better evaluation can truly lead to a better instruct model. To address these issues, we propose **B**ase model **O**riented **S**ystematic **E**valuation (**BOSE**), a method specifically designed to optimize the evaluation of base models. Specifically, BOSE introduces two key innovations: In-Context Light-instruction Prompt (**ICLiP**) for open-ended tasks and **Blank-ppl** for multi-choice tasks with candidate options, which transforms the standard perplexity (ppl) metric into a fill-in-the-blank format to mitigate early-stage evaluation fluctuations. Furthermore, we are the first to propose Kendall’s rank correlation to quantitatively measure the evaluation stability and consistency. Experimental results demonstrate that BOSE significantly enhances both the stability of evaluations during pre-training and the consistency between base and instruct models, thereby providing more reliable guidance for the LLMs’ training.

Anthology ID:: 2025.findings-acl.830
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16147–16158
Language:
URL:: https://aclanthology.org/2025.findings-acl.830/
DOI:: 10.18653/v1/2025.findings-acl.830
Bibkey:
Cite (ACL):: Hongzhi Luan, Changxin Tian, Zhaoxin Huan, Xiaolu Zhang, Kunlong Chen, Zhiqiang Zhang, and Jun Zhou. 2025. BOSE: A Systematic Evaluation Method Optimized for Base Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16147–16158, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: BOSE: A Systematic Evaluation Method Optimized for Base Models (Luan et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.830.pdf

PDF Cite Search Fix data