DavIR: Data Selection via Implicit Reward for Large Language Models

Haotian Zhou; Tingkai Liu; Qianli Ma; Yufeng Zhang; Jianbo Yuan; Pengfei Liu; Yang You; Hongxia Yang

doi:10.18653/v1/2025.acl-long.452

DavIR: Data Selection via Implicit Reward for Large Language Models

Haotian Zhou, Tingkai Liu, Qianli Ma, Yufeng Zhang, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang

Abstract

We introduce DavIR, a model-based data selection method for post-training Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal language modeling, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.

Anthology ID:: 2025.acl-long.452
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9220–9237
Language:
URL:: https://aclanthology.org/2025.acl-long.452/
DOI:: 10.18653/v1/2025.acl-long.452
Bibkey:
Cite (ACL):: Haotian Zhou, Tingkai Liu, Qianli Ma, Yufeng Zhang, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. 2025. DavIR: Data Selection via Implicit Reward for Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9220–9237, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DavIR: Data Selection via Implicit Reward for Large Language Models (Zhou et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.452.pdf

PDF Cite Search Fix data