From Individual to Common: An Early Exploration of Consensus in Non-verifiable Data for Balanced Preference Optimization

Shangjian Yin (尹商鉴); Zhouxing Shi

From Individual to Common: An Early Exploration of Consensus in Non-verifiable Data for Balanced Preference Optimization

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable effectiveness in boosting the objective performance (e.g., reasoning) of Large Language Models (LLMs) through rule-based, on-policy self-improvement strategies. However, optimizing LLMs for subjective capabilities and alignment with human preferences remains challenging due to the non-verifiable nature. Most prior works use datasets comprising response pairs with substantial quality gaps labeled by a strong external judge. While effective for preference metrics, this paradigm often incurs an “alignment tax”, where the model’s objective performance on downstream tasks degrades as it overfits to subjective preferences. In this work, we introduce Donkey, a high-quality, non-verifiable dataset where response pairs differ only by subtle nuances. We find that LLMs optimized on Donkey via preference learning outperform those trained on data with explicit quality gaps, while simultaneously maintaining their objective capabilities. Furthermore, we observe that preference signals on Donkey can be decomposed into consensus preferences and individual preferences. Our analysis reveals that distilling consensus preferences provides a significantly more data-efficient signal for preference optimization. Our findings underscore the importance of leveraging nuanced preference signals and the consensus of multiple judges for advancing subjective LLM alignment. Our code and data will be available at https://github.com/SJY8460/Donkey.

Anthology ID:: 2026.acl-long.1598
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34612–34630
Language:
URL:: https://aclanthology.org/2026.acl-long.1598/
DOI:
Bibkey:
Cite (ACL):: Shangjian Yin and Zhouxing Shi. 2026. From Individual to Common: An Early Exploration of Consensus in Non-verifiable Data for Balanced Preference Optimization. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34612–34630, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Individual to Common: An Early Exploration of Consensus in Non-verifiable Data for Balanced Preference Optimization (Yin & Shi, ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1598.pdf
Checklist:: 2026.acl-long.1598.checklist.pdf

PDF Cite Search Checklist Fix data