Scaling Unverifiable Rewards: A Case Study on Visual Insights

Shuyu Gan; James Mooney; Pan Hao; Renxiang Wang; Mingyi Hong; Qianwen Wang; Dongyeop Kang

doi:10.18653/v1/2026.findings-acl.1724

Scaling Unverifiable Rewards: A Case Study on Visual Insights

Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang

Abstract

Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), an iterative refinement process guided by reward signals.However, many real-world tasks involve multi-stage pipelines whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to error accumulation across stages.We propose Selective TTS, a process-based refinement framework that scales inference across stages of a multi-agent pipeline, instead of repeatedly refining a single output over time as in prior work.By distributing compute across stages and pruning low-quality branches early using process-specific judgers, Selective TTS mitigates the judge drift and stabilizes refinement.Grounded in a data science workflow, we build an end-to-end multi-agent pipeline for generating visually insightful reports from a given dataset, and design a reliable LLM-based judge model that aligns with human experts (Kendall’s 𝜏=0.55) to evaluate them.Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 (baseline) to 65.86 while reducing variance.We hope our findings serve as the first step toward scaling complex, open-ended tasks with unverifiable rewards like scientific discovery. Our code and generated reports are publicly available at https://minnesotanlp.github.io/insight-scaling-webpage.

Anthology ID:: 2026.findings-acl.1724
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34537–34569
Language:
URL:: https://aclanthology.org/2026.findings-acl.1724/
DOI:: 10.18653/v1/2026.findings-acl.1724
Bibkey:
Cite (ACL):: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, and Dongyeop Kang. 2026. Scaling Unverifiable Rewards: A Case Study on Visual Insights. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34537–34569, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Scaling Unverifiable Rewards: A Case Study on Visual Insights (Gan et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1724.pdf
Checklist:: 2026.findings-acl.1724.checklist.pdf

PDF Cite Search Checklist Fix data