Sparse Rewards Can Self-Train Dialogue Agents

Barrett Martin Lattimer; Varun Prashant Gangal; Ryan McDonald; Yi Yang

doi:10.18653/v1/2025.findings-acl.1302

Sparse Rewards Can Self-Train Dialogue Agents

Barrett Martin Lattimer, Varun Prashant Gangal, Ryan McDonald, Yi Yang

Abstract

Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub.

Anthology ID:: 2025.findings-acl.1302
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25395–25413
Language:
URL:: https://aclanthology.org/2025.findings-acl.1302/
DOI:: 10.18653/v1/2025.findings-acl.1302
Bibkey:
Cite (ACL):: Barrett Martin Lattimer, Varun Prashant Gangal, Ryan McDonald, and Yi Yang. 2025. Sparse Rewards Can Self-Train Dialogue Agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25395–25413, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Sparse Rewards Can Self-Train Dialogue Agents (Lattimer et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1302.pdf

PDF Cite Search Fix data