ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi; Krisztian Balog; Sally Goldman; Avi Caciularu; Guy Tennenholtz; Jihwan Jeong; Amir Globerson; Craig Boutilier

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier

Abstract

The promise of *LLM-based user simulators* to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce *ConvApparel*, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol, using both "good" and "bad" recommenders, enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction.We propose a comprehensive validation framework that combines *statistical alignment*, a *human-likeness score*, and *counterfactual validation* to test for generalization.Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

Anthology ID:: 2026.eacl-long.244
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5270–5304
Language:
URL:: https://aclanthology.org/2026.eacl-long.244/
DOI:
Bibkey:
Cite (ACL):: Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, and Craig Boutilier. 2026. ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5270–5304, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders (Meshi et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.244.pdf
Checklist:: 2026.eacl-long.244.checklist.pdf

PDF Cite Search Checklist Fix data