Weixin Cai
2025
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Yao Dou
|
Michel Galley
|
Baolin Peng
|
Chris Kedzie
|
Weixin Cai
|
Alan Ritter
|
Chris Quirk
|
Wei Xu
|
Jianfeng Gao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human–LLM conversations on two interactive tasks—math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman’s 𝜌 of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.
2023
Interactive Text Generation
Felix Faltings
|
Michel Galley
|
Kianté Brantley
|
Baolin Peng
|
Weixin Cai
|
Yizhe Zhang
|
Jianfeng Gao
|
Bill Dolan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Users interact with text, image, code, or other editors on a daily basis. However, machine learning models are rarely trained in the settings that reflect the interactivity between users and their editor. This is understandable as training AI models with real users is not only slow and costly, but what these models learn may be specific to user interface design choices. Unfortunately, this means most of the research on text, code, and image generation has focused on non-interactive settings, whereby the model is expected to get everything right without accounting for any input from a user who may be willing to help. We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users, by using user simulators that provide edits that guide the model towards a given target text. We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts, even when all models are given the same budget of user inputs or edits.
Search
Fix author
Co-authors
- Michel Galley 2
- Jianfeng Gao 2
- Baolin Peng 2
- Kianté Brantley 1
- William B. Dolan 1
- show all...