Valerie Chen
2026
Coding Agents with Multimodal Browsing are Generalist Problem Solvers
Aditya Bharat Soni | Boxuan Li | Xingyao Wang | Valerie Chen | Graham Neubig
Findings of the Association for Computational Linguistics: EACL 2026
Aditya Bharat Soni | Boxuan Li | Xingyao Wang | Valerie Chen | Graham Neubig
Findings of the Association for Computational Linguistics: EACL 2026
Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. Similarly, specialized AI agents with task-specific tools or architectures often fail to generalize beyond their intended scope. In this work, we ask: *can agents achieve generalizability across diverse domains with a small, but well-chosen set of general tools?* We propose OpenHands-Versa, a single-agent system with a modest number of general tools like code execution, search engine, web browser and multimodal file viewer, for three practical domains: software engineering, deep research, and web browsing. Notably, OpenHands-Versa demonstrates superior or competitive performance over task-specific specialized agents on three challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, with absolute improvements in success rate of **9.1**, **1.3**, and **9.1** points, respectively. Thus, our *single-agent* system can achieve strong generalization indicating that specialist agents for these domains provide no practical benefit. Furthermore, we find that specialist multi-agent systems do not generalize beyond their intended scope. These findings establish OpenHands-Versa as a strong baseline for future research.
2025
When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback
Jane Pan | Ryan Shar | Jacob Pfau | Ameet Talwalkar | He He | Valerie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Jane Pan | Ryan Shar | Jacob Pfau | Ameet Talwalkar | He He | Valerie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Programming with a coding assistant is a fundamentally interactive process, yet existing static benchmarks fail to capture key features of model-user collaboration. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting, in which we obfuscate the input of static coding benchmarks so that the code model must interact with a simulated user. Across 10 models and 3 datasets, the relative rankings of models often permute greatly between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that similarly effective feedback types differ in terms of how models respond to higher- vs. lower-quality feedback. Moreover, feedback type impacts the degree to which the models make aesthetic or behavioral edits to their output. Our work aims to “re-evaluate” model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.
2024
Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design
Lindia Tjuatja | Valerie Chen | Tongshuang Wu | Ameet Talwalkwar | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 12
Lindia Tjuatja | Valerie Chen | Tongshuang Wu | Ameet Talwalkwar | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 12
One widely cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording—but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of “prompts” have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior.1