AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran; Samuel Joseph Amouyal; Chaitanya Malaviya; Ben Bogin; Ofir Press; Jonathan Berant

doi:10.18653/v1/2024.emnlp-main.505

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

Abstract

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well in terms of accuracy, they exhibit low precision and tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that open web navigation remains a major challenge.

Anthology ID:: 2024.emnlp-main.505
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8938–8968
Language:
URL:: https://aclanthology.org/2024.emnlp-main.505/
DOI:: 10.18653/v1/2024.emnlp-main.505
Bibkey:
Cite (ACL):: Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. 2024. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? (Yoran et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.505.pdf

PDF Cite Search Fix data