You Only Look at Screens: Multimodal Chain-of-Action Agents

Zhuosheng Zhang, Aston Zhang


Abstract
Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique—leveraging a series of intermediate previous action histories and future action plans—to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-GUI.
Anthology ID:
2024.findings-acl.186
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3132–3149
Language:
URL:
https://aclanthology.org/2024.findings-acl.186
DOI:
10.18653/v1/2024.findings-acl.186
Bibkey:
Cite (ACL):
Zhuosheng Zhang and Aston Zhang. 2024. You Only Look at Screens: Multimodal Chain-of-Action Agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3132–3149, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
You Only Look at Screens: Multimodal Chain-of-Action Agents (Zhang & Zhang, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.186.pdf