ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Jakub Hoscilowicz, Artur Janicki


Abstract
With the growing reliance on digital devices with graphical user interfaces (GUIs) like computers and smartphones, the demand for smart voice assistants has grown significantly. While multimodal large language models (MLLM) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this work, we introduce ClickAgent, a novel framework for building autonomous agents. ClickAgent combines MLLM-driven reasoning and action planning with a separate UI location model that identifies relevant UI elements on the screen. This approach addresses a key limitation of current MLLMs: their inability to accurately locate UI elements. Evaluations conducted using both an Android emulator and a real smartphone show that ClickAgent outperforms other autonomous agents (DigiRL, CogAgent, AppAgent) on the AITW benchmark.
Anthology ID:
2025.sigdial-1.38
Volume:
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
August
Year:
2025
Address:
Avignon, France
Editors:
Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
Venue:
SIGDIAL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
471–476
Language:
URL:
https://aclanthology.org/2025.sigdial-1.38/
DOI:
Bibkey:
Cite (ACL):
Jakub Hoscilowicz and Artur Janicki. 2025. ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 471–476, Avignon, France. Association for Computational Linguistics.
Cite (Informal):
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents (Hoscilowicz & Janicki, SIGDIAL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.sigdial-1.38.pdf