Artur Janicki


2025

pdf bib
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
Jakub Hoscilowicz | Artur Janicki
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

With the growing reliance on digital devices with graphical user interfaces (GUIs) like computers and smartphones, the demand for smart voice assistants has grown significantly. While multimodal large language models (MLLM) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this work, we introduce ClickAgent, a novel framework for building autonomous agents. ClickAgent combines MLLM-driven reasoning and action planning with a separate UI location model that identifies relevant UI elements on the screen. This approach addresses a key limitation of current MLLMs: their inability to accurately locate UI elements. Evaluations conducted using both an Android emulator and a real smartphone show that ClickAgent outperforms other autonomous agents (DigiRL, CogAgent, AppAgent) on the AITW benchmark.