UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents

Jiwen Zhang (张霁雯); Ya-Qi Yu; Minghui Liao; Wentao Li; Jihao Wu; Zhongyu Wei (魏忠钰)

UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents

Jiwen Zhang, Ya-Qi Yu, Minghui Liao, WenTao Li, Jihao Wu, Zhongyu Wei

Abstract

Graphical User Interface (GUI) agents are expected to precisely operate on the screens of digital devices. Existing GUI agents merely depend on current visual observations and plain-text action history, ignoring the significance of history screens. To mitigate this issue, we propose **UI-Hawk**, a multi-modal GUI agent specially designed to process screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder to handle the screen sequences. To acquire a better understanding of screen streams, we select four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We further propose a curriculum learning strategy to subsequently guide the model from fundamental tasks to advanced screen-stream comprehension.Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is essential for GUI tasks.Our code and data are now available at https://github.com/IMNearth/UIHawk.

Anthology ID:: 2025.emnlp-main.920
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18228–18247
Language:
URL:: https://aclanthology.org/2025.emnlp-main.920/
DOI:
Bibkey:
Cite (ACL):: Jiwen Zhang, Ya-Qi Yu, Minghui Liao, WenTao Li, Jihao Wu, and Zhongyu Wei. 2025. UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18228–18247, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.920.pdf
Checklist:: 2025.emnlp-main.920.checklist.pdf

PDF Cite Search Checklist Fix data