MUG: Interactive Multimodal Grounding on User Interfaces

Tao Li, Gang Li, Jingjie Zheng, Purple Wang, Yang Li


Abstract
We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging split. Our results lay the foundation for further investigation of the problem.
Anthology ID:
2024.findings-eacl.17
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
231–251
Language:
URL:
https://aclanthology.org/2024.findings-eacl.17
DOI:
Bibkey:
Cite (ACL):
Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and Yang Li. 2024. MUG: Interactive Multimodal Grounding on User Interfaces. In Findings of the Association for Computational Linguistics: EACL 2024, pages 231–251, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
MUG: Interactive Multimodal Grounding on User Interfaces (Li et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.17.pdf